-
Commit2Vec: Learning Distributed Representations of Code Changes
Authors:
Rocìo Cabrera Lozoya,
Arnaud Baumann,
Antonino Sabetta,
Michele Bezzi
Abstract:
Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories).
In this work, we elaborate upon a state-of-the-art approach to the representation of source code that…
▽ More
Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories).
In this work, we elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and we adapt it to represent source changes (i.e., commits). We use this representation to classify security-relevant commits.
Because our method uses transfer learning (that is, we train a network on a "pretext task" for which abundant labeled data is available, and then we use such network for the target task of commit classification, for which fewer labeled instances are available), we studied the impact of pre-training the network using two different pretext tasks versus a randomly initialized model.
Our results indicate that representations that leverage the structural information obtained through code syntax outperform token-based representations. Furthermore, the performance metrics obtained when pre-training on a loosely related pretext task with a very large dataset ($>10^6$ samples) were surpassed when pretraining on a smaller dataset ($>10^4$ samples) but for a pretext task that is more closely related to the target task.
△ Less
Submitted 17 November, 2021; v1 submitted 18 November, 2019;
originally announced November 2019.
-
A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software
Authors:
Serena E. Ponta,
Henrik Plate,
Antonino Sabetta,
Michele Bezzi,
Cédric Dangremont
Abstract:
Advancing our understanding of software vulnerabilities, automating their identification, the analysis of their impact, and ultimately their mitigation is necessary to enable the development of software that is more secure. While operating a vulnerability assessment tool that we developed and that is currently used by hundreds of development units at SAP, we manually collected and curated a datase…
▽ More
Advancing our understanding of software vulnerabilities, automating their identification, the analysis of their impact, and ultimately their mitigation is necessary to enable the development of software that is more secure. While operating a vulnerability assessment tool that we developed and that is currently used by hundreds of development units at SAP, we manually collected and curated a dataset of vulnerabilities of open-source software and the commits fixing them. The data was obtained both from the National Vulnerability Database (NVD) and from project-specific Web resources that we monitor on a continuous basis. From that data, we extracted a dataset that maps 624 publicly disclosed vulnerabilities affecting 205 distinct open-source Java projects, used in SAP products or internal tools, onto the 1282 commits that fix them. Out of 624 vulnerabilities, 29 do not have a CVE identifier at all and 46, which do have a CVE identifier assigned by a numbering authority, are not available in the NVD yet. The dataset is released under an open-source license, together with supporting scripts that allow researchers to automatically retrieve the actual content of the commits from the corresponding repositories and to augment the attributes available for each instance. Also, these scripts allow to complement the dataset with additional instances that are not security fixes (which is useful, for example, in machine learning applications). Our dataset has been successfully used to train classifiers that could automatically identify security-relevant commits in code repositories. The release of this dataset and the supporting code as open-source will allow future research to be based on data of industrial relevance; also, it represents a concrete step towards making the maintenance of this dataset a shared effort involving open-source communities, academia, and the industry.
△ Less
Submitted 19 March, 2019; v1 submitted 7 February, 2019;
originally announced February 2019.
-
A Practical Approach to the Automatic Classification of Security-Relevant Commits
Authors:
Antonino Sabetta,
Michele Bezzi
Abstract:
The lack of reliable sources of detailed information on the vulnerabilities of open-source software (OSS) components is a major obstacle to maintaining a secure software supply chain and an effective vulnerability management process. Standard sources of advisories and vulnerability data, such as the National Vulnerability Database (NVD), are known to suffer from poor coverage and inconsistent qual…
▽ More
The lack of reliable sources of detailed information on the vulnerabilities of open-source software (OSS) components is a major obstacle to maintaining a secure software supply chain and an effective vulnerability management process. Standard sources of advisories and vulnerability data, such as the National Vulnerability Database (NVD), are known to suffer from poor coverage and inconsistent quality.
To reduce our dependency on these sources, we propose an approach that uses machine-learning to analyze source code repositories and to automatically identify commits that are security-relevant (i.e., that are likely to fix a vulnerability). We treat the source code changes introduced by commits as documents written in natural language, classifying them using standard document classification methods.
Combining independent classifiers that use information from different facets of commits, our method can yield high precision (80%) while ensuring acceptable recall (43%). In particular, the use of information extracted from the source code changes yields a substantial improvement over the best known approach in state of the art, while requiring a significantly smaller amount of training data and employing a simpler architecture.
△ Less
Submitted 6 July, 2018;
originally announced July 2018.
-
Machine-Readable Privacy Certificates for Services
Authors:
Marco Anisetti,
Claudio A. Ardagna,
Michele Bezzi,
Ernesto Damiani,
Antonino Sabetta
Abstract:
Privacy-aware processing of personal data on the web of services requires managing a number of issues arising both from the technical and the legal domain. Several approaches have been proposed to matching privacy requirements (on the clients side) and privacy guarantees (on the service provider side). Still, the assurance of effective data protection (when possible) relies on substantial human ef…
▽ More
Privacy-aware processing of personal data on the web of services requires managing a number of issues arising both from the technical and the legal domain. Several approaches have been proposed to matching privacy requirements (on the clients side) and privacy guarantees (on the service provider side). Still, the assurance of effective data protection (when possible) relies on substantial human effort and exposes organizations to significant (non-)compliance risks. In this paper we put forward the idea that a privacy certification scheme producing and managing machine-readable artifacts in the form of privacy certificates can play an important role towards the solution of this problem. Digital privacy certificates represent the reasons why a privacy property holds for a service and describe the privacy measures supporting it. Also, privacy certificates can be used to automatically select services whose certificates match the client policies (privacy requirements).
Our proposal relies on an evolution of the conceptual model developed in the Assert4Soa project and on a certificate format specifically tailored to represent privacy properties. To validate our approach, we present a worked-out instance showing how privacy property Retention-based unlinkability can be certified for a banking financial service.
△ Less
Submitted 26 July, 2013;
originally announced July 2013.
-
Towards understanding and modelling office daily life
Authors:
Michele Bezzi,
Robin Groenevelt
Abstract:
Measuring and modeling human behavior is a very complex task. In this paper we present our initial thoughts on modeling and automatic recognition of some human activities in an office. We argue that to successfully model human activities, we need to consider both individual behavior and group dynamics. To demonstrate these theoretical approaches, we introduce an experimental system for analyzing…
▽ More
Measuring and modeling human behavior is a very complex task. In this paper we present our initial thoughts on modeling and automatic recognition of some human activities in an office. We argue that to successfully model human activities, we need to consider both individual behavior and group dynamics. To demonstrate these theoretical approaches, we introduce an experimental system for analyzing everyday activity in our office.
△ Less
Submitted 13 June, 2007;
originally announced June 2007.
-
Quantifying the information transmitted in a single stimulus
Authors:
Michele Bezzi
Abstract:
Shannon mutual information provides a measure of how much information is, on average, contained in a set of neural activities about a set of stimuli. It has been extensively used to study neural coding in different brain areas. To apply a similar approach to investigate single stimulus encoding, we need to introduce a quantity specific for a single stimulus. This quantity has been defined in lit…
▽ More
Shannon mutual information provides a measure of how much information is, on average, contained in a set of neural activities about a set of stimuli. It has been extensively used to study neural coding in different brain areas. To apply a similar approach to investigate single stimulus encoding, we need to introduce a quantity specific for a single stimulus. This quantity has been defined in literature by four different measures, but none of them satisfies the same intuitive properties (non-negativity, additivity), that characterize mutual information. We present here a detailed analysis of the different meanings and properties of these four definitions. We show that all these measures satisfy, at least, a weaker additivity condition, i.e. limited to the response set. This allows us to use them for analysing correlated coding, as we illustrate in a toy-example from hippocampal place cells.
△ Less
Submitted 23 January, 2006;
originally announced January 2006.
-
Mathematical modeling of filamentous microorganisms
Authors:
Michele Bezzi,
Andrea Ciliberto
Abstract:
Growth patterns generated by filamentous organisms (e.g. actinomycetes and fungi) involve spatial and temporal dynamics at different length scales. Several mathematical models have been proposed in the last thirty years to address these specific dynamics. Phenomenological macroscopic models are able to reproduce the temporal dynamics of colony-related quantities (e.g. colony growth rate) but do…
▽ More
Growth patterns generated by filamentous organisms (e.g. actinomycetes and fungi) involve spatial and temporal dynamics at different length scales. Several mathematical models have been proposed in the last thirty years to address these specific dynamics. Phenomenological macroscopic models are able to reproduce the temporal dynamics of colony-related quantities (e.g. colony growth rate) but do not explain the development of mycelial morphologies nor the single hyphal growth. Reaction-diffusion models are a bridge between macroscopic and microscopic worlds as they produce mean-field approximations of single-cell behaviors. Microscopic models describe intracellular events, such as branching, septation and translocation. Finally, completely discrete models, cellular automata, simulate the microscopic interaction among cells to reproduce emergent cooperative behaviors of large colonies. In this comment, we review a selection of models for each of these length scales, stressing their advantages and shortcomings.
△ Less
Submitted 3 February, 2004;
originally announced February 2004.
-
Measuring information spatial densities
Authors:
Michele Bezzi,
Ines Samengo,
Stefan Leutgeb,
Sheri Mizumori
Abstract:
A novel definition of the stimulus-specific information is presented, which is particularly useful when the stimuli constitute a continuous and metric set, as for example, position in space. The approach allows one to build the spatial information distribution of a given neural response. The method is applied to the investigation of putative differences in the coding of position in hippocampus a…
▽ More
A novel definition of the stimulus-specific information is presented, which is particularly useful when the stimuli constitute a continuous and metric set, as for example, position in space. The approach allows one to build the spatial information distribution of a given neural response. The method is applied to the investigation of putative differences in the coding of position in hippocampus and lateral septum.
△ Less
Submitted 9 November, 2001;
originally announced November 2001.
-
Redundancy and synergy arising from correlations in large ensembles
Authors:
Michele Bezzi,
Mathew E. Diamond,
Alessandro Treves
Abstract:
Multielectrode arrays allow recording of the activity of many single neurons, from which correlations can be calculated. The functional roles of correlations can be revealed by the measures of the information conveyed by neuronal activity; a simple formula has been shown to discriminate the information transmitted by individual spikes from the positive or negative contributions due to correlatio…
▽ More
Multielectrode arrays allow recording of the activity of many single neurons, from which correlations can be calculated. The functional roles of correlations can be revealed by the measures of the information conveyed by neuronal activity; a simple formula has been shown to discriminate the information transmitted by individual spikes from the positive or negative contributions due to correlations (Panzeri et al, Proc. Roy. Soc. B., {266}: 1001--1012 (1999)). The formula quantifies the corrections to the single-unit instantaneous information rate which result from correlations in spike emission between pairs of neurons. Positive corrections imply synergy, while negative corrections indicate redundancy. Here, this analysis, previously applied to recordings from small ensembles, is developed further by considering a model of a large ensemble, in which correlations among the signal and noise components of neuronal firing are small in absolute value and entirely random in origin. Even such small random correlations are shown to lead to large possible synergy or redundancy, whenever the time window for extracting information from neuronal firing extends to the order of the mean interspike interval. In addition, a sample of recordings from rat barrel cortex illustrates the mean time window at which such `corrections' dominate when correlations are, as often in the real brain, neither random nor small. The presence of this kind of correlations for a large ensemble of cells restricts further the time of validity of the expansion, unless what is decodable by the receiver is also taken into account.
△ Less
Submitted 7 December, 2000;
originally announced December 2000.
-
Small world effects in evolution
Authors:
Franco Bagnoli,
Michele Bezzi
Abstract:
For asexual organisms point mutations correspond to local displacements in the genotypic space, while other genotypic rearrangements represent long-range jumps. We investigate the spreading properties of an initially homogeneous population in a flat fitness landscape, and the equilibrium properties on a smooth fitness landscape. We show that a small-world effect is present: even a small fraction…
▽ More
For asexual organisms point mutations correspond to local displacements in the genotypic space, while other genotypic rearrangements represent long-range jumps. We investigate the spreading properties of an initially homogeneous population in a flat fitness landscape, and the equilibrium properties on a smooth fitness landscape. We show that a small-world effect is present: even a small fraction of quenched long-range jumps makes the results indistinguishable from those obtained by assuming all mutations equiprobable. Moreover, we find that the equilibrium distribution is a Boltzmann one, in which the fitness plays the role of an energy, and mutations that of a temperature.
△ Less
Submitted 9 February, 2001; v1 submitted 28 July, 2000;
originally announced July 2000.
-
Pattern formation by competition: a biological example
Authors:
M. Bezzi,
A. Ciliberto,
A. Mengoni
Abstract:
We present a simple model based on a reaction-diffusion equation to explain pattern formation in a multicellular bacterium (Streptomyces). We assume competition for resources as the basic mechanism that leads to pattern formation; in particular we are able to reproduce the spatial pattern formed by bacterial aerial mycelium in case of growth in minimal (low resources) and maximal (large resource…
▽ More
We present a simple model based on a reaction-diffusion equation to explain pattern formation in a multicellular bacterium (Streptomyces). We assume competition for resources as the basic mechanism that leads to pattern formation; in particular we are able to reproduce the spatial pattern formed by bacterial aerial mycelium in case of growth in minimal (low resources) and maximal (large resources) culture media.
△ Less
Submitted 7 September, 1999;
originally announced September 1999.
-
An evolutionary model for simple ecosystems
Authors:
Franco Bagnoli,
Michele Bezzi
Abstract:
In this review some simple models of asexual populations evolving on smooth landscapes are studied. The basic model is based on a cellular automaton, which is analyzed here in the spatial mean-field limit. Firstly, the evolution on a fixed fitness landscape is considered. The correspondence between the time evolution of the population and equilibrium properties of a statistical mechanics system…
▽ More
In this review some simple models of asexual populations evolving on smooth landscapes are studied. The basic model is based on a cellular automaton, which is analyzed here in the spatial mean-field limit. Firstly, the evolution on a fixed fitness landscape is considered. The correspondence between the time evolution of the population and equilibrium properties of a statistical mechanics system is investigated, finding the limits for which this map** holds. The mutational meltdown, Eigen's error threshold and Muller's ratchet phenomena are studied in the framework of a simplified model. Finally, the shape of a quasi-species and the condition of coexistence of multiple species in a static fitness landscape are analyzed. In the second part, these results are applied to the study of the coexistence of quasi-species in the presence of competition, obtaining the conditions for a robust speciation effect in asexual populations.
△ Less
Submitted 11 June, 1999;
originally announced June 1999.
-
Eigen's Error Threshold and Mutational Meltdown in a Quasispecies Model
Authors:
F. Bagnoli,
M. Bezzi
Abstract:
We introduce a toy model for interacting populations connected by mutations and limited by a shared resource. We study the presence of Eigen's error threshold and mutational meltdown. The phase diagram of the system shows that the extinction of the whole population due to mutational meltdown can occur well before an eventual error threshold transition.
We introduce a toy model for interacting populations connected by mutations and limited by a shared resource. We study the presence of Eigen's error threshold and mutational meltdown. The phase diagram of the system shows that the extinction of the whole population due to mutational meltdown can occur well before an eventual error threshold transition.
△ Less
Submitted 21 April, 1999; v1 submitted 30 July, 1998;
originally announced July 1998.
-
Phase Disorder Effects in a Cellular Automaton Model of Epidemic Propagation
Authors:
M. Bezzi,
R. Livi
Abstract:
A deterministic cellular automaton rule defined on the Moore neighbourhood is studied as a model of epidemic propagation. The directed nature of the interaction between cells allows one to introduce the dependence on a disorder parameter that determines the fraction of ``in-phase'' cells. Phase-disorder is shown to produce peculiar changes in the dynamical and statistical properties of the diffe…
▽ More
A deterministic cellular automaton rule defined on the Moore neighbourhood is studied as a model of epidemic propagation. The directed nature of the interaction between cells allows one to introduce the dependence on a disorder parameter that determines the fraction of ``in-phase'' cells. Phase-disorder is shown to produce peculiar changes in the dynamical and statistical properties of the different evolution regimes obtained by varying the infection and the immunization periods. In particular, the finite-velocity spreading of perturbations, characterizing chaotic evolution, can be prevented by localization effects induced by phase-disorder, that may also yield spatial isotropy of the infection propagation as a statistical effect. Analogously, the structure of phase-synchronous ordered patterns is rapidly lost as soon as phase-disorder is increased, yielding a defect-mediated turbulent regime.
△ Less
Submitted 5 May, 1998;
originally announced May 1998.
-
Species Formation in Simple Ecosystems
Authors:
Franco Bagnoli,
Michele Bezzi
Abstract:
In this paper we consider a microscopic model of a simple ecosystem. The basic ingredients of this model are individuals, and both the phenotypic and genotypic levels are taken in account. The model is based on a long range cellular automaton (CA); introducing simple interactions between the individuals, we get some of the complex collective behaviors observed in a real ecosystem. Since our fitn…
▽ More
In this paper we consider a microscopic model of a simple ecosystem. The basic ingredients of this model are individuals, and both the phenotypic and genotypic levels are taken in account. The model is based on a long range cellular automaton (CA); introducing simple interactions between the individuals, we get some of the complex collective behaviors observed in a real ecosystem. Since our fitness function is smooth, the model does not exhibit the error threshold transition; on the other hand the size of total population is not kept constant, and the mutational meltdown transition is present. We study the effects of competition between genetically similar individuals and how it can lead to species formation. This speciation transition does not depend on the mutation rate. We present also an analytical approximation of the model.
△ Less
Submitted 24 April, 1998; v1 submitted 3 April, 1998;
originally announced April 1998.
-
Speciation as Pattern Formation by Competition in a Smooth Fitness Landscape
Authors:
Franco Bagnoli,
Michele Bezzi
Abstract:
We investigate the problem of speciation and coexistence in simple ecosystems when the competition among individuals is included in the Eigen model for quasi-species. By suggesting an analogy between the competition among strains and the diffusion of a chemical inhibitor in a reaction-diffusion system, the speciation phenomenon is considered the analogous of chemical pattern formation in genetic…
▽ More
We investigate the problem of speciation and coexistence in simple ecosystems when the competition among individuals is included in the Eigen model for quasi-species. By suggesting an analogy between the competition among strains and the diffusion of a chemical inhibitor in a reaction-diffusion system, the speciation phenomenon is considered the analogous of chemical pattern formation in genetic space. In the limit of vanishing mutation rate we obtain analytically the conditions for speciation. Using different forms of the competition interaction we show that the speciation is absent for the genetic equivalent of a normal diffusing inhibitor, and is present for shorter-range interactions. The comparison with numerical simulations is very good.
△ Less
Submitted 14 August, 1997;
originally announced August 1997.
-
Competition in a Fitness Landscape
Authors:
Franco Bagnoli,
Michele Bezzi
Abstract:
We present an extension of Eigen's model for quasi-species including the competition among individuals, proposed as the simplest mechanism for the formation of new species in a smooth fitness landscape. We are able to obtain analytically the critical threshold for species formation. The comparison with numerical simulations is very good.
We present an extension of Eigen's model for quasi-species including the competition among individuals, proposed as the simplest mechanism for the formation of new species in a smooth fitness landscape. We are able to obtain analytically the critical threshold for species formation. The comparison with numerical simulations is very good.
△ Less
Submitted 14 August, 1997; v1 submitted 13 February, 1997;
originally announced February 1997.
-
Transition between immune and disease states in a cellular automaton model of clonal immune response
Authors:
Michele Bezzi,
Franco Celada,
Stefano Ruffo,
Philip E. Seiden
Abstract:
In this paper we extend the Celada-Seiden (CS) model of the humoral immune response to include infectious virus and cytotoxic T lymphocytes (cellular response). The response of the system to virus involves a competition between the ability of the virus to kill the host cells and the host's ability to eliminate the virus. We find two basins of attraction in the dynamics of this system, one is ide…
▽ More
In this paper we extend the Celada-Seiden (CS) model of the humoral immune response to include infectious virus and cytotoxic T lymphocytes (cellular response). The response of the system to virus involves a competition between the ability of the virus to kill the host cells and the host's ability to eliminate the virus. We find two basins of attraction in the dynamics of this system, one is identified with disease and the other with the immune state. There is also an oscillating state that exists on the border of these two stable states. Fluctuations in the population of virus or antibody can end the oscillation and drive the system into one of the stable states. The introduction of mechanisms of cross-regulation between the two responses can bias the system towards one of them. We also study a mean field model, based on coupled maps, to investigate virus-like infections. This simple model reproduces the attractors for average populations observed in the cellular automaton. All the dynamical behavior connected to spatial extension is lost, as is the oscillating feature. Thus the mean field approximation introduced with coupled maps destroys oscillations.
△ Less
Submitted 2 January, 1997;
originally announced January 1997.