-
CRATOR: a Dark Web Crawler
Authors:
Daniel De Pascale,
Giuseppe Cascavilla,
Damian A. Tamburri,
Willem-Jan Van Den Heuvel
Abstract:
Dark web crawling is a complex process that involves specific methodologies and techniques to navigate the Tor network and extract data from hidden services. This study proposes a general dark web crawler designed to extract pages handling security protocols, such as captchas, efficiently. Our approach uses a combination of seed URL lists, link analysis, and scanning to discover new content. We al…
▽ More
Dark web crawling is a complex process that involves specific methodologies and techniques to navigate the Tor network and extract data from hidden services. This study proposes a general dark web crawler designed to extract pages handling security protocols, such as captchas, efficiently. Our approach uses a combination of seed URL lists, link analysis, and scanning to discover new content. We also incorporate methods for user-agent rotation and proxy usage to maintain anonymity and avoid detection. We evaluate the effectiveness of our crawler using metrics such as coverage, performance and robustness. Our results demonstrate that our crawler effectively extracts pages handling security protocols while maintaining anonymity and avoiding detection. Our proposed dark web crawler can be used for various applications, including threat intelligence, cybersecurity, and online investigations.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Architectural Design Decisions for Self-Serve Data Platforms in Data Meshes
Authors:
Tom van Eijk,
Indika Kumara,
Dario Di Nucci,
Damian Andrew Tamburri,
Willem-Jan van den Heuvel
Abstract:
Data mesh is an emerging decentralized approach to managing and generating value from analytical enterprise data at scale. It shifts the ownership of the data to the business domains closest to the data, promotes sharing and managing data as autonomous products, and uses a federated and automated data governance model. The data mesh relies on a managed data platform that offers services to domain…
▽ More
Data mesh is an emerging decentralized approach to managing and generating value from analytical enterprise data at scale. It shifts the ownership of the data to the business domains closest to the data, promotes sharing and managing data as autonomous products, and uses a federated and automated data governance model. The data mesh relies on a managed data platform that offers services to domain and governance teams to build, share, and manage data products efficiently. However, designing and implementing a self-serve data platform is challenging, and the platform engineers and architects must understand and choose the appropriate design options to ensure the platform will enhance the experience of domain and governance teams. For these reasons, this paper proposes a catalog of architectural design decisions and their corresponding decision options by systematically reviewing 43 industrial gray literature articles on self-serve data platforms in data mesh. Moreover, we used semi-structured interviews with six data engineering experts with data mesh experience to validate, refine, and extend the findings from the literature. Such a catalog of design decisions and options drawn from the state of practice shall aid practitioners in building data meshes while providing a baseline for further research on data mesh architectures.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
When the Few Outweigh the Many: Illicit Content Recognition with Few-Shot Learning
Authors:
G. Cascavilla,
G. Catolino,
M. Conti,
D. Mellios,
D. A. Tamburri
Abstract:
The anonymity and untraceability benefits of the Dark web account for the exponentially-increased potential of its popularity while creating a suitable womb for many illicit activities, to date. Hence, in collaboration with cybersecurity and law enforcement agencies, research has provided approaches for recognizing and classifying illicit activities with most exploiting textual dark web markets' c…
▽ More
The anonymity and untraceability benefits of the Dark web account for the exponentially-increased potential of its popularity while creating a suitable womb for many illicit activities, to date. Hence, in collaboration with cybersecurity and law enforcement agencies, research has provided approaches for recognizing and classifying illicit activities with most exploiting textual dark web markets' content recognition; few such approaches use images that originated from dark web content. This paper investigates this alternative technique for recognizing illegal activities from images. In particular, we investigate label-agnostic learning techniques like One-Shot and Few-Shot learning featuring the use Siamese neural networks, a state-of-the-art approach in the field. Our solution manages to handle small-scale datasets with promising accuracy. In particular, Siamese neural networks reach 90.9% on 20-Shot experiments over a 10-class dataset; this leads us to conclude that such models are a promising and cheaper alternative to the definition of automated law-enforcing machinery over the dark web.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
Counter-terrorism in cyber-physical spaces: Best practices and technologies from the state of the art
Authors:
Giuseppe Cascavilla,
Damian A. Tamburri,
Francesco Leotta,
Massimo Mecella,
WillemJan Van Den Heuvel
Abstract:
Context: The demand for protection and security of physical spaces and urban areas increased with the escalation of terroristic attacks in recent years. We envision with the proposed cyber-physical systems and spaces, a city that would indeed become a smarter urbanistic object, proactively providing alerts and being protective against any threat. Objectives: This survey intend to provide a systema…
▽ More
Context: The demand for protection and security of physical spaces and urban areas increased with the escalation of terroristic attacks in recent years. We envision with the proposed cyber-physical systems and spaces, a city that would indeed become a smarter urbanistic object, proactively providing alerts and being protective against any threat. Objectives: This survey intend to provide a systematic multivocal literature survey comprised of an updated, comprehensive and timely overview of state of the art in counter-terrorism cyber-physical systems, hence aimed at the protection of cyber-physical spaces. Hence, provide guidelines to law enforcement agencies and practitioners providing a description of technologies and best practices for the protection of public spaces. Methods: We analyzed 112 papers collected from different online sources, both from the academic field and from websites and blogs ranging from 2004 till mid-2022. Results: a) There is no one single bullet-proof solution available for the protection of public spaces. b) From our analysis we found three major active fields for the protection of public spaces: Information Technologies, Architectural approaches, Organizational field. c) While the academic suggest best practices and methodologies for the protection of urban areas, the market did not provide any type of implementation of such suggested approaches, which shows a lack of fertilization between academia and industry. Conclusion: The overall analysis has led us to state that there is no one single solution available, conversely, multiple methods and techniques can be put in place to guarantee safety and security in public spaces. The techniques range from architectural design to rethink the design of public spaces kee** security into account in continuity, to emerging technologies such as AI and predictive surveillance.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
Unveiling and unraveling aggregation and dispersion fallacies in group MCDM
Authors:
Majid Mohammadi,
Damian A. Tamburri,
Jafar Rezaei
Abstract:
Priorities in multi-criteria decision-making (MCDM) convey the relevance preference of one criterion over another, which is usually reflected by imposing the non-negativity and unit-sum constraints. The processing of such priorities is different than other unconstrained data, but this point is often neglected by researchers, which results in fallacious statistical analysis. This article studies th…
▽ More
Priorities in multi-criteria decision-making (MCDM) convey the relevance preference of one criterion over another, which is usually reflected by imposing the non-negativity and unit-sum constraints. The processing of such priorities is different than other unconstrained data, but this point is often neglected by researchers, which results in fallacious statistical analysis. This article studies three prevalent fallacies in group MCDM along with solutions based on compositional data analysis to avoid misusing statistical operations. First, we use a compositional approach to aggregate the priorities of a group of DMs and show that the outcome of the compositional analysis is identical to the normalized geometric mean, meaning that the arithmetic mean should be avoided. Furthermore, a new aggregation method is developed, which is a robust surrogate for the geometric mean. We also discuss the errors in computing measures of dispersion, including standard deviation and distance functions. Discussing the fallacies in computing the standard deviation, we provide a probabilistic criteria ranking by develo** proper Bayesian tests, where we calculate the extent to which a criterion is more important than another. Finally, we explain the errors in computing the distance between priorities, and a clustering algorithm is specially tailored based on proper distance metrics.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Data Mesh: a Systematic Gray Literature Review
Authors:
Abel Goedegebuure,
Indika Kumara,
Stefan Driessen,
Dario Di Nucci,
Geert Monsieur,
Willem-jan van den Heuvel,
Damian Andrew Tamburri
Abstract:
Data mesh is an emerging domain-driven decentralized data architecture that aims to minimize or avoid operational bottlenecks associated with centralized, monolithic data architectures in enterprises. The topic has picked the practitioners' interest, and there is considerable gray literature on it. At the same time, we observe a lack of academic attempts at defining and building upon the concept.…
▽ More
Data mesh is an emerging domain-driven decentralized data architecture that aims to minimize or avoid operational bottlenecks associated with centralized, monolithic data architectures in enterprises. The topic has picked the practitioners' interest, and there is considerable gray literature on it. At the same time, we observe a lack of academic attempts at defining and building upon the concept. Hence, in this article, we aim to start from the foundations and characterize the data mesh architecture regarding its design principles, architectural components, capabilities, and organizational roles. We systematically collected, analyzed, and synthesized 114 industrial gray literature articles. The review provides insights into practitioners' perspectives on the four key principles of data mesh: data as a product, domain ownership of data, self-serve data platform, and federated computational governance. Moreover, due to the comparability of data mesh and SOA (service-oriented architecture), we mapped the findings from the gray literature into the reference architectures from the SOA academic literature to create the reference architectures for describing three key dimensions of data mesh: organization of capabilities and roles, development, and runtime. Finally, we discuss open research issues in data mesh, partially based on the findings from the gray literature.
△ Less
Submitted 1 June, 2024; v1 submitted 3 April, 2023;
originally announced April 2023.
-
Architecture Smells vs. Concurrency Bugs: an Exploratory Study and Negative Results
Authors:
Damian Andrew Tamburri,
Francesca Arcelli Fontana,
Riccardo Roveda,
Valentina Lenarduzzi
Abstract:
Technical debt occurs in many different forms across software artifacts. One such form is connected to software architectures where debt emerges in the form of structural anti-patterns across architecture elements, namely, architecture smells. As defined in the literature, ``Architecture smells are recurrent architectural decisions that negatively impact internal system quality", thus increasing t…
▽ More
Technical debt occurs in many different forms across software artifacts. One such form is connected to software architectures where debt emerges in the form of structural anti-patterns across architecture elements, namely, architecture smells. As defined in the literature, ``Architecture smells are recurrent architectural decisions that negatively impact internal system quality", thus increasing technical debt. In this paper, we aim at exploring whether there exist manifestations of architectural technical debt beyond decreased code or architectural quality, namely, whether there is a relation between architecture smells (which primarily reflect structural characteristics) and the occurrence of concurrency bugs (which primarily manifest at runtime). We study 125 releases of 5 large data-intensive software systems to reveal that (1) several architecture smells may in fact indicate the presence of concurrency problems likely to manifest at runtime but (2) smells are not correlated with concurrency in general -- rather, for specific concurrency bugs they must be combined with an accompanying articulation of specific project characteristics such as project distribution. As an example, a cyclic dependency could be present in the code, but the specific execution-flow could be never executed at runtime.
△ Less
Submitted 31 March, 2023;
originally announced March 2023.
-
Microservice Architecture Practices and Experience: a Focused Look on Docker Configuration Files
Authors:
Luciano Baresi,
Giovanni Quattrocchi,
Damian Andrew Tamburri
Abstract:
Cloud applications are more and more microservice-oriented, but a concrete charting of the microservices architecture landscape -- namely, the space of technical options available for microservice software architects in their decision-making -- is still very much lacking, thereby limiting the ability of software architects to properly evaluate their architectural decisions with sound experiential…
▽ More
Cloud applications are more and more microservice-oriented, but a concrete charting of the microservices architecture landscape -- namely, the space of technical options available for microservice software architects in their decision-making -- is still very much lacking, thereby limiting the ability of software architects to properly evaluate their architectural decisions with sound experiential devices and/or practical design principles. On the one hand, Microservices are fine-grained, loosely coupled services that communicate through lightweight protocols. On the other hand, each microservice can use a different software stack, be deployed and scaled independently or even executed in different containers, which provide isolation and a wide-range of configuration options but also offer unforeseeable architectural interactions and under-explored architecture smells, with such experience captured mainly in software repositories where such solutions are cycled.
This paper adopts a mining software repositories (MSR) approach to capture the practice within the microservice architecture landscape, by eliciting and analysing Docker configuration files, being Docker the leading technical device to design for, and implement modern microservices. Our analysis of Docker-based microservices gives an interesting summary of the current state of microservices practice and experience. Conversely, observing that all our datapoints have their own shape and characteristics, we conclude that further comparative assessment with industrial systems is needed to better address the recurring positive principles and patterns around microservices.
△ Less
Submitted 6 December, 2022;
originally announced December 2022.
-
Blockchain-Oriented Services Computing in Action: Insights from a User Study
Authors:
Giovanni Quattrocchi,
Damian Andrew Tamburri,
WIllem-Jan Van Den Heuvel
Abstract:
Blockchain architectures promise disruptive innovation but factually they pose many architectural restrictions to classical service-based applications and show considerable design, implementation, and operations overhead. Furthermore, the relation between such overheads and user benefits is not clear yet. To shed light on the aforementioned relations, a service-based blockchain architecture was de…
▽ More
Blockchain architectures promise disruptive innovation but factually they pose many architectural restrictions to classical service-based applications and show considerable design, implementation, and operations overhead. Furthermore, the relation between such overheads and user benefits is not clear yet. To shed light on the aforementioned relations, a service-based blockchain architecture was designed and deployed as part of a field study in real-life experimentation. An observational approach was then performed to elaborate on the technology-acceptance of the service-based blockchain architecture in question. Evidence shows that the resulting architecture is, in principle, not different than other less complex equivalents; furthermore, the architectural limitations posed by the blockchain-oriented design demand a significant additional effort to be put onto even the simplest of functionalities. We conclude that further research shall be invested in clarifying further the design principles we learned as part of this study as well as any trade-offs posed by blockchain-oriented service design and operation.
△ Less
Submitted 22 September, 2022;
originally announced September 2022.
-
A Declarative Modelling Framework for the Deployment and Management of Blockchain Applications
Authors:
Luciano Baresi,
Giovanni Quattrocchi,
Damian Andrew Tamburri,
Luca Terracciano
Abstract:
The deployment and management of Blockchain applications require non-trivial efforts given the unique characteristics of their infrastructure (i.e., immutability) and the complexity of the software systems being executed. The operation of Blockchain applications is still based on ad-hoc solutions that are error-prone, difficult to maintain and evolve, and do not manage their interactions with othe…
▽ More
The deployment and management of Blockchain applications require non-trivial efforts given the unique characteristics of their infrastructure (i.e., immutability) and the complexity of the software systems being executed. The operation of Blockchain applications is still based on ad-hoc solutions that are error-prone, difficult to maintain and evolve, and do not manage their interactions with other infrastructures (e.g., a Cloud backend).
This paper proposes KATENA, a framework for the deployment and management of Blockchain applications. In particular, it focuses on applications that are compatible with Ethereum, a popular general-purpose Blockchain technology. KATENA provides i) a metamodel for defining Blockchain applications, ii) a set of processes to automate the deployment and management of defined models, and iii) an implementation of the approach based on TOSCA, a standard language for Infrastructure-as-Code, and xOpera, a TOSCA-compatible orchestrator. To evaluate the approach, we applied KATENA to model and deploy three real-world Blockchain applications, and showed that our solution reduces the amount of code required for their operations up to $82.7\%$.
△ Less
Submitted 12 September, 2022;
originally announced September 2022.
-
Real-world K-Anonymity Applications: the \textsc{KGen} approach and its evaluation in Fraudulent Transactions
Authors:
Daniel De Pascale,
Giuseppe Cascavilla,
Damian A. Tamburri,
Willem-Jan Van Den Heuvel
Abstract:
K-Anonymity is a property for the measurement, management, and governance of the data anonymization. Many implementations of k-anonymity have been described in state of the art, but most of them are not able to work with a large number of attributes in a "Big" dataset, i.e., a dataset drawn from Big Data. To address this significant shortcoming, we introduce and evaluate \textsc{KGen} an approach…
▽ More
K-Anonymity is a property for the measurement, management, and governance of the data anonymization. Many implementations of k-anonymity have been described in state of the art, but most of them are not able to work with a large number of attributes in a "Big" dataset, i.e., a dataset drawn from Big Data. To address this significant shortcoming, we introduce and evaluate \textsc{KGen} an approach to K-anonymity featuring Genetic Algorithms. \textsc{KGen} promotes such a meta-heuristic approach since it can solve the problem by finding a pseudo-optimal solution in a reasonable time over a considerable load of input. \textsc{KGen} allows the data manager to guarantee a high anonymity level while preserving the usability and preventing loss of information entropy over the data. Differently from other approaches that provide optimal global solutions catered for small datasets, \textsc{KGen} works properly also over Big datasets while still providing a good-enough solution. Evaluation results show how our approach can still work efficiently on a real world dataset, provided by Dutch Tax Authority, with 47 attributes (i.e., the columns of the dataset to be anonymized) and over 1.5K+ observations (i.e., the rows of that dataset), as well as on a dataset with 97 attributes and over 3942 observations.
△ Less
Submitted 31 March, 2022;
originally announced April 2022.
-
Internet-of-Things Architectures for Secure Cyber-Physical Spaces: the VISOR Experience Report
Authors:
Daniel De Pascale,
Giuseppe Cascavilla,
Mirella Sangiovanni,
Damian A. Tamburri,
Willem-Jan van den Heuvel
Abstract:
Internet of things (IoT) technologies are becoming a more and more widespread part of civilian life in common urban spaces, which are rapidly turning into cyber-physical spaces. Simultaneously, the fear of terrorism and crime in such public spaces is ever-increasing. Due to the resulting increased demand for security, video-based IoT surveillance systems have become an important area for research.…
▽ More
Internet of things (IoT) technologies are becoming a more and more widespread part of civilian life in common urban spaces, which are rapidly turning into cyber-physical spaces. Simultaneously, the fear of terrorism and crime in such public spaces is ever-increasing. Due to the resulting increased demand for security, video-based IoT surveillance systems have become an important area for research. Considering the large number of devices involved in the illicit recognition task, we conducted a field study in a Dutch Easter music festival in a national interest project called VISOR to select the most appropriate device configuration in terms of performance and results. We iteratively architected solutions for the security of cyber-physical spaces using IoT devices. We tested the performance of multiple federated devices encompassing drones, closed-circuit television, smart phone cameras, and smart glasses to detect real-case scenarios of potentially malicious activities such as mosh-pits and pick-pocketing. Our results pave the way to select optimal IoT architecture configurations -- i.e., a mix of CCTV, drones, smart glasses, and camera phones in our case -- to make safer cyber-physical spaces' a reality.
△ Less
Submitted 1 April, 2022;
originally announced April 2022.
-
In Search of Socio-Technical Congruence: A Large-Scale Longitudinal Study
Authors:
Wolfgang Mauerer,
Mitchell Joblin,
Damian A. Tamburri,
Carlos Paradis,
Rick Kazman,
Sven Apel
Abstract:
We report on a large-scale empirical study investigating the relevance of socio-technical congruence over key basic software quality metrics, namely, bugs and churn. In particular, we explore whether alignment or misalignment of social communication structures and technical dependencies in large software projects influences software quality. To this end, we have defined a quantitative and operatio…
▽ More
We report on a large-scale empirical study investigating the relevance of socio-technical congruence over key basic software quality metrics, namely, bugs and churn. In particular, we explore whether alignment or misalignment of social communication structures and technical dependencies in large software projects influences software quality. To this end, we have defined a quantitative and operational notion of socio-technical congruence, which we call socio-technical motif congruence (STMC). STMC is a measure of the degree to which developers working on the same file or on two related files, need to communicate. As socio-technical congruence is a complex and multi-faceted phenomenon, the interpretability of the results is one of our main concerns, so we have employed a careful mixed-methods statistical analysis. In particular, we provide analyses with similar techniques as employed by seminal work in the field to ensure comparability of our results with the existing body of work. The major result of our study, based on an analysis of 25 large open-source projects, is that STMC is not related to project quality measures -- software bugs and churn -- in any temporal scenario. That is, we find no statistical relationship between the alignment of developer tasks and developer communications on the one hand, and project outcomes on the other hand. We conclude that, wherefore congruence does matter as literature shows, then its measurable effect lies elsewhere.
△ Less
Submitted 17 May, 2021;
originally announced May 2021.
-
An efficient projection neural network for $\ell_1$-regularized logistic regression
Authors:
Majid Mohammadi,
Amir Ahooye Atashin,
Damian A. Tamburri
Abstract:
$\ell_1$ regularization has been used for logistic regression to circumvent the overfitting and use the estimated sparse coefficient for feature selection. However, the challenge of such a regularization is that the $\ell_1…
▽ More
$\ell_1$ regularization has been used for logistic regression to circumvent the overfitting and use the estimated sparse coefficient for feature selection. However, the challenge of such a regularization is that the $\ell_1$ norm is not differentiable, making the standard algorithms for convex optimization not applicable to this problem. This paper presents a simple projection neural network for $\ell_1$-regularized logistics regression. In contrast to many available solvers in the literature, the proposed neural network does not require any extra auxiliary variable nor any smooth approximation, and its complexity is almost identical to that of the gradient descent for logistic regression without $\ell_1$ regularization, thanks to the projection operator. We also investigate the convergence of the proposed neural network by using the Lyapunov theory and show that it converges to a solution of the problem with any arbitrary initial value. The proposed neural solution significantly outperforms state-of-the-art methods with respect to the execution time and is competitive in terms of accuracy and AUROC.
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers
Authors:
Therese Fehrer,
RocÃo Cabrera Lozoya,
Antonino Sabetta,
Dario Di Nucci,
Damian A. Tamburri
Abstract:
The sources of reliable, code-level information about vulnerabilities that affect open-source software (OSS) are scarce, which hinders a broad adoption of advanced tools that provide code-level detection and assessment of vulnerable OSS dependencies.
In this paper, we study the extent to which the output of off-the-shelf static code analyzers can be used as a source of features to represent comm…
▽ More
The sources of reliable, code-level information about vulnerabilities that affect open-source software (OSS) are scarce, which hinders a broad adoption of advanced tools that provide code-level detection and assessment of vulnerable OSS dependencies.
In this paper, we study the extent to which the output of off-the-shelf static code analyzers can be used as a source of features to represent commits in Machine Learning (ML) applications. In particular, we investigate how such features can be used to construct embeddings and train ML models to automatically identify source code commits that contain vulnerability fixes.
We analyze such embeddings for security-relevant and non-security-relevant commits, and we show that, although in isolation they are not different in a statistically significant manner, it is possible to use them to construct a ML pipeline that achieves results comparable with the state of the art.
We also found that the combination of our method with commit2vec represents a tangible improvement over the state of the art in the automatic identification of commits that fix vulnerabilities: the ML models we construct and commit2vec are complementary, the former being more generally applicable, albeit not as accurate.
△ Less
Submitted 7 May, 2021;
originally announced May 2021.
-
QSOC: Quantum Service-Oriented Computing
Authors:
Indika Kumara,
Willem-Jan Van Den Heuvel,
Damian A. Tamburri
Abstract:
Quantum computing is quickly turning from a promise to a reality, witnessing the launch of several cloud-based, general-purpose offerings, and IDEs. Unfortunately, however, existing solutions typically implicitly assume intimate knowledge about quantum computing concepts and operators. This paper introduces Quantum Service-Oriented Computing (QSOC), including a model-driven methodology to allow en…
▽ More
Quantum computing is quickly turning from a promise to a reality, witnessing the launch of several cloud-based, general-purpose offerings, and IDEs. Unfortunately, however, existing solutions typically implicitly assume intimate knowledge about quantum computing concepts and operators. This paper introduces Quantum Service-Oriented Computing (QSOC), including a model-driven methodology to allow enterprise DevOps teams to compose, configure and operate enterprise applications without intimate knowledge on the underlying quantum infrastructure, advocating knowledge reuse, separation of concerns, resource optimization, and mixed quantum- & conventional QSOC applications.
△ Less
Submitted 4 May, 2021;
originally announced May 2021.
-
DataOps for Societal Intelligence: a Data Pipeline for Labor Market Skills Extraction and Matching
Authors:
Damian Andrew Tamburri,
Willem-Jan Van den Heuvel,
Martin Garriga
Abstract:
Big Data analytics supported by AI algorithms can support skills localization and retrieval in the context of a labor market intelligence problem. We formulate and solve this problem through specific DataOps models, blending data sources from administrative and technical partners in several countries into cooperation, creating shared knowledge to support policy and decision-making. We then focus o…
▽ More
Big Data analytics supported by AI algorithms can support skills localization and retrieval in the context of a labor market intelligence problem. We formulate and solve this problem through specific DataOps models, blending data sources from administrative and technical partners in several countries into cooperation, creating shared knowledge to support policy and decision-making. We then focus on the critical task of skills extraction from resumes and vacancies featuring state-of-the-art machine learning models. We showcase preliminary results with applied machine learning on real data from the employment agencies of the Netherlands and the Flemish region in Belgium. The final goal is to match these skills to standard ontologies of skills, jobs and occupations.
△ Less
Submitted 5 April, 2021;
originally announced April 2021.
-
Automated Map** of Vulnerability Advisories onto their Fix Commits in Open Source Repositories
Authors:
Daan Hommersom,
Antonino Sabetta,
Bonaventura Coppola,
Dario Di Nucci,
Damian A. Tamburri
Abstract:
The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this paper, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML) - specifically, natural language processing (NLP) - to address this problem. Our method consists of…
▽ More
The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this paper, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML) - specifically, natural language processing (NLP) - to address this problem. Our method consists of three phases. First, an advisory record containing key information about a vulnerability is extracted from an advisory (expressed in natural language). Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project by filtering out commits that are known to be irrelevant for the task at hand. Finally, for each such candidate commit, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. The feature vectors are then exploited for building a final ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to interpret the predictions.
We evaluated our approach using a prototype implementation named FixFinder on a manually curated data set that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03% of the vulnerabilities (with a fix commit on the first position for 65.06% of the vulnerabilities). In conclusion, our method reduces considerably the effort needed to search OSS repositories for the commits that fix known vulnerabilities.
△ Less
Submitted 10 May, 2023; v1 submitted 24 March, 2021;
originally announced March 2021.
-
Automated Test-Case Generation for Solidity Smart Contracts: the AGSolT Approach and its Evaluation
Authors:
Stefan Driessen,
Dario Di Nucci,
Geert Monsieur,
Damian A. Tamburri,
Willem-Jan van den Heuvel
Abstract:
Blockchain and smart contract technology are novel approaches to data and code management that facilitate trusted computing by allowing for development in a distributed and decentralized manner. Testing smart contracts comes with its own set of challenges which have not yet been fully identified and explored. Although existing tools can identify and discover known vulnerabilities and their interac…
▽ More
Blockchain and smart contract technology are novel approaches to data and code management that facilitate trusted computing by allowing for development in a distributed and decentralized manner. Testing smart contracts comes with its own set of challenges which have not yet been fully identified and explored. Although existing tools can identify and discover known vulnerabilities and their interactions on the Ethereum blockchain through random search or symbolic execution, these tools generally do not produce test suites suitable for human oracles. In this paper, we present AGSOLT (Automated Generator of Solidity Test Suites). We demonstrate its efficiency by implementing two search algorithms to automatically generate test suites for stand-alone Solidity smart contracts, taking into account some of the blockchain-specific challenges. To test AGSOLT, we compared a random search algorithm and a genetic algorithm on a set of 36 real-world smart contracts. We found that AGSOLT is capable of achieving high branch coverage with both approaches and even discovered some errors in some of the most popular Solidity smart contracts on Github.
△ Less
Submitted 15 April, 2022; v1 submitted 17 February, 2021;
originally announced February 2021.
-
DeepIaC: Deep Learning-Based Linguistic Anti-pattern Detection in IaC
Authors:
Nemania Borovits,
Indika Kumara,
Parvathy Krishnan,
Stefano Dalla Palma,
Dario Di Nucci,
Fabio Palomba,
Damian A. Tamburri,
Willem-Jan van den Heuvel
Abstract:
Linguistic anti-patterns are recurring poor practices concerning inconsistencies among the naming, documentation, and implementation of an entity. They impede readability, understandability, and maintainability of source code. This paper attempts to detect linguistic anti-patterns in infrastructure as code (IaC) scripts used to provision and manage computing environments. In particular, we conside…
▽ More
Linguistic anti-patterns are recurring poor practices concerning inconsistencies among the naming, documentation, and implementation of an entity. They impede readability, understandability, and maintainability of source code. This paper attempts to detect linguistic anti-patterns in infrastructure as code (IaC) scripts used to provision and manage computing environments. In particular, we consider inconsistencies between the logic/body of IaC code units and their names. To this end, we propose a novel automated approach that employs word embeddings and deep learning techniques. We build and use the abstract syntax tree of IaC code units to create their code embedments. Our experiments with a dataset systematically extracted from open source repositories show that our approach yields an accuracy between0.785and0.915in detecting inconsistencies
△ Less
Submitted 22 September, 2020;
originally announced September 2020.
-
Blockchain and Cryptocurrencies: a Classification and Comparison of Architecture Drivers
Authors:
Martin Garriga,
Stefano Dalla Palma,
Maximiliano Arias,
Alan De Renzis,
Remo Pareschi,
Damian Andrew Tamburri
Abstract:
Blockchain is a decentralized transaction and data management solution, the technological leap behind the success of Bitcoin and other cryptocurrencies. As the variety of existing blockchains and distributed ledgers continues to increase, adopters should focus on selecting the solution that best fits their needs and the requirements of their decentralized applications, rather than develo** yet a…
▽ More
Blockchain is a decentralized transaction and data management solution, the technological leap behind the success of Bitcoin and other cryptocurrencies. As the variety of existing blockchains and distributed ledgers continues to increase, adopters should focus on selecting the solution that best fits their needs and the requirements of their decentralized applications, rather than develo** yet another blockchain from scratch. In this paper we present a conceptual framework to aid software architects, developers, and decision makers to adopt the right blockchain technology. The framework exposes the interrelation between technological decisions and architectural features, capturing the knowledge from existing academic literature, industrial products, technical forums/blogs, and experts' feedback. We empirically show the applicability of our framework by dissecting the platforms behind Bitcoin and other top 10 cryptocurrencies, aided by a focus group with researchers and industry practitioners. Then, we leverage the framework together with key notions of the Architectural Tradeoff Analysis Method (ATAM) to analyze four real-world blockchain case studies from industry and academia. Results shown that applying our framework leads to a deeper understanding of the architectural tradeoffs, allowing to assess technologies more objectively and select the one that best fit developers needs, ultimately cutting costs, reducing time-to-market and accelerating return on investment.
△ Less
Submitted 23 July, 2020;
originally announced July 2020.
-
Towards Semantic Detection of Smells in Cloud Infrastructure Code
Authors:
Indika Kumara,
Zoe Vasileiou,
Georgios Meditskos,
Damian A. Tamburri,
Willem-Jan Van Den Heuvel,
Anastasios Karakostas,
Stefanos Vrochidis,
Ioannis Kompatsiaris
Abstract:
Automated deployment and management of Cloud applications relies on descriptions of their deployment topologies, often referred to as Infrastructure Code. As the complexity of applications and their deployment models increases, developers inadvertently introduce software smells to such code specifications, for instance, violations of good coding practices, modular structure, and more. This paper p…
▽ More
Automated deployment and management of Cloud applications relies on descriptions of their deployment topologies, often referred to as Infrastructure Code. As the complexity of applications and their deployment models increases, developers inadvertently introduce software smells to such code specifications, for instance, violations of good coding practices, modular structure, and more. This paper presents a knowledge-driven approach enabling developers to identify the aforementioned smells in deployment descriptions. We detect smells with SPARQL-based rules over pattern-based OWL 2 knowledge graphs capturing deployment models. We show the feasibility of our approach with a prototype and three case studies.
△ Less
Submitted 4 July, 2020;
originally announced July 2020.
-
Success and Failure in Software Engineering: a Followup Systematic Literature Review
Authors:
Damian A. Tamburri,
Fabio Palomba,
Rick Kazman
Abstract:
Success and failure in software engineering are still among the least understood phenomena in the discipline. In a recent special journal issue on the topic, Mantyla et al. started discussing these topics from different angles; the authors focused their contributions on offering a general overview of both topics without deeper detail. Recognising the importance and impact of the topic, we have exe…
▽ More
Success and failure in software engineering are still among the least understood phenomena in the discipline. In a recent special journal issue on the topic, Mantyla et al. started discussing these topics from different angles; the authors focused their contributions on offering a general overview of both topics without deeper detail. Recognising the importance and impact of the topic, we have executed a followup, more in-depth systematic literature review with additional analyses beyond what was previously provided. These new analyses offer: (a) a grounded-theory of success and failure factors, harvesting over 500+ factors from the literature; (b) 14 manually-validated clusters of factors that provide relevant areas for success- and failure-specific measurement and risk-analysis; (c) a quality model composed of previously unmeasured organizational structure quantities which are germane to software product, process, and community quality. We show that the topics of success and failure deserve further study as well as further automated tool support, e.g., monitoring tools and metrics able to track the factors and patterns emerging from our study. This paper provides managers with risks as well as a more fine-grained analysis of the parameters that can be appraised to anticipate the risks.
△ Less
Submitted 22 June, 2020;
originally announced June 2020.
-
Towards a Catalogue of Software Quality Metrics for Infrastructure Code
Authors:
Stefano Dalla Palma,
Dario Di Nucci,
Fabio Palomba,
Damian A. Tamburri
Abstract:
Infrastructure-as-code (IaC) is a practice to implement continuous deployment by allowing management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools. On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little…
▽ More
Infrastructure-as-code (IaC) is a practice to implement continuous deployment by allowing management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools. On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC practice in a measurable fashion. On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed. However, unlike general-purpose programming languages (GPLs), IaC scripts use domain-specific languages, and metrics used for GPLs may not be applicable for IaC scripts. This article proposes a catalogue consisting of 46 metrics to identify IaC properties focusing on Ansible, one of the most popular IaC language to date, and shows how they can be used to analyze IaC scripts.
△ Less
Submitted 7 July, 2020; v1 submitted 27 May, 2020;
originally announced May 2020.
-
Organisational Structure Patterns in Agile Teams: An Industrial Empirical Study
Authors:
Damian A. Tamburri,
Rick Kazman,
Hamed Fahimi
Abstract:
Forming members of an organization into coherent groups or communities is an important issue in any large-scale software engineering endeavour, especially so in agile software development teams which rely heavily on self-organisation and organisational flexibility. To address this problem, many researchers and practitioners have advocated a strategy of mirroring system structure and organisational…
▽ More
Forming members of an organization into coherent groups or communities is an important issue in any large-scale software engineering endeavour, especially so in agile software development teams which rely heavily on self-organisation and organisational flexibility. To address this problem, many researchers and practitioners have advocated a strategy of mirroring system structure and organisational structure, to simplify communication and coordination of collaborative work. But what are the patterns of organisation found in practice in agile software communities and how effective are those patterns? We address these research questions using mixed-methods research in industry, that is, interview surveys, focus-groups, and delphi studies of agile teams. In our study of 30 agile software organisations we found that, out of 7 organisational structure patterns that recur across our dataset, a single organisational pattern occurs over 37% of the time. This pattern: (a) reflects young communities (1-12 months old); (b) disappears in established ones (13+ months); (c) reflects the highest number of architecture issues reported. Finally, we observe a negative correlation between a proposed organisational measure and architecture issues. These insights may serve to aid architects in designing not only their architectures but also their communities to best support their co-evolution.
△ Less
Submitted 16 April, 2020;
originally announced April 2020.
-
SDSN@RT: a middleware environment for single-instance multi-tenant cloud applications
Authors:
Indika Kumara,
Jun Han,
Alan Colman,
Willem-Jan van den Heuvel,
Damian A. Tamburri,
Malinda Kapuruge
Abstract:
With the Single-Instance Multi-Tenancy (SIMT) model for composite Software-as-a-Service (SaaS) applications, a single composite application instance can host multiple tenants, yielding the benefits of better service and resource utilization, and reduced operational cost for the SaaS provider. An SIMT application needs to share services and their aggregation (the application) among its tenants whil…
▽ More
With the Single-Instance Multi-Tenancy (SIMT) model for composite Software-as-a-Service (SaaS) applications, a single composite application instance can host multiple tenants, yielding the benefits of better service and resource utilization, and reduced operational cost for the SaaS provider. An SIMT application needs to share services and their aggregation (the application) among its tenants while supporting variations in the functional and performance requirements of the tenants. The SaaS provider requires a middleware environment that can deploy, enact and manage a designed SIMT application, to achieve the varied requirements of the different tenants in a controlled manner. This paper presents the SDSN@RT (Software-Defined Service Networks @ RunTime) middleware environment that can meet the aforementioned requirements. SDSN@RT represents an SIMT composite cloud application as a multi-tenant service network, where the same service network simultaneously hosts a set of virtual service networks (VSNs), one for each tenant. A service network connects a set of services, and coordinates the interactions between them. A VSN realizes the requirements for a specific tenant and can be deployed, configured, and logically isolated in the service network at runtime. SDSN@RT also supports the monitoring and runtime changes of the deployed multi-tenant service networks. We show the feasibility of SDSN@RT with a prototype implementation, and demonstrate its capabilities to host SIMT applications and support their changes with a case study. The performance study of the prototype implementation shows that the runtime capabilities of our middleware incur little overhead.
△ Less
Submitted 10 February, 2020;
originally announced February 2020.
-
Blockchains: a Systematic Multivocal Literature Review
Authors:
Bert-Jan Butijn,
Damian A. Tamburri,
Willem-Jan Van Den Heuvel
Abstract:
Blockchain technology has gained tremendous popularity both in practice and academia. The goal of this article is to develop a coherent overview of the state of the art in blockchain technology, using a systematic(i.e.,protocol-based, replicable), multivocal (i.e., featuring both white and grey literature alike) literature review, to (1) define blockchain technology (2) elaborate on its architectu…
▽ More
Blockchain technology has gained tremendous popularity both in practice and academia. The goal of this article is to develop a coherent overview of the state of the art in blockchain technology, using a systematic(i.e.,protocol-based, replicable), multivocal (i.e., featuring both white and grey literature alike) literature review, to (1) define blockchain technology (2) elaborate on its architecture options and (3) trade-offs, as well as understanding (4) the current applications and challenges, as evident from the state of the art. We derive a systematic definition of blockchain technology, based on a formal concept analysis. Further on, we flesh out an overview of blockchain technology elaborated by means of Grounded-Theory.
△ Less
Submitted 26 November, 2019;
originally announced November 2019.
-
Towards Surgically-Precise Technical Debt Estimation: Early Results and Research Roadmap
Authors:
Valentina Lenarduzzi,
Antonio Martini,
Davide Taibi,
Damian Andrew Tamburri
Abstract:
The concept of technical debt has been explored from many perspectives but its precise estimation is still under heavy empirical and experimental inquiry. We aim to understand whether, by harnessing approximate, data-driven, machine-learning approaches it is possible to improve the current techniques for technical debt estimation, as represented by a top industry quality analysis tool such as Sona…
▽ More
The concept of technical debt has been explored from many perspectives but its precise estimation is still under heavy empirical and experimental inquiry. We aim to understand whether, by harnessing approximate, data-driven, machine-learning approaches it is possible to improve the current techniques for technical debt estimation, as represented by a top industry quality analysis tool such as SonarQube. For the sake of simplicity, we focus on relatively simple regression modelling techniques and apply them to modelling the additional project cost connected to the sub-optimal conditions existing in the projects under study. Our results shows that current techniques can be improved towards a more precise estimation of technical debt and the case study shows promising results towards the identification of more accurate estimation of technical debt.
△ Less
Submitted 2 August, 2019;
originally announced August 2019.