Search | arXiv e-print repository

doi 10.1109/TDSC.2023.3346195

Highly Available Blockchain Nodes With N-Version Design

Authors: Javier Ron, César Soto-Valero, Long Zhang, Benoit Baudry, Martin Monperrus

Abstract: As all software, blockchain nodes are exposed to faults in their underlying execution stack. Unstable execution environments can disrupt the availability of blockchain nodes interfaces, resulting in downtime for users. This paper introduces the concept of N-version Blockchain nodes. This new type of node relies on simultaneous execution of different implementations of the same blockchain protocol,… ▽ More As all software, blockchain nodes are exposed to faults in their underlying execution stack. Unstable execution environments can disrupt the availability of blockchain nodes interfaces, resulting in downtime for users. This paper introduces the concept of N-version Blockchain nodes. This new type of node relies on simultaneous execution of different implementations of the same blockchain protocol, in the line of Avizienis' N-version programming vision. We design and implement an N-version blockchain node prototype in the context of Ethereum, called N-ETH. We show that N-ETH is able to mitigate the effects of unstable execution environments and significantly enhance availability under environment faults. To simulate unstable execution environments, we perform fault injection at the system-call level. Our results show that existing Ethereum node implementations behave asymmetrically under identical instability scenarios. N-ETH leverages this asymmetric behavior available in the diverse implementations of Ethereum nodes to provide increased availability, even under our most aggressive fault-injection strategies. We are the first to validate the relevance of N-version design in the domain of blockchain infrastructure. From an industrial perspective, our results are of utmost importance for businesses operating blockchain nodes, including Google, ConsenSys, and many other major blockchain companies. △ Less

Submitted 7 February, 2024; v1 submitted 25 March, 2023; originally announced March 2023.

Journal ref: IEEE Transactions on Dependable and Secure Computing, 2023

arXiv:2303.11102 [pdf, other]

doi 10.1109/MSEC.2023.3302956

Challenges of Producing Software Bill Of Materials for Java

Authors: Musard Balliu, Benoit Baudry, Sofia Bobadilla, Mathias Ekstedt, Martin Monperrus, Javier Ron, Aman Sharma, Gabriel Skoglund, César Soto-Valero, Martin Wittlinger

Abstract: Software bills of materials (SBOM) promise to become the backbone of software supply chain hardening. We deep-dive into 6 tools and the accuracy of the SBOMs they produce for complex open-source Java projects. Our novel insights reveal some hard challenges for the accurate production and usage of SBOMs. Software bills of materials (SBOM) promise to become the backbone of software supply chain hardening. We deep-dive into 6 tools and the accuracy of the SBOMs they produce for complex open-source Java projects. Our novel insights reveal some hard challenges for the accurate production and usage of SBOMs. △ Less

Submitted 7 June, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

Journal ref: IEEE Security & Privacy, 2023

arXiv:2302.08370 [pdf, other]

Automatic Specialization of Third-Party Java Dependencies

Authors: César Soto-Valero, Deepika Tiwari, Tim Toady, Benoit Baudry

Abstract: Large-scale code reuse significantly reduces both development costs and time. However, the massive share of third-party code in software projects poses new challenges, especially in terms of maintenance and security. In this paper, we propose a novel technique to specialize dependencies of Java projects, based on their actual usage. Given a project and its dependencies, we systematically identify… ▽ More Large-scale code reuse significantly reduces both development costs and time. However, the massive share of third-party code in software projects poses new challenges, especially in terms of maintenance and security. In this paper, we propose a novel technique to specialize dependencies of Java projects, based on their actual usage. Given a project and its dependencies, we systematically identify the subset of each dependency that is necessary to build the project, and we remove the rest. As a result of this process, we package each specialized dependency in a JAR file. Then, we generate specialized dependency trees where the original dependencies are replaced by the specialized versions. This allows building the project with significantly less third-party code than the original. As a result, the specialized dependencies become a first-class concept in the software supply chain, rather than a transient artifact in an optimizing compiler toolchain. We implement our technique in a tool called DepTrim, which we evaluate with 30 notable open-source Java projects. DepTrim specializes a total of 343 (86.6%) dependencies across these projects, and successfully rebuilds each project with a specialized dependency tree. Moreover, through this specialization, DepTrim removes a total of 57,444 (42.2%) classes from the dependencies, reducing the ratio of dependency classes to project classes from 8.7x in the original projects to 5.0x after specialization. These novel results indicate that dependency specialization significantly reduces the share of third-party code in Java projects. △ Less

Submitted 13 October, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

Comments: 17 pages, 2 figures, 4 tables, 1 algorithm, 2 code listings, 3 equations

arXiv:2202.07029 [pdf, other]

doi 10.1109/MC.2022.3175542

The Multibillion Dollar Software Supply Chain of Ethereum

Authors: César Soto-Valero, Martin Monperrus, Benoit Baudry

Abstract: The rise of blockchain technologies has triggered tremendous research interest, coding efforts, and monetary investments in the last decade. Ethereum is the single largest programmable blockchain platform today. It features cryptocurrency trading, digital art, and decentralized finance through smart contracts. So-called Ethereum nodes operate the blockchain, relying on a vast supply chain of third… ▽ More The rise of blockchain technologies has triggered tremendous research interest, coding efforts, and monetary investments in the last decade. Ethereum is the single largest programmable blockchain platform today. It features cryptocurrency trading, digital art, and decentralized finance through smart contracts. So-called Ethereum nodes operate the blockchain, relying on a vast supply chain of third-party software dependencies maintained by diverse organizations. These software suppliers have a direct impact on the reliability and the security of Ethereum. In this article, we perform an analysis of the software supply chain of Java Ethereum nodes and distill the challenges of maintaining and securing this blockchain technology. △ Less

Submitted 8 August, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

Comments: 8 pages, 2 figures, 2 tables

Journal ref: IEEE Computer, 2022

arXiv:2105.14226 [pdf, other]

A Longitudinal Analysis of Bloated Java Dependencies

Authors: César Soto-Valero, Thomas Durieux, Benoit Baudry

Abstract: We study the evolution and impact of bloated dependencies in a single software ecosystem: Java/Maven. Bloated dependencies are third-party libraries that are packaged in the application binary but are not needed to run the application. We analyze the history of 435 Java projects. This historical data includes 48,469 distinct dependencies, which we study across a total of 31,515 versions of Maven d… ▽ More We study the evolution and impact of bloated dependencies in a single software ecosystem: Java/Maven. Bloated dependencies are third-party libraries that are packaged in the application binary but are not needed to run the application. We analyze the history of 435 Java projects. This historical data includes 48,469 distinct dependencies, which we study across a total of 31,515 versions of Maven dependency trees. Bloated dependencies steadily increase over time, and 89.02% of the direct dependencies that are bloated remain bloated in all subsequent versions of the studied projects. This empirical evidence suggests that developers can safely remove a bloated dependency. We further report novel insights regarding the unnecessary maintenance efforts induced by bloat. We find that 22% of dependency updates performed by developers are made on bloated dependencies and that Dependabot suggests a similar ratio of updates on bloated dependencies. △ Less

Submitted 29 May, 2021; originally announced May 2021.

Comments: In Proceeding of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'2021)

arXiv:2103.09672 [pdf, other]

DUETS: A Dataset of Reproducible Pairs ofJava Library-Clients

Authors: Thomas Durieux, César Soto-Valero, Benoit Baudry

Abstract: Software engineering researchers look for software artifacts to study their characteristics or to evaluate new techniques. In this paper, we introduce DUETS, a new dataset of software libraries and their clients. This dataset can be exploited to gain many different insights, such as API usage, usage inputs, or novel observations about the test suites of clients and libraries. DUETS is meant to sup… ▽ More Software engineering researchers look for software artifacts to study their characteristics or to evaluate new techniques. In this paper, we introduce DUETS, a new dataset of software libraries and their clients. This dataset can be exploited to gain many different insights, such as API usage, usage inputs, or novel observations about the test suites of clients and libraries. DUETS is meant to support both static and dynamic analysis. This means that the libraries and the clients compile correctly, they are executable and their test suites pass. The dataset is composed of open-source projects that have more than five stars on GitHub. The final dataset contains 395 libraries and 2,874 clients. Additionally, we provide the raw data that we use to create this dataset, such as 34,560 pom.xml files or the complete file list from 34,560 projects. This dataset can be used to study how libraries are used by their clients or as a list of software projects that successfully build. The client's test suite can be used as an additional verification step for code transformation techniques that modify the libraries. △ Less

Submitted 17 March, 2021; originally announced March 2021.

Comments: 5 pages, accepted in Mining Software Repositories Conference 2021

arXiv:2010.07827 [pdf, other]

Interpretation of Swedish Sign Language using Convolutional Neural Networks and Transfer Learning

Authors: Gustaf Halvardsson, Johanna Peterson, César Soto-Valero, Benoit Baudry

Abstract: The automatic interpretation of sign languages is a challenging task, as it requires the usage of high-level vision and high-level motion processing systems for providing accurate image perception. In this paper, we use Convolutional Neural Networks (CNNs) and transfer learning in order to make computers able to interpret signs of the Swedish Sign Language (SSL) hand alphabet. Our model consist of… ▽ More The automatic interpretation of sign languages is a challenging task, as it requires the usage of high-level vision and high-level motion processing systems for providing accurate image perception. In this paper, we use Convolutional Neural Networks (CNNs) and transfer learning in order to make computers able to interpret signs of the Swedish Sign Language (SSL) hand alphabet. Our model consist of the implementation of a pre-trained InceptionV3 network, and the usage of the mini-batch gradient descent optimization algorithm. We rely on transfer learning during the pre-training of the model and its data. The final accuracy of the model, based on 8 study subjects and 9,400 images, is 85%. Our results indicate that the usage of CNNs is a promising approach to interpret sign languages, and transfer learning can be used to achieve high testing accuracy despite using a small training dataset. Furthermore, we describe the implementation details of our model to interpret signs as a user-friendly web application. △ Less

Submitted 15 October, 2020; originally announced October 2020.

arXiv:2008.08401 [pdf, other]

Coverage-Based Debloating for Java Bytecode

Authors: César Soto-Valero, Thomas Durieux, Nicolas Harrand, Benoit Baudry

Abstract: Software bloat is code that is packaged in an application but is actually not necessary to run the application. The presence of software bloat is an issue for security, for performance, and for maintenance. In this paper, we introduce a novel technique for debloating, which we call coverage-based debloating. We implement the technique for one single language: Java bytecode. We leverage a combinati… ▽ More Software bloat is code that is packaged in an application but is actually not necessary to run the application. The presence of software bloat is an issue for security, for performance, and for maintenance. In this paper, we introduce a novel technique for debloating, which we call coverage-based debloating. We implement the technique for one single language: Java bytecode. We leverage a combination of state-of-the-art Java bytecode coverage tools to precisely capture what parts of a project and its dependencies are used when running with a specific workload. Then, we automatically remove the parts that are not covered, in order to generate a debloated version of the project. We succeed to debloat 211 library versions from a dataset of 94 unique open-source Java libraries. The debloated versions are syntactically correct and preserve their original behavior according to the workload. Our results indicate that 68.3% of the libraries' bytecode and 20.3% of their total dependencies can be removed through coverage-based debloating. For the first time in the literature on software debloating, we assess the utility of debloated libraries with respect to client applications that reuse them. We select 988 client projects that either have a direct reference to the debloated library in their source code or which test suite covers at least one class of the libraries that we debloat. Our results show that 81.5% of the clients, with at least one test that uses the library, successfully compile and pass their test suite when the original library is replaced by its debloated version. △ Less

Submitted 19 May, 2022; v1 submitted 19 August, 2020; originally announced August 2020.

arXiv:2005.11315 [pdf, other]

doi 10.1016/j.jss.2020.110645

Java Decompiler Diversity and its Application to Meta-decompilation

Authors: Nicolas Harrand, César Soto-Valero, Martin Monperrus, Benoit Baudry

Abstract: During compilation from Java source code to bytecode, some information is irreversibly lost. In other words, compilation and decompilation of Java code is not symmetric. Consequently, decompilation, which aims at producing source code from bytecode, relies on strategies to reconstruct the information that has been lost. Different Java decompilers use distinct strategies to achieve proper decompila… ▽ More During compilation from Java source code to bytecode, some information is irreversibly lost. In other words, compilation and decompilation of Java code is not symmetric. Consequently, decompilation, which aims at producing source code from bytecode, relies on strategies to reconstruct the information that has been lost. Different Java decompilers use distinct strategies to achieve proper decompilation. In this work, we hypothesize that the diverse ways in which bytecode can be decompiled has a direct impact on the quality of the source code produced by decompilers. In this paper, we assess the strategies of eight Java decompilers with respect to three quality indicators: syntactic correctness, syntactic distortion and semantic equivalence modulo inputs. Our results show that no single modern decompiler is able to correctly handle the variety of bytecode structures coming from real-world programs. The highest ranking decompiler in this study produces syntactically correct, and semantically equivalent code output for 84%, respectively 78%, of the classes in our dataset. Our results demonstrate that each decompiler correctly handles a different set of bytecode classes. We propose a new decompiler called Arlecchino that leverages the diversity of existing decompilers. To do so, we merge partial decompilation into a new one based on compilation errors. Arlecchino handles 37.6% of bytecode classes that were previously handled by no decompiler. We publish the sources of this new bytecode decompiler. △ Less

Submitted 21 May, 2020; originally announced May 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1908.06895

Journal ref: Journal of Systems and Software, 2020

arXiv:2001.07808 [pdf, other]

doi 10.1007/s10664-020-09914-8

A Comprehensive Study of Bloated Dependencies in the Maven Ecosystem

Authors: César Soto-Valero, Nicolas Harrand, Martin Monperrus, Benoit Baudry

Abstract: Build automation tools and package managers have a profound influence on software development. They facilitate the reuse of third-party libraries, support a clear separation between the application's code and its external dependencies, and automate several software development tasks. However, the wide adoption of these tools introduces new challenges related to dependency management. In this paper… ▽ More Build automation tools and package managers have a profound influence on software development. They facilitate the reuse of third-party libraries, support a clear separation between the application's code and its external dependencies, and automate several software development tasks. However, the wide adoption of these tools introduces new challenges related to dependency management. In this paper, we propose an original study of one such challenge: the emergence of bloated dependencies. Bloated dependencies are libraries that the build tool packages with the application's compiled code but that are actually not necessary to build and run the application. This phenomenon artificially grows the size of the built binary and increases maintenance effort. We propose a tool, called DepClean, to analyze the presence of bloated dependencies in Maven artifacts. We analyze 9,639 Java artifacts hosted on Maven Central, which include a total of 723,444 dependency relationships. Our key result is that 75.1% of the analyzed dependency relationships are bloated. In other words, it is feasible to reduce the number of dependencies of Maven artifacts up to 1/4 of its current count. We also perform a qualitative study with 30 notable open-source projects. Our results indicate that developers pay attention to their dependencies and are willing to remove bloated dependencies: 18/21 answered pull requests were accepted and merged by developers, removing 131 dependencies in total. △ Less

Submitted 21 January, 2020; originally announced January 2020.

Comments: Manuscript submitted to Empirical Software Engineering (EMSE)

Journal ref: Empirical Software Engineering, 2021

arXiv:1908.09757 [pdf, other]

API Beauty is in the eye of the Clients: 2.2 Million Maven Dependencies reveal the Spectrum of Client-API Usages

Authors: Nicolas Harrand, Amine Benelallam, César Soto-Valero, François Bettega, Olivier Barais, Benoit Baudry

Abstract: Hyrum's law states a common observation in the software industry: "With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody". Meanwhile, recent research results seem to contradict this observation when they state that "for most APIs, there is a small number of features that are actually… ▽ More Hyrum's law states a common observation in the software industry: "With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody". Meanwhile, recent research results seem to contradict this observation when they state that "for most APIs, there is a small number of features that are actually used". We investigate this seeming paradox between the observations in industry and the research literature, with a large scale empirical study of client API relationships in one single ecosystem: Maven Central. We study the 94 most popular libraries in Maven Central, as well as the 829,410 client artifacts that declare a dependency to these libraries and that are available in Maven Central, summing up to 2.2M dependencies. Our analysis indicates the existence of a wide spectrum of API usages, with enough clients, most API types end up being used at least once. Our second key observation is that, for all libraries, there is a small set of API types that are used by the vast majority of its clients. The practical consequences of this study are two-fold: (i) it is possible for API maintainers to find an essential part of their API on which they can focus their efforts; (ii) API developers should limit the public API elements to the set of features for which they are ready to have users. △ Less

Submitted 19 October, 2021; v1 submitted 26 August, 2019; originally announced August 2019.

Comments: 15 pages, 10 figures, 3 tables, 2 listings

Journal ref: Journal of Systems and Software 2021

arXiv:1908.06895 [pdf, other]

doi 10.1109/scam.2019.00019

The Strengths and Behavioral Quirks of Java Bytecode Decompilers

Authors: Nicolas Harrand, César Soto-Valero, Martin Monperrus, Benoit Baudry

Abstract: During compilation from Java source code to bytecode, some information is irreversibly lost. In other words, compilation and decompilation of Java code is not symmetric. Consequently, the decompilation process, which aims at producing source code from bytecode, must establish some strategies to reconstruct the information that has been lost. Modern Java decompilers tend to use distinct strategies… ▽ More During compilation from Java source code to bytecode, some information is irreversibly lost. In other words, compilation and decompilation of Java code is not symmetric. Consequently, the decompilation process, which aims at producing source code from bytecode, must establish some strategies to reconstruct the information that has been lost. Modern Java decompilers tend to use distinct strategies to achieve proper decompilation. In this work, we hypothesize that the diverse ways in which bytecode can be decompiled has a direct impact on the quality of the source code produced by decompilers. We study the effectiveness of eight Java decompilers with respect to three quality indicators: syntactic correctness, syntactic distortion and semantic equivalence modulo inputs. This study relies on a benchmark set of 14 real-world open-source software projects to be decompiled (2041 classes in total). Our results show that no single modern decompiler is able to correctly handle the variety of bytecode structures coming from real-world programs. Even the highest ranking decompiler in this study produces syntactically correct output for 84% of classes of our dataset and semantically equivalent code output for 78% of classes. △ Less

Submitted 19 August, 2019; originally announced August 2019.

Comments: 11 pages, 6 figures, 9 listings, 3 tables

Journal ref: Proceedings of the 19th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2019)

arXiv:1903.05394 [pdf, other]

doi 10.1109/MSR.2019.00059

The Emergence of Software Diversity in Maven Central

Authors: César Soto-Valero, Amine Benelallam, Nicolas Harrand, Olivier Barais, Benoit Baudry

Abstract: Maven artifacts are immutable: an artifact that is uploaded on Maven Central cannot be removed nor modified. The only way for developers to upgrade their library is to release a new version. Consequently, Maven Central accumulates all the versions of all the libraries that are published there, and applications that declare a dependency towards a library can pick any version. In this work, we hypot… ▽ More Maven artifacts are immutable: an artifact that is uploaded on Maven Central cannot be removed nor modified. The only way for developers to upgrade their library is to release a new version. Consequently, Maven Central accumulates all the versions of all the libraries that are published there, and applications that declare a dependency towards a library can pick any version. In this work, we hypothesize that the immutability of Maven artifacts and the ability to choose any version naturally support the emergence of software diversity within Maven Central. We analyze 1,487,956 artifacts that represent all the versions of 73,653 libraries. We observe that more than 30% of libraries have multiple versions that are actively used by latest artifacts. In the case of popular libraries, more than 50% of their versions are used. We also observe that more than 17% of libraries have several versions that are significantly more used than the other versions. Our results indicate that the immutability of artifacts in Maven Central does support a sustained level of diversity among versions of libraries in the repository. △ Less

Submitted 14 March, 2019; v1 submitted 13 March, 2019; originally announced March 2019.

Comments: Accepted for publication in 16th International Conference on Mining Software Repositories (MSR) at Montréal, Canada

Showing 1–13 of 13 results for author: Soto-Valero, C