Search | arXiv e-print repository

doi 10.1145/3627106.3627138

On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI

Authors: Piergiorgio Ladisa, Serena Elisa Ponta, Nicola Ronzoni, Matias Martinez, Olivier Barais

Abstract: Current software supply chains heavily rely on open-source packages hosted in public repositories. Given the popularity of ecosystems like npm and PyPI, malicious users started to spread malware by publishing open-source packages containing malicious code. Recent works apply machine learning techniques to detect malicious packages in the npm ecosystem. However, the scarcity of samples poses a chal… ▽ More Current software supply chains heavily rely on open-source packages hosted in public repositories. Given the popularity of ecosystems like npm and PyPI, malicious users started to spread malware by publishing open-source packages containing malicious code. Recent works apply machine learning techniques to detect malicious packages in the npm ecosystem. However, the scarcity of samples poses a challenge to the application of machine learning techniques in other ecosystems. Despite the differences between JavaScript and Python, the open-source software supply chain attacks targeting such languages show noticeable similarities (e.g., use of installation scripts, obfuscated strings, URLs). In this paper, we present a novel approach that involves a set of language-independent features and the training of models capable of detecting malicious packages in npm and PyPI by capturing their commonalities. This methodology allows us to train models on a diverse dataset encompassing multiple languages, thereby overcoming the challenge of limited sample availability. We evaluate the models both in a controlled experiment (where labels of data are known) and in the wild by scanning newly uploaded packages for both npm and PyPI for 10 days. We find that our approach successfully detects malicious packages for both npm and PyPI. Over an analysis of 31,292 packages, we reported 58 previously unknown malicious packages (38 for npm and 20 for PyPI), which were consequently removed from the respective repositories. △ Less

Submitted 14 October, 2023; originally announced October 2023.

Comments: Proceedings of Annual Computer Security Applications Conference (ACSAC '23), December 4--8, 2023, Austin, TX, USA

arXiv:2307.09087 [pdf, other]

doi 10.1145/3605770.3625212

The Hitchhiker's Guide to Malicious Third-Party Dependencies

Authors: Piergiorgio Ladisa, Merve Sahin, Serena Elisa Ponta, Marco Rosa, Matias Martinez, Olivier Barais

Abstract: The increasing popularity of certain programming languages has spurred the creation of ecosystem-specific package repositories and package managers. Such repositories (e.g., npm, PyPI) serve as public databases that users can query to retrieve packages for various functionalities, whereas package managers automatically handle dependency resolution and package installation on the client side. These… ▽ More The increasing popularity of certain programming languages has spurred the creation of ecosystem-specific package repositories and package managers. Such repositories (e.g., npm, PyPI) serve as public databases that users can query to retrieve packages for various functionalities, whereas package managers automatically handle dependency resolution and package installation on the client side. These mechanisms enhance software modularization and accelerate implementation. However, they have become a target for malicious actors seeking to propagate malware on a large scale. In this work, we show how attackers can leverage capabilities of popular package managers and languages to achieve arbitrary code execution on victim machines, thereby realizing open-source software supply chain attacks. Based on the analysis of 7 ecosystems, we identify 3 install-time and 4 runtime techniques, and we provide recommendations describing how to reduce the risk when consuming third-party dependencies. We will provide proof-of-concepts that demonstrate the identified techniques. Furthermore, we describe evasion strategies employed by attackers to circumvent detection mechanisms. △ Less

Submitted 6 October, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

Comments: Proceedings of the 2023 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses (SCORED '23), November 30, 2023, Copenhagen, Denmark

arXiv:2304.05200 [pdf, other]

Journey to the Center of Software Supply Chain Attacks

Authors: Piergiorgio Ladisa, Serena Elisa Ponta, Antonino Sabetta, Matias Martinez, Olivier Barais

Abstract: This work discusses open-source software supply chain attacks and proposes a general taxonomy describing how attackers conduct them. We then provide a list of safeguards to mitigate such attacks. We present our tool "Risk Explorer for Software Supply Chains" to explore such information and we discuss its industrial use-cases. This work discusses open-source software supply chain attacks and proposes a general taxonomy describing how attackers conduct them. We then provide a list of safeguards to mitigate such attacks. We present our tool "Risk Explorer for Software Supply Chains" to explore such information and we discuss its industrial use-cases. △ Less

Submitted 11 April, 2023; originally announced April 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2204.04008

arXiv:2210.03998 [pdf, other]

Towards the Detection of Malicious Java Packages

Authors: Piergiorgio Ladisa, Henrik Plate, Matias Martinez, Olivier Barais, Serena Elisa Ponta

Abstract: Open-source software supply chain attacks aim at infecting downstream users by poisoning open-source packages. The common way of consuming such artifacts is through package repositories and the development of vetting strategies to detect such attacks is ongoing research. Despite its popularity, the Java ecosystem is the less explored one in the context of supply chain attacks. In this paper we p… ▽ More Open-source software supply chain attacks aim at infecting downstream users by poisoning open-source packages. The common way of consuming such artifacts is through package repositories and the development of vetting strategies to detect such attacks is ongoing research. Despite its popularity, the Java ecosystem is the less explored one in the context of supply chain attacks. In this paper we present indicators of malicious behavior that can be observed statically through the analysis of Java bytecode. Then we evaluate how such indicators and their combinations perform when detecting malicious code injections. We do so by injecting three malicious payloads taken from real-world examples into the Top-10 most popular Java libraries from libraries.io. We found that the analysis of strings in the constant pool and of sensitive APIs in the bytecode instructions aid in the task of detecting malicious Java packages by significantly reducing the information, thus, making also manual triage possible. △ Less

Submitted 8 October, 2022; originally announced October 2022.

arXiv:2205.08350 [pdf, other]

RISCLESS: A Reinforcement Learning Strategy to Exploit Unused Cloud Resources

Authors: Sidahmed Yalles, Mohamed Handaoui, Jean-Emile Dartois, Olivier Barais, Laurent d'Orazio, Jalil Boukhobza

Abstract: One of the main objectives of Cloud Providers (CP) is to guarantee the Service-Level Agreement (SLA) of customers while reducing operating costs. To achieve this goal, CPs have built large-scale datacenters. This leads, however, to underutilized resources and an increase in costs. A way to improve the utilization of resources is to reclaim the unused parts and resell them at a lower price. Providi… ▽ More One of the main objectives of Cloud Providers (CP) is to guarantee the Service-Level Agreement (SLA) of customers while reducing operating costs. To achieve this goal, CPs have built large-scale datacenters. This leads, however, to underutilized resources and an increase in costs. A way to improve the utilization of resources is to reclaim the unused parts and resell them at a lower price. Providing SLA guarantees to customers on reclaimed resources is a challenge due to their high volatility. Some state-of-the-art solutions consider kee** a proportion of resources free to absorb sudden variation in workloads. Others consider stable resources on top of the volatile ones to fill in for the lost resources. However, these strategies either reduce the amount of reclaimable resources or operate on less volatile ones such as Amazon Spot instance. In this paper, we proposed RISCLESS, a Reinforcement Learning strategy to exploit unused Cloud resources. Our approach consists of using a small proportion of stable on-demand resources alongside the ephemeral ones in order to guarantee customers SLA and reduce the overall costs. The approach decides when and how much stable resources to allocate in order to fulfill customers' demands. RISCLESS improved the CPs' profits by an average of 15.9% compared to state-of-the-art strategies. It also reduced the SLA violation time by an average of 36.7% while increasing the amount of used ephemeral resources by 19.5% on average △ Less

Submitted 28 April, 2022; originally announced May 2022.

arXiv:2204.04008 [pdf, other]

doi 10.1109/SP46215.2023.00010

Taxonomy of Attacks on Open-Source Software Supply Chains

Authors: Piergiorgio Ladisa, Henrik Plate, Matias Martinez, Olivier Barais

Abstract: The widespread dependency on open-source software makes it a fruitful target for malicious actors, as demonstrated by recurring attacks. The complexity of today's open-source supply chains results in a significant attack surface, giving attackers numerous opportunities to reach the goal of injecting malicious code into open-source artifacts that is then downloaded and executed by victims. This w… ▽ More The widespread dependency on open-source software makes it a fruitful target for malicious actors, as demonstrated by recurring attacks. The complexity of today's open-source supply chains results in a significant attack surface, giving attackers numerous opportunities to reach the goal of injecting malicious code into open-source artifacts that is then downloaded and executed by victims. This work proposes a general taxonomy for attacks on open-source supply chains, independent of specific programming languages or ecosystems, and covering all supply chain stages from code contributions to package distribution. Taking the form of an attack tree, it covers 107 unique vectors, linked to 94 real-world incidents, and mapped to 33 mitigating safeguards. User surveys conducted with 17 domain experts and 134 software developers positively validated the correctness, comprehensiveness and comprehensibility of the taxonomy, as well as its suitability for various use-cases. Survey participants also assessed the utility and costs of the identified safeguards, and whether they are used. △ Less

Submitted 19 April, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

Journal ref: 2023 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, US, 2023 pp. 1509-1526

arXiv:2009.11208 [pdf, other]

ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization Of Ephemeral Cloud Resources

Authors: Mohamed Handaoui, Jean-Emile Dartois, Jalil Boukhobza, Olivier Barais, Laurent d'Orazio

Abstract: Cloud data center capacities are over-provisioned to handle demand peaks and hardware failures which leads to low resources' utilization. One way to improve resource utilization and thus reduce the total cost of ownership is to offer unused resources (referred to as ephemeral resources) at a lower price. However, reselling resources needs to meet the expectations of its customers in terms of Quali… ▽ More Cloud data center capacities are over-provisioned to handle demand peaks and hardware failures which leads to low resources' utilization. One way to improve resource utilization and thus reduce the total cost of ownership is to offer unused resources (referred to as ephemeral resources) at a lower price. However, reselling resources needs to meet the expectations of its customers in terms of Quality of Service. The goal is so to maximize the amount of reclaimed resources while avoiding SLA penalties. To achieve that, cloud providers have to estimate their future utilization to provide availability guarantees. The prediction should consider a safety margin for resources to react to unpredictable workloads. The challenge is to find the safety margin that provides the best trade-off between the amount of resources to reclaim and the risk of SLA violations. Most state-of-the-art solutions consider a fixed safety margin for all types of metrics (e.g., CPU, RAM). However, a unique fixed margin does not consider various workloads variations over time which may lead to SLA violations or/and poor utilization. In order to tackle these challenges, we propose ReLeaSER, a Reinforcement Learning strategy for optimizing the ephemeral resources' utilization in the cloud. ReLeaSER dynamically tunes the safety margin at the host-level for each resource metric. The strategy learns from past prediction errors (that caused SLA violations). Our solution reduces significantly the SLA violation penalties on average by 2.7x and up to 3.4x. It also improves considerably the CPs' potential savings by 27.6% on average and up to 43.6%. △ Less

Submitted 10 December, 2020; v1 submitted 23 September, 2020; originally announced September 2020.

arXiv:1908.09757 [pdf, other]

API Beauty is in the eye of the Clients: 2.2 Million Maven Dependencies reveal the Spectrum of Client-API Usages

Authors: Nicolas Harrand, Amine Benelallam, César Soto-Valero, François Bettega, Olivier Barais, Benoit Baudry

Abstract: Hyrum's law states a common observation in the software industry: "With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody". Meanwhile, recent research results seem to contradict this observation when they state that "for most APIs, there is a small number of features that are actually… ▽ More Hyrum's law states a common observation in the software industry: "With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody". Meanwhile, recent research results seem to contradict this observation when they state that "for most APIs, there is a small number of features that are actually used". We investigate this seeming paradox between the observations in industry and the research literature, with a large scale empirical study of client API relationships in one single ecosystem: Maven Central. We study the 94 most popular libraries in Maven Central, as well as the 829,410 client artifacts that declare a dependency to these libraries and that are available in Maven Central, summing up to 2.2M dependencies. Our analysis indicates the existence of a wide spectrum of API usages, with enough clients, most API types end up being used at least once. Our second key observation is that, for all libraries, there is a small set of API types that are used by the vast majority of its clients. The practical consequences of this study are two-fold: (i) it is possible for API maintainers to find an essential part of their API on which they can focus their efforts; (ii) API developers should limit the public API elements to the set of features for which they are ready to have users. △ Less

Submitted 19 October, 2021; v1 submitted 26 August, 2019; originally announced August 2019.

Comments: 15 pages, 10 figures, 3 tables, 2 listings

Journal ref: Journal of Systems and Software 2021

arXiv:1903.05394 [pdf, other]

doi 10.1109/MSR.2019.00059

The Emergence of Software Diversity in Maven Central

Authors: César Soto-Valero, Amine Benelallam, Nicolas Harrand, Olivier Barais, Benoit Baudry

Abstract: Maven artifacts are immutable: an artifact that is uploaded on Maven Central cannot be removed nor modified. The only way for developers to upgrade their library is to release a new version. Consequently, Maven Central accumulates all the versions of all the libraries that are published there, and applications that declare a dependency towards a library can pick any version. In this work, we hypot… ▽ More Maven artifacts are immutable: an artifact that is uploaded on Maven Central cannot be removed nor modified. The only way for developers to upgrade their library is to release a new version. Consequently, Maven Central accumulates all the versions of all the libraries that are published there, and applications that declare a dependency towards a library can pick any version. In this work, we hypothesize that the immutability of Maven artifacts and the ability to choose any version naturally support the emergence of software diversity within Maven Central. We analyze 1,487,956 artifacts that represent all the versions of 73,653 libraries. We observe that more than 30% of libraries have multiple versions that are actively used by latest artifacts. In the case of popular libraries, more than 50% of their versions are used. We also observe that more than 17% of libraries have several versions that are significantly more used than the other versions. Our results indicate that the immutability of artifacts in Maven Central does support a sustained level of diversity among versions of libraries in the repository. △ Less

Submitted 14 March, 2019; v1 submitted 13 March, 2019; originally announced March 2019.

Comments: Accepted for publication in 16th International Conference on Mining Software Repositories (MSR) at Montréal, Canada

arXiv:1901.05392 [pdf, other]

The Maven Dependency Graph: a Temporal Graph-based Representation of Maven Central

Authors: Amine Benelallam, Nicolas Harrand, César Soto Valero, Benoit Baudry, Olivier Barais

Abstract: The Maven Central Repository provides an extraordinary source of data to understand complex architecture and evolution phenomena among Java applications. As of September 6, 2018, this repository includes 2.8M artifacts (compiled piece of code implemented in a JVM-based language), each of which is characterized with metadata such as exact version, date of upload and list of dependencies towards oth… ▽ More The Maven Central Repository provides an extraordinary source of data to understand complex architecture and evolution phenomena among Java applications. As of September 6, 2018, this repository includes 2.8M artifacts (compiled piece of code implemented in a JVM-based language), each of which is characterized with metadata such as exact version, date of upload and list of dependencies towards other artifacts. Today, one who wants to analyze the complete ecosystem of Maven artifacts and their dependencies faces two key challenges: (i) this is a huge data set; and (ii) dependency relationships among artifacts are not modeled explicitly and cannot be queried. In this paper, we present the Maven Dependency Graph. This open source data set provides two contributions: a snapshot of the whole Maven Central taken on September 6, 2018, stored in a graph database in which we explicitly model all dependencies; an open source infrastructure to query this huge dataset. △ Less

Submitted 16 January, 2019; originally announced January 2019.

Comments: 5 pages, 2 figures, 2 tables

arXiv:1704.04378 [pdf, other]

doi 10.1145/3079368.3079394

Weaving Rules into [email protected] for Embedded Smart Systems

Authors: Ludovic Mouline, Thomas Hartmann, François Fouquet, Yves Le Traon, Johann Bourcier, Olivier Barais

Abstract: Smart systems are characterised by their ability to analyse measured data in live and to react to changes according to expert rules. Therefore, such systems exploit appropriate data models together with actions, triggered by domain-related conditions. The challenge at hand is that smart systems usually need to process thousands of updates to detect which rules need to be triggered, often even on r… ▽ More Smart systems are characterised by their ability to analyse measured data in live and to react to changes according to expert rules. Therefore, such systems exploit appropriate data models together with actions, triggered by domain-related conditions. The challenge at hand is that smart systems usually need to process thousands of updates to detect which rules need to be triggered, often even on restricted hardware like a Raspberry Pi. Despite various approaches have been investigated to efficiently check conditions on data models, they either assume to fit into main memory or rely on high latency persistence storage systems that severely damage the reactivity of smart systems. To tackle this challenge, we propose a novel composition process, which weaves executable rules into a data model with lazy loading abilities. We quantitatively show, on a smart building case study, that our approach can handle, at low latency, big sets of rules on top of large-scale data models on restricted hardware. △ Less

Submitted 14 April, 2017; originally announced April 2017.

Comments: pre-print version, published in the proceedings of MOMO-17 Workshop

arXiv:1405.6817 [pdf, other]

Kevoree Modeling Framework (KMF): Efficient modeling techniques for runtime use

Authors: Fouquet Francois, Grégory Nain, Brice Morin, Erwan Daubert, Olivier Barais, Noël Plouzeau, Jean-Marc Jézéquel

Abstract: The creation of Domain Specific Languages(DSL) counts as one of the main goals in the field of Model-Driven Software Engineering (MDSE). The main purpose of these DSLs is to facilitate the manipulation of domain specific concepts, by providing developers with specific tools for their domain of expertise. A natural approach to create DSLs is to reuse existing modeling standards and tools. In this a… ▽ More The creation of Domain Specific Languages(DSL) counts as one of the main goals in the field of Model-Driven Software Engineering (MDSE). The main purpose of these DSLs is to facilitate the manipulation of domain specific concepts, by providing developers with specific tools for their domain of expertise. A natural approach to create DSLs is to reuse existing modeling standards and tools. In this area, the Eclipse Modeling Framework (EMF) has rapidly become the defacto standard in the MDSE for building Domain Specific Languages (DSL) and tools based on generative techniques. However, the use of EMF generated tools in domains like Internet of Things (IoT), Cloud Computing or Models@Runtime reaches several limitations. In this paper, we identify several properties the generated tools must comply with to be usable in other domains than desktop-based software systems. We then challenge EMF on these properties and describe our approach to overcome the limitations. Our approach, implemented in the Kevoree Modeling Framework (KMF), is finally evaluated according to the identified properties and compared to EMF. △ Less

Submitted 27 May, 2014; originally announced May 2014.

Comments: ISBN 978-2-87971-131-7; N° TR-SnT-2014-11 (2014)

Report number: TR-SnT-2014-11

arXiv:1306.0760 [pdf, other]

doi 10.1007/s10270-013-0354-4

Mashup of Meta-Languages and its Implementation in the Kermeta Language Workbench

Authors: Jean-Marc Jézéquel, Benoit Combemale, Olivier Barais, Martin Monperrus, François Fouquet

Abstract: With the growing use of domain-specific languages (DSL) in industry, DSL design and implementation goes far beyond an activity for a few experts only and becomes a challenging task for thousands of software engineers. DSL implementation indeed requires engineers to care for various concerns, from abstract syntax, static semantics, behavioral semantics, to extra-functional issues such as run-time p… ▽ More With the growing use of domain-specific languages (DSL) in industry, DSL design and implementation goes far beyond an activity for a few experts only and becomes a challenging task for thousands of software engineers. DSL implementation indeed requires engineers to care for various concerns, from abstract syntax, static semantics, behavioral semantics, to extra-functional issues such as run-time performance. This paper presents an approach that uses one meta-language per language implementation concern. We show that the usage and combination of those meta-languages is simple and intuitive enough to deserve the term "mashup". We evaluate the approach by completely implementing the non trivial fUML modeling language, a semantically sound and executable subset of the Unified Modeling Language (UML). △ Less

Submitted 4 June, 2013; originally announced June 2013.

Comments: Published in Software and Systems Modeling (2013)

Journal ref: Software and Systems Modeling, Springer Verlag, volume 14, 2015

arXiv:0804.1696 [pdf, ps, other]

A classification of invasive patterns in AOP

Authors: Freddy Munoz, Benoit Baudry, Olivier Barais

Abstract: Aspect-Oriented Programming (AOP) improves modularity by encapsulating crosscutting concerns into aspects. Some mechanisms to compose aspects allow invasiveness as a mean to integrate concerns. Invasiveness means that AOP languages have unrestricted access to program properties. Such kind of languages are interesting because they allow performing complex operations and better introduce functiona… ▽ More Aspect-Oriented Programming (AOP) improves modularity by encapsulating crosscutting concerns into aspects. Some mechanisms to compose aspects allow invasiveness as a mean to integrate concerns. Invasiveness means that AOP languages have unrestricted access to program properties. Such kind of languages are interesting because they allow performing complex operations and better introduce functionalities. In this report we present a classification of invasive patterns in AOP. This classification characterizes the aspects invasive behavior and allows developers to abstract about the aspect incidence over the program they crosscut. △ Less

Submitted 24 April, 2008; v1 submitted 10 April, 2008; originally announced April 2008.

Report number: RR-6501

Showing 1–14 of 14 results for author: Barais, O