Search | arXiv e-print repository

SAFE: Self-Attentive Function Embeddings for Binary Similarity

Authors: Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, Roberto Baldoni

Abstract: The binary similarity problem consists in determining if two functions are similar by only considering their compiled form. Advanced techniques for binary similarity recently gained momentum as they can be applied in several fields, such as copyright disputes, malware analysis, vulnerability detection, etc., and thus have an immediate practical impact. Current solutions compare functions by first… ▽ More The binary similarity problem consists in determining if two functions are similar by only considering their compiled form. Advanced techniques for binary similarity recently gained momentum as they can be applied in several fields, such as copyright disputes, malware analysis, vulnerability detection, etc., and thus have an immediate practical impact. Current solutions compare functions by first transforming their binary code in multi-dimensional vector representations (embeddings), and then comparing vectors through simple and efficient geometric operations. However, embeddings are usually derived from binary code using manual feature extraction, that may fail in considering important function characteristics, or may consider features that are not important for the binary similarity problem. In this paper we propose SAFE, a novel architecture for the embedding of functions based on a self-attentive neural network. SAFE works directly on disassembled binary functions, does not require manual feature extraction, is computationally more efficient than existing solutions (i.e., it does not incur in the computational overhead of building or manipulating control flow graphs), and is more general as it works on stripped binaries and on multiple architectures. We report the results from a quantitative and qualitative analysis that show how SAFE provides a noticeable performance improvement with respect to previous solutions. Furthermore, we show how clusters of our embedding vectors are closely related to the semantic of the implemented algorithms, paving the way for further interesting applications (e.g. semantic-based binary function search). △ Less

Submitted 19 December, 2019; v1 submitted 13 November, 2018; originally announced November 2018.

Comments: Published in International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA) 2019

arXiv:1810.09683 [pdf, other]

Unsupervised Features Extraction for Binary Similarity Using Graph Embedding Neural Networks

Authors: Roberto Baldoni, Giuseppe Antonio Di Luna, Luca Massarelli, Fabio Petroni, Leonardo Querzoni

Abstract: In this paper we consider the binary similarity problem that consists in determining if two binary functions are similar only considering their compiled form. This problem is know to be crucial in several application scenarios, such as copyright disputes, malware analysis, vulnerability detection, etc. The current state-of-the-art solutions in this field work by creating an embedding model that ma… ▽ More In this paper we consider the binary similarity problem that consists in determining if two binary functions are similar only considering their compiled form. This problem is know to be crucial in several application scenarios, such as copyright disputes, malware analysis, vulnerability detection, etc. The current state-of-the-art solutions in this field work by creating an embedding model that maps binary functions into vectors in $\mathbb{R}^{n}$. Such embedding model captures syntactic and semantic similarity between binaries, i.e., similar binary functions are mapped to points that are close in the vector space. This strategy has many advantages, one of them is the possibility to precompute embeddings of several binary functions, and then compare them with simple geometric operations (e.g., dot product). In [32] functions are first transformed in Annotated Control Flow Graphs (ACFGs) constituted by manually engineered features and then graphs are embedded into vectors using a deep neural network architecture. In this paper we propose and test several ways to compute annotated control flow graphs that use unsupervised approaches for feature learning, without incurring a human bias. Our methods are inspired after techniques used in the natural language processing community (e.g., we use word2vec to encode assembly instructions). We show that our approach is indeed successful, and it leads to better performance than previous state-of-the-art solutions. Furthermore, we report on a qualitative analysis of functions embeddings. We found interesting cases in which embeddings are clustered according to the semantic of the original binary function. △ Less

Submitted 13 November, 2018; v1 submitted 23 October, 2018; originally announced October 2018.

arXiv:1710.08189 [pdf, other]

doi 10.1016/j.cose.2018.11.001

Survey of Machine Learning Techniques for Malware Analysis

Authors: Daniele Ucci, Leonardo Aniello, Roberto Baldoni

Abstract: Co** with malware is getting more and more challenging, given their relentless growth in complexity and volume. One of the most common approaches in literature is using machine learning techniques, to automatically learn models and patterns behind such complexity, and to develop technologies to keep pace with malware evolution. This survey aims at providing an overview on the way machine learnin… ▽ More Co** with malware is getting more and more challenging, given their relentless growth in complexity and volume. One of the most common approaches in literature is using machine learning techniques, to automatically learn models and patterns behind such complexity, and to develop technologies to keep pace with malware evolution. This survey aims at providing an overview on the way machine learning has been used so far in the context of malware analysis in Windows environments, i.e. for the analysis of Portable Executables. We systematize surveyed papers according to their objectives (i.e., the expected output), what information about malware they specifically use (i.e., the features), and what machine learning techniques they employ (i.e., what algorithm is used to process the input and produce the output). We also outline a number of issues and challenges, including those concerning the used datasets, and identify the main current topical trends and how to possibly advance them. In particular, we introduce the novel concept of malware analysis economics, regarding the study of existing trade-offs among key metrics, such as analysis accuracy and economical costs. △ Less

Submitted 26 November, 2018; v1 submitted 23 October, 2017; originally announced October 2017.

Comments: 55 pages, 4 figures, 8 tables, added new references, corrected typos, and revised manuscript. Forthcoming in Computers & Security

arXiv:1709.00875 [pdf, other]

Android Malware Family Classification Based on Resource Consumption over Time

Authors: Luca Massarelli, Leonardo Aniello, Claudio Ciccotelli, Leonardo Querzoni, Daniele Ucci, Roberto Baldoni

Abstract: The vast majority of today's mobile malware targets Android devices. This has pushed the research effort in Android malware analysis in the last years. An important task of malware analysis is the classification of malware samples into known families. Static malware analysis is known to fall short against techniques that change static characteristics of the malware (e.g. code obfuscation), while d… ▽ More The vast majority of today's mobile malware targets Android devices. This has pushed the research effort in Android malware analysis in the last years. An important task of malware analysis is the classification of malware samples into known families. Static malware analysis is known to fall short against techniques that change static characteristics of the malware (e.g. code obfuscation), while dynamic analysis has proven effective against such techniques. To the best of our knowledge, the most notable work on Android malware family classification purely based on dynamic analysis is DroidScribe. With respect to DroidScribe, our approach is easier to reproduce. Our methodology only employs publicly available tools, does not require any modification to the emulated environment or Android OS, and can collect data from physical devices. The latter is a key factor, since modern mobile malware can detect the emulated environment and hide their malicious behavior. Our approach relies on resource consumption metrics available from the proc file system. Features are extracted through detrended fluctuation analysis and correlation. Finally, a SVM is employed to classify malware into families. We provide an experimental evaluation on malware samples from the Drebin dataset, where we obtain a classification accuracy of 82%, proving that our methodology achieves an accuracy comparable to that of DroidScribe. Furthermore, we make the software we developed publicly available, to ease the reproducibility of our results. △ Less

Submitted 4 September, 2017; originally announced September 2017.

Comments: Extended Version

arXiv:1704.05521 [pdf, ps, other]

Building Regular Registers with Rational Malicious Servers and Anonymous Clients -- Extended Version

Authors: Antonella Del Pozzo, Silvia Bonomi, Riccardo Lazzeretti, Roberto Baldoni

Abstract: The paper addresses the problem of emulating a regular register in a synchronous distributed system where clients invoking ${\sf read}()$ and ${\sf write}()$ operations are anonymous while server processes maintaining the state of the register may be compromised by rational adversaries (i.e., a server might behave as \emph{rational malicious Byzantine} process). We first model our problem as a Bay… ▽ More The paper addresses the problem of emulating a regular register in a synchronous distributed system where clients invoking ${\sf read}()$ and ${\sf write}()$ operations are anonymous while server processes maintaining the state of the register may be compromised by rational adversaries (i.e., a server might behave as \emph{rational malicious Byzantine} process). We first model our problem as a Bayesian game between a client and a rational malicious server where the equilibrium depends on the decisions of the malicious server (behave correctly and not be detected by clients vs returning a wrong register value to clients with the risk of being detected and then excluded by the computation). We prove such equilibrium exists and finally we design a protocol implementing the regular register that forces the rational malicious server to behave correctly. △ Less

Submitted 18 April, 2017; originally announced April 2017.

Comments: Extended version of paper accepted at 2017 International Symposium on Cyber Security Cryptography and Machine Learning (CSCML 2017)

MSC Class: 68

arXiv:1610.00502 [pdf, other]

A Survey of Symbolic Execution Techniques

Authors: Roberto Baldoni, Emilio Coppa, Daniele Cono D'Elia, Camil Demetrescu, Irene Finocchi

Abstract: Many security and software testing applications require checking whether certain properties of a program hold for any possible usage scenario. For instance, a tool for identifying software vulnerabilities may need to rule out the existence of any backdoor to bypass a program's authentication. One approach would be to test the program using different, possibly random inputs. As the backdoor may onl… ▽ More Many security and software testing applications require checking whether certain properties of a program hold for any possible usage scenario. For instance, a tool for identifying software vulnerabilities may need to rule out the existence of any backdoor to bypass a program's authentication. One approach would be to test the program using different, possibly random inputs. As the backdoor may only be hit for very specific program workloads, automated exploration of the space of possible inputs is of the essence. Symbolic execution provides an elegant solution to the problem, by systematically exploring many possible execution paths at the same time without necessarily requiring concrete inputs. Rather than taking on fully specified input values, the technique abstractly represents them as symbols, resorting to constraint solvers to construct actual instances that would cause property violations. Symbolic execution has been incubated in dozens of tools developed over the last four decades, leading to major practical breakthroughs in a number of prominent software reliability applications. The goal of this survey is to provide an overview of the main ideas, challenges, and solutions developed in the area, distilling them for a broad audience. The present survey has been accepted for publication at ACM Computing Surveys. If you are considering citing this survey, we would appreciate if you could use the following BibTeX entry: http://goo.gl/Hf5Fvc △ Less

Submitted 2 May, 2018; v1 submitted 3 October, 2016; originally announced October 2016.

Comments: This is the authors pre-print copy. If you are considering citing this survey, we would appreciate if you could use the following BibTeX entry: http://goo.gl/Hf5Fvc

Journal ref: ACM Computing Surveys 51(3), 2018. BibTeX entry: http://goo.gl/Hf5Fvc

arXiv:1505.03509 [pdf, other]

Investigating the Cost of Anonymity on Dynamic Networks

Authors: Giuseppe Antonio Di Luna, Roberto Baldoni

Abstract: In this paper we study the difficulty of counting nodes in a synchronous dynamic network where nodes share the same identifier, they communicate by using a broadcast with unlimited bandwidth and, at each synchronous round, network topology may change. To count in such setting, it has been shown that the presence of a leader is necessary. We focus on a particularly interesting subset of dynamic net… ▽ More In this paper we study the difficulty of counting nodes in a synchronous dynamic network where nodes share the same identifier, they communicate by using a broadcast with unlimited bandwidth and, at each synchronous round, network topology may change. To count in such setting, it has been shown that the presence of a leader is necessary. We focus on a particularly interesting subset of dynamic networks, namely \textit{Persistent Distance} - ${\cal G}($PD$)_{h}$, in which each node has a fixed distance from the leader across rounds and such distance is at most $h$. In these networks the dynamic diameter $D$ is at most $2h$. We prove the number of rounds for counting in ${\cal G}($PD$)_{2}$ is at least logarithmic with respect to the network size $|V|$. Thanks to this result, we show that counting on any dynamic anonymous network with $D$ constant w.r.t. $|V|$ takes at least $D+ Ω(\text{log}\, |V| )$ rounds where $Ω(\text{log}\, |V|)$ represents the additional cost to be payed for handling anonymity. At the best of our knowledge this is the fist non trivial, i.e. different from $Ω(D)$, lower bounds on counting in anonymous interval connected networks with broadcast and unlimited bandwith. △ Less

Submitted 13 May, 2015; originally announced May 2015.

arXiv:1405.2992 [pdf, other]

Correlating power consumption and network traffic for improving data centers resiliency

Authors: Roberto Baldoni, Mario Caruso, Adriano Cerocchi, Claudio Ciccotelli, Luca Montanari, Luca Nicoletti

Abstract: The deployment of business critical applications and information infrastructures are moving to the cloud. This means they are hosted in large scale data centers with other business applications and infrastructures with less (or none) mission critical constraints. This mixed and complex environment makes very challenging the process of monitoring critical applications and handling (detecting and re… ▽ More The deployment of business critical applications and information infrastructures are moving to the cloud. This means they are hosted in large scale data centers with other business applications and infrastructures with less (or none) mission critical constraints. This mixed and complex environment makes very challenging the process of monitoring critical applications and handling (detecting and recovering) possible failures of servers' data center that could affect responsiveness and/or reliability of mission critical applications. Monitoring mechanisms used in data center are usually intrusive in the sense that they need to install agents on each single server. This has considerable drawbacks: huge usage of human resources to install and patch the system and interference with the critical application because agents share application resources. In order to detect (and possibly predict) failures in data centers the paper does a first attempt in showing the correlation between network traffic and servers' power consumption. This is an important step in deriving non-intrusive monitoring systems, as both network traffic and power consumption can be captured without installing any software at the servers. This will improve in its turn the overall resiliency of the data center and its self-managing capacity. △ Less

Submitted 12 May, 2014; originally announced May 2014.

Comments: EDCC-2014, BIG4CIP-2014

Report number: BIG4CIP/2014/02

arXiv:cs/9910019 [pdf, ps, other]

Consistent Checkpointing in Distributed Databases: Towards a Formal Approach

Authors: R. Baldoni, F. Quaglia, M. Raynal

Abstract: Whether it is for audit or for recovery purposes, data checkpointing is an important problem of distributed database systems. Actually, transactions establish dependence relations on data checkpoints taken by data object managers. So, given an arbitrary set of data checkpoints (including at least a single data checkpoint from a data manager, and at most a data checkpoint from each data manager),… ▽ More Whether it is for audit or for recovery purposes, data checkpointing is an important problem of distributed database systems. Actually, transactions establish dependence relations on data checkpoints taken by data object managers. So, given an arbitrary set of data checkpoints (including at least a single data checkpoint from a data manager, and at most a data checkpoint from each data manager), an important question is the following one: ``Can these data checkpoints be members of a same consistent global checkpoint?''. This paper answers this question by providing a necessary and sufficient condition suited for database systems. Moreover, to show the usefulness of this condition, two {\em non-intrusive} data checkpointing protocols are derived from this condition. It is also interesting to note that this paper, by exhibiting ``correspondences'', establishes a bridge between the data object/transaction model and the process/message-passing model. △ Less

Submitted 22 October, 1999; originally announced October 1999.

Comments: 13 pages, 3 figures

Report number: Rapporto di Ricerca, Dipartimento di Informatica e Sistemistica, Universita' di Roma "La Sapienza"-(Italy) n. 27-97, July 1997 ACM Class: C.2.4; H.2

Showing 1–9 of 9 results for author: Baldoni, R