Search | arXiv e-print repository

Assemblage: Automatic Binary Dataset Construction for Machine Learning

Authors: Chang Liu, Rebecca Saul, Yihao Sun, Edward Raff, Maya Fuchs, Townsend Southard Pantano, James Holt, Kristopher Micinski

Abstract: Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpuses of malicious binaries, obtaining high-quality corpuses of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis… ▽ More Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpuses of malicious binaries, obtaining high-quality corpuses of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpuses (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpuses of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage can be downloaded from https://assemblage-dataset.net △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2311.02206 [pdf, other]

Modern Datalog on the GPU

Authors: Yihao Sun, Ahmedur Rahman Shovon, Thomas Gilray, Kristopher Micinski, Sidharth Kumar

Abstract: Modern deductive database engines (e.g., LogicBlox and Soufflé) enable their users to write declarative queries which compute recursive deductions over extensional data, leaving their high-performance operationalization (query planning, semi-naïve evaluation, and parallelization) to the engine. Such engines form the backbone of modern high-throughput applications in static analysis, security audit… ▽ More Modern deductive database engines (e.g., LogicBlox and Soufflé) enable their users to write declarative queries which compute recursive deductions over extensional data, leaving their high-performance operationalization (query planning, semi-naïve evaluation, and parallelization) to the engine. Such engines form the backbone of modern high-throughput applications in static analysis, security auditing, social-media mining, and business analytics. State-of-the-art engines are built upon nested loop joins over explicit representations (e.g., BTrees and tries) and ubiquitously employ range indexing to accelerate iterated joins. In this work, we present GDlog: a GPU-based deductive analytics engine (implemented as a CUDA library) which achieves significant performance improvements (5--10x or more) versus prior systems. GDlog is powered by a novel range-indexed SIMD datastructure: the hash-indexed sorted array (HISA). We perform extensive evaluation on GDlog, comparing it against both CPU and GPU-based hash tables and Datalog engines, and using it to support a range of large-scale deductive queries including reachability, same generation, and context-sensitive program analysis. Our experiments show that GDlog achieves performance competitive with modern SIMD hash tables and beats prior work by an order of magnitude in runtime while offering more favorable memory footprint. △ Less

Submitted 10 March, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

arXiv:2211.11573 [pdf, other]

Higher-Order, Data-Parallel Structured Deduction

Authors: Thomas Gilray, Arash Sahebolamri, Sidharth Kumar, Kristopher Micinski

Abstract: State-of-the-art Datalog engines include expressive features such as ADTs (structured heap values), stratified aggregation and negation, various primitive operations, and the opportunity for further extension using FFIs. Current parallelization approaches for state-of-art Datalogs target shared-memory locking data-structures using conventional multi-threading, or use the map-reduce model for distr… ▽ More State-of-the-art Datalog engines include expressive features such as ADTs (structured heap values), stratified aggregation and negation, various primitive operations, and the opportunity for further extension using FFIs. Current parallelization approaches for state-of-art Datalogs target shared-memory locking data-structures using conventional multi-threading, or use the map-reduce model for distributed computing. Furthermore, current state-of-art approaches cannot scale to formal systems which pervasively manipulate structured data due to their lack of indexing for structured data stored in the heap. In this paper, we describe a new approach to data-parallel structured deduction that involves a key semantic extension of Datalog to permit first-class facts and higher-order relations via defunctionalization, an implementation approach that enables parallelism uniformly both across sets of disjoint facts and over individual facts with nested structure. We detail a core language, $DL_s$, whose key invariant (subfact closure) ensures that each subfact is materialized as a top-class fact. We extend $DL_s$ to Slog, a fully-featured language whose forms facilitate leveraging subfact closure to rapidly implement expressive, high-performance formal systems. We demonstrate Slog by building a family of control-flow analyses from abstract machines, systematically, along with several implementations of classical type systems (such as STLC and LF). We performed experiments on EC2, Azure, and ALCF's Theta at up to 1000 threads, showing orders-of-magnitude scalability improvements versus competing state-of-art systems. △ Less

Submitted 21 November, 2022; originally announced November 2022.

arXiv:2107.12909 [pdf, ps, other]

So You Want to Analyze Scheme Programs With Datalog?

Authors: Davis Ross Silverman, Yihao Sun, Kristopher Micinski, Thomas Gilray

Abstract: Static analysis approximates the results of a program by examining only its syntax. For example, control-flow analysis (CFA) determines which syntactic lambdas (for functional languages) or (for object-oriented) methods may be invoked at each call site within a program. Rich theoretical results exist studying control flow analysis for Scheme-like languages, but implementations are often complex an… ▽ More Static analysis approximates the results of a program by examining only its syntax. For example, control-flow analysis (CFA) determines which syntactic lambdas (for functional languages) or (for object-oriented) methods may be invoked at each call site within a program. Rich theoretical results exist studying control flow analysis for Scheme-like languages, but implementations are often complex and specialized. By contrast, object-oriented languages (Java in particular) enjoy high-precision control-flow analyses that scale to thousands (or more) of lines of code. State-of-the-art implementations (such as DOOP on Soufflé) structure the analysis using Horn-SAT (Datalog) to enable compilation of the analysis to efficient implementations such as high-performance relational algebra kernels. In this paper, we present an implementation of control-flow analysis for a significant subset of Scheme (including set!, call/cc, and primitive operations) using the Soufflé Datalog engine. We present an evaluation on a worst-case term demonstrating the polynomial complexity of our m-CFA and remark upon scalability results using Soufflé. △ Less

Submitted 27 July, 2021; originally announced July 2021.

arXiv:2101.04718 [pdf, other]

Declarative Demand-Driven Reverse Engineering

Authors: Yihao Sun, Jeffrey Ching, Kristopher Micinski

Abstract: Binary reverse engineering is a challenging task because it often necessitates reasoning using both domain-specific knowledge (e.g., understanding entrypoint idioms common to an ABI) and logical inference (e.g., reconstructing interprocedural control flow). To help perform these tasks, reverse engineers often use toolkits (such as IDA Pro or Ghidra) that allow them to interactively explicate prope… ▽ More Binary reverse engineering is a challenging task because it often necessitates reasoning using both domain-specific knowledge (e.g., understanding entrypoint idioms common to an ABI) and logical inference (e.g., reconstructing interprocedural control flow). To help perform these tasks, reverse engineers often use toolkits (such as IDA Pro or Ghidra) that allow them to interactively explicate properties of binaries. We argue that deductive databases serve as a natural abstraction for interfacing between visualization-based binary analysis tools and high-performance logical inference engines that compute facts about binaries. In this paper, we present a vision for the future in which reverse engineers use a visualization-based tool to understand binaries while simultaneously querying a logical-inference engine to perform arbitrarily-complex deductive inference tasks. We call our vision declarative demand-driven reverse engineering (D^3RE for short), and sketch a formal semantics whose goal is to mediate interaction between a logical-inference engine (such Souffle) and a reverse engineering tool. We describe aprototype tool, d3re, which are using to explore the D^3RE vision. While still a prototype, we have used d3re to reimplement several common querying tasks on binaries. Our evaluation demonstrates that d3re enables both better performance and more succinct implementation of these common RE tasks. △ Less

Submitted 12 January, 2021; originally announced January 2021.

arXiv:1912.00317 [pdf, other]

An Observational Investigation of Reverse Engineers' Processes

Authors: Daniel Votipka, Seth M. Rabin, Kristopher Micinski, Jeffrey S. Foster, Michelle L. Mazurek

Abstract: Reverse engineering is a complex process essential to software-security tasks such as vulnerability discovery and malware analysis. Significant research and engineering effort has gone into develo** tools to support reverse engineers. However, little work has been done to understand the way reverse engineers think when analyzing programs, leaving tool developers to make interface design decision… ▽ More Reverse engineering is a complex process essential to software-security tasks such as vulnerability discovery and malware analysis. Significant research and engineering effort has gone into develo** tools to support reverse engineers. However, little work has been done to understand the way reverse engineers think when analyzing programs, leaving tool developers to make interface design decisions based only on intuition. This paper takes a first step toward a better understanding of reverse engineers' processes, with the goal of producing insights for improving interaction design for reverse engineering tools. We present the results of a semi-structured, observational interview study of reverse engineers (N=16). Each observation investigated the questions reverse engineers ask as they probe a program, how they answer these questions, and the decisions they make throughout the reverse engineering process. From the interview responses, we distill a model of the reverse engineering process, divided into three phases: overview, sub-component scanning, and focused experimentation. Each analysis phase's results feed the next as reverse engineers' mental representations become more concrete. We find that reverse engineers typically use static methods in the first two phases, but dynamic methods in the final phase, with experience playing large, but varying, roles in each phase. % and the role of experience varies between phases. Based on these results, we provide five interaction design guidelines for reverse engineering tools. △ Less

Submitted 30 November, 2019; originally announced December 2019.

Comments: 22 pages, 6 figures, to appear at the 2020 USENIX Security Symposium

arXiv:1807.09377 [pdf, other]

Racets: Faceted Execution in Racket

Authors: Kristopher Micinski, Zhanpeng Wang, Thomas Gilray

Abstract: Faceted Execution is a linguistic paradigm for dynamic information-flow control. Under faceted execution, secure program data is represented by faceted values: decision trees that encode how the data should appear to its owner (represented by a label) versus everyone else. When labels are allowed to be first-class (i.e., predicates that decide at runtime which data to reveal), faceted execution en… ▽ More Faceted Execution is a linguistic paradigm for dynamic information-flow control. Under faceted execution, secure program data is represented by faceted values: decision trees that encode how the data should appear to its owner (represented by a label) versus everyone else. When labels are allowed to be first-class (i.e., predicates that decide at runtime which data to reveal), faceted execution enables policy-agnostic programming: a programming style that allows privacy policies for data to be enforced independently of code that computes on that data. To date, implementations of faceted execution are relatively heavyweight: requiring either changing the language runtime or the application code (e.g., by using monads). Following Racket's languages-as-libraries approach, we present Racets: an implementation of faceted execution as a library of macros. Given Racket's highly-expressive macro system, our implementation follows relatively directly from the semantics of faceted execution. To demonstrate how Racets can be used for policy-agnostic programming, we use it to build a web-based game of Battleship. Our implementation sheds light on several interesting issues in interacting with code written without faceted execution. Our Racets implementation is open source, under development, and available online. △ Less

Submitted 24 July, 2018; originally announced July 2018.

arXiv:1504.03711 [pdf, other]

Checking Interaction-Based Declassification Policies for Android Using Symbolic Execution

Authors: Kristopher Micinski, Jonathan Fetter-Degges, **seong Jeon, Jeffrey S. Foster, Michael R. Clarkson

Abstract: Mobile apps can access a wide variety of secure information, such as contacts and location. However, current mobile platforms include only coarse access control mechanisms to protect such data. In this paper, we introduce interaction-based declassification policies, in which the user's interactions with the app constrain the release of sensitive information. Our policies are defined extensionally,… ▽ More Mobile apps can access a wide variety of secure information, such as contacts and location. However, current mobile platforms include only coarse access control mechanisms to protect such data. In this paper, we introduce interaction-based declassification policies, in which the user's interactions with the app constrain the release of sensitive information. Our policies are defined extensionally, so as to be independent of the app's implementation, based on sequences of security-relevant events that occur in app runs. Policies use LTL formulae to precisely specify which secret inputs, read at which times, may be released. We formalize a semantic security condition, interaction-based noninterference, to define our policies precisely. Finally, we describe a prototype tool that uses symbolic execution to check interaction-based declassification policies for Android, and we show that it enforces policies correctly on a set of apps. △ Less

Submitted 29 July, 2015; v1 submitted 14 April, 2015; originally announced April 2015.

Comments: This research was supported in part by NSF grants CNS-1064997 and 1421373, AFOSR grants FA9550-12-1-0334 and FA9550-14-1-0334, a partnership between UMIACS and the Laboratory for Telecommunication Sciences, and the National Security Agency

arXiv:1401.4492 [pdf, ps, other]

Temporal Logics for Hyperproperties

Authors: Michael R. Clarkson, Bernd Finkbeiner, Masoud Koleini, Kristopher K. Micinski, Markus N. Rabe, César Sánchez

Abstract: Two new logics for verification of hyperproperties are proposed. Hyperproperties characterize security policies, such as noninterference, as a property of sets of computation paths. Standard temporal logics such as LTL, CTL, and CTL* can refer only to a single path at a time, hence cannot express many hyperproperties of interest. The logics proposed here, HyperLTL and HyperCTL*, add explicit and s… ▽ More Two new logics for verification of hyperproperties are proposed. Hyperproperties characterize security policies, such as noninterference, as a property of sets of computation paths. Standard temporal logics such as LTL, CTL, and CTL* can refer only to a single path at a time, hence cannot express many hyperproperties of interest. The logics proposed here, HyperLTL and HyperCTL*, add explicit and simultaneous quantification over multiple paths to LTL and to CTL*. This kind of quantification enables expression of hyperproperties. A model checking algorithm for the proposed logics is given. For a fragment of HyperLTL, a prototype model checker has been implemented. △ Less

Submitted 20 January, 2014; v1 submitted 17 January, 2014; originally announced January 2014.

arXiv:1306.5678 [pdf, ps, other]

A Temporal Logic of Security

Authors: Masoud Koleini, Michael R. Clarkson, Kristopher K. Micinski

Abstract: A new logic for verification of security policies is proposed. The logic, HyperLTL, extends linear-time temporal logic (LTL) with connectives for explicit and simultaneous quantification over multiple execution paths, thereby enabling HyperLTL to express information-flow security policies that LTL cannot. A model-checking algorithm for a fragment of HyperLTL is given, and the algorithm is implemen… ▽ More A new logic for verification of security policies is proposed. The logic, HyperLTL, extends linear-time temporal logic (LTL) with connectives for explicit and simultaneous quantification over multiple execution paths, thereby enabling HyperLTL to express information-flow security policies that LTL cannot. A model-checking algorithm for a fragment of HyperLTL is given, and the algorithm is implemented in a prototype model checker. The class of security policies expressible in HyperLTL is characterized by an arithmetic hierarchy of hyperproperties. △ Less

Submitted 9 July, 2013; v1 submitted 24 June, 2013; originally announced June 2013.

Showing 1–10 of 10 results for author: Micinski, K