-
Revealing the Utilized Rank of Subspaces of Learning in Neural Networks
Authors:
Isha Garg,
Christian Koguchi,
Eshan Verma,
Daniel Ulbricht
Abstract:
In this work, we study how well the learned weights of a neural network utilize the space available to them. This notion is related to capacity, but additionally incorporates the interaction of the network architecture with the dataset. Most learned weights appear to be full rank, and are therefore not amenable to low rank decomposition. This deceptively implies that the weights are utilizing the…
▽ More
In this work, we study how well the learned weights of a neural network utilize the space available to them. This notion is related to capacity, but additionally incorporates the interaction of the network architecture with the dataset. Most learned weights appear to be full rank, and are therefore not amenable to low rank decomposition. This deceptively implies that the weights are utilizing the entire space available to them. We propose a simple data-driven transformation that projects the weights onto the subspace where the data and the weight interact. This preserves the functional map** of the layer and reveals its low rank structure. In our findings, we conclude that most models utilize a fraction of the available space. For instance, for ViTB-16 and ViTL-16 trained on ImageNet, the mean layer utilization is 35% and 20% respectively. Our transformation results in reducing the parameters to 50% and 25% respectively, while resulting in less than 0.2% accuracy drop after fine-tuning. We also show that self-supervised pre-training drives this utilization up to 70%, justifying its suitability for downstream tasks.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
Authors:
Narek Maloyan,
Ekansh Verma,
Bulat Nutfullin,
Bislan Ashinov
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains, but their vulnerability to trojan or backdoor attacks poses significant security risks. This paper explores the challenges and insights gained from the Trojan Detection Competition 2023 (TDC2023), which focused on identifying and evaluating trojan attacks on LLMs. We investigate the difficulty of distinguish…
▽ More
Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains, but their vulnerability to trojan or backdoor attacks poses significant security risks. This paper explores the challenges and insights gained from the Trojan Detection Competition 2023 (TDC2023), which focused on identifying and evaluating trojan attacks on LLMs. We investigate the difficulty of distinguishing between intended and unintended triggers, as well as the feasibility of reverse engineering trojans in real-world scenarios. Our comparative analysis of various trojan detection methods reveals that achieving high Recall scores is significantly more challenging than obtaining high Reverse-Engineering Attack Success Rate (REASR) scores. The top-performing methods in the competition achieved Recall scores around 0.16, comparable to a simple baseline of randomly sampling sentences from a distribution similar to the given training prefixes. This finding raises questions about the detectability and recoverability of trojans inserted into the model, given only the harmful targets. Despite the inability to fully solve the problem, the competition has led to interesting observations about the viability of trojan detection and improved techniques for optimizing LLM input prompts. The phenomenon of unintended triggers and the difficulty in distinguishing them from intended triggers highlights the need for further research into the robustness and interpretability of LLMs. The TDC2023 has provided valuable insights into the challenges and opportunities associated with trojan detection in LLMs, laying the groundwork for future research in this area to ensure their safety and reliability in real-world applications.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
AI ATAC 1: An Evaluation of Prominent Commercial Malware Detectors
Authors:
Robert A. Bridges,
Brian Weber,
Justin M. Beaver,
Jared M. Smith,
Miki E. Verma,
Savannah Norem,
Kevin Spakes,
Cory Watson,
Jeff A. Nichols,
Brian Jewell,
Michael. D. Iannacone,
Chelsey Dunivan Stahl,
Kelly M. T. Huffer,
T. Sean Oesch
Abstract:
This work presents an evaluation of six prominent commercial endpoint malware detectors, a network malware detector, and a file-conviction algorithm from a cyber technology vendor. The evaluation was administered as the first of the Artificial Intelligence Applications to Autonomous Cybersecurity (AI ATAC) prize challenges, funded by / completed in service of the US Navy. The experiment employed 1…
▽ More
This work presents an evaluation of six prominent commercial endpoint malware detectors, a network malware detector, and a file-conviction algorithm from a cyber technology vendor. The evaluation was administered as the first of the Artificial Intelligence Applications to Autonomous Cybersecurity (AI ATAC) prize challenges, funded by / completed in service of the US Navy. The experiment employed 100K files (50/50% benign/malicious) with a stratified distribution of file types, including ~1K zero-day program executables (increasing experiment size two orders of magnitude over previous work). We present an evaluation process of delivering a file to a fresh virtual machine donning the detection technology, waiting 90s to allow static detection, then executing the file and waiting another period for dynamic detection; this allows greater fidelity in the observational data than previous experiments, in particular, resource and time-to-detection statistics. To execute all 800K trials (100K files $\times$ 8 tools), a software framework is designed to choreographed the experiment into a completely automated, time-synced, and reproducible workflow with substantial parallelization. A cost-benefit model was configured to integrate the tools' recall, precision, time to detection, and resource requirements into a single comparable quantity by simulating costs of use. This provides a ranking methodology for cyber competitions and a lens through which to reason about the varied statistical viewpoints of the results. These statistical and cost-model results provide insights on state of commercial malware detection.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
Time-Based CAN Intrusion Detection Benchmark
Authors:
Deborah H. Blevins,
Pablo Moriano,
Robert A. Bridges,
Miki E. Verma,
Michael D. Iannacone,
Samuel C Hollifield
Abstract:
Modern vehicles are complex cyber-physical systems made of hundreds of electronic control units (ECUs) that communicate over controller area networks (CANs). This inherited complexity has expanded the CAN attack surface which is vulnerable to message injection attacks. These injections change the overall timing characteristics of messages on the bus, and thus, to detect these malicious messages, t…
▽ More
Modern vehicles are complex cyber-physical systems made of hundreds of electronic control units (ECUs) that communicate over controller area networks (CANs). This inherited complexity has expanded the CAN attack surface which is vulnerable to message injection attacks. These injections change the overall timing characteristics of messages on the bus, and thus, to detect these malicious messages, time-based intrusion detection systems (IDSs) have been proposed. However, time-based IDSs are usually trained and tested on low-fidelity datasets with unrealistic, labeled attacks. This makes difficult the task of evaluating, comparing, and validating IDSs. Here we detail and benchmark four time-based IDSs against the newly published ROAD dataset, the first open CAN IDS dataset with real (non-simulated) stealthy attacks with physically verified effects. We found that methods that perform hypothesis testing by explicitly estimating message timing distributions have lower performance than methods that seek anomalies in a distribution-related statistic. In particular, these "distribution-agnostic" based methods outperform "distribution-based" methods by at least 55% in area under the precision-recall curve (AUC-PR). Our results expand the body of knowledge of CAN time-based IDSs by providing details of these methods and reporting their results when tested on datasets with real advanced attacks. Finally, we develop an after-market plug-in detector using lightweight hardware, which can be used to deploy the best performing IDS method on nearly any vehicle.
△ Less
Submitted 14 January, 2021;
originally announced January 2021.
-
A Comprehensive Guide to CAN IDS Data & Introduction of the ROAD Dataset
Authors:
Miki E. Verma,
Robert A. Bridges,
Michael D. Iannacone,
Samuel C. Hollifield,
Pablo Moriano,
Steven C. Hespeler,
Bill Kay,
Frank L. Combs
Abstract:
Although ubiquitous in modern vehicles, Controller Area Networks (CANs) lack basic security properties and are easily exploitable. A rapidly growing field of CAN security research has emerged that seeks to detect intrusions on CANs. Producing vehicular CAN data with a variety of intrusions is out of reach for most researchers as it requires expensive assets and expertise. To assist researchers, we…
▽ More
Although ubiquitous in modern vehicles, Controller Area Networks (CANs) lack basic security properties and are easily exploitable. A rapidly growing field of CAN security research has emerged that seeks to detect intrusions on CANs. Producing vehicular CAN data with a variety of intrusions is out of reach for most researchers as it requires expensive assets and expertise. To assist researchers, we present the first comprehensive guide to the existing open CAN intrusion datasets, including a quality analysis of each dataset and an enumeration of each's benefits, drawbacks, and suggested use case. Current public CAN IDS datasets are limited to real fabrication (simple message injection) attacks and simulated attacks often in synthetic data, which lack fidelity. In general, the physical effects of attacks on the vehicle are not verified in the available datasets. Only one dataset provides signal-translated data but not a corresponding raw binary version. Overall, the available data pigeon-holes CAN IDS works into testing on limited, often inappropriate data (usually with attacks that are too easily detectable to truly test the method), and this lack data has stymied comparability and reproducibility of results. As our primary contribution, we present the ROAD (Real ORNL Automotive Dynamometer) CAN Intrusion Dataset, consisting of over 3.5 hours of one vehicle's CAN data. ROAD contains ambient data recorded during a diverse set of activities, and attacks of increasing stealth with multiple variants and instances of real fuzzing, fabrication, and unique advanced attacks, as well as simulated masquerade attacks. To facilitate benchmarking CAN IDS methods that require signal-translated inputs, we also provide the signal time series format for many of the CAN captures. Our contributions aim to facilitate appropriate benchmarking and needed comparability in the CAN IDS field.
△ Less
Submitted 7 February, 2024; v1 submitted 28 December, 2020;
originally announced December 2020.
-
Beyond the Hype: A Real-World Evaluation of the Impact and Cost of Machine Learning-Based Malware Detection
Authors:
Robert A. Bridges,
Sean Oesch,
Miki E. Verma,
Michael D. Iannacone,
Kelly M. T. Huffer,
Brian Jewell,
Jeff A. Nichols,
Brian Weber,
Justin M. Beaver,
Jared M. Smith,
Daniel Scofield,
Craig Miles,
Thomas Plummer,
Mark Daniell,
Anne M. Tall
Abstract:
In this paper, we present a scientific evaluation of four prominent malware detection tools to assist an organization with two primary questions: To what extent do ML-based tools accurately classify previously- and never-before-seen files? Is it worth purchasing a network-level malware detector? To identify weaknesses, we tested each tool against 3,536 total files (2,554 or 72\% malicious, 982 or…
▽ More
In this paper, we present a scientific evaluation of four prominent malware detection tools to assist an organization with two primary questions: To what extent do ML-based tools accurately classify previously- and never-before-seen files? Is it worth purchasing a network-level malware detector? To identify weaknesses, we tested each tool against 3,536 total files (2,554 or 72\% malicious, 982 or 28\% benign) of a variety of file types, including hundreds of malicious zero-days, polyglots, and APT-style files, delivered on multiple protocols. We present statistical results on detection time and accuracy, consider complementary analysis (using multiple tools together), and provide two novel applications of the recent cost-benefit evaluation procedure of Iannacone \& Bridges. While the ML-based tools are more effective at detecting zero-day files and executables, the signature-based tool may still be an overall better option. Both network-based tools provide substantial (simulated) savings when paired with either host tool, yet both show poor detection rates on protocols other than HTTP or SMTP. Our results show that all four tools have near-perfect precision but alarmingly low recall, especially on file types other than executables and office files -- 37% of malware tested, including all polyglot files, were undetected. Priorities for researchers and takeaways for end users are given.
△ Less
Submitted 17 August, 2022; v1 submitted 16 December, 2020;
originally announced December 2020.
-
FairMixRep : Self-supervised Robust Representation Learning for Heterogeneous Data with Fairness constraints
Authors:
Souradip Chakraborty,
Ekansh Verma,
Saswata Sahoo,
Jyotishka Datta
Abstract:
Representation Learning in a heterogeneous space with mixed variables of numerical and categorical types has interesting challenges due to its complex feature manifold. Moreover, feature learning in an unsupervised setup, without class labels and a suitable learning loss function, adds to the problem complexity. Further, the learned representation and subsequent predictions should not reflect disc…
▽ More
Representation Learning in a heterogeneous space with mixed variables of numerical and categorical types has interesting challenges due to its complex feature manifold. Moreover, feature learning in an unsupervised setup, without class labels and a suitable learning loss function, adds to the problem complexity. Further, the learned representation and subsequent predictions should not reflect discriminatory behavior towards certain sensitive groups or attributes. The proposed feature map should preserve maximum variations present in the data and needs to be fair with respect to the sensitive variables. We propose, in the first phase of our work, an efficient encoder-decoder framework to capture the mixed-domain information. The second phase of our work focuses on de-biasing the mixed space representations by adding relevant fairness constraints. This ensures minimal information loss between the representations before and after the fairness-preserving projections. Both the information content and the fairness aspect of the final representation learned has been validated through several metrics where it shows excellent performance. Our work (FairMixRep) addresses the problem of Mixed Space Fair Representation learning from an unsupervised perspective and learns a Universal representation that is timely, unique, and a novel research contribution.
△ Less
Submitted 14 October, 2020; v1 submitted 7 October, 2020;
originally announced October 2020.
-
CAN-D: A Modular Four-Step Pipeline for Comprehensively Decoding Controller Area Network Data
Authors:
Miki E. Verma,
Robert A. Bridges,
Jordan J. Sosnowski,
Samuel C. Hollifield,
Michael D. Iannacone
Abstract:
CANs are a broadcast protocol for real-time communication of critical vehicle subsystems. Original equipment manufacturers of passenger vehicles hold secret their map**s of CAN data to vehicle signals, and these definitions vary according to make, model, and year. Without these map**s, the wealth of real-time vehicle information hidden in the CAN packets is uninterpretable, impeding vehicle-re…
▽ More
CANs are a broadcast protocol for real-time communication of critical vehicle subsystems. Original equipment manufacturers of passenger vehicles hold secret their map**s of CAN data to vehicle signals, and these definitions vary according to make, model, and year. Without these map**s, the wealth of real-time vehicle information hidden in the CAN packets is uninterpretable, impeding vehicle-related research. Guided by the 4-part CAN signal definition, we present CAN-D (CAN-Decoder), a modular, 4-step pipeline for identifying each signal's boundaries (start bit, length), endianness (byte order), signedness (bit-to-integer encoding), and by leveraging diagnostic standards, augmenting a subset of the extracted signals with physical interpretation. We provide a comprehensive review of the CAN signal reverse engineering research. Previous methods ignore endianness and signedness, rendering them incapable of decoding many standard CAN signal definitions. Incorporating endianness grows the search space from 128 to 4.72E21 signal tokenizations and introduces a web of changing dependencies. We formulate, formally analyze, and provide an efficient solution to an optimization problem, allowing identification of the optimal set of signal boundaries and byte orderings. We provide two novel, state-of-the-art signal boundary classifiers-both superior to previous approaches in precision and recall in three different test scenarios-and the first signedness classification algorithm which exhibits a $>$97\% F-score. CAN-D is the only solution with the potential to extract any CAN signal. In evaluation on 10 vehicles, CAN-D's average $\ell^1$ error is 5x better than all previous methods and exhibits lower ave. error, even when considering only signals that meet prior methods' assumptions. CAN-D is implemented in lightweight hardware, allowing for an OBD-II plugin for real-time in-vehicle CAN decoding.
△ Less
Submitted 22 June, 2021; v1 submitted 9 June, 2020;
originally announced June 2020.
-
ACTT: Automotive CAN Tokenization and Translation
Authors:
Miki E. Verma,
Robert A. Bridges,
Samuel C. Hollifield
Abstract:
Modern vehicles contain scores of Electrical Control Units (ECUs) that broadcast messages over a Controller Area Network (CAN). Vehicle manufacturers rely on security through obscurity by concealing their unique map** of CAN messages to vehicle functions which differs for each make, model, year, and even trim. This poses a major obstacle for after-market modifications notably performance tuning…
▽ More
Modern vehicles contain scores of Electrical Control Units (ECUs) that broadcast messages over a Controller Area Network (CAN). Vehicle manufacturers rely on security through obscurity by concealing their unique map** of CAN messages to vehicle functions which differs for each make, model, year, and even trim. This poses a major obstacle for after-market modifications notably performance tuning and in-vehicle network security measures. We present ACTT: Automotive CAN Tokenization and Translation, a novel, vehicle-agnostic, algorithm that leverages available diagnostic information to parse CAN data into meaningful messages, simultaneously cutting binary messages into tokens, and learning the translation to map these contiguous bits to the value of the vehicle function communicated.
△ Less
Submitted 19 November, 2018;
originally announced November 2018.
-
Defining a Metric Space of Host Logs and Operational Use Cases
Authors:
Miki E. Verma,
Robert A. Bridges
Abstract:
Host logs, in particular, Windows Event Logs, are a valuable source of information often collected by security operation centers (SOCs). The semi-structured nature of host logs inhibits automated analytics, and while manual analysis is common, the sheer volume makes manual inspection of all logs impossible. Although many powerful algorithms for analyzing time-series and sequential data exist, util…
▽ More
Host logs, in particular, Windows Event Logs, are a valuable source of information often collected by security operation centers (SOCs). The semi-structured nature of host logs inhibits automated analytics, and while manual analysis is common, the sheer volume makes manual inspection of all logs impossible. Although many powerful algorithms for analyzing time-series and sequential data exist, utilization of such algorithms for most cyber security applications is either infeasible or requires tailored, research-intensive preparations. In particular, basic mathematic and algorithmic developments for providing a generalized, meaningful similarity metric on system logs is needed to bridge the gap between many existing sequential data mining methods and this currently available but under-utilized data source. In this paper, we provide a rigorous definition of a metric product space on Windows Event Logs, providing an embedding that allows for the application of established machine learning and time-series analysis methods. We then demonstrate the utility and flexibility of this embedding with multiple use-cases on real data: (1) comparing known infected to new host log streams for attack detection and forensics, (2) collapsing similar streams of logs into semantically-meaningful groups (by user, by role), thereby reducing the quantity of data but not the content, (3) clustering logs as well as short sequences of logs to identify and visualize user behaviors and background processes over time. Overall, we provide a metric space framework for general host logs and log sequences that respects semantic similarity and facilitates a wide variety of data science analytics to these logs without data-specific preparations for each.
△ Less
Submitted 1 November, 2018;
originally announced November 2018.