-
Honeyfile Camouflage: Hiding Fake Files in Plain Sight
Authors:
Roelien C. Timmer,
David Liebowitz,
Surya Nepal,
Salil S. Kanhere
Abstract:
Honeyfiles are a particularly useful type of honeypot: fake files deployed to detect and infer information from malicious behaviour. This paper considers the challenge of naming honeyfiles so they are camouflaged when placed amongst real files in a file system. Based on cosine distances in semantic vector spaces, we develop two metrics for filename camouflage: one based on simple averaging and one…
▽ More
Honeyfiles are a particularly useful type of honeypot: fake files deployed to detect and infer information from malicious behaviour. This paper considers the challenge of naming honeyfiles so they are camouflaged when placed amongst real files in a file system. Based on cosine distances in semantic vector spaces, we develop two metrics for filename camouflage: one based on simple averaging and one on clustering with mixture fitting. We evaluate and compare the metrics, showing that both perform well on a publicly available GitHub software repository dataset.
△ Less
Submitted 10 May, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
NASA Science Mission Directorate Knowledge Graph Discovery
Authors:
Roelien C. Timmer,
Fech Scen Khoo,
Megan Mark,
Marcella Scoczynski Ribeiro Martins,
Anamaria Berea,
Gregory Renard,
Kaylin Bugbee
Abstract:
The size of the National Aeronautics and Space Administration (NASA) Science Mission Directorate (SMD) is growing exponentially, allowing researchers to make discoveries. However, making discoveries is challenging and time-consuming due to the size of the data catalogs, and as many concepts and data are indirectly connected. This paper proposes a pipeline to generate knowledge graphs (KGs) represe…
▽ More
The size of the National Aeronautics and Space Administration (NASA) Science Mission Directorate (SMD) is growing exponentially, allowing researchers to make discoveries. However, making discoveries is challenging and time-consuming due to the size of the data catalogs, and as many concepts and data are indirectly connected. This paper proposes a pipeline to generate knowledge graphs (KGs) representing different NASA SMD domains. These KGs can be used as the basis for dataset search engines, saving researchers time and supporting them in finding new connections. We collected textual data and used several modern natural language processing (NLP) methods to create the nodes and the edges of the KGs. We explore the cross-domain connections, discuss our challenges, and provide future directions to inspire researchers working on similar challenges.
△ Less
Submitted 20 March, 2023;
originally announced March 2023.
-
Deception for Cyber Defence: Challenges and Opportunities
Authors:
David Liebowitz,
Surya Nepal,
Kristen Moore,
Cody J. Christopher,
Salil S. Kanhere,
David Nguyen,
Roelien C. Timmer,
Michael Longland,
Keerth Rathakumar
Abstract:
Deception is rapidly growing as an important tool for cyber defence, complementing existing perimeter security measures to rapidly detect breaches and data theft. One of the factors limiting the use of deception has been the cost of generating realistic artefacts by hand. Recent advances in Machine Learning have, however, created opportunities for scalable, automated generation of realistic decept…
▽ More
Deception is rapidly growing as an important tool for cyber defence, complementing existing perimeter security measures to rapidly detect breaches and data theft. One of the factors limiting the use of deception has been the cost of generating realistic artefacts by hand. Recent advances in Machine Learning have, however, created opportunities for scalable, automated generation of realistic deceptions. This vision paper describes the opportunities and challenges involved in develo** models to mimic many common elements of the IT stack for deception effects.
△ Less
Submitted 15 August, 2022;
originally announced August 2022.
-
TSM: Measuring the Enticement of Honeyfiles with Natural Language Processing
Authors:
Roelien C. Timmer,
David Liebowitz,
Surya Nepal,
Salil Kanhere
Abstract:
Honeyfile deployment is a useful breach detection method in cyber deception that can also inform defenders about the intent and interests of intruders and malicious insiders. A key property of a honeyfile, enticement, is the extent to which the file can attract an intruder to interact with it. We introduce a novel metric, Topic Semantic Matching (TSM), which uses topic modelling to represent files…
▽ More
Honeyfile deployment is a useful breach detection method in cyber deception that can also inform defenders about the intent and interests of intruders and malicious insiders. A key property of a honeyfile, enticement, is the extent to which the file can attract an intruder to interact with it. We introduce a novel metric, Topic Semantic Matching (TSM), which uses topic modelling to represent files in the repository and semantic matching in an embedding vector space to compare honeyfile text and topic words robustly. We also present a honeyfile corpus created with different Natural Language Processing (NLP) methods. Experiments show that TSM is effective in inter-corpus comparisons and is a promising tool to measure the enticement of honeyfiles. TSM is the first measure to use NLP techniques to quantify the enticement of honeyfile content that compares the essential topical content of local contexts to honeyfiles and is robust to paraphrasing.
△ Less
Submitted 14 March, 2022;
originally announced March 2022.
-
Can pre-trained Transformers be used in detecting complex sensitive sentences? -- A Monsanto case study
Authors:
Roelien C. Timmer,
David Liebowitz,
Surya Nepal,
Salil S. Kanhere
Abstract:
Each and every organisation releases information in a variety of forms ranging from annual reports to legal proceedings. Such documents may contain sensitive information and releasing them openly may lead to the leakage of confidential information. Detection of sentences that contain sensitive information in documents can help organisations prevent the leakage of valuable confidential information.…
▽ More
Each and every organisation releases information in a variety of forms ranging from annual reports to legal proceedings. Such documents may contain sensitive information and releasing them openly may lead to the leakage of confidential information. Detection of sentences that contain sensitive information in documents can help organisations prevent the leakage of valuable confidential information. This is especially challenging when such sentences contain a substantial amount of information or are paraphrased versions of known sensitive content. Current approaches to sensitive information detection in such complex settings are based on keyword-based approaches or standard machine learning models. In this paper, we wish to explore whether pre-trained transformer models are well suited to detect complex sensitive information. Pre-trained transformers are typically trained on an enormous amount of text and therefore readily learn grammar, structure and other linguistic features, making them particularly attractive for this task. Through our experiments on the Monsanto trial data set, we observe that the fine-tuned Bidirectional Encoder Representations from Transformers (BERT) transformer model performs better than traditional models. We experimented with four different categories of documents in the Monsanto dataset and observed that BERT achieves better F2 scores by 24.13\% to 65.79\% for GHOST, 30.14\% to 54.88\% for TOXIC, 39.22\% for CHEMI, 53.57\% for REGUL compared to existing sensitive information detection models.
△ Less
Submitted 13 March, 2022;
originally announced March 2022.
-
Streaming readout for next generation electron scattering experiment
Authors:
Fabrizio Ameli,
Marco Battaglieri,
Vladimir V. Berdnikov,
Mariangela Bondì,
Sergey Boyarinov,
Nathan Brei,
Laura Cappelli,
Andrea Celentano,
Tommaso Chiarusi,
Raffaella De Vita,
Cristiano Fanelli,
Vardan Gyurjyan,
David Lawrence,
Patrick Moran,
Paolo Musico,
Carmelo Pellegrino,
Alessandro Pilloni,
Ben Raydo,
Carl Timmer,
Maurizio Ungaro,
Simone Vallarino
Abstract:
Current and future experiments at the high intensity frontier are expected to produce an enormous amount of data that needs to be collected and stored for offline analysis. Thanks to the continuous progress in computing and networking technology, it is now possible to replace the standard `triggered' data acquisition systems with a new, simplified and outperforming scheme. `Streaming readout' (SRO…
▽ More
Current and future experiments at the high intensity frontier are expected to produce an enormous amount of data that needs to be collected and stored for offline analysis. Thanks to the continuous progress in computing and networking technology, it is now possible to replace the standard `triggered' data acquisition systems with a new, simplified and outperforming scheme. `Streaming readout' (SRO) DAQ aims to replace the hardware-based trigger with a much more powerful and flexible software-based one, that considers the whole detector information for efficient real-time data tagging and selection. Considering the crucial role of DAQ in an experiment, validation with on-field tests is required to demonstrate SRO performance. In this paper we report results of the on-beam validation of the Jefferson Lab SRO framework. We exposed different detectors (PbWO-based electromagnetic calorimeters and a plastic scintillator hodoscope) to the Hall-D electron-positron secondary beam and to the Hall-B production electron beam, with increasingly complex experimental conditions. By comparing the data collected with the SRO system against the traditional DAQ, we demonstrate that the SRO performs as expected. Furthermore, we provide evidence of its superiority in implementing sophisticated AI-supported algorithms for real-time data analysis and reconstruction.
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
SAMPA Based Streaming Readout Data Acquisition Prototype
Authors:
E. Jastrzembski,
D. Abbott,
J. Gu,
V. Gyurjyan,
G. Heyes,
B. Moffit,
E. Pooser,
C. Timmer,
A. Hellman
Abstract:
We have assembled a small-scale streaming data acquisition system based on the SAMPA front-end ASIC. We report on measurements performed on the SAMPA chip and preliminary cosmic ray data acquired from a Gas Electron Multiplier (GEM) detector read out using the SAMPA.
We have assembled a small-scale streaming data acquisition system based on the SAMPA front-end ASIC. We report on measurements performed on the SAMPA chip and preliminary cosmic ray data acquired from a Gas Electron Multiplier (GEM) detector read out using the SAMPA.
△ Less
Submitted 2 November, 2020;
originally announced November 2020.
-
FIPA agent based network distributed control system
Authors:
V. Gyurjyan,
D. Abbott,
G. Heyes,
E. Jastrzembski,
C. Timmer,
E. Wolin
Abstract:
A control system with the capabilities to combine heteregeneous control systems or processes into a uniform homogeneous environment is discussed. This dynamically extensible system is an example of the software system at the agent level of abstraction. This level of abstraction considers agents as atomic entities that communicate to implement the functionality of the control system. Agents engin…
▽ More
A control system with the capabilities to combine heteregeneous control systems or processes into a uniform homogeneous environment is discussed. This dynamically extensible system is an example of the software system at the agent level of abstraction. This level of abstraction considers agents as atomic entities that communicate to implement the functionality of the control system. Agents engineering aspects are addressed by adopting the domain independent software standard, formulated by FIPA. Jade core Java classes are used as a FIPA specification implementation. A special, lightweight, XML RDFS based, control oriented, ontology markup language is developed to standardize the description of the arbitrary control system data processor. Control processes, described in this language, are integrated into the global system at runtime, without actual programming. Fault tolerance and recovery issues are also addressed.
△ Less
Submitted 12 May, 2003;
originally announced May 2003.