-
LLM-Powered Code Vulnerability Repair with Reinforcement Learning and Semantic Reward
Authors:
Nafis Tanveer Islam,
Joseph Khoury,
Andrew Seong,
Mohammad Bahrami Karkevandi,
Gonzalo De La Torre Parra,
Elias Bou-Harb,
Peyman Najafirad
Abstract:
In software development, the predominant emphasis on functionality often supersedes security concerns, a trend gaining momentum with AI-driven automation tools like GitHub Copilot. These tools significantly improve developers' efficiency in functional code development. Nevertheless, it remains a notable concern that such tools are also responsible for creating insecure code, predominantly because…
▽ More
In software development, the predominant emphasis on functionality often supersedes security concerns, a trend gaining momentum with AI-driven automation tools like GitHub Copilot. These tools significantly improve developers' efficiency in functional code development. Nevertheless, it remains a notable concern that such tools are also responsible for creating insecure code, predominantly because of pre-training on publicly available repositories with vulnerable code. Moreover, developers are called the "weakest link in the chain" since they have very minimal knowledge of code security. Although existing solutions provide a reasonable solution to vulnerable code, they must adequately describe and educate the developers on code security to ensure that the security issues are not repeated. Therefore we introduce a multipurpose code vulnerability analysis system \texttt{SecRepair}, powered by a large language model, CodeGen2 assisting the developer in identifying and generating fixed code along with a complete description of the vulnerability with a code comment. Our innovative methodology uses a reinforcement learning paradigm to generate code comments augmented by a semantic reward mechanism. Inspired by how humans fix code issues, we propose an instruction-based dataset suitable for vulnerability analysis with LLMs. We further identify zero-day and N-day vulnerabilities in 6 Open Source IoT Operating Systems on GitHub. Our findings underscore that incorporating reinforcement learning coupled with semantic reward augments our model's performance, thereby fortifying its capacity to address code vulnerabilities with improved efficacy.
△ Less
Submitted 21 February, 2024; v1 submitted 6 January, 2024;
originally announced January 2024.
-
Ransomware Detection Using Federated Learning with Imbalanced Datasets
Authors:
Aldin Vehabovic,
Hadi Zanddizari,
Nasir Ghani,
G. Javidi,
S. Uluagac,
M. Rahouti,
E. Bou-Harb,
M. Safaei Pour
Abstract:
Ransomware is a type of malware which encrypts user data and extorts payments in return for the decryption keys. This cyberthreat is one of the most serious challenges facing organizations today and has already caused immense financial damage. As a result, many researchers have been develo** techniques to counter ransomware. Recently, the federated learning (FL) approach has also been applied fo…
▽ More
Ransomware is a type of malware which encrypts user data and extorts payments in return for the decryption keys. This cyberthreat is one of the most serious challenges facing organizations today and has already caused immense financial damage. As a result, many researchers have been develo** techniques to counter ransomware. Recently, the federated learning (FL) approach has also been applied for ransomware analysis, allowing corporations to achieve scalable, effective detection and attribution without having to share their private data. However, in reality there is much variation in the quantity and composition of ransomware data collected across multiple FL client sites/regions. This imbalance will inevitably degrade the effectiveness of any defense mechanisms. To address this concern, a modified FL scheme is proposed using a weighted cross-entropy loss function approach to mitigate dataset imbalance. A detailed performance evaluation study is then presented for the case of static analysis using the latest Windows-based ransomware families. The findings confirm improved ML classifier performance for a highly imbalanced dataset.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Federated Learning Approach for Distributed Ransomware Analysis
Authors:
Aldin Vehabovic,
Hadi Zanddizari,
Farook Shaikh,
Nasir Ghani,
Morteza Safaei Pour,
Elias Bou-Harb,
Jorge Crichigno
Abstract:
Researchers have proposed a wide range of ransomware detection and analysis schemes. However, most of these efforts have focused on older families targeting Windows 7/8 systems. Hence there is a critical need to develop efficient solutions to tackle the latest threats, many of which may have relatively fewer samples to analyze. This paper presents a machine learning (ML) framework for early ransom…
▽ More
Researchers have proposed a wide range of ransomware detection and analysis schemes. However, most of these efforts have focused on older families targeting Windows 7/8 systems. Hence there is a critical need to develop efficient solutions to tackle the latest threats, many of which may have relatively fewer samples to analyze. This paper presents a machine learning (ML) framework for early ransomware detection and attribution. The solution pursues a data-centric approach which uses a minimalist ransomware dataset and implements static analysis using portable executable (PE) files. Results for several ML classifiers confirm strong performance in terms of accuracy and zero-day threat detection.
△ Less
Submitted 24 June, 2023;
originally announced June 2023.
-
IoT Threat Detection Testbed Using Generative Adversarial Networks
Authors:
Farooq Shaikh,
Elias Bou-Harb,
Aldin Vehabovic,
Jorge Crichigno,
Aysegul Yayimli,
Nasir Ghani
Abstract:
The Internet of Things(IoT) paradigm provides persistent sensing and data collection capabilities and is becoming increasingly prevalent across many market sectors. However, most IoT devices emphasize usability and function over security, making them very vulnerable to malicious exploits. This concern is evidenced by the increased use of compromised IoT devices in large scale bot networks (botnets…
▽ More
The Internet of Things(IoT) paradigm provides persistent sensing and data collection capabilities and is becoming increasingly prevalent across many market sectors. However, most IoT devices emphasize usability and function over security, making them very vulnerable to malicious exploits. This concern is evidenced by the increased use of compromised IoT devices in large scale bot networks (botnets) to launch distributed denial of service(DDoS) attacks against high value targets. Unsecured IoT systems can also provide entry points to private networks, allowing adversaries relatively easy access to valuable resources and services. Indeed, these evolving IoT threat vectors (ranging from brute force attacks to remote code execution exploits) are posing key challenges. Moreover, many traditional security mechanisms are not amenable for deployment on smaller resource-constrained IoT platforms. As a result, researchers have been develo** a range of methods for IoT security, with many strategies using advanced machine learning(ML) techniques. Along these lines, this paper presents a novel generative adversarial network(GAN) solution to detect threats from malicious IoT devices both inside and outside a network. This model is trained using both benign IoT traffic and global darknet data and further evaluated in a testbed with real IoT devices and malware threats.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Data-Centric Machine Learning Approach for Early Ransomware Detection and Attribution
Authors:
Aldin Vehabovic,
Hadi Zanddizari,
Nasir Ghani,
Farooq Shaikh,
Elias Bou-Harb,
Morteza Safaei Pour,
Jorge Crichigno
Abstract:
Researchers have proposed a wide range of ransomware detection and analysis schemes. However, most of these efforts have focused on older families targeting Windows 7/8 systems. Hence there is a critical need to develop efficient solutions to tackle the latest threats, many of which may have relatively fewer samples to analyze. This paper presents a machine learning(ML) framework for early ransomw…
▽ More
Researchers have proposed a wide range of ransomware detection and analysis schemes. However, most of these efforts have focused on older families targeting Windows 7/8 systems. Hence there is a critical need to develop efficient solutions to tackle the latest threats, many of which may have relatively fewer samples to analyze. This paper presents a machine learning(ML) framework for early ransomware detection and attribution. The solution pursues a data-centric approach which uses a minimalist ransomware dataset and implements static analysis using portable executable(PE) files. Results for several ML classifiers confirm strong performance in terms of accuracy and zero-day threat detection.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
An Unbiased Transformer Source Code Learning with Semantic Vulnerability Graph
Authors:
Nafis Tanveer Islam,
Gonzalo De La Torre Parra,
Dylan Manuel,
Elias Bou-Harb,
Peyman Najafirad
Abstract:
Over the years, open-source software systems have become prey to threat actors. Even as open-source communities act quickly to patch the breach, code vulnerability screening should be an integral part of agile software development from the beginning. Unfortunately, current vulnerability screening techniques are ineffective at identifying novel vulnerabilities or providing developers with code vuln…
▽ More
Over the years, open-source software systems have become prey to threat actors. Even as open-source communities act quickly to patch the breach, code vulnerability screening should be an integral part of agile software development from the beginning. Unfortunately, current vulnerability screening techniques are ineffective at identifying novel vulnerabilities or providing developers with code vulnerability and classification. Furthermore, the datasets used for vulnerability learning often exhibit distribution shifts from the real-world testing distribution due to novel attack strategies deployed by adversaries and as a result, the machine learning model's performance may be hindered or biased. To address these issues, we propose a joint interpolated multitasked unbiased vulnerability classifier comprising a transformer "RoBERTa" and graph convolution neural network (GCN). We present a training process utilizing a semantic vulnerability graph (SVG) representation from source code, created by integrating edges from a sequential flow, control flow, and data flow, as well as a novel flow dubbed Poacher Flow (PF). Poacher flow edges reduce the gap between dynamic and static program analysis and handle complex long-range dependencies. Moreover, our approach reduces biases of classifiers regarding unbalanced datasets by integrating Focal Loss objective function along with SVG. Remarkably, experimental results show that our classifier outperforms state-of-the-art results on vulnerability detection with fewer false negatives and false positives. After testing our model across multiple datasets, it shows an improvement of at least 2.41% and 18.75% in the best-case scenario. Evaluations using N-day program samples demonstrate that our proposed approach achieves a 93% accuracy and was able to detect 4, zero-day vulnerabilities from popular GitHub repositories.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Ransomware Detection and Classification Strategies
Authors:
Aldin Vehabovic,
Nasir Ghani,
Elias Bou-Harb,
Jorge Crichigno,
Aysegul Yayimli
Abstract:
Ransomware uses encryption methods to make data inaccessible to legitimate users. To date a wide range of ransomware families have been developed and deployed, causing immense damage to governments, corporations, and private users. As these cyberthreats multiply, researchers have proposed a range of ransomware detection and classification schemes. Most of these methods use advanced machine learnin…
▽ More
Ransomware uses encryption methods to make data inaccessible to legitimate users. To date a wide range of ransomware families have been developed and deployed, causing immense damage to governments, corporations, and private users. As these cyberthreats multiply, researchers have proposed a range of ransomware detection and classification schemes. Most of these methods use advanced machine learning techniques to process and analyze real-world ransomware binaries and action sequences. Hence this paper presents a survey of this critical space and classifies existing solutions into several categories, i.e., including network-based, host-based, forensic characterization, and authorship attribution. Key facilities and tools for ransomware analysis are also presented along with open challenges.
△ Less
Submitted 10 April, 2023;
originally announced April 2023.
-
An Exhaustive Survey on P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future Trends
Authors:
Elie F. Kfoury,
Jorge Crichigno,
Elias Bou-Harb
Abstract:
Traditionally, the data plane has been designed with fixed functions to forward packets using a small set of protocols. This closed-design paradigm has limited the capability of the switches to proprietary implementations which are hardcoded by vendors, inducing a lengthy, costly, and inflexible process. Recently, data plane programmability has attracted significant attention from both the researc…
▽ More
Traditionally, the data plane has been designed with fixed functions to forward packets using a small set of protocols. This closed-design paradigm has limited the capability of the switches to proprietary implementations which are hardcoded by vendors, inducing a lengthy, costly, and inflexible process. Recently, data plane programmability has attracted significant attention from both the research community and the industry, permitting operators and programmers in general to run customized packet processing function. This open-design paradigm is paving the way for an unprecedented wave of innovation and experimentation by reducing the time of designing, testing, and adopting new protocols; enabling a customized, top-down approach to develop network applications; providing granular visibility of packet events defined by the programmer; reducing complexity and enhancing resource utilization of the programmable switches; and drastically improving the performance of applications that are offloaded to the data plane. Despite the impressive advantages of programmable data plane switches and their importance in modern networks, the literature has been missing a comprehensive survey. To this end, this paper provides a background encompassing an overview of the evolution of networks from legacy to programmable, describing the essentials of programmable switches, and summarizing their advantages over Software-defined Networking (SDN) and legacy devices. The paper then presents a unique, comprehensive taxonomy of applications developed with P4 language; surveying, classifying, and analyzing more than 150 articles; discussing challenges and considerations; and presenting future perspectives and open research issues.
△ Less
Submitted 7 June, 2021; v1 submitted 1 February, 2021;
originally announced February 2021.
-
Improving Borderline Adulthood Facial Age Estimation through Ensemble Learning
Authors:
Felix Anda,
David Lillis,
Aikaterini Kanta,
Brett A. Becker,
Elias Bou-Harb,
Nhien-An Le-Khac,
Mark Scanlon
Abstract:
Achieving high performance for facial age estimation with subjects in the borderline between adulthood and non-adulthood has always been a challenge. Several studies have used different approaches from the age of a baby to an elder adult and different datasets have been employed to measure the mean absolute error (MAE) ranging between 1.47 to 8 years. The weakness of the algorithms specifically in…
▽ More
Achieving high performance for facial age estimation with subjects in the borderline between adulthood and non-adulthood has always been a challenge. Several studies have used different approaches from the age of a baby to an elder adult and different datasets have been employed to measure the mean absolute error (MAE) ranging between 1.47 to 8 years. The weakness of the algorithms specifically in the borderline has been a motivation for this paper. In our approach, we have developed an ensemble technique that improves the accuracy of underage estimation in conjunction with our deep learning model (DS13K) that has been fine-tuned on the Deep Expectation (DEX) model. We have achieved an accuracy of 68% for the age group 16 to 17 years old, which is 4 times better than the DEX accuracy for such age range. We also present an evaluation of existing cloud-based and offline facial age prediction services, such as Amazon Rekognition, Microsoft Azure Cognitive Services, How-Old.net and DEX.
△ Less
Submitted 2 July, 2019;
originally announced July 2019.
-
Cross-Layer Authentication Protocol Design for Ultra-Dense 5G HetNets
Authors:
Christian Miranda,
Georges Kaddoum,
Elias Bou-Harb
Abstract:
Creating a secure environment for communications is becoming a significantly challenging task in 5G Heterogeneous Networks (HetNets) given the stringent latency and high capacity requirements of 5G networks. This is particularly factual knowing that the infrastructure tends to be highly diversified especially with the continuous deployment of small cells. In fact, frequent handovers in these cells…
▽ More
Creating a secure environment for communications is becoming a significantly challenging task in 5G Heterogeneous Networks (HetNets) given the stringent latency and high capacity requirements of 5G networks. This is particularly factual knowing that the infrastructure tends to be highly diversified especially with the continuous deployment of small cells. In fact, frequent handovers in these cells introduce unnecessarily recurring authentications leading to increased latency. In this paper, we propose a software-defined wireless network (SDWN)-enabled fast cross-authentication scheme which combines non-cryptographic and cryptographic algorithms to address the challenges of latency and weak security. Initially, the received radio signal strength vectors at the mobile terminal (MT) is used as a fingerprinting source to generate an unpredictable secret key. Subsequently, a cryptographic mechanism based upon the authentication and key agreement protocol by employing the generated secret key is performed in order to improve the confidentiality and integrity of the authentication handover. Further, we propose a radio trusted zone database aiming to enhance the frequent authentication of radio devices which are present in the network. In order to reduce recurring authentications, a given covered area is divided into trusted zones where each zone contains more than one small cell, thus permitting the MT to initiate a single authentication request per zone, even if it keeps roaming between different cells. The proposed scheme is analyzed under different attack scenarios and its complexity is compared with cryptographic and non-cryptographic approaches to demonstrate its security resilience and computational efficiency.
△ Less
Submitted 5 February, 2018;
originally announced February 2018.
-
Towards the Leveraging of Data Deduplication to Break the Disk Acquisition Speed Limit
Authors:
Hannah Wolahan,
Claudio Chico Lorenzo,
Elias Bou-Harb,
Mark Scanlon
Abstract:
Digital forensic evidence acquisition speed is traditionally limited by two main factors: the read speed of the storage device being investigated, i.e., the read speed of the disk, memory, remote storage, mobile device, etc.), and the write speed of the system used for storing the acquired data. Digital forensic investigators can somewhat mitigate the latter issue through the use of high-speed sto…
▽ More
Digital forensic evidence acquisition speed is traditionally limited by two main factors: the read speed of the storage device being investigated, i.e., the read speed of the disk, memory, remote storage, mobile device, etc.), and the write speed of the system used for storing the acquired data. Digital forensic investigators can somewhat mitigate the latter issue through the use of high-speed storage options, such as networked RAID storage, in the controlled environment of the forensic laboratory. However, traditionally, little can be done to improve the acquisition speed past its physical read speed from the target device itself. The protracted time taken for data acquisition wastes digital forensic experts' time, contributes to digital forensic investigation backlogs worldwide, and delays pertinent information from potentially influencing the direction of an investigation. In a remote acquisition scenario, a third contributing factor can also become a detriment to the overall acquisition time - typically the Internet upload speed of the acquisition system. This paper explores an alternative to the traditional evidence acquisition model through the leveraging of a forensic data deduplication system. The advantages that a deduplicated approach can provide over the current digital forensic evidence acquisition process are outlined and some preliminary results of a prototype implementation are discussed.
△ Less
Submitted 20 October, 2016; v1 submitted 18 October, 2016;
originally announced October 2016.
-
Fingerprinting Internet DNS Amplification DDoS Activities
Authors:
Claude Fachkha,
Elias Bou-Harb,
Mourad Debbabi
Abstract:
This work proposes a novel approach to infer and characterize Internet-scale DNS amplification DDoS attacks by leveraging the darknet space. Complementary to the pioneer work on inferring Distributed Denial of Service (DDoS) activities using darknet, this work shows that we can extract DDoS activities without relying on backscattered analysis. The aim of this work is to extract cyber security inte…
▽ More
This work proposes a novel approach to infer and characterize Internet-scale DNS amplification DDoS attacks by leveraging the darknet space. Complementary to the pioneer work on inferring Distributed Denial of Service (DDoS) activities using darknet, this work shows that we can extract DDoS activities without relying on backscattered analysis. The aim of this work is to extract cyber security intelligence related to DNS Amplification DDoS activities such as detection period, attack duration, intensity, packet size, rate and geo-location in addition to various network-layer and flow-based insights. To achieve this task, the proposed approach exploits certain DDoS parameters to detect the attacks. We empirically evaluate the proposed approach using 720 GB of real darknet data collected from a /13 address space during a recent three months period. Our analysis reveals that the approach was successful in inferring significant DNS amplification DDoS activities including the recent prominent attack that targeted one of the largest anti-spam organizations. Moreover, the analysis disclosed the mechanism of such DNS amplification DDoS attacks. Further, the results uncover high-speed and stealthy attempts that were never previously documented. The case study of the largest DDoS attack in history lead to a better understanding of the nature and scale of this threat and can generate inferences that could contribute in detecting, preventing, assessing, mitigating and even attributing of DNS amplification DDoS activities.
△ Less
Submitted 5 November, 2013; v1 submitted 15 October, 2013;
originally announced October 2013.