Search | arXiv e-print repository

doi 10.1016/j.eswa.2022.118590

Fast & Furious: Modelling Malware Detection as Evolving Data Streams

Authors: Fabrício Ceschin, Marcus Botacin, Heitor Murilo Gomes, Felipe Pinagé, Luiz S. Oliveira, André Grégio

Abstract: Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted machine learning (ML) pipelines. However, malware dev… ▽ More Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted machine learning (ML) pipelines. However, malware developers unceasingly change their samples' features to bypass detection. This constant evolution of malware samples causes changes to the data distribution (i.e., concept drifts) that directly affect ML model detection rates, something not considered in the majority of the literature work. In this work, we evaluate the impact of concept drift on malware classifiers for two Android datasets: DREBIN (about 130K apps) and a subset of AndroZoo (about 285K apps). We used these datasets to train an Adaptive Random Forest (ARF) classifier, as well as a Stochastic Gradient Descent (SGD) classifier. We also ordered all datasets samples using their VirusTotal submission timestamp and then extracted features from their textual attributes using two algorithms (Word2Vec and TF-IDF). Then, we conducted experiments comparing both feature extractors, classifiers, as well as four drift detectors (DDM, EDDM, ADWIN, and KSWIN) to determine the best approach for real environments. Finally, we compare some possible approaches to mitigate concept drift and propose a novel data stream pipeline that updates both the classifier and the feature extractor. To do so, we conducted a longitudinal evaluation by (i) classifying malware samples collected over nine years (2009-2018), (ii) reviewing concept drift detection algorithms to attest its pervasiveness, (iii) comparing distinct ML approaches to mitigate the issue, and (iv) proposing an ML data stream pipeline that outperformed literature approaches. △ Less

Submitted 15 August, 2022; v1 submitted 24 May, 2022; originally announced May 2022.

arXiv:2109.06127 [pdf, ps, other]

Malware MultiVerse: From Automatic Logic Bomb Identification to Automatic Patching and Tracing

Authors: Marcus Botacin, André Grégio

Abstract: Malware and other suspicious software often hide behaviors and components behind logic bombs and context-sensitive execution paths. Uncovering these is essential to react against modern threats, but current solutions are not ready to detect these paths in a completely automated manner. To bridge this gap, we propose the Malware Multiverse (MalVerse), a solution able to inspect multiple execution p… ▽ More Malware and other suspicious software often hide behaviors and components behind logic bombs and context-sensitive execution paths. Uncovering these is essential to react against modern threats, but current solutions are not ready to detect these paths in a completely automated manner. To bridge this gap, we propose the Malware Multiverse (MalVerse), a solution able to inspect multiple execution paths via symbolic execution aiming to discover function inputs and returns that trigger malicious behaviors. MalVerse automatically patches the context-sensitive functions with the identified symbolic values to allow the software execution in a traditional sandbox. We implemented MalVerse on top of angr and evaluated it with a set of Linux and Windows evasive samples. We found that MalVerse was able to generate automatic patches for the most common evasion techniques (e.g., ptrace checks). △ Less

Submitted 13 September, 2021; originally announced September 2021.

arXiv:2109.06068 [pdf, other]

A [in]Segurança dos Sistemas Governamentais Brasileiros: Um Estudo de Caso em Sistemas Web e Redes Abertas

Authors: Marcus Botacin, André Grégio

Abstract: Whereas the world relies on computer systems for providing public services, there is a lack of academic work that systematically assess the security of government systems. To partially fill this gap, we conducted a security evaluation of publicly available systems from public institutions. We revisited OWASP top-10 and identified multiple vulnerabilities in deployed services by scanning public gov… ▽ More Whereas the world relies on computer systems for providing public services, there is a lack of academic work that systematically assess the security of government systems. To partially fill this gap, we conducted a security evaluation of publicly available systems from public institutions. We revisited OWASP top-10 and identified multiple vulnerabilities in deployed services by scanning public government networks. Overall, the unprotected services found have inadequate security level, which must be properly discussed and addressed. △ Less

Submitted 13 September, 2021; originally announced September 2021.

Comments: in Portuguese

arXiv:2105.09900 [pdf, other]

Online Binary Models are Promising for Distinguishing Temporally Consistent Computer Usage Profiles

Authors: Luiz Giovanini, Fabrício Ceschin, Mirela Silva, Aokun Chen, Ramchandra Kulkarni, Sanjay Banda, Madison Lysaght, Heng Qiao, Nikolaos Sapountzis, Ruimin Sun, Brandon Matthews, Dapeng Oliver Wu, André Grégio, Daniela Oliveira

Abstract: This paper investigates whether computer usage profiles comprised of process-, network-, mouse-, and keystroke-related events are unique and consistent over time in a naturalistic setting, discussing challenges and opportunities of using such profiles in applications of continuous authentication. We collected ecologically-valid computer usage profiles from 31 MS Windows 10 computer users over 8 we… ▽ More This paper investigates whether computer usage profiles comprised of process-, network-, mouse-, and keystroke-related events are unique and consistent over time in a naturalistic setting, discussing challenges and opportunities of using such profiles in applications of continuous authentication. We collected ecologically-valid computer usage profiles from 31 MS Windows 10 computer users over 8 weeks and submitted this data to comprehensive machine learning analysis involving a diverse set of online and offline classifiers. We found that: (i) profiles were mostly consistent over the 8-week data collection period, with most (83.9%) repeating computer usage habits on a daily basis; (ii) computer usage profiling has the potential to uniquely characterize computer users (with a maximum F-score of 99.90%); (iii) network-related events were the most relevant features to accurately recognize profiles (95.69% of the top features distinguishing users were network-related); and (iv) binary models were the most well-suited for profile recognition, with better results achieved in the online setting compared to the offline setting (maximum F-score of 99.90% vs. 95.50%). △ Less

Submitted 2 September, 2021; v1 submitted 20 May, 2021; originally announced May 2021.

arXiv:2012.02164 [pdf, other]

People Still Care About Facts: Twitter Users Engage More with Factual Discourse than Misinformation--A Comparison Between COVID and General Narratives on Twitter

Authors: Mirela Silva, Fabrício Ceschin, Prakash Shrestha, Christopher Brant, Shlok Gilda, Juliana Fernandes, Catia S. Silva, André Grégio, Daniela Oliveira, Luiz Giovanini

Abstract: Misinformation entails the dissemination of falsehoods that leads to the slow fracturing of society via decreased trust in democratic processes, institutions, and science. The public has grown aware of the role of social media as a superspreader of untrustworthy information, where even pandemics have not been immune. In this paper, we focus on COVID-19 misinformation and examine a subset of 2.1M t… ▽ More Misinformation entails the dissemination of falsehoods that leads to the slow fracturing of society via decreased trust in democratic processes, institutions, and science. The public has grown aware of the role of social media as a superspreader of untrustworthy information, where even pandemics have not been immune. In this paper, we focus on COVID-19 misinformation and examine a subset of 2.1M tweets to understand misinformation as a function of engagement, tweet content (COVID-19- vs. non-COVID-19-related), and veracity (misleading or factual). Using correlation analysis, we show the most relevant feature subsets among over 126 features that most heavily correlate with misinformation or facts. We found that (i) factual tweets, regardless of whether COVID-related, were more engaging than misinformation tweets; and (ii) features that most heavily correlated with engagement varied depending on the veracity and content of the tweet. △ Less

Submitted 9 September, 2021; v1 submitted 3 December, 2020; originally announced December 2020.

Comments: 22 pages

arXiv:2010.16045 [pdf, other]

doi 10.1145/3617897

Machine Learning (In) Security: A Stream of Problems

Authors: Fabrício Ceschin, Marcus Botacin, Albert Bifet, Bernhard Pfahringer, Luiz S. Oliveira, Heitor Murilo Gomes, André Grégio

Abstract: Machine Learning (ML) has been widely applied to cybersecurity and is considered state-of-the-art for solving many of the open issues in that field. However, it is very difficult to evaluate how good the produced solutions are, since the challenges faced in security may not appear in other areas. One of these challenges is the concept drift, which increases the existing arms race between attackers… ▽ More Machine Learning (ML) has been widely applied to cybersecurity and is considered state-of-the-art for solving many of the open issues in that field. However, it is very difficult to evaluate how good the produced solutions are, since the challenges faced in security may not appear in other areas. One of these challenges is the concept drift, which increases the existing arms race between attackers and defenders: malicious actors can always create novel threats to overcome the defense solutions, which may not consider them in some approaches. Due to this, it is essential to know how to properly build and evaluate an ML-based security solution. In this paper, we identify, detail, and discuss the main challenges in the correct application of ML techniques to cybersecurity data. We evaluate how concept drift, evolution, delayed labels, and adversarial ML impact the existing solutions. Moreover, we address how issues related to data collection affect the quality of the results presented in the security literature, showing that new strategies are needed to improve current solutions. Finally, we present how existing solutions may fail under certain circumstances, and propose mitigations to them, presenting a novel checklist to help the development of future ML solutions for cybersecurity. △ Less

Submitted 4 September, 2023; v1 submitted 29 October, 2020; originally announced October 2020.

Journal ref: Digital Threats 2023

arXiv:1802.02503 [pdf, other]

A Praise for Defensive Programming: Leveraging Uncertainty for Effective Malware Mitigation

Authors: Ruimin Sun, Marcus Botacin, Nikolaos Sapountzis, Xiaoyong Yuan, Matt Bishop, Donald E Porter, Xiaolin Li, Andre Gregio, Daniela Oliveira

Abstract: A promising avenue for improving the effectiveness of behavioral-based malware detectors would be to combine fast traditional machine learning detectors with high-accuracy, but time-consuming deep learning models. The main idea would be to place software receiving borderline classifications by traditional machine learning methods in an environment where uncertainty is added, while software is anal… ▽ More A promising avenue for improving the effectiveness of behavioral-based malware detectors would be to combine fast traditional machine learning detectors with high-accuracy, but time-consuming deep learning models. The main idea would be to place software receiving borderline classifications by traditional machine learning methods in an environment where uncertainty is added, while software is analyzed by more time-consuming deep learning models. The goal of uncertainty would be to rate-limit actions of potential malware during the time consuming deep analysis. In this paper, we present a detailed description of the analysis and implementation of CHAMELEON, a framework for realizing this uncertain environment for Linux. CHAMELEON offers two environments for software: (i) standard - for any software identified as benign by conventional machine learning methods and (ii) uncertain - for software receiving borderline classifications when analyzed by these conventional machine learning methods. The uncertain environment adds obstacles to software execution through random perturbations applied probabilistically on selected system calls. We evaluated CHAMELEON with 113 applications and 100 malware samples for Linux. Our results showed that at threshold 10%, intrusive and non-intrusive strategies caused approximately 65% of malware to fail accomplishing their tasks, while approximately 30% of the analyzed benign software to meet with various levels of disruption. With a dynamic, per-system call threshold, CHAMELEON caused 92% of the malware to fail, and only 10% of the benign software to be disrupted. We also found that I/O-bound software was three times more affected by uncertainty than CPU-bound software. Further, we analyzed the logs of software crashed with non-intrusive strategies, and found that some crashes are due to the software bugs. △ Less

Submitted 12 June, 2020; v1 submitted 7 February, 2018; originally announced February 2018.

Journal ref: IEEE Transaction of Dependability and Security 2020

arXiv:1712.01145 [pdf, other]

Learning Fast and Slow: PROPEDEUTICA for Real-time Malware Detection

Authors: Ruimin Sun, Xiaoyong Yuan, Pan He, Qile Zhu, Aokun Chen, Andre Gregio, Daniela Oliveira, Xiaolin Li

Abstract: Existing malware detectors on safety-critical devices have difficulties in runtime detection due to the performance overhead. In this paper, we introduce PROPEDEUTICA, a framework for efficient and effective real-time malware detection, leveraging the best of conventional machine learning (ML) and deep learning (DL) techniques. In PROPEDEUTICA, all software start execution are considered as benign… ▽ More Existing malware detectors on safety-critical devices have difficulties in runtime detection due to the performance overhead. In this paper, we introduce PROPEDEUTICA, a framework for efficient and effective real-time malware detection, leveraging the best of conventional machine learning (ML) and deep learning (DL) techniques. In PROPEDEUTICA, all software start execution are considered as benign and monitored by a conventional ML classifier for fast detection. If the software receives a borderline classification from the ML detector (e.g. the software is 50% likely to be benign and 50% likely to be malicious), the software will be transferred to a more accurate, yet performance demanding DL detector. To address spatial-temporal dynamics and software execution heterogeneity, we introduce a novel DL architecture (DEEPMALWARE) for PROPEDEUTICA with multi-stream inputs. We evaluated PROPEDEUTICA with 9,115 malware samples and 1,338 benign software from various categories for the Windows OS. With a borderline interval of [30%-70%], PROPEDEUTICA achieves an accuracy of 94.34% and a false-positive rate of 8.75%, with 41.45% of the samples moved for DEEPMALWARE analysis. Even using only CPU, PROPEDEUTICA can detect malware within less than 0.1 seconds. △ Less

Submitted 17 October, 2021; v1 submitted 4 December, 2017; originally announced December 2017.

Comments: 12 pages, 4 figures. This paper has been accepted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

Showing 1–8 of 8 results for author: Grégio, A