-
Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance
Authors:
Savino Dambra,
Yufei Han,
Simone Aonzo,
Platon Kotzias,
Antonino Vitale,
Juan Caballero,
Davide Balzarotti,
Leyla Bilge
Abstract:
Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware cl…
▽ More
Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware classification results: whether they are tied to the nature and distribution of the collected dataset, to what extent the number of families and samples in the training dataset influence performance, and how well static and dynamic features complement each other.
This work sheds light on those open questions. by investigating the key factors influencing ML-based malware detection and classification. For this, we collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each), and train state-of-the-art models for malware detection and family classification using our dataset. Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features. We discover no correlation between packing and classification accuracy, and that missing behaviors in dynamically-extracted features highly penalize their performance. We also demonstrate how a larger number of families to classify make the classification harder, while a higher number of samples per family increases accuracy. Finally, we find that models trained on a uniform distribution of samples per family better generalize on unseen data.
△ Less
Submitted 27 July, 2023;
originally announced July 2023.
-
Quantifying Carbon Emissions due to Online Third-Party Tracking
Authors:
Michalis Pachilakis,
Savino Dambra,
Iskander Sanchez-Rola,
Leyla Bilge
Abstract:
In the past decade, global warming made several headlines and turned the attention of the whole world to it. Carbon footprint is the main factor that drives greenhouse emissions up and results in the temperature increase of the planet with dire consequences. While the attention of the public is turned to reducing carbon emissions by transportation, food consumption and household activities, we ign…
▽ More
In the past decade, global warming made several headlines and turned the attention of the whole world to it. Carbon footprint is the main factor that drives greenhouse emissions up and results in the temperature increase of the planet with dire consequences. While the attention of the public is turned to reducing carbon emissions by transportation, food consumption and household activities, we ignore the contribution of CO2eq emissions produced by online activities. In the current information era, we spend a big amount of our days browsing online. This activity consumes electricity which in turn produces CO2eq. While website browsing contributes to the production of greenhouse gas emissions, the impact of the Internet on the environment is further exacerbated by the web-tracking practice. Indeed, most webpages are heavily loaded by tracking content used mostly for advertising, data analytics and usability improvements. This extra content implies big data transmissions which results in higher electricity consumption and thus higher greenhouse gas emissions. In this work, we focus on the overhead caused by web tracking and analyse both its network and carbon footprint. By leveraging the browsing telemetry of 100k users and the results of a crawling experiment of 2.7M websites, we find that web tracking increases data transmissions upwards of 21%, which in turn implies the additional emission of around 11 Mt of greenhouse gases in the atmosphere every year. We find such contribution to be far from negligible, and comparable to many activities of modern life, such as meat production, transportation, and even cryptocurrency mining. Our study also highlights that there exist significant inequalities when considering the footprint of different countries, website categories, and tracking organizations, with a few actors contributing to a much greater extent than the remaining ones.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
One Size Does not Fit All: Quantifying the Risk of Malicious App Encounters for Different Android User Profiles
Authors:
Savino Dambra,
Leyla Bilge,
Platon Kotzias,
Yun Shen,
Juan Caballero
Abstract:
Previous work has investigated the particularities of security practices within specific user communities defined based on country of origin, age, prior tech abuse, and economic status. Their results highlight that current security solutions that adopt a one-size-fits-all-users approach ignore the differences and needs of particular user communities. However, those works focus on a single communit…
▽ More
Previous work has investigated the particularities of security practices within specific user communities defined based on country of origin, age, prior tech abuse, and economic status. Their results highlight that current security solutions that adopt a one-size-fits-all-users approach ignore the differences and needs of particular user communities. However, those works focus on a single community or cluster users into hard-to-interpret sub-populations.
In this work, we perform a large-scale quantitative analysis of the risk of encountering malware and other potentially unwanted applications (PUA) across user communities. At the core of our study is a dataset of app installation logs collected from 12M Android mobile devices. Leveraging user-installed apps, we define intuitive profiles based on users' interests (e.g., gamers and investors), and fit a subset of 5.4M devices to those profiles. Our analysis is structured in three parts. First, we perform risk analysis on the whole population to measure how the risk of malicious app encounters is affected by different factors. Next, we create different profiles to investigate whether risk differences across users may be due to their interests. Finally, we compare a per-profile approach for classifying clean and infected devices with the classical approach that considers the whole population.
We observe that features such as the diversity of the app signers and the use of alternative markets highly correlate with the risk of malicious app encounters. We also discover that some profiles such as gamers and social-media users are exposed to more than twice the risks experienced by the average users. We also show that the classification outcome has a marked accuracy improvement when using a per-profile approach to train the prediction models. Overall, our results confirm the inadequacy of one-size-fits-all protection solutions.
△ Less
Submitted 18 January, 2023;
originally announced January 2023.
-
Unsupervised Detection and Clustering of Malicious TLS Flows
Authors:
Gibran Gomez,
Platon Kotzias,
Matteo Dell'Amico,
Leyla Bilge,
Juan Caballero
Abstract:
Malware abuses TLS to encrypt its malicious traffic, preventing examination by content signatures and deep packet inspection. Network detection of malicious TLS flows is an important, but challenging, problem. Prior works have proposed supervised machine learning detectors using TLS features. However, by trying to represent all malicious traffic, supervised binary detectors produce models that are…
▽ More
Malware abuses TLS to encrypt its malicious traffic, preventing examination by content signatures and deep packet inspection. Network detection of malicious TLS flows is an important, but challenging, problem. Prior works have proposed supervised machine learning detectors using TLS features. However, by trying to represent all malicious traffic, supervised binary detectors produce models that are too loose, thus introducing errors. Furthermore, they do not distinguish flows generated by different malware. On the other hand, supervised multi-class detectors produce tighter models and can classify flows by malware family, but require family labels, which are not available for many samples.
To address these limitations, this work proposes a novel unsupervised approach to detect and cluster malicious TLS flows. Our approach takes as input network traces from sandboxes. It clusters similar TLS flows using 90 features that capture properties of the TLS client, TLS server, certificate, and encrypted payload; and uses the clusters to build an unsupervised detector that can assign a malicious flow to the cluster it belongs to, or determine it is benign. We evaluate our approach using 972K traces from a commercial sandbox and 35M TLS flows from a research network. Our clustering shows very high precision and recall with an F1 score of 0.993. We compare our unsupervised detector with two state-of-the-art approaches, showing that it outperforms both. The false detection rate of our detector is 0.032% measured over four months of traffic.
△ Less
Submitted 23 December, 2022; v1 submitted 8 September, 2021;
originally announced September 2021.
-
How Did That Get In My Phone? Unwanted App Distribution on Android Devices
Authors:
Platon Kotzias,
Juan Caballero,
Leyla Bilge
Abstract:
Android is the most popular operating system with billions of active devices. Unfortunately, its popularity and openness makes it attractive for unwanted apps, i.e., malware and potentially unwanted programs (PUP). In Android, app installations typically happen via the official and alternative markets, but also via other smaller and less understood alternative distribution vectors such as Web down…
▽ More
Android is the most popular operating system with billions of active devices. Unfortunately, its popularity and openness makes it attractive for unwanted apps, i.e., malware and potentially unwanted programs (PUP). In Android, app installations typically happen via the official and alternative markets, but also via other smaller and less understood alternative distribution vectors such as Web downloads, pay-per-install (PPI) services, backup restoration, bloatware, and IM tools. This work performs a thorough investigation on unwanted app distribution by quantifying and comparing distribution through different vectors. At the core of our measurements are reputation logs of a large security vendor, which include 7.9M apps observed in 12M devices between June and September 2019. As a first step, we measure that between 10% and 24% of users devices encounter at least one unwanted app, and compare the prevalence of malware and PUP. An analysis of the who-installs-who relationships between installers and child apps reveals that the Play market is the main app distribution vector, responsible for 87% of all installs and 67% of unwanted app installs, but it also has the best defenses against unwanted apps. Alternative markets distribute instead 5.7% of all apps, but over 10% of unwanted apps. Bloatware is also a significant unwanted app distribution vector with 6% of those installs. And, backup restoration is an unintentional distribution vector that may even allow unwanted apps to survive users' phone replacement. We estimate unwanted app distribution via PPI to be smaller than on Windows. Finally, we observe that Web downloads are rare, but provide a riskier proposition even compared to alternative markets.
△ Less
Submitted 20 October, 2020;
originally announced October 2020.