Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance
Authors:
Savino Dambra,
Yufei Han,
Simone Aonzo,
Platon Kotzias,
Antonino Vitale,
Juan Caballero,
Davide Balzarotti,
Leyla Bilge
Abstract:
Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware cl…
▽ More
Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware classification results: whether they are tied to the nature and distribution of the collected dataset, to what extent the number of families and samples in the training dataset influence performance, and how well static and dynamic features complement each other.
This work sheds light on those open questions. by investigating the key factors influencing ML-based malware detection and classification. For this, we collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each), and train state-of-the-art models for malware detection and family classification using our dataset. Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features. We discover no correlation between packing and classification accuracy, and that missing behaviors in dynamically-extracted features highly penalize their performance. We also demonstrate how a larger number of families to classify make the classification harder, while a higher number of samples per family increases accuracy. Finally, we find that models trained on a uniform distribution of samples per family better generalize on unseen data.
△ Less
Submitted 27 July, 2023;
originally announced July 2023.
Event-driven Vision and Control for UAVs on a Neuromorphic Chip
Authors:
Antonio Vitale,
Alpha Renner,
Celine Nauer,
Davide Scaramuzza,
Yulia Sandamirskaya
Abstract:
Event-based vision sensors achieve up to three orders of magnitude better speed vs. power consumption trade off in high-speed control of UAVs compared to conventional image sensors. Event-based cameras produce a sparse stream of events that can be processed more efficiently and with a lower latency than images, enabling ultra-fast vision-driven control. Here, we explore how an event-based vision a…
▽ More
Event-based vision sensors achieve up to three orders of magnitude better speed vs. power consumption trade off in high-speed control of UAVs compared to conventional image sensors. Event-based cameras produce a sparse stream of events that can be processed more efficiently and with a lower latency than images, enabling ultra-fast vision-driven control. Here, we explore how an event-based vision algorithm can be implemented as a spiking neuronal network on a neuromorphic chip and used in a drone controller. We show how seamless integration of event-based perception on chip leads to even faster control rates and lower latency. In addition, we demonstrate how online adaptation of the SNN controller can be realised using on-chip learning. Our spiking neuronal network on chip is the first example of a neuromorphic vision-based controller solving a high-speed UAV control task. The excellent scalability of processing in neuromorphic hardware opens the possibility to solve more challenging visual tasks in the future and integrate visual perception in fast control loops.
△ Less
Submitted 19 August, 2021; v1 submitted 8 August, 2021;
originally announced August 2021.
Similarità per la ricerca del dominio di una frase
Authors:
Massimiliano Morrelli,
Giacomo Pansini,
Massimiliano Polito,
Arturo Vitale
Abstract:
English. This document aims to study the best algorithms to verify the belonging of a specific document to a related domain by comparing different methods for calculating the distance between two vectors. This study has been made possible with the help of the structures made available by the Apache Spark framework. Starting from the study illustrated in the publication "New frontier of textual cla…
▽ More
English. This document aims to study the best algorithms to verify the belonging of a specific document to a related domain by comparing different methods for calculating the distance between two vectors. This study has been made possible with the help of the structures made available by the Apache Spark framework. Starting from the study illustrated in the publication "New frontier of textual classification: Big data and distributed calculus" by Massimiliano Morrelli et al., We wanted to carry out a study on the possible implementation of a solution capable of calculating the Similarity of a sentence using the distributed environment.
Italiano. Il presente documento persegue l'obiettivo di studiare gli algoritmi migliori per verificare l'appartenenza di un determinato documento a un relativo dominio tramite un confronto di diversi metodi per il calcolo della distanza fra due vettori. Tale studio è stato condotto con l'ausilio delle strutture messe a disposizione dal framework Apache Spark. Partendo dallo studio illustrato nella pubblicazione "Nuova frontiera della classificazione testuale: Big data e calcolo distribuito" di Massimiliano Morrelli et al., si è voluto realizzare uno studio sulla possibile implementazione di una soluzione in grado di calcolare la Similarità di una frase sfruttando l'ambiente distribuito.
△ Less
Submitted 31 January, 2020;
originally announced February 2020.