Search | arXiv e-print repository

A Survey of Machine Learning Algorithms for Detecting Malware in IoT Firmware

Authors: Erik Larsen, Korey MacVittie, John Lilly

Abstract: This work explores the use of machine learning techniques on an Internet-of-Things firmware dataset to detect malicious attempts to infect edge devices or subsequently corrupt an entire network. Firmware updates are uncommon in IoT devices; hence, they abound with vulnerabilities. Attacks against such devices can go unnoticed, and users can become a weak point in security. Malware can cause DDoS a… ▽ More This work explores the use of machine learning techniques on an Internet-of-Things firmware dataset to detect malicious attempts to infect edge devices or subsequently corrupt an entire network. Firmware updates are uncommon in IoT devices; hence, they abound with vulnerabilities. Attacks against such devices can go unnoticed, and users can become a weak point in security. Malware can cause DDoS attacks and even spy on sensitive areas like peoples' homes. To help mitigate this threat, this paper employs a number of machine learning algorithms to classify IoT firmware and the best performing models are reported. In a general comparison, the top three algorithms are Gradient Boosting, Logistic Regression, and Random Forest classifiers. Deep learning approaches including Convolutional and Fully Connected Neural Networks with both experimental and proven successful architectures are also explored. △ Less

Submitted 3 November, 2021; originally announced November 2021.

Comments: 9 pages, 8 figures, 2 tables

arXiv:2111.02378 [pdf]

Intrusion Detection: Machine Learning Baseline Calculations for Image Classification

Authors: Erik Larsen, Korey MacVittie, John Lilly

Abstract: Cyber security can be enhanced through application of machine learning by recasting network attack data into an image format, then applying supervised computer vision and other machine learning techniques to detect malicious specimens. Exploratory data analysis reveals little correlation and few distinguishing characteristics between the ten classes of malware used in this study. A general model c… ▽ More Cyber security can be enhanced through application of machine learning by recasting network attack data into an image format, then applying supervised computer vision and other machine learning techniques to detect malicious specimens. Exploratory data analysis reveals little correlation and few distinguishing characteristics between the ten classes of malware used in this study. A general model comparison demonstrates that the most promising candidates for consideration are Light Gradient Boosting Machine, Random Forest Classifier, and Extra Trees Classifier. Convolutional networks fail to deliver their outstanding classification ability, being surpassed by a simple, fully connected architecture. Most tests fail to break 80% categorical accuracy and present low F1 scores, indicating more sophisticated approaches (e.g., bootstrap**, random samples, and feature selection) may be required to maximize performance. △ Less

Submitted 3 November, 2021; originally announced November 2021.

Comments: 10 pages, 9 figures, 3 tables

arXiv:2111.02375 [pdf]

Virus-MNIST: Machine Learning Baseline Calculations for Image Classification

Authors: Erik Larsen, Korey MacVittie, John Lilly

Abstract: The Virus-MNIST data set is a collection of thumbnail images that is similar in style to the ubiquitous MNIST hand-written digits. These, however, are cast by resha** possible malware code into an image array. Naturally, it is poised to take on a role in benchmarking progress of virus classifier model training. Ten types are present: nine classified as malware and one benign. Cursory examination… ▽ More The Virus-MNIST data set is a collection of thumbnail images that is similar in style to the ubiquitous MNIST hand-written digits. These, however, are cast by resha** possible malware code into an image array. Naturally, it is poised to take on a role in benchmarking progress of virus classifier model training. Ten types are present: nine classified as malware and one benign. Cursory examination reveals unequal class populations and other key aspects that must be considered when selecting classification and pre-processing methods. Exploratory analyses show possible identifiable characteristics from aggregate metrics (e.g., the pixel median values), and ways to reduce the number of features by identifying strong correlations. A model comparison shows that Light Gradient Boosting Machine, Gradient Boosting Classifier, and Random Forest algorithms produced the highest accuracy scores, thus showing promise for deeper scrutiny. △ Less

Submitted 3 November, 2021; originally announced November 2021.

Comments: 11 pages, 13 figures, 2 tables

arXiv:2110.07636 [pdf]

A Survey of Machine Learning Algorithms for Detecting Ransomware Encryption Activity

Authors: Erik Larsen, David Noever, Korey MacVittie

Abstract: A survey of machine learning techniques trained to detect ransomware is presented. This work builds upon the efforts of Taylor et al. in using sensor-based methods that utilize data collected from built-in instruments like CPU power and temperature monitors to identify encryption activity. Exploratory data analysis (EDA) shows the features most useful from this simulated data are clock speed, temp… ▽ More A survey of machine learning techniques trained to detect ransomware is presented. This work builds upon the efforts of Taylor et al. in using sensor-based methods that utilize data collected from built-in instruments like CPU power and temperature monitors to identify encryption activity. Exploratory data analysis (EDA) shows the features most useful from this simulated data are clock speed, temperature, and CPU load. These features are used in training multiple algorithms to determine an optimal detection approach. Performance is evaluated with accuracy, F1 score, and false-negative rate metrics. The Multilayer Perceptron with three hidden layers achieves scores of 97% in accuracy and F1 and robust data preparation. A random forest model produces scores of 93% accuracy and 92% F1, showing that sensor-based detection is currently a viable option to detect even zero-day ransomware attacks before the code fully executes. △ Less

Submitted 14 October, 2021; originally announced October 2021.

Comments: 9 pages, 8 figures, 3 tables

arXiv:2107.00436 [pdf]

Overhead-MNIST: Machine Learning Baselines for Image Classification

Authors: Erik Larsen, David Noever, Korey MacVittie, John Lilly

Abstract: Twenty-three machine learning algorithms were trained then scored to establish baseline comparison metrics and to select an image classification algorithm worthy of embedding into mission-critical satellite imaging systems. The Overhead-MNIST dataset is a collection of satellite images similar in style to the ubiquitous MNIST hand-written digits found in the machine learning literature. The CatBoo… ▽ More Twenty-three machine learning algorithms were trained then scored to establish baseline comparison metrics and to select an image classification algorithm worthy of embedding into mission-critical satellite imaging systems. The Overhead-MNIST dataset is a collection of satellite images similar in style to the ubiquitous MNIST hand-written digits found in the machine learning literature. The CatBoost classifier, Light Gradient Boosting Machine, and Extreme Gradient Boosting models produced the highest accuracies, Areas Under the Curve (AUC), and F1 scores in a PyCaret general comparison. Separate evaluations showed that a deep convolutional architecture was the most promising. We present results for the overall best performing algorithm as a baseline for edge deployability and future performance improvement: a convolutional neural network (CNN) scoring 0.965 categorical accuracy on unseen test data. △ Less

Submitted 19 October, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

Comments: 6 pages; 8 figures, 2 tables

Showing 1–5 of 5 results for author: MacVittie, K