Enhancing IoT Malware Detection through Adaptive Model Parallelism and Resource Optimization

Sreenitha Kasarapu, , Sanket Shukla,  ,
Sai Manoj Pudukotai Dinakarrao, 
S. Kasarapu, S. Shukla, and S. M. P. Dinakarrao are associated with the Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA, 22030, USA. Email: {skasarap,sshukla4,spudukot}@gmu.eduThis work was supported by the Commonwealth Cyber Initiative, and investment in the advancement of cyber R&D innovation, and workforce development. For more information about CCI, visit www.cyberinitiative.org
Abstract

The widespread integration of IoT devices has greatly improved connectivity and computational capabilities, facilitating seamless communication across networks. Despite their global deployment, IoT devices are frequently targeted for security breaches due to inherent vulnerabilities. Among these threats, malware poses a significant risk to IoT devices. The lack of built-in security features and limited resources present challenges for implementing effective malware detection techniques on IoT devices. Moreover, existing methods assume access to all device resources for malware detection, which is often not feasible for IoT devices deployed in critical real-world scenarios. To overcome this challenge, this study introduces a novel approach to malware detection tailored for IoT devices, leveraging resource and workload awareness inspired by model parallelism. Initially, the device assesses available resources for malware detection using a lightweight regression model. Based on resource availability, ongoing workload, and communication costs, the malware detection task is dynamically allocated either on-device or offloaded to neighboring IoT nodes with sufficient resources. To uphold data integrity and user privacy, instead of transferring the entire malware detection task, the classifier is divided and distributed across multiple nodes, then integrated at the parent node for detection. Experimental results demonstrate that this proposed technique achieves a significant speedup of 9.8 ×\times× compared to on-device inference, while maintaining a high malware detection accuracy of 96.7%.

Index Terms:
Hardware security, malware detection, deep learning models, image processing, model parallelism, distributed learning

I Introduction

Recent advancements and innovations in Internet-of-Things (IoT) devices have fueled the growth and extensive deployment of a network comprised of intelligent IoT devices [1]. These devices find application in various domains, including consumer electronics such as smart homes, smart cars, and smart grids, as well as in defense systems [1]. Despite offering numerous benefits, IoT devices and networks have become attractive targets for cyber attackers seeking unauthorized access to user information [2]. Notably, malicious applications, commonly referred to as malware, pose a significant threat to IoT devices, with cyber-attacks often executed through the deployment of such malware [3]. Malware, characterized as malicious software or applications, is designed to infiltrate devices, enabling unauthorized access to sensitive information such as passwords and financial data, and allowing manipulation of stored data without user consent.

Malware stands out as a significant threat, primarily due to its ease of creation and the limited verification capabilities to execute third-party applications on IoT devices [3]. The security risks for IoT networks tend to escalate each year, with an exponential increase observed annually [3]. In 2021 alone, over 5.4 billion malware attacks were recorded, with the first half of 2022 already witnessing 2.8 billion attacks [3]. Adversaries leverage technological advancements to develop sophisticated malware, aiming to evade detection. Records indicate that an average of more than 8 million malware threats are identified daily in recent years [4].

The significant surge in malware attacks and security threats has heightened concerns regarding the security of IoT devices, potentially hindering their deployability. This underscores the need for techniques capable of detecting malware in IoT devices and mitigating the exploitation of user data. Several studies have been put forth to address malware detection on IoT devices [5, 6]. However, the existing works primarily suffer from four challenges:

(1) Real-time Malware Detection: Detecting malware during runtime with minimal latency is crucial, as malware can have severe consequences and can be challenging to detect once its payload is activated. Recently, two different approaches have emerged for malware detection: static analysis and dynamic analysis [7, 8]. Static analysis involves examining the internal structure of malware binaries without actually executing the binary executable files in a non-runtime environment. On the other hand, dynamic analysis inspects binary applications for malware traces by executing them in a sandbox environment. Unlike static analysis, dynamic analysis is a functionality test, which makes it better at identifying the presence of malware in an application.

Recent works on malware detection (both static and dynamic analysis) techniques utilize a variety of Machine Learning (ML) techniques to enhance the performance [9]. Among the ML-based malware detection techniques, the CNN-based image classification technique [10] is observed to be efficient due to its prime ability to learn image features. The emerging trends of malware indicate that the malware developers create advanced malware by employing techniques such as code-obfuscation, metamorphism, and polymorphism [11, 12, 13] to mutate malware binary executables and modify the static and dynamic application traces (signatures) and evade malware detection. This further enhances the complexity of malware detection making the malware detection incur large latency.

(2) Reliable Feature Extraction: Despite the abundance of research on malware detection [8, 10], there is a persistent challenge in reliably extracting input features that contribute to effective malware detection [14]. Regardless of the effectiveness of the underlying analysis technique, whether machine learning (ML) or non-ML, if the extracted features are not reliable, the malware detection task becomes unreliable. A popular technique to address this challenge is the utilization of hardware performance counters (HPC), device, and network features for node-level malware detection. This approach aims to minimize overheads and meet latency requirements [14]. HPCs can assist in distinguishing between malware and benign applications with low overheads. However, concerns have been raised regarding the reliability of using HPC information for security purposes in recent years [15]. For example, in Intel Pentium 4 processors, the ‘Instruction count’ is often over-counted [15]. Additionally, the coexistence of multiple applications can influence HPC values and trends, leading to non-determinism and unreliability. Therefore, there is a need for improved techniques that can efficiently analyze traits of benign and malware applications while addressing these reliability challenges.

(3) Manual Data Acquisition: Supervised learning models are commonly employed for malware detection, utilizing datasets comprising both malware and benign data. However, as the volume of malware data increases annually, there arises a necessity to regularly update these machine learning (ML) models. Yet, the process of collecting, cleaning, and labeling data is labor-intensive. Furthermore, adversaries employ various techniques such as code obfuscation, metamorphism, and polymorphism to enhance the complexity of malware binaries and evade detection [16, 11]. In such scenarios, manual data acquisition becomes increasingly challenging. For instance, morphism techniques alter malware binary files to mimic the functionality of standard applications, thereby deceiving the detection capabilities of various methodologies.

Techniques such as code obfuscation [16] involve encrypting specific sections of code within malware binary files while preserving its functionality. This tactic effectively conceals the presence of malware within embedded systems, exploiting their security vulnerabilities. Another strategy employed to obscure malware identity is stealthy malware [17], where malware is integrated into benign binaries using randomized obfuscation. Consequently, the benign application exhibits malware-like behavior only after a certain period, rendering it challenging to detect. These sophisticated techniques underscore the complexity of disguising malware and necessitate extensive training to enable machine learning (ML) models to discern hidden malware patterns. Consequently, acquiring the necessary data for training becomes more complex. This highlights the urgency to adopt efficient malware detection techniques that can operate effectively with limited data.

(4) Limited Resources on IoT devices: As previously mentioned, IoT devices are designed with constrained resources to prioritize portability and meet user demands [18]. Typically, the bulk of these resources are allocated to executing user applications, with only a limited portion reserved for on-device security measures. Consequently, it is impractical for IoT devices with limited resources to undertake computationally intensive malware detection tasks. Existing approaches either (1) prioritize malware detection at the expense of consuming all available application memory on IoT devices or (2) prioritize user applications, neglecting malware detection capabilities altogether. Both scenarios pose challenges for IoT devices: in the former case, the primary user application’s performance is restricted, while in the latter case, user security and privacy are compromised. Thus, there is a pressing need for a technique that can effectively perform malware detection without disrupting the workload of an IoT device.

To address the aforementioned limitations, this work introduces a novel resource-aware and workload-aware model-parallelism-based malware detection technique for IoT devices. This technique enables efficient malware classification without the need for excessive resources from IoT devices. Instead, it employs the distribution of the ML model over neighboring IoT nodes and facilitates malware detection. The application privacy is maintained despite shared resources, as, the model is distributed onto nodes of the same IoT network. The ML model is trained using a few-shot technique to decrease its need for manually annotated image samples. The novel contributions of this work can be outlined as follows:

  • This work introduces a methodology for reliably extracting device and network characteristics, laying the foundation for efficient and effective malware detection.

  • This work implements an automatic assessment of available resources in IoT devices for malware detection. It provides an estimate of whether to offload the malware detection task or not. This analysis is conducted by training a lightweight regressor on the workload of the IoT device and ML model parameters.

  • The proposed approach involves distributing ML model resources to neighboring devices in a resource-aware manner, taking into account communication and computation overheads for effective malware detection.

  • We also introduce a code-aware data generation-based few-shot technique aimed at generating mutated training samples to capture the features of actual malware samples. These generated images mimic the complex functionality of malware, addressing the challenge of comprehensive data acquisition.

The experimental results prove that the proposed resource-aware model parallelism technique can detect complex malware in IoT networks with an accuracy of over 90%. Experimental analysis shows that the proposed technique can achieve a speed-up of 9.8×\times× compared to on-device inference while maintaining a malware detection accuracy of 96.7%.

The rest of the paper is organized as follows: Section II describes the related work and its shortcomings and comparison with the proposed model. Section III describes the problem for malware detection in IoT devices. Section IV describes the proposed architecture resource-aware model parallelism, which assists with efficient malware detection in IoT devices, using a distributed runtime model training methodology. The experimental evaluation of the proposed model and comparison with various ML architectures is illustrated in Section V and followed by the conclusions drawn from the paper are furnished in Section VI.

II State-of-the-Art

In this section, we present some of the relevant works proposed in the recent past on malware detection, distributed learning, and few-shot learning.

II-A Malware Detection Techniques

Malware detection in recent years has gained a lot of interest. We broadly categorize malware detection into two categories.

II-A1 Static Analysis based Malware Detection

Traditionally, static and dynamic analyses of malware detection are employed. Static analysis [7] on malware data is performed by comparing the opcode sequences of binary executable files, control flow graphs, and code patterns. This technique is performed in a non-runtime environment, as it doesn’t require any executions.

The work in [9] introduced a technique for malware detection using image processing technique where binary applications are converted into grayscale images. The generated images have identical patterns because of the executable file structural distributions. The paper used the K-Nearest Neighbour ML algorithm for the classification of malware images. Similar approaches include image visualization and classification using machine learning algorithms such as SVM. However, these approaches don’t address the problem of classifying newer complex malware. Neural networks such as artificial neural networks (ANNs) are used extensively to solve the problem of classification, prediction, filtering, optimization, pattern recognition, and function approximation [19], as neurons can capture the features of the images more accurately than other machine learning algorithms. However, the fully connected layers of ANN tend to exhaust the computational resources. In [10, 20, 21] authors used Convolutional neural networks (CNNs), due to their ability to efficiently handle image data through feature extraction by Convolutional 2D layers and using Maxpooling 2D layers to downsample the input parameters. Thus, serving as an efficient image classification algorithm with lesser resource consumption.

II-A2 Dynamic Analysis based Malware Detection

Dynamic analysis is a malware detection technique, performed in a secured runtime environment, like Sandbox. It is a functionality test and the binary files are executed to detect malware functionalities in them. Malware detection using dynamic analysis is performed based on detecting system calls or HPC [8]. Dynamic analysis is much more efficient than static analysis in malware detection. Dynamic analysis need a huge amount of resources and is time consuming, so, it is hard to carry on edge devices. Furthermore, malware developers implement code obfuscation, metamorphism, and polymorphism [11] to mutate malware binary executables. These new strategies in masking malware’s identity are stealthy malware [17], where malware is incorporated into benign applications using random obfuscation techniques. In such cases, dynamic malware analysis produce poor estimations. So there is a need to train these dynamic models with reliable features.

In past, many researchers have leveraged architectural and application features for malware analysis and detection [22]. In [23] Bilar et al. used the difference of opcodes between known malware and benign as a key to predicting malware. However, these proposed techniques require a considerable amount of work to model each program based on instructions. Since the code size increases day by day, modeling programs based on opcodes becomes a time-consuming and computationally expensive task. Demme et al. [24] proposed the use of a hardware performance counter (HPC) to monitor the lower level micro-architectural parameters such as branch-misses, instruction per cycle, and cache miss rate. HPCs can provide access to interior performance information comprehensively with much lower overhead than other methods. In works such as [25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41], machine learning models like Random Forest, SVM, and Logistic Regressors are used on HPC values obtained at execution, to classify benign and malware classes. In [14], the authors introduce StealthMiner a novel stealthy malware detection model using time series prediction. They build a Fully Convolutional Neural Network (FCN) on HPC run-time branch instruction features to detect stealthy malware traces.

II-B Distributed Learning

Deep learning has achieved milestone and has valuable applications in cybersecurity, malware detection and other domains. Training deep learning models requires huge amount of time especially with the massive amount of data which needs to be processed. On the other hand, scaling neural network architecture may result in a network with complex parameters, leading to time complexity (i.e. high execution time while training the model). Fortunately, these bottlenecks can be addressed through parallelization paradigms. Parallelization of tasks in deep learning models is one of the best approaches for accelerating implementation, i.e., it speed-ups the algorithm by minimizing the execution time, allowing complex tasks to be processed with less computational resources and execution time [42].

Two types of distributed learning techniques are available: data parallelism and model parallelism. In data parallelism [43], each node has a copy of the whole ML model which needs to be trained. But, each node is given a different mini-batch of data for training the model. After training the results are collected and combined into an updated model. Though it reduces complexity and inter-node communication, data parallelism suffers from huge memory utilization. Model parallelism [44] is a technique, where each node has the same data but the ML model is divided. Each node contains only a single layer of the neural network to be trained. Node-to-node communication is done for weight sharing and back-propagation. Model parallelism is suitable to train a massive ML when there are limited resources.

In [45], authors propose linear-algebraic-based model parallelism for deep learning networks. This framework allows the parallel distribution of any tensor in the DNN. Model parallelism is also mainly used in natural language processing. In [46], authors train a multi-billion parameter-based transformer language model. With the help of multiple GPU nodes and pipeline structures, they could train such a gigantic model. It also achieves state-of-the-art speedups. In [47], authors build a 3-dimensional distributed model to accelerate the training in the language model. They use a 3D model to complement matrix multiplication and vector operations in the transformer models. To the best of our knowledge, this is the first work that employs model parallelism for the purpose of malware detection.

II-C Few-Shot Learning

With a consistent increase in malware applications each year [3], there is a constant need to update the ML models involved in malware detection algorithms. But complex data availability and continuous data collection for different cases are difficult. The machine learning and deep learning models need to be updated with each new type of training sample to generalize well. Due to this, the efficiency of machine learning models for malware detection is often debated. So there is a need to build an efficient malware detection model with only a few samples that do not need constant updating. Few-shot learning is a supervised learning technique that aims to learn different class concepts using a few samples. And could improve ML models which have limited complex data availability.

The important frameworks for few-shot learning are data augmentation techniques. These models improve the feature extraction capability of few-shot learning algorithms. Models such as Generative Adversarial Networks (GAN) [48], Variational Autoencoders (VAE) [49] and Mixture Density Networks (MDN) [50] can generate high-quality samples. GANs can produce new samples by loss minimization in the generated samples, and MDN with the help of Gaussian Mixture Models can produce highly probable samples. VAE with its encoder-decoder architecture is said to reconstruct input data efficiently. Works such as [51], [52, 38] use techniques such as reflection, translation and augmentation to generative new samples for training. [53] used a memory augmentation technique for few-shot learning.

III Motivation and Problem Formulation

With technology advancements, attackers are introducing complex hidden malware, by sneaking them into general applications. This is mathematically represented in Equation (1). Even advanced anti-malware software fails to detect these advanced malware families [11].

𝕀𝕆𝕋devices(BM)𝕀𝕆subscript𝕋𝑑𝑒𝑣𝑖𝑐𝑒𝑠direct-sum𝐵𝑀\centering\mathbb{IOT}_{devices}\leftarrow(B\oplus M)\@add@centeringblackboard_I blackboard_O blackboard_T start_POSTSUBSCRIPT italic_d italic_e italic_v italic_i italic_c italic_e italic_s end_POSTSUBSCRIPT ← ( italic_B ⊕ italic_M ) (1)

As represented in Equation (1), B represents benign and M represents the malware executables for IoT devices. The target for the malware is the IoT devices, represented as 𝕀𝕆𝕋devices𝕀𝕆subscript𝕋𝑑𝑒𝑣𝑖𝑐𝑒𝑠\mathbb{IOT}_{devices}blackboard_I blackboard_O blackboard_T start_POSTSUBSCRIPT italic_d italic_e italic_v italic_i italic_c italic_e italic_s end_POSTSUBSCRIPT. One can represent the problem of malware detection on IoT devices as follows:

(Dn):XY:superscript𝐷𝑛𝑋𝑌\displaystyle\mathbb{C}(D^{n}):{X}\rightarrow{Y}blackboard_C ( italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) : italic_X → italic_Y (2)
s.t. 𝔐[]<𝔐[node]𝔐delimited-[]𝔐delimited-[]𝑛𝑜𝑑𝑒\displaystyle\mathfrak{M}[\mathbb{C}]<\mathfrak{M}[node]fraktur_M [ blackboard_C ] < fraktur_M [ italic_n italic_o italic_d italic_e ]
Refer to caption
Figure 1: (a) Distributed IoT device framework, (b) HPC and Binary data pre-processing to extract input image dataset and generating additional synthetic samples with Code-Aware Data Generation technique using GANs, (c) Framework to identify the resources in the malware detection model using a lightweight linear regressor

As shown in Equation (2), \mathbb{C}blackboard_C is a pre-trained classifier trained with dataset Dnsuperscript𝐷𝑛D^{n}italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to perform malware detection. The dataset Dnsuperscript𝐷𝑛D^{n}italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT contains, a combination of malware M𝑀Mitalic_M and benign B𝐵Bitalic_B samples. As a pre-trained model, the classifier \mathbb{C}blackboard_C will not incur any overhead and can be used for inference. This model has the ability to detect if there is malware in any sample X𝑋Xitalic_X and map it to either malware class M𝑀{M}italic_M or benign class B𝐵Bitalic_B. The output class is represented as Y𝑌Yitalic_Y. The memory required to perform inference, represented as 𝔐[]𝔐delimited-[]\mathfrak{M}[\mathbb{C}]fraktur_M [ blackboard_C ] should be less than the available resources in an IoT node, represented as 𝔐[node]𝔐delimited-[]𝑛𝑜𝑑𝑒\mathfrak{M}[node]fraktur_M [ italic_n italic_o italic_d italic_e ]. If the constraint in equation (2) is not met, then the inference task can’t be carried out by the device. Also, to produce an effective ML model there is a need for enough training samples D𝐷Ditalic_D. With the need for enough training data and memory, the problem of implementing malware detection in IoT devices can be defined as a dual optimization problem.

maximize ijDij𝔐ijsubscript𝑖subscript𝑗subscript𝐷𝑖𝑗subscript𝔐𝑖𝑗\displaystyle\sum_{i}\sum_{j}D_{ij}\mathfrak{M}_{ij}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT fraktur_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (3)
s.t. j𝒫dxj=1i=d,𝔪formulae-sequencesubscript𝑗𝒫subscript𝑑𝑥𝑗1for-all𝑖𝑑𝔪\displaystyle\sum_{{j\in\mathcal{P}}}d_{xj}=1\quad\forall i=d,\mathfrak{m}∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_P end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_x italic_j end_POSTSUBSCRIPT = 1 ∀ italic_i = italic_d , fraktur_m

Equation (LABEL:eq3) describes the problem of optimizing training data and the available resources such as memory. Our proposed technique solves this by introducing a novel resource-aware malware detection model through off-loading the workload inference to neighboring nodes. We also introduce a code-aware data generation technique to increase the training samples. Thus addressing the problem in IoT devices of limited memory and training data.

IV Proposed Resource- and Workload-aware Malware Detection

IV-A Overview of the Proposed Technique

The overview of the proposed technique is shown in Figure 1. The computations happening at node level are presented. The Figure 1(a) represent the IoT devices present in a network. The proposed technique starts with data collection at the IoT device, in which the popular malware and benign application files are collected. Figure 1(b) describes the data collection process. The HPC traces are considered as input for the proposed technique to improve the reliability of malware detection. Along with the HPC data, the benign binary samples used in IoT devices and malware binary samples which affect the IoT devices are collected. The HPC data and binary files are converted to grayscale images. To increase the training data for better training capabilities synthetic data is generated using code-aware data generation technique is employed. These image samples are fed as input to the machine learning algorithms such as CNNs for effective malware detection. As shown in Figure 1(c), an automatic estimation is done using a lightweight regression model to analyze the resources needed to perform the malware detection. Depending on the resource availability, workload in a IoT node and the communication overhead, the malware detection task is either performed on-device or off-loaded to neighbouring nodes with sufficient resources as shown in Figure 1. The MP𝑀𝑃MPitalic_M italic_P block in Figure 1 represents the model parallelism task.

IV-B Pre-processing and Data Collection

IV-B1 Generation HPC-based Grayscale Images

To address the reliability concerns which are not addressed in the existing techniques, we propose fine-tuning state-of-the-art model-specific registers (MSRs) available in the modern computing system architectures, which are the source of the HPC information. Firstly, to solve the non-determinism challenge in HPCs, we redesign HPC capturing protocols with proper context switching and handling performance monitoring interrupt (PMI) units in the system while collecting HPCs. To obtain the HPCs solely for a given application, context switching needs to be accommodated, thereby eliminating the contamination of the obtained HPCs. From our preliminary analysis, the overhead (in terms of latency) to perform context switching for MiBench applications is around 3% of an average application runtime which is affordable for enhanced security. Further, to ensure proper context switching and reading of HPCs, PMIs can aid. It has been seen that configuring PMI per process often leads to better capturing of the HPCs [15, 54, 55, 56]. Through this two-pronged utilization of context-switching+PMI, we collect reliable HPCs. To address the challenges such as over counting [15], we perform calibration through testing.

We also require the microarchitectural event traces captured through HPCs for malware detection. One of the challenges is that there are a limited number of available on-chip HPCs that one can extract at a given time-instance. However, executing an application generates few tens of microarchitectural events. Thus, to perform real-time malware detection, one needs to determine the non-trivial microarchitectural events that could be captured through the limited number of HPCs and yield high detection performance. To achieve this, we use principal component analysis (PCA) for feature/event reduction on all the microarchitectural event traces captured offline by iteratively executing the application. Based on the PCA, we determine the most prominent events and monitor them during runtime. The ranking of the events is determined as follows:

ρi=cov(Appi,Zi)var(Appi)×var(Zi)subscript𝜌𝑖𝑐𝑜𝑣𝐴𝑝subscript𝑝𝑖subscript𝑍𝑖𝑣𝑎𝑟𝐴𝑝subscript𝑝𝑖𝑣𝑎𝑟subscript𝑍𝑖\rho_{i}=\frac{cov(App_{i},Z_{i})}{\sqrt{var(App_{i})\times var(Z_{i})}}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_c italic_o italic_v ( italic_A italic_p italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_v italic_a italic_r ( italic_A italic_p italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) × italic_v italic_a italic_r ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG end_ARG (4)

where ρisubscript𝜌𝑖\rho_{i}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is pearson correlation coefficient of any ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT application. Appi𝐴𝑝subscript𝑝𝑖App_{i}italic_A italic_p italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is any ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT incoming application. Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an output data contains different classes, backdoor, rootkit, trojan, virus and worm in our case. cov(Appi,Zi)𝑐𝑜𝑣𝐴𝑝subscript𝑝𝑖subscript𝑍𝑖cov(App_{i},Z_{i})italic_c italic_o italic_v ( italic_A italic_p italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) measures covariance between input and output. var(Appi)𝑣𝑎𝑟𝐴𝑝subscript𝑝𝑖var(App_{i})italic_v italic_a italic_r ( italic_A italic_p italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and var(Zi)𝑣𝑎𝑟subscript𝑍𝑖var(Z_{i})italic_v italic_a italic_r ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) measure variance of both input and output data respectively. Based on the ranking, we can select most prominent HPCs and monitor them during runtime for efficient malware detection. These reduced features collected at runtime are provided as input to ML classifiers which determine the malware class label (Y^^𝑌{\widehat{Y}}over^ start_ARG italic_Y end_ARG \Rightarrow Backdoor, Rootkit, Trojan, Virus and Worm) with higher confidence.

To the best of our knowledge, this is the first work which captures functionality of dynamic HPC attributes (values) and converts/represents them into grayscale images. We execute malware and benign application in a sandbox environment and capture range of HPC values (e.g. for 20 ns, 40 ns) using Quick HPC tool. Capturing the range of HPC values for a particular executable (benign or malware), illustrates the trend in variation in the HPC values for benign and malware samples. Hence, we have unique patterns in grayscale image for each executable file. However, it should be noted that the grayscale images of same class of malware tend to show similar texture in some portion of the grayscale image. Moreover, the advantage of this technique is the malware payload which is triggered by stealthy and code obfuscated malware can be identified and classified based on HPC based grayscale images because the grayscale texture of triggered malware tend to match either of a malicious pattern from the generated training data.

TABLE I: Parameter Estimations per Each Layer in a CNN Algorithm
Layers Description Parameters
Input No learnable parameters 0
CONV (width of filter * height of filter * No. of filters in previous layer+1) * No. of filters in current layer fconv=(whp)+1)cf_{conv}=(w*h*p)+1)*citalic_f start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT = ( italic_w ∗ italic_h ∗ italic_p ) + 1 ) ∗ italic_c
POOL No learnable parameters 0
FC (current layer neurons * previous layer neurons)+1 * current layer neurons fFC=(ncnp)+1ncsubscript𝑓𝐹𝐶subscript𝑛𝑐subscript𝑛𝑝1subscript𝑛𝑐f_{FC}=(n_{c}*n_{p})+1*n_{c}italic_f start_POSTSUBSCRIPT italic_F italic_C end_POSTSUBSCRIPT = ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∗ italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + 1 ∗ italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT
Softmax (current layer neurons * previous layer neurons) + 1 * current layer neurons fS=(ncnp)+1ncsubscript𝑓𝑆subscript𝑛𝑐subscript𝑛𝑝1subscript𝑛𝑐f_{S}=(n_{c}*n_{p})+1*n_{c}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∗ italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) + 1 ∗ italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT

IV-B2 Code-Aware Data Generation

Code-aware data generation technique is a novel data augmentation technique to generate reliable synthetic data. This synthetic data helps in feature extraction from limited samples. The data generation is done using generative adversarial networks (GANs). It is code-aware because GANs are trained with images constructed from binary code files. So the feature extraction carried out in GANs can be interpreted as capturing the malware code patterns. So in the case of varied test data, there won’t be the need to train ML models again. The obfuscated and morphic malware samples, which have hidden malware code blocks can be detected easily as GAN is made able to detect these hidden patterns. This makes the data generation process code-aware. In the case of HPC samples, grayscale images are constructed based on the functional attributed. GANs are trained with dynamic HPC grayscale images, to generate augmented HPC samples. The generated images are loss-controlled which makes them effective in capturing the features of limited available data. GAN consists of two parts a generator and a discriminator. Generator considers a random uniform distribution as a reference to generate new data points. Based on this uniform distribution and input data, generator tries to generate a correlated sample. This generated sample augments the real image with the help of uniform distribution so that when given to a ML model the feature extraction rate improves. The discriminator block of a GAN tries to classify the generated image as real or fake.

The generator and discriminator are loss-controlled, so that the generator can generate realistic images which are as close to the real images. And the discriminator is trained to invalidate the fake images. This helps the generator to learn and improve its ability to generate data. And discriminator is trained to classify them better.

Algorithm 1 Code-Aware Data Generation Algorithm
1:  Input: Dlsubscript𝐷𝑙D_{l}italic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (Dataset with limited data version), B𝐵Bitalic_B (Benign grayscale images), M𝑀Mitalic_M (Malware grayscale images), MOsubscript𝑀𝑂M_{O}italic_M start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT (Random obfuscated malware), MSTsubscript𝑀𝑆𝑇M_{ST}italic_M start_POSTSUBSCRIPT italic_S italic_T end_POSTSUBSCRIPT (Generated Stealthy malware),Dl={B+M+MO+MST}subscript𝐷𝑙𝐵𝑀subscript𝑀𝑂subscript𝑀𝑆𝑇D_{l}=\{B+M+M_{O}+M_{ST}\}italic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { italic_B + italic_M + italic_M start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT + italic_M start_POSTSUBSCRIPT italic_S italic_T end_POSTSUBSCRIPT }
2:     define CAD_generator(X):
3:        for XDl𝑋subscript𝐷𝑙X\leftarrow D_{l}italic_X ← italic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT: do
4:           for epochrange(1000)𝑒𝑝𝑜𝑐𝑟𝑎𝑛𝑔𝑒1000epoch\leftarrow range(1000)italic_e italic_p italic_o italic_c italic_h ← italic_r italic_a italic_n italic_g italic_e ( 1000 ): do
5:              G_model=define_generator()𝐺_𝑚𝑜𝑑𝑒𝑙𝑑𝑒𝑓𝑖𝑛𝑒_𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟G\_model=define\_generator()italic_G _ italic_m italic_o italic_d italic_e italic_l = italic_d italic_e italic_f italic_i italic_n italic_e _ italic_g italic_e italic_n italic_e italic_r italic_a italic_t italic_o italic_r ( )
6:              D_model=define_discriminator()𝐷_𝑚𝑜𝑑𝑒𝑙𝑑𝑒𝑓𝑖𝑛𝑒_𝑑𝑖𝑠𝑐𝑟𝑖𝑚𝑖𝑛𝑎𝑡𝑜𝑟D\_model=define\_discriminator()italic_D _ italic_m italic_o italic_d italic_e italic_l = italic_d italic_e italic_f italic_i italic_n italic_e _ italic_d italic_i italic_s italic_c italic_r italic_i italic_m italic_i italic_n italic_a italic_t italic_o italic_r ( )
7:              noisevector(256,None)𝑛𝑜𝑖𝑠𝑒𝑣𝑒𝑐𝑡𝑜𝑟256𝑁𝑜𝑛𝑒noise\leftarrow vector(256,None)italic_n italic_o italic_i italic_s italic_e ← italic_v italic_e italic_c italic_t italic_o italic_r ( 256 , italic_N italic_o italic_n italic_e )
8:              X_fakeG_model(noise)𝑋_𝑓𝑎𝑘𝑒𝐺_𝑚𝑜𝑑𝑒𝑙𝑛𝑜𝑖𝑠𝑒X\_{fake}\leftarrow G\_model(noise)italic_X _ italic_f italic_a italic_k italic_e ← italic_G _ italic_m italic_o italic_d italic_e italic_l ( italic_n italic_o italic_i italic_s italic_e )
9:              end for
10:           DmuG_modelpredict(vector)subscript𝐷𝑚𝑢𝐺_𝑚𝑜𝑑𝑒𝑙𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑣𝑒𝑐𝑡𝑜𝑟D_{mu}\leftarrow G\_model\cdot predict(vector)italic_D start_POSTSUBSCRIPT italic_m italic_u end_POSTSUBSCRIPT ← italic_G _ italic_m italic_o italic_d italic_e italic_l ⋅ italic_p italic_r italic_e italic_d italic_i italic_c italic_t ( italic_v italic_e italic_c italic_t italic_o italic_r )
11:           end for
12:           return Dmusubscript𝐷𝑚𝑢D_{mu}italic_D start_POSTSUBSCRIPT italic_m italic_u end_POSTSUBSCRIPT
13:  Output: Dmusubscript𝐷𝑚𝑢D_{mu}italic_D start_POSTSUBSCRIPT italic_m italic_u end_POSTSUBSCRIPT (Generated dataset with mutated samples)

Algorithm 1 takes in the limited version dataset Dlsubscript𝐷𝑙D_{l}italic_D start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as input. For each class in the dataset, the CAD_generator(X) function trains a generator and a discriminator. We train our GAN for 1000 epochs (Line 4), enough times to minimize the loss and generate images similar to training data. As represented in the algorithm (Line 5- Line 6), the generator model is described as G_model𝐺_𝑚𝑜𝑑𝑒𝑙G\_modelitalic_G _ italic_m italic_o italic_d italic_e italic_l, and the discriminator model is described as D_model𝐷_𝑚𝑜𝑑𝑒𝑙D\_modelitalic_D _ italic_m italic_o italic_d italic_e italic_l. They are convolutional neural networks, where, G_model𝐺_𝑚𝑜𝑑𝑒𝑙G\_modelitalic_G _ italic_m italic_o italic_d italic_e italic_l is trained to generate an image when a latent space is given as input. As represented in the algorithm 1 (Line 7- Line 9), when a latent noise generated by function vector()𝑣𝑒𝑐𝑡𝑜𝑟vector()italic_v italic_e italic_c italic_t italic_o italic_r ( ) of size (256,None)256𝑁𝑜𝑛𝑒(256,None)( 256 , italic_N italic_o italic_n italic_e ) is given as input, it generates an image of size (32,32)3232(32,32)( 32 , 32 ). The D_model𝐷_𝑚𝑜𝑑𝑒𝑙D\_modelitalic_D _ italic_m italic_o italic_d italic_e italic_l tries to classify the generated fake image X_fake. A loss function is generated for D_model𝐷_𝑚𝑜𝑑𝑒𝑙D\_modelitalic_D _ italic_m italic_o italic_d italic_e italic_l and G_model𝐺_𝑚𝑜𝑑𝑒𝑙G\_modelitalic_G _ italic_m italic_o italic_d italic_e italic_l. To decrease the gradient loss, the generator learns to generate better fake images X_fake𝑋_𝑓𝑎𝑘𝑒X\_{fake}italic_X _ italic_f italic_a italic_k italic_e, and the discriminator keeps on learning to classify them. After 1000 epochs, the generator model learns enough to be able to generate realistic fake images. So vectors of latent spaces are created to generate mutated data by using the modelpredict()𝑚𝑜𝑑𝑒𝑙𝑝𝑟𝑒𝑑𝑖𝑐𝑡model\cdot predict()italic_m italic_o italic_d italic_e italic_l ⋅ italic_p italic_r italic_e italic_d italic_i italic_c italic_t ( ) function, they are represented using dataset Dmusubscript𝐷𝑚𝑢D_{mu}italic_D start_POSTSUBSCRIPT italic_m italic_u end_POSTSUBSCRIPT (Line 12).

Dmu(X)Dw(X)similar-tosubscript𝐷𝑚𝑢𝑋subscript𝐷𝑤𝑋\centering D_{mu}(X)\sim D_{w}(X)\vspace{-1em}\@add@centeringitalic_D start_POSTSUBSCRIPT italic_m italic_u end_POSTSUBSCRIPT ( italic_X ) ∼ italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_X ) (5)

The samples in the generated synthetic dataset represented as Dmu(X)subscript𝐷𝑚𝑢𝑋D_{mu}(X)italic_D start_POSTSUBSCRIPT italic_m italic_u end_POSTSUBSCRIPT ( italic_X ) have a high correlation with real samples Dw(X)subscript𝐷𝑤𝑋D_{w}(X)italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_X ). A few shots of real samples Dw(X)subscript𝐷𝑤𝑋D_{w}(X)italic_D start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_X ) are used for training a CNN classifier along with the generated synthetic data Dmu(X)subscript𝐷𝑚𝑢𝑋D_{mu}(X)italic_D start_POSTSUBSCRIPT italic_m italic_u end_POSTSUBSCRIPT ( italic_X ) for malware detection.

After the data generation happens, a dataset is built using a few shots of real data, and a CNN model is trained with this data. The augmented data generated in the code-aware data generation technique helps in training the CNN model for the few-shot learning technique and helps improve the model performance. The CNN model is trained offline and for inference, in IoT devices, the model is taken as a pre-trained model. As the training happens offline, the few-shot learning-based CNN model doesn’t incur any memory overhead in IoT devices.

IV-C Automatic Resource Estimation

Execution, inference, and training of CNNs and DNNs for malware detection and other applications often incur a significant amount of resources. Deploying them on a single IoT device is not also always feasible due to the available limited resources. Furthermore, the on-going parallel execution of other applications on IoT devices such as sensing, and other computations minimize the available resources for CNN/DNN execution. Thus, the number of resources available in each node changes based on its workload. Instead of, manually calculating the parameters of CNN and estimating whether available resources on a node will be sufficient each time, a regression model is developed in this work. The binary regression model is trained using data such as CNN’s parameters, memory requirements of these parameters, and available memory at each node. As output, the binary regression model gives an estimate of whether the CNN model inference can be performed on a single node or must be distributed onto multiple nodes. The rationale for adopting the binary regression is its low overhead and complexity along with higher efficiency.

As illustrated in Algorithm 2, the binary regression algorithm is constructed. The training features of the regressor are the parameters of CNN. So, the parameters of each layer are calculated. For each layer of CNN \mathbb{C}blackboard_C, (Line 3 - Line 5) the variables weight matrix W𝑊Witalic_W, bias B𝐵Bitalic_B, and activation A𝐴Aitalic_A are collected and stored in the variable var𝑣𝑎𝑟varitalic_v italic_a italic_r. These variables contribute to parameter calculations of different layers in CNN (Line 6 - Line 11). As shown in Table I, the input layer and pooling layer represented as POOL𝑃𝑂𝑂𝐿POOLitalic_P italic_O italic_O italic_L of the Convolutional Neural Networks does not have any learnable parameters. So, parameters par𝑝𝑎𝑟paritalic_p italic_a italic_r are zero for these two layers. For convolutional layer CONV𝐶𝑂𝑁𝑉CONVitalic_C italic_O italic_N italic_V, fully connected layer represented as FC𝐹𝐶FCitalic_F italic_C and softmax layer represented as Softmax𝑆𝑜𝑓𝑡𝑚𝑎𝑥Softmaxitalic_S italic_o italic_f italic_t italic_m italic_a italic_x, the parameters are calculated using the equation shown in Table I.

Algorithm 2 Lightweight Linear Regression Algorithm
1:  Require: Bexesubscript𝐵𝑒𝑥𝑒B_{exe}italic_B start_POSTSUBSCRIPT italic_e italic_x italic_e end_POSTSUBSCRIPT (Benign application files), Mexesubscript𝑀𝑒𝑥𝑒M_{exe}italic_M start_POSTSUBSCRIPT italic_e italic_x italic_e end_POSTSUBSCRIPT (Malware application files)
2:  Input: 𝔐𝔢𝔪[node],𝔐𝔢𝔪delimited-[]𝑛𝑜𝑑𝑒\mathfrak{Mem}[node],\mathbb{C}fraktur_M fraktur_e fraktur_m [ italic_n italic_o italic_d italic_e ] , blackboard_C
3:     define Regressor()𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑜𝑟Regressor(\mathbb{C})italic_R italic_e italic_g italic_r italic_e italic_s italic_s italic_o italic_r ( blackboard_C ):
4:        for layer𝑙𝑎𝑦𝑒𝑟layer\leftarrow\mathbb{C}italic_l italic_a italic_y italic_e italic_r ← blackboard_C: do
5:           varf(W,B,A)𝑣𝑎𝑟𝑓𝑊𝐵𝐴var\leftarrow f(W,B,A)italic_v italic_a italic_r ← italic_f ( italic_W , italic_B , italic_A )
6:           if layerCONV𝑙𝑎𝑦𝑒𝑟𝐶𝑂𝑁𝑉{layer\rightarrow CONV}italic_l italic_a italic_y italic_e italic_r → italic_C italic_O italic_N italic_V
7:              par=fconv(var)𝑝𝑎𝑟subscript𝑓𝑐𝑜𝑛𝑣𝑣𝑎𝑟par=f_{conv}(var)italic_p italic_a italic_r = italic_f start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT ( italic_v italic_a italic_r )
8:           elif layer(FCSoftmax)𝑙𝑎𝑦𝑒𝑟𝐹𝐶𝑆𝑜𝑓𝑡𝑚𝑎𝑥{layer\rightarrow(FC\vee Softmax)}italic_l italic_a italic_y italic_e italic_r → ( italic_F italic_C ∨ italic_S italic_o italic_f italic_t italic_m italic_a italic_x )
9:              par=fFC(var)𝑝𝑎𝑟subscript𝑓𝐹𝐶𝑣𝑎𝑟par=f_{FC}(var)italic_p italic_a italic_r = italic_f start_POSTSUBSCRIPT italic_F italic_C end_POSTSUBSCRIPT ( italic_v italic_a italic_r )
10:           else
11:              par=0𝑝𝑎𝑟0par=0italic_p italic_a italic_r = 0
12:           end if
13:           (¯P).append(par)\bar{(}P).append(par)over¯ start_ARG ( end_ARG italic_P ) . italic_a italic_p italic_p italic_e italic_n italic_d ( italic_p italic_a italic_r )
14:        end for
15:        𝔐𝔢𝔪[model]Nbatch_size(¯P)1KB\mathfrak{Mem}[model]\leftarrow N*batch\_size*\bar{(}P)*1KBfraktur_M fraktur_e fraktur_m [ italic_m italic_o italic_d italic_e italic_l ] ← italic_N ∗ italic_b italic_a italic_t italic_c italic_h _ italic_s italic_i italic_z italic_e ∗ over¯ start_ARG ( end_ARG italic_P ) ∗ 1 italic_K italic_B
16:        XR.features{W,A,B,(¯P),𝔐𝔢𝔪[model],formulae-sequencesubscript𝑋𝑅𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠𝑊𝐴𝐵¯(𝑃𝔐𝔢𝔪delimited-[]𝑚𝑜𝑑𝑒𝑙X_{R}.features\leftarrow\{W,A,B,\bar{(}P),\mathfrak{Mem}[model],italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT . italic_f italic_e italic_a italic_t italic_u italic_r italic_e italic_s ← { italic_W , italic_A , italic_B , over¯ start_ARG ( end_ARG italic_P ) , fraktur_M fraktur_e fraktur_m [ italic_m italic_o italic_d italic_e italic_l ] ,
17:                             𝔐𝔢𝔪[node]}\mathfrak{Mem}[node]\}fraktur_M fraktur_e fraktur_m [ italic_n italic_o italic_d italic_e ] }
18:        Res:(XR,β):𝑅𝑒𝑠subscript𝑋𝑅𝛽Res\leftarrow\mathbb{R}:(X_{R},\beta)italic_R italic_e italic_s ← blackboard_R : ( italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_β )
19:        return Res𝑅𝑒𝑠Resitalic_R italic_e italic_s

If there are multiple convolutional and pooling layers, the parameters are calculated multiple times with different activation functions A𝐴Aitalic_A. At last, the estimated parameters of each layer are appended to give (¯P)\bar{(}P)over¯ start_ARG ( end_ARG italic_P ) (Line 13 - Line 18). Then, the memory for the model 𝔐𝔢𝔪[model]𝔐𝔢𝔪delimited-[]𝑚𝑜𝑑𝑒𝑙\mathfrak{Mem}[model]fraktur_M fraktur_e fraktur_m [ italic_m italic_o italic_d italic_e italic_l ], is calculated. The memory is a function of parameters (¯P)\bar{(}P)over¯ start_ARG ( end_ARG italic_P ) for each batch per N number of batches. It is assumed that each parameter needs one Kilo Byte (1KB) for inference, based on which the final memory required will be in MBs (5MBsimilar-toabsent5𝑀𝐵\sim 5MB∼ 5 italic_M italic_B). XR.featuresformulae-sequencesubscript𝑋𝑅𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠X_{R}.featuresitalic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT . italic_f italic_e italic_a italic_t italic_u italic_r italic_e italic_s represents the features to be given as input to the regressor \mathbb{R}blackboard_R which predicts the resource estimations Res𝑅𝑒𝑠Resitalic_R italic_e italic_s. The features in XR.featuresformulae-sequencesubscript𝑋𝑅𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠X_{R}.featuresitalic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT . italic_f italic_e italic_a italic_t italic_u italic_r italic_e italic_s include, weight matrix W𝑊Witalic_W, bias B𝐵Bitalic_B, activation A𝐴Aitalic_A, parameters of CNN at each layer (¯P)\bar{(}P)over¯ start_ARG ( end_ARG italic_P ), memory estimation of model 𝔐𝔢𝔪[model]𝔐𝔢𝔪delimited-[]𝑚𝑜𝑑𝑒𝑙\mathfrak{Mem}[model]fraktur_M fraktur_e fraktur_m [ italic_m italic_o italic_d italic_e italic_l ] and memory available at each node 𝔐𝔢𝔪[node]𝔐𝔢𝔪delimited-[]𝑛𝑜𝑑𝑒\mathfrak{Mem}[node]fraktur_M fraktur_e fraktur_m [ italic_n italic_o italic_d italic_e ]. This resource estimation provides a prediction whether the inference can be performed on a single node or if it needs to be done using parallelism.

IV-D Workload- and resource-aware malware detection

We develop a few-shot learning based convolutional neural network, trained on malware and benign samples. The inference task using this pre-tained malware detection model is partitioned and executed on different devices [44]. It was also ensured that child devices have no access to complete information. The partitioning is performed based on the independency of the nodes of the ML classifier, represented as a graph, and the workload that could be accommodated on the parent and child devices [57]. We provide an upper bound on the number of devices to which the task can be distributed. As the ML architecture is defined during design time, the model parallelism and model splitting overheads do not affect during the runtime. The overhead to determine whether distributed ML is needed is minimal due to involved low-complex computations.

Refer to caption
Figure 2: Model Distribution Over n Nodes

Given the model is distributed on multiple IoT devices as shown in Figure 2, the accumulation of the gradients from the child nodes is a challenging task [44]. Techniques such as DistBelief [58] are highly dependent on the partitioning of the model. Thus, they can lead to varied performances in our case and hence not adaptable. We adapt AllReduce [59] paradigm in this project, where the parent node accumulates the gradients from the children nodes. To update the gradients and perform other computations including inference, a synchronous Allreduce approach is utilized for better scalability [59]. However, a direct adaptation of such a method makes it vulnerable to faults such as the unavailability or garbage data from one device can stagnate or contaminate the whole process. To address such concerns, we deploy Downpour stochastic gradient descent (SGD) [60]. Downpour SGD is more resilient to machine failures and data manipulations, as it allows the training and inference to continue even if some model replicas are offline. It needs to be noted that the training happens offline, and inference is performed in real-time. To minimize the communication overheads, we let the parent device choose the child devices within a threshold radius R𝑅Ritalic_R for which the communication costs are lower and ensure the devices have a smaller workload to process. As frequent communication between parent and child nodes lead to large overheads, we let the system communicate whenever a device’s output is required as input for another device.

Algorithm 3 Pseudo-Code for Distributed Runtime Modelling of Malware Detection
1:  Require: M𝑀Mitalic_M (Malware grayscale images), B𝐵Bitalic_B (Generated Benign grayscale images)
2:  Input: Dn={B+M},𝔐𝔢𝔪[model]superscript𝐷𝑛𝐵𝑀𝔐𝔢𝔪delimited-[]𝑚𝑜𝑑𝑒𝑙D^{n}=\{B+M\},\mathfrak{Mem}[model]italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = { italic_B + italic_M } , fraktur_M fraktur_e fraktur_m [ italic_m italic_o italic_d italic_e italic_l ]
3:     define Distribute_CNN_model()𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑒_𝐶𝑁𝑁_𝑚𝑜𝑑𝑒𝑙Distribute\_CNN\_model()italic_D italic_i italic_s italic_t italic_r italic_i italic_b italic_u italic_t italic_e _ italic_C italic_N italic_N _ italic_m italic_o italic_d italic_e italic_l ( ):
4:        for nrange(0,x)𝑛𝑟𝑎𝑛𝑔𝑒0𝑥n\leftarrow range(0,x)italic_n ← italic_r italic_a italic_n italic_g italic_e ( 0 , italic_x ): do
5:           if 𝔐𝔢𝔪[model]𝔐𝔢𝔪[node]𝔐𝔢𝔪delimited-[]𝑚𝑜𝑑𝑒𝑙𝔐𝔢𝔪delimited-[]𝑛𝑜𝑑𝑒\mathfrak{Mem}[model]\leq\mathfrak{Mem}[node]fraktur_M fraktur_e fraktur_m [ italic_m italic_o italic_d italic_e italic_l ] ≤ fraktur_M fraktur_e fraktur_m [ italic_n italic_o italic_d italic_e ]
6:              node.append(n)formulae-sequence𝑛𝑜𝑑𝑒𝑎𝑝𝑝𝑒𝑛𝑑𝑛node.append(n)italic_n italic_o italic_d italic_e . italic_a italic_p italic_p italic_e italic_n italic_d ( italic_n )
7:              𝔐𝔢𝔪[node]𝔐𝔢𝔪[0]+𝔐𝔢𝔪[1]++𝔐𝔢𝔪[n]𝔐𝔢𝔪delimited-[]𝑛𝑜𝑑𝑒𝔐𝔢𝔪delimited-[]0𝔐𝔢𝔪delimited-[]1𝔐𝔢𝔪delimited-[]𝑛\mathfrak{Mem}[node]\leftarrow\mathfrak{Mem}[0]+\mathfrak{Mem}[1]+...+% \mathfrak{Mem}[n]fraktur_M fraktur_e fraktur_m [ italic_n italic_o italic_d italic_e ] ← fraktur_M fraktur_e fraktur_m [ 0 ] + fraktur_M fraktur_e fraktur_m [ 1 ] + … + fraktur_M fraktur_e fraktur_m [ italic_n ]
8:           end if
9:        end for
10:        l1=nn.layer1.cuda(0)formulae-sequencesubscript𝑙1𝑛𝑛𝑙𝑎𝑦𝑒𝑟1𝑐𝑢𝑑𝑎0l_{1}=nn.layer1.cuda(0)italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n italic_n . italic_l italic_a italic_y italic_e italic_r 1 . italic_c italic_u italic_d italic_a ( 0 )
11:        l2=nn.layer2.cuda(1)formulae-sequencesubscript𝑙2𝑛𝑛𝑙𝑎𝑦𝑒𝑟2𝑐𝑢𝑑𝑎1l_{2}=nn.layer2.cuda(1)italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_n italic_n . italic_l italic_a italic_y italic_e italic_r 2 . italic_c italic_u italic_d italic_a ( 1 )
12:        \cdots
13:        \cdots
14:        ln=nn.layern.cuda(n)formulae-sequencesubscript𝑙𝑛𝑛𝑛𝑙𝑎𝑦𝑒𝑟𝑛𝑐𝑢𝑑𝑎𝑛l_{n}=nn.layern.cuda(n)italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_n italic_n . italic_l italic_a italic_y italic_e italic_r italic_n . italic_c italic_u italic_d italic_a ( italic_n )
15:        model=nn.Sequential(l1,l2,,ln)formulae-sequence𝑚𝑜𝑑𝑒𝑙𝑛𝑛𝑆𝑒𝑞𝑢𝑒𝑛𝑡𝑖𝑎𝑙subscript𝑙1subscript𝑙2subscript𝑙𝑛model=nn.Sequential(l_{1},l_{2},...,l_{n})italic_m italic_o italic_d italic_e italic_l = italic_n italic_n . italic_S italic_e italic_q italic_u italic_e italic_n italic_t italic_i italic_a italic_l ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
16:        input=Dn.cuda(0)formulae-sequence𝑖𝑛𝑝𝑢𝑡superscript𝐷𝑛𝑐𝑢𝑑𝑎0input=D^{n}.cuda(0)italic_i italic_n italic_p italic_u italic_t = italic_D start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT . italic_c italic_u italic_d italic_a ( 0 )
17:        output=model(input)𝑜𝑢𝑡𝑝𝑢𝑡𝑚𝑜𝑑𝑒𝑙𝑖𝑛𝑝𝑢𝑡output=model(input)italic_o italic_u italic_t italic_p italic_u italic_t = italic_m italic_o italic_d italic_e italic_l ( italic_i italic_n italic_p italic_u italic_t )
18:        return Omsubscript𝑂𝑚O_{m}italic_O start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

Algorithm 3, represents the Pseudo-code for proposed distributed runtime-based modeling of malware detection. The function to distribute CNN represented as, Distribute_CNN_model()𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑒_𝐶𝑁𝑁_𝑚𝑜𝑑𝑒𝑙Distribute\_CNN\_model()italic_D italic_i italic_s italic_t italic_r italic_i italic_b italic_u italic_t italic_e _ italic_C italic_N italic_N _ italic_m italic_o italic_d italic_e italic_l ( ), is called based on the output of the regressor. It also takes the memory estimation 𝔐𝔢𝔪[model]𝔐𝔢𝔪delimited-[]𝑚𝑜𝑑𝑒𝑙\mathfrak{Mem}[model]fraktur_M fraktur_e fraktur_m [ italic_m italic_o italic_d italic_e italic_l ] of the CNN model for malware detection as input. It compares the model memory 𝔐𝔢𝔪[model]𝔐𝔢𝔪delimited-[]𝑚𝑜𝑑𝑒𝑙\mathfrak{Mem}[model]fraktur_M fraktur_e fraktur_m [ italic_m italic_o italic_d italic_e italic_l ] and available memory at each node 𝔐𝔢𝔪[node]𝔐𝔢𝔪delimited-[]𝑛𝑜𝑑𝑒\mathfrak{Mem}[node]fraktur_M fraktur_e fraktur_m [ italic_n italic_o italic_d italic_e ]. It appends multiple node memory elements to find the number of nodes, required to distribute the model. The number of nodes n𝑛nitalic_n should have a combined memory more than or equal to the model memory 𝔐𝔢𝔪[model]𝔐𝔢𝔪delimited-[]𝑚𝑜𝑑𝑒𝑙\mathfrak{Mem}[model]fraktur_M fraktur_e fraktur_m [ italic_m italic_o italic_d italic_e italic_l ] (Line 3 - Line 5). If this condition is met, the CNN is distributed on n𝑛nitalic_n nodes. The different layers of malware detection model l1,l2,,lnsubscript𝑙1subscript𝑙2subscript𝑙𝑛l_{1},l_{2},\cdots,l_{n}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (Line 10 - Line 14) are divided on n𝑛nitalic_n and trained. The input data is made available to the input layers, by passing them to the node0𝑛𝑜𝑑𝑒0node0italic_n italic_o italic_d italic_e 0. Communication between the nodes is made possible due to the interdependent variables and back pass algorithm by the function model=nn.Sequential(l1,l2,,ln)formulae-sequence𝑚𝑜𝑑𝑒𝑙𝑛𝑛𝑆𝑒𝑞𝑢𝑒𝑛𝑡𝑖𝑎𝑙subscript𝑙1subscript𝑙2subscript𝑙𝑛model=nn.Sequential(l_{1},l_{2},...,l_{n})italic_m italic_o italic_d italic_e italic_l = italic_n italic_n . italic_S italic_e italic_q italic_u italic_e italic_n italic_t italic_i italic_a italic_l ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (Line 15 - Line 17).

V Results

V-A Experimental Setup

For the IoT network setup, we deployed 20 IoT nodes encompassing Broadcom BCM2711, and quad-core Cortex-A72 (ARM v8) 64-bit boards. These nodes are connected through a wireless interface (WiFi). For the purpose of model parallelism, we deployed multiple Jetson Nanos containing 128-core NVIDIA Maxwell architecture-based GPU and Quad-core ARM® A57 CPU. The 4 JetsonNano boards are deployed for employing model parallelism and providing access to multiple CPU and GPU nodes to IoT nodes. Each Jetson Nano board acts as a single entity. We have obtained malware and benign applications from VirusTotal [61] with 12,500 benign samples and 70,000 malware samples that encompass 5 malware classes: backdoor, rootkit, trojan, virus, and worm. These files are executed in Sandbox to capture malware HPC attributes. These HPC attributes of benign and malware samples are converted to grayscale images of size 256 ×\times× 256. The benign and malware binary samples are also converted to grayscale images of size 256 ×\times× 256. In this image dataset, 70% of the data is divided into the training set and 30% of unseen data is taken as the test set. To further improve the model training and make it resilient to malware evolution, synthetic data generated using code-aware data generation technique based on few-shot learning technique is added to the training set. This synthetic data is also split into 70%-30% and used to augment the training and test data during the runtime. A CNN is built on all this data in offline and the inference task of test data is performed on multiple CPU and GPU nodes based on resource availability.

V-B Simulation Results

The inference is performed using a pre-trained convolution neural network algorithm. If the resources to perform inference are not enough, the malware detection task is off-loaded to multiple nodes. Table II represents the performance of different datasets when model parallelism is applied. These datasets have only a few samples and are populated with synthetic data generated using GANs. Difference models trained on HPC, binary, and combined datasets are divided over multiple nodes. We compare the performance of the proposed technique in terms of accuracy, F1-score, precision, and recall. We can observe the highest accuracy of 98.62% in the case of training data containing both HPC and binary image data, and where model parallelism is performed on two nodes. This case performs well because there are more training samples in this case which improves the model learning capability. And also the inference model is only divided into two nodes, so the penalty is less compared to three or four-node model parallelism. The lowest accuracy of 89.45% is observed in the case of the model trained on a binary dataset and performed model parallelism on four nodes. This is due to the limited features present in the binary dataset which help in detecting complex malware. And the penalty due to dividing the model into four nodes. With the increase in the number of nodes the model is divided, we can observe a minute penalty in performance.

TABLE II: Performance comparison of the proposed model with different MP algorithms
Model Nodes Accuracy Precision Recall F1-score
(%) (%) (%) (%)
2 97.64 97.62 97.65 97.63
HPC 3 97.24 97.24 97.26 76.45
(with MP) 4 96.73 96.73 97.21 96.72
2 91.62 91.63 91.62 91.70
Binares 3 91.18 91.16 91.15 91.18
(with MP) 4 89.45 89.43 89.42 89.45
2 98.62 98.63 98.62 98.70
HPC + Binaries 3 98.14 98.12 97.86 98.12
(with MP) 4 97.12 97.13 97.12 97.12

V-C Comparison with Previous Works

Table III presents the comparison of the proposed technique with the existing HPC-based malware detection techniques. We compare the performance of the proposed technique in terms of accuracy, F1-score, recall, latency, and area. All the models in table III focus on malware detection based on HPC runtime features. Compared to the existing techniques the proposed CNN-based distributed training on HPC-based image data achieves the highest accuracy. It maintains an accuracy of 96.7% while producing comparable latency and area values.

TABLE III: Comparison with existing HPC-based detection techniques
Model Accuracy F1-score Recall Latency Area
(%) (%) (%) (@ 10ns) (μ𝜇\muitalic_μm2)
OneR [62] 81.00 81.00 82.00 1 1258
JRIP [62] 83.00 83.00 84.00 4 1504
PART [62] 81.00 81.50 83.10 6 2131
J48 [62] 82.00 82.00 82.00 9 1801
Adaptive-HMD [63] 85.30 85.30 85.80 4 876
SVM [64] 73.90 73.60 77.20 - -
RF [64] 83.50 83.40 82.20 - -
NN [64] 81.10 81.10 81.60 - -
SMO [65] 93.20 93.30 93.10 22 2466
Proposed 96.70 96.70 97.20 10 1044

The F1 score and recall also support the performance of the proposed technique compared to other techniques. From these results, it is evident that the proposed technique achieves state-of-the-art HPC-based malware detection accuracy. The latency is represented in terms of clock cycles of 10ns to measure inference time, obtained from the Synopsys DC tool. The inference time of a few tens of nanoseconds indicates real-time malware detection.

V-D Impact on Latency with Proposed Technique

Normalized inference execution time is analyzed for cases a) the parent node has sufficient resources; b) the parent node does not have enough resources and outsources to multiple nodes. Figure 3, represent the latency of these cases. In Figure 3, Node represent a 128-core NVIDIA Maxwell architecture-based GPU present in Jetson Nano boards and the ARM represents the Quad-core ARM® A57 CPU present in Jetson boards. Also, P represents the parent node, C1 represents the first child node, C2 represents the second child node and C3 represents the third child node. We observe that with an increase in the number of nodes the inference time decreases. As the parameters are divided over various nodes the execution time needed for inference decreases. As the executions run in parallel, the total latency to perform inference in model parallelism is the latency of node which takes the highest time to execute (usually the model parallelism latency is the latency of parent node P). In Figure 3, for the case of sufficient resources, it takes 98 seconds to perform the inference task. For the case of model parallelism, we can observe a speedup of 4 ×\times× when the inference task is parallelized between two nodes. If we further off-load the inference task to three nodes, an additional speedup of about 1.5×\times× is observed. The ARM boards used as child nodes in model parallelism also produced notable speed-ups. We observed an overall speedup of 9.8×\times× while using four Jetson Nano boards.

Refer to caption
Figure 3: Latency of Distributed learning for Malware Detection

V-E Impact of Proposed Technique on Resource Consumption

The resource consumption of different worker nodes can be observed in Figure 4. In Figure 4, P represents the parent node, C1 represents the first child node, C2 represents the second child node and C3 represents the third child node. We observe the resource consumption for the following scenarios: a) the parent node has sufficient resources; b) the parent node does not have enough resources and outsources to multiple nodes. The inference task takes 4 MB of data to complete. In the first case, the single parent node P can provide this data to complete the inference task. In other cases, the inference task is divided between multiple nodes (model parallelism), so the data required is also divided into multiple nodes. We can observe that the resources are not equally consumed in the parent and child nodes. This is because the parent node usually has additional steps to perform, like the gradient collection from child nodes, adding them, etc, so, it consumes high resources. When compared to using a single node, in model parallelism, the required resources are provided from various nodes, and this helps to improve the processing speed.

Refer to caption
Figure 4: Resource Consumption for Inference Over n Nodes

V-F ASIC Implementation of Proposed Technique

We present the hardware implementation of the classifiers on ASIC to estimate resource utilization. The power, area, and energy values are reported at 100MHz. We used Design Compiler Graphical by Synopsys to obtain the area for the models. Power consumption is obtained using Synopsys Primetime PX. The post-layout area, power, and energy are summarized in Table IV. The resource utilization of the binary regression model is significantly less, whereas the CNN consumes high power, energy, and area on-chip ( Table IV ), hence we split the CNN model across the neighboring nodes with the available resource for inference computation. The post-layout energy numbers were almost \approx 1.8 ×\times× higher than the post-synthesis results. This increase in energy is mainly because of metal routing resulting in layout parasitics. As the tool uses different routing optimizations, the power, area, and energy values keep changing with the classifiers’ composition and architecture.

TABLE IV: Post synthesis hardware results of the classifiers (@100MHz) when deployed
Model Power (mW𝑚𝑊mWitalic_m italic_W) Energy (mJ𝑚𝐽mJitalic_m italic_J) Area (mm2𝑚superscript𝑚2mm^{2}italic_m italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT)
CNN 82.45 5.12 4.42
Regressor 27.81 2.52 1.18

VI Conclusion

With the proposed resource- and workload-aware model parallelism-based malware detection technique employs distributed training to enable better security for resource-constrained IoT devices. The metrics of distributed training on multiple nodes are analyzed and a speed-up of 9.8×\times× is observed compared to on-device training. From the results presented, it is also evident that the proposed technique produces state-of-the-art malware detection accuracy of 96.7% among HPC-based detection techniques. We also furnished the ASIC implementations of the CNN classifier trained using the proposed technique and the lightweight logistic regressor trained to classify the availability of resources. Thus, the proposed technique is reliable and accurate for malware detection in IoT devices.

References

  • [1] T. Adiono, “Challenges and opportunities in designing internet of things,” 2014 The 1st International Conference on Information Technology, Computer, and Electrical Engineering, 2014.
  • [2] O. Abbas and et al., “Big data issues and challenges,” 2016.
  • [3] J. Johnson, “Number of malware attacks per year 2020,” Sep 2022. [Online]. Available: https://www.statista.com/statistics/873097/malware-attacks-per-year-worldwide/
  • [4] “Malware statistics trends report: Av-test,” 2021. [Online]. Available: https://www.av-test.org/en/statistics/malware/
  • [5] J. Wurm, K. Hoang, O. Arias, A. R. Sadeghi, and Y. **, “Security analysis on consumer and industrial iot devices,” in Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 2016.
  • [6] E. Ronen and A. Shamir, “Extended functionality attacks on IoT devices: The case of smart lights,” in IEEE European Symposium on Security and Privacy, 2016.
  • [7] A. Damodaran et al., “A comparison of static, dynamic, and hybrid analysis for malware detection,” Journal of Computer Virology and Hacking Techniques, 2015.
  • [8] C. Rossow and et.al, “Prudent practices for designing malware experiments: Status quo and outlook,” Symposium on Security and Privacy, 2012.
  • [9] L. Nataraj and et al., “Malware images: Visualization and automatic classification,” in Int. Symposium on Visualization for Cyber Security, 2011.
  • [10] D. Gibert and et.al, “Using convolutional neural networks for classification of malware represented as images,” Journal of Computer Virology and Hacking Techniques, 2019.
  • [11] B. Bashari Rad and et.al, “Camouflage in malware: From encryption to metamorphism,” IJCSNS, 2012.
  • [12] S. Shukla, P. D. Sai Manoj, G. Kolhe, and S. Rafatirad, “On-device malware detection using performance-aware and robust collaborative learning,” in 2021 58th ACM/IEEE Design Automation Conference (DAC), 2021.
  • [13] S. Shukla, S. Rafatirad, H. Homayoun, and S. M. P. Dinakarrao, “Federated learning with heterogeneous models for on-device malware detection in iot networks,” in 2023 Design, Automation and Test in Europe Conference and Exhibition (DATE), 2023.
  • [14] H. Sayadi, Y. Gao, H. Mohammadi Makrani, J. Lin, P. C. Costa, S. Rafatirad, and H. Homayoun, “Towards accurate run-time hardware-assisted stealthy malware detection: A lightweight, yet effective time series cnn-based approach,” Cryptography, vol. 5, no. 4, 2021. [Online]. Available: https://www.mdpi.com/2410-387X/5/4/28
  • [15] S. Das, J. Werner, M. Antonakakis, M. Polychronakis, and F. Monrose, “The challenges, pitfalls, and perils of using hardware performance counters for security,” in IEEE Security & Privacy, 2019.
  • [16] I. You and et.al, “Malware obfuscation techniques: A brief survey,” in Int. Conf. on Broadband, Wireless Comp., Comm. and Applications, 2010.
  • [17] S. J. Stolfo and et.al, “Towards stealthy malware detection,” in Malware Detection, 2007.
  • [18] S. Li, H. Song, and M. Iqbal, “Privacy and security for resource-constrained iot devices and networks: Research challenges and opportunities,” Sensors, vol. 19, no. 8, 2019. [Online]. Available: https://www.mdpi.com/1424-8220/19/8/1935
  • [19] A. Makandar and A. Patrot, “Malware class recognition using image processing techniques,” in Int. Conf. on Data Management, Analytics and Innovation (ICDMAI), 2017.
  • [20] A. Dhavlle and S. Shukla, “A novel malware detection mechanism based on features extracted from converted malware binary images,” 2021.
  • [21] S. Shukla, G. Kolhe, H. Homayoun, S. Rafatirad, and S. M. P D, “Rafel - robust and data-aware federated learning-inspired malware detection in internet-of-things (iot) networks,” in Proceedings of the Great Lakes Symposium on VLSI 2022, ser. GLSVLSI ’22.   New York, NY, USA: Association for Computing Machinery, 2022.
  • [22] Brasser and et al., “Advances and throwbacks in hardware-assisted security: Special session,” in International Conference on Compilers, Architecture and Synthesis for Embedded Systems, ser. CASES ’18, 2018.
  • [23] D. Bilar, “Opcodes as predictor for malware,” IJESDF, 2007.
  • [24] D. John, M. Matthew, S. Jared, T. Adrian, W. Adam, S. Simha, and S. Salvatore, “On the feasibility of online malware detection with performance counters,” in ISCA’13, 2013.
  • [25] A. P. Kuruvila, S. Kundu, and K. Basu, “Analyzing the efficiency of machine learning classifiers in hardware-based malware detectors,” in 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2020, pp. 452–457.
  • [26] A. Dhavlle, S. Shukla, S. Rafatirad, H. Homayoun, and S. M. Pudukotai Dinakarrao, “Hmd-hardener: Adversarially robust and efficient hardware-assisted runtime malware detection,” in 2021 Design, Automation and Test in Europe Conference and Exhibition (DATE), 2021.
  • [27] S. Shukla, G. Kolhe, S. M. P. D, and S. Rafatirad, “Microarchitectural events and image processing-based hybrid approach for robust malware detection: Work-in-progress,” in Proceedings of the International Conference on Compliers, Architectures and Synthesis for Embedded Systems Companion, ser. CASES ’19.   New York, NY, USA: Association for Computing Machinery, 2019.
  • [28] S. Shukla and P. D. Sai Manoj, “Bring it on: Kinetic energy harvesting to spark machine learning computations in iots,” in 2024 International Symposium on Quality Electronic Design (ISQED), 2024.
  • [29] S. Barve, S. Shukla, S. M. P. Dinakarrao, and R. Jha, “Adversarial attack mitigation approaches using rram-neuromorphic architectures,” in Proceedings of the 2021 on Great Lakes Symposium on VLSI, ser. GLSVLSI ’21.   New York, NY, USA: Association for Computing Machinery, 2021.
  • [30] S. Kasarapu, R. Hassan, S. Rafatirad, H. Homayoun, and S. M. P. Dinakarrao, “Demography-aware covid-19 confinement with game theory,” in 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS).   IEEE, 2021, pp. 1–4.
  • [31] S. Kasarapu, R. Hassan, H. Homayoun, and S. M. Pudukotai Dinakarrao, “Scalable and demography-agnostic confinement strategies for covid-19 pandemic with game theory and graph algorithms,” COVID, vol. 2, no. 6, pp. 767–792, 2022.
  • [32] S. Kasarapu, S. Shukla, and S. M. P. Dinakarrao, “Resource- and workload-aware malware detection through distributed computing in iot networks,” in 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC), 2024, pp. 368–373.
  • [33] S. Kasarapu, S. Shukla, and S. M. Pudukotai Dinakarrao, “Resource- and workload-aware model parallelism-inspired novel malware detection for iot devices,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 12, pp. 4618–4628, 2023.
  • [34] S. Kasarapu, S. Shukla, R. Hassan, A. Sasan, H. Homayoun, and S. M. PD, “Cad-fsl: Code-aware data generation based few-shot learning for efficient malware detection,” 2022.
  • [35] S. Shukla, S. Kasarapu, R. Hasan, S. M. P. D, and H. Shen, “Ubol: User-behavior-aware one-shot learning for safe autonomous driving,” in 2022 Fifth International Conference on Connected and Autonomous Driving (MetroCAD), 2022, pp. 7–12.
  • [36] S. Kasarapu, S. Bavikadi, and S. M. P. Dinakarrao, “Processing-in-memory architecture with precision-scaling for malware detection,” in 2024 37th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID), 2024, pp. 529–534.
  • [37] G. P. C. R. S. Bharani Surya, S and N. Mohankumar, “Evolving reversible fault-tolerant adder architectures and their power estimation,” in International Conference on Communication, Computing and Electronics Systems: Proceedings of ICCCES 2019, 2020.
  • [38] R. Saravanan, S. Bavikadi, S. Rai, A. Kumar, and S. M. Pudukotai Dinakarrao, “Reconfigurable fet approximate computing-based accelerator for deep learning applications,” in 2023 IEEE International Symposium on Circuits and Systems (ISCAS), 2023, pp. 1–5.
  • [39] M. N. Raghul, S, “Microcontroller based ann for pick and place robot coordinate monitoring system,” in Proceedings of the 1st International Conference on Data Science, Machine Learning and Applications (ICDSMLA), 2020.
  • [40] R. S, Y. Akhileswar, and M. N, “N configuration trng based scrambler protocol for secured file transfer,” in Proceedings of the International Conference on IoT Based Control Networks and Intelligent Systems (ICICNIS), 2021.
  • [41] Y. Akhileswar, S. Raghul, C. Meghana, and N. Mohankumar, Hardware-Assisted QR Code Generation Using Fault-Tolerant TRNG, 2020.
  • [42] M. D. F. De Grazia, I. Stoianov, and M. Zorzi, “Parallelization of deep networks.” in ESANN.   Citeseer, 2012.
  • [43] M. Li, D. Andersen, J. Park, A. Smola, A. Ahmed, V. Josifovski, J. Long, E. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” Proc. OSDI, pp. 583–598, 01 2014.
  • [44] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning,” ACM Comput. Surv., vol. 53, no. 2, Mar. 2020.
  • [45] R. Hewett and I. Grady, “A linear algebraic approach to model parallelism in deep learning,” 06 2020.
  • [46] M. Shoeybi, M. M. A. Patwary, R. Puri, P. Legresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using gpu model parallelism,” 09 2019.
  • [47] Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing parallelism in distributed training for huge neural networks,” 05 2021.
  • [48] S. M. Grigorescu, “Generative one-shot learning (gol): A semi-parametric approach to one-shot learning in autonomous vision,” Int. Conf. on Robotics and Automation (ICRA), 2018.
  • [49] D. Bogdoll, J. Jestram, J. Rauch, C. Scheib, M. Wittig, and J. M. Zöllner, “Compressing sensor data for remote assistance of autonomous vehicles using deep generative models,” CoRR, 2021.
  • [50] S. Pillai and J. J. Leonard, “Towards visual ego-motion learning in robots,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
  • [51] L. Taylor and G. Nitschke, “Improving deep learning with generic data augmentation,” in Symposium Series on Computational Intelligence (SSCI), 2018.
  • [52] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” Neural Information Processing Systems, 2012.
  • [53] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “Meta-learning with memory-augmented neural networks,” in Int. conf. on machine learning.   PMLR, 2016.
  • [54] S. Shukla, A. Dhavlle, S. M. P. D, H. Homayoun, and S. Rafatirad, “Iron-dome: Securing iot networked systems at runtime by network and device characteristics to confine malware epidemics,” in 2022 IEEE 40th International Conference on Computer Design (ICCD), 2022.
  • [55] S. Shukla, G. Kolhe, S. M. PD, and S. Rafatirad, “Rnn-based classifier to detect stealthy malware using localized features and complex symbolic sequence,” in 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 2019.
  • [56] S. Shukla, G. Kolhe, S. M. P D, and S. Rafatirad, “Stealthy malware detection using rnn-based automated localized feature extraction and classifier,” in 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), 2019.
  • [57] J. Geng, D. Li, and S. Wang, “Elasticpipe: An efficient and dynamic model-parallel solution to dnn training,” in Workshop on Scientific Cloud Computing, 2019.
  • [58] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “Large scale distributed deep networks,” in International Conference on Neural Information Processing Systems, 2012.
  • [59] P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms for clusters of workstations,” J. Parallel Distrib. Comput., vol. 69, no. 2, p. 117–124, Feb. 2009.
  • [60] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” J. Mach. Learn. Res., vol. 12, no. null, p. 2121–2159, Jul. 2011.
  • [61] “Virustotal package,” 2021. [Online]. Available: https://www.rdocumentation.org/packages/virustotal/versions/0.2.1
  • [62] N. Patel, A. Sasan, and H. Homayoun, “Analyzing hardware based malware detectors,” in 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), 2017, pp. 1–6.
  • [63] Y. Gao, H. M. Makrani, M. Aliasgari, A. Rezaei, J. Lin, H. Homayoun, and H. Sayadi, “Adaptive-hmd: Accurate and cost-efficient machine learning-driven malware detection using microarchitectural events,” in 2021 IEEE 27th International Symposium on On-Line Testing and Robust System Design (IOLTS), 2021, pp. 1–7.
  • [64] A. P. Kuruvila, S. Kundu, and K. Basu, “Analyzing the efficiency of machine learning classifiers in hardware-based malware detectors,” in 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2020, pp. 452–457.
  • [65] H. Sayadi, H. Mohammadi Makrani, O. Randive, S. M. P.D., S. Rafatirad, and H. Homayoun, “Customized machine learning-based hardware-assisted malware detection in embedded devices,” in 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), 2018, pp. 1685–1688.
[Uncaptioned image] Sreenitha Kasarapu is a Ph.D. student, currently conducting her research under the supervision of Dr. Sai Manoj P D, an Assistant professor at the Electrical and Computer Engineering Department, George Mason University, Fairfax, VA, USA. She previously worked as a research assistant at GMU. Sreenitha’s present research interest includes IoT network security, computer vision, image processing, and time series analysis. She published her work in AICAS’20 and actively participated in research projects for malware detection. She received her Bachelor in Technology degree in Electronics and Communication Engineering from Jawaharlal Nehru Technological University Hyderabad (JNTUH), Hyderabad, India in 2019. At that time, she won second prize at National Level Technical Symposium held by the Indian Society for Technical Education (ISTE) for her paper on sensor technology and participated in several peer-reviewed paper presentations.
[Uncaptioned image] Sanket Shukla received his bachelor’s degree in Electronics Engineering, in 2015, from Mumbai University and his master’s degree (M.Sc.) in Computer Engineering, in 2021, from George Mason University. He is currently pursuing a Ph.D. degree with the Electrical and Computer Engineering department, George Mason University under the supervision of Dr. Sai Manoj PD. He conducts research in develo** machine learning and deep learning-based solutions for IoT and cybersecurity. He has published research papers in DATE, DAC, ICCD, GLSVLSI, ICMLA and ICTAI conferences and have also reviewed several journals and research papers. His research work submitted to DATE 2023 was recognized and nominated for the best paper award. His research interests include malware detection, cybersecurity, federated learning, energy and computational efficient machine learning and deep learning for security on IoTs and computer systems.
[Uncaptioned image] Sai Manoj P D (S’13-M’15) is an assistant professor at George Mason University. Prior joining to George Mason University (GMU) as an assistant professor, he served as research assistant professor and post-doctoral research fellow at GMU and was a post-doctoral research scientist at the System-on-Chip group, Institute of Computer Technology, Vienna University of Technology (TU Wien), Austria. He received his Ph.D. in Electrical and Electronics Engineering from Nanyang Technological University, Singapore in 2015. He received his Masters in Information Technology from International Institute of Information Technology Bangalore (IIITB), Bangalore, India in 2012. His research interests include on-chip hardware security, neuromorphic computing, adversarial machine learning, self-aware SoC design, image processing and time-series analysis, emerging memory devices and heterogeneous integration techniques. He won best paper award in Int. Conf. On Data Mining 2019, and his works were nominated for best paper award in prestigious conferences such as Design Automation & Test in Europe (DATE) 2018, International Conference on Consumer Electronics 2020, and won Xilinx open hardware contest in 2017 (student category). He is the recipient of the “A. Richard Newton Young Research Fellow” award in Design Automation Conference, 2013.