Enhancing IoT Malware Detection through Adaptive Model Parallelism and Resource Optimization

Sreenitha Kasarapu, , Sanket Shukla, ,
Sai Manoj Pudukotai Dinakarrao,
S. Kasarapu, S. Shukla, and S. M. P. Dinakarrao are associated with the Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA, 22030, USA. Email: {skasarap,sshukla4,spudukot}@gmu.eduThis work was supported by the Commonwealth Cyber Initiative, and investment in the advancement of cyber R&D innovation, and workforce development. For more information about CCI, visit www.cyberinitiative.org

Abstract

The widespread integration of IoT devices has greatly improved connectivity and computational capabilities, facilitating seamless communication across networks. Despite their global deployment, IoT devices are frequently targeted for security breaches due to inherent vulnerabilities. Among these threats, malware poses a significant risk to IoT devices. The lack of built-in security features and limited resources present challenges for implementing effective malware detection techniques on IoT devices. Moreover, existing methods assume access to all device resources for malware detection, which is often not feasible for IoT devices deployed in critical real-world scenarios. To overcome this challenge, this study introduces a novel approach to malware detection tailored for IoT devices, leveraging resource and workload awareness inspired by model parallelism. Initially, the device assesses available resources for malware detection using a lightweight regression model. Based on resource availability, ongoing workload, and communication costs, the malware detection task is dynamically allocated either on-device or offloaded to neighboring IoT nodes with sufficient resources. To uphold data integrity and user privacy, instead of transferring the entire malware detection task, the classifier is divided and distributed across multiple nodes, then integrated at the parent node for detection. Experimental results demonstrate that this proposed technique achieves a significant speedup of 9.8 $\times$ compared to on-device inference, while maintaining a high malware detection accuracy of 96.7%.

Index Terms:

Hardware security, malware detection, deep learning models, image processing, model parallelism, distributed learning

I Introduction

Recent advancements and innovations in Internet-of-Things (IoT) devices have fueled the growth and extensive deployment of a network comprised of intelligent IoT devices [1]. These devices find application in various domains, including consumer electronics such as smart homes, smart cars, and smart grids, as well as in defense systems [1]. Despite offering numerous benefits, IoT devices and networks have become attractive targets for cyber attackers seeking unauthorized access to user information [2]. Notably, malicious applications, commonly referred to as malware, pose a significant threat to IoT devices, with cyber-attacks often executed through the deployment of such malware [3]. Malware, characterized as malicious software or applications, is designed to infiltrate devices, enabling unauthorized access to sensitive information such as passwords and financial data, and allowing manipulation of stored data without user consent.

Malware stands out as a significant threat, primarily due to its ease of creation and the limited verification capabilities to execute third-party applications on IoT devices [3]. The security risks for IoT networks tend to escalate each year, with an exponential increase observed annually [3]. In 2021 alone, over 5.4 billion malware attacks were recorded, with the first half of 2022 already witnessing 2.8 billion attacks [3]. Adversaries leverage technological advancements to develop sophisticated malware, aiming to evade detection. Records indicate that an average of more than 8 million malware threats are identified daily in recent years [4].

The significant surge in malware attacks and security threats has heightened concerns regarding the security of IoT devices, potentially hindering their deployability. This underscores the need for techniques capable of detecting malware in IoT devices and mitigating the exploitation of user data. Several studies have been put forth to address malware detection on IoT devices [5, 6]. However, the existing works primarily suffer from four challenges:

(1) Real-time Malware Detection: Detecting malware during runtime with minimal latency is crucial, as malware can have severe consequences and can be challenging to detect once its payload is activated. Recently, two different approaches have emerged for malware detection: static analysis and dynamic analysis [7, 8]. Static analysis involves examining the internal structure of malware binaries without actually executing the binary executable files in a non-runtime environment. On the other hand, dynamic analysis inspects binary applications for malware traces by executing them in a sandbox environment. Unlike static analysis, dynamic analysis is a functionality test, which makes it better at identifying the presence of malware in an application.

Recent works on malware detection (both static and dynamic analysis) techniques utilize a variety of Machine Learning (ML) techniques to enhance the performance [9]. Among the ML-based malware detection techniques, the CNN-based image classification technique [10] is observed to be efficient due to its prime ability to learn image features. The emerging trends of malware indicate that the malware developers create advanced malware by employing techniques such as code-obfuscation, metamorphism, and polymorphism [11, 12, 13] to mutate malware binary executables and modify the static and dynamic application traces (signatures) and evade malware detection. This further enhances the complexity of malware detection making the malware detection incur large latency.

(2) Reliable Feature Extraction: Despite the abundance of research on malware detection [8, 10], there is a persistent challenge in reliably extracting input features that contribute to effective malware detection [14]. Regardless of the effectiveness of the underlying analysis technique, whether machine learning (ML) or non-ML, if the extracted features are not reliable, the malware detection task becomes unreliable. A popular technique to address this challenge is the utilization of hardware performance counters (HPC), device, and network features for node-level malware detection. This approach aims to minimize overheads and meet latency requirements [14]. HPCs can assist in distinguishing between malware and benign applications with low overheads. However, concerns have been raised regarding the reliability of using HPC information for security purposes in recent years [15]. For example, in Intel Pentium 4 processors, the ‘Instruction count’ is often over-counted [15]. Additionally, the coexistence of multiple applications can influence HPC values and trends, leading to non-determinism and unreliability. Therefore, there is a need for improved techniques that can efficiently analyze traits of benign and malware applications while addressing these reliability challenges.

(3) Manual Data Acquisition: Supervised learning models are commonly employed for malware detection, utilizing datasets comprising both malware and benign data. However, as the volume of malware data increases annually, there arises a necessity to regularly update these machine learning (ML) models. Yet, the process of collecting, cleaning, and labeling data is labor-intensive. Furthermore, adversaries employ various techniques such as code obfuscation, metamorphism, and polymorphism to enhance the complexity of malware binaries and evade detection [16, 11]. In such scenarios, manual data acquisition becomes increasingly challenging. For instance, morphism techniques alter malware binary files to mimic the functionality of standard applications, thereby deceiving the detection capabilities of various methodologies.

Techniques such as code obfuscation [16] involve encrypting specific sections of code within malware binary files while preserving its functionality. This tactic effectively conceals the presence of malware within embedded systems, exploiting their security vulnerabilities. Another strategy employed to obscure malware identity is stealthy malware [17], where malware is integrated into benign binaries using randomized obfuscation. Consequently, the benign application exhibits malware-like behavior only after a certain period, rendering it challenging to detect. These sophisticated techniques underscore the complexity of disguising malware and necessitate extensive training to enable machine learning (ML) models to discern hidden malware patterns. Consequently, acquiring the necessary data for training becomes more complex. This highlights the urgency to adopt efficient malware detection techniques that can operate effectively with limited data.

(4) Limited Resources on IoT devices: As previously mentioned, IoT devices are designed with constrained resources to prioritize portability and meet user demands [18]. Typically, the bulk of these resources are allocated to executing user applications, with only a limited portion reserved for on-device security measures. Consequently, it is impractical for IoT devices with limited resources to undertake computationally intensive malware detection tasks. Existing approaches either (1) prioritize malware detection at the expense of consuming all available application memory on IoT devices or (2) prioritize user applications, neglecting malware detection capabilities altogether. Both scenarios pose challenges for IoT devices: in the former case, the primary user application’s performance is restricted, while in the latter case, user security and privacy are compromised. Thus, there is a pressing need for a technique that can effectively perform malware detection without disrupting the workload of an IoT device.

To address the aforementioned limitations, this work introduces a novel resource-aware and workload-aware model-parallelism-based malware detection technique for IoT devices. This technique enables efficient malware classification without the need for excessive resources from IoT devices. Instead, it employs the distribution of the ML model over neighboring IoT nodes and facilitates malware detection. The application privacy is maintained despite shared resources, as, the model is distributed onto nodes of the same IoT network. The ML model is trained using a few-shot technique to decrease its need for manually annotated image samples. The novel contributions of this work can be outlined as follows:

•

This work introduces a methodology for reliably extracting device and network characteristics, laying the foundation for efficient and effective malware detection.
•

This work implements an automatic assessment of available resources in IoT devices for malware detection. It provides an estimate of whether to offload the malware detection task or not. This analysis is conducted by training a lightweight regressor on the workload of the IoT device and ML model parameters.
•

The proposed approach involves distributing ML model resources to neighboring devices in a resource-aware manner, taking into account communication and computation overheads for effective malware detection.
•

We also introduce a code-aware data generation-based few-shot technique aimed at generating mutated training samples to capture the features of actual malware samples. These generated images mimic the complex functionality of malware, addressing the challenge of comprehensive data acquisition.

The experimental results prove that the proposed resource-aware model parallelism technique can detect complex malware in IoT networks with an accuracy of over 90%. Experimental analysis shows that the proposed technique can achieve a speed-up of 9.8 $\times$ compared to on-device inference while maintaining a malware detection accuracy of 96.7%.

The rest of the paper is organized as follows: Section II describes the related work and its shortcomings and comparison with the proposed model. Section III describes the problem for malware detection in IoT devices. Section IV describes the proposed architecture resource-aware model parallelism, which assists with efficient malware detection in IoT devices, using a distributed runtime model training methodology. The experimental evaluation of the proposed model and comparison with various ML architectures is illustrated in Section V and followed by the conclusions drawn from the paper are furnished in Section VI.

II State-of-the-Art

In this section, we present some of the relevant works proposed in the recent past on malware detection, distributed learning, and few-shot learning.

II-A Malware Detection Techniques

Malware detection in recent years has gained a lot of interest. We broadly categorize malware detection into two categories.

II-A1 Static Analysis based Malware Detection

Traditionally, static and dynamic analyses of malware detection are employed. Static analysis [7] on malware data is performed by comparing the opcode sequences of binary executable files, control flow graphs, and code patterns. This technique is performed in a non-runtime environment, as it doesn’t require any executions.

The work in [9] introduced a technique for malware detection using image processing technique where binary applications are converted into grayscale images. The generated images have identical patterns because of the executable file structural distributions. The paper used the K-Nearest Neighbour ML algorithm for the classification of malware images. Similar approaches include image visualization and classification using machine learning algorithms such as SVM. However, these approaches don’t address the problem of classifying newer complex malware. Neural networks such as artificial neural networks (ANNs) are used extensively to solve the problem of classification, prediction, filtering, optimization, pattern recognition, and function approximation [19], as neurons can capture the features of the images more accurately than other machine learning algorithms. However, the fully connected layers of ANN tend to exhaust the computational resources. In [10, 20, 21] authors used Convolutional neural networks (CNNs), due to their ability to efficiently handle image data through feature extraction by Convolutional 2D layers and using Maxpooling 2D layers to downsample the input parameters. Thus, serving as an efficient image classification algorithm with lesser resource consumption.

II-A2 Dynamic Analysis based Malware Detection

Dynamic analysis is a malware detection technique, performed in a secured runtime environment, like Sandbox. It is a functionality test and the binary files are executed to detect malware functionalities in them. Malware detection using dynamic analysis is performed based on detecting system calls or HPC [8]. Dynamic analysis is much more efficient than static analysis in malware detection. Dynamic analysis need a huge amount of resources and is time consuming, so, it is hard to carry on edge devices. Furthermore, malware developers implement code obfuscation, metamorphism, and polymorphism [11] to mutate malware binary executables. These new strategies in masking malware’s identity are stealthy malware [17], where malware is incorporated into benign applications using random obfuscation techniques. In such cases, dynamic malware analysis produce poor estimations. So there is a need to train these dynamic models with reliable features.

In past, many researchers have leveraged architectural and application features for malware analysis and detection [22]. In [23] Bilar et al. used the difference of opcodes between known malware and benign as a key to predicting malware. However, these proposed techniques require a considerable amount of work to model each program based on instructions. Since the code size increases day by day, modeling programs based on opcodes becomes a time-consuming and computationally expensive task. Demme et al. [24] proposed the use of a hardware performance counter (HPC) to monitor the lower level micro-architectural parameters such as branch-misses, instruction per cycle, and cache miss rate. HPCs can provide access to interior performance information comprehensively with much lower overhead than other methods. In works such as [25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41], machine learning models like Random Forest, SVM, and Logistic Regressors are used on HPC values obtained at execution, to classify benign and malware classes. In [14], the authors introduce StealthMiner a novel stealthy malware detection model using time series prediction. They build a Fully Convolutional Neural Network (FCN) on HPC run-time branch instruction features to detect stealthy malware traces.

II-B Distributed Learning

Deep learning has achieved milestone and has valuable applications in cybersecurity, malware detection and other domains. Training deep learning models requires huge amount of time especially with the massive amount of data which needs to be processed. On the other hand, scaling neural network architecture may result in a network with complex parameters, leading to time complexity (i.e. high execution time while training the model). Fortunately, these bottlenecks can be addressed through parallelization paradigms. Parallelization of tasks in deep learning models is one of the best approaches for accelerating implementation, i.e., it speed-ups the algorithm by minimizing the execution time, allowing complex tasks to be processed with less computational resources and execution time [42].

Two types of distributed learning techniques are available: data parallelism and model parallelism. In data parallelism [43], each node has a copy of the whole ML model which needs to be trained. But, each node is given a different mini-batch of data for training the model. After training the results are collected and combined into an updated model. Though it reduces complexity and inter-node communication, data parallelism suffers from huge memory utilization. Model parallelism [44] is a technique, where each node has the same data but the ML model is divided. Each node contains only a single layer of the neural network to be trained. Node-to-node communication is done for weight sharing and back-propagation. Model parallelism is suitable to train a massive ML when there are limited resources.

In [45], authors propose linear-algebraic-based model parallelism for deep learning networks. This framework allows the parallel distribution of any tensor in the DNN. Model parallelism is also mainly used in natural language processing. In [46], authors train a multi-billion parameter-based transformer language model. With the help of multiple GPU nodes and pipeline structures, they could train such a gigantic model. It also achieves state-of-the-art speedups. In [47], authors build a 3-dimensional distributed model to accelerate the training in the language model. They use a 3D model to complement matrix multiplication and vector operations in the transformer models. To the best of our knowledge, this is the first work that employs model parallelism for the purpose of malware detection.

II-C Few-Shot Learning

With a consistent increase in malware applications each year [3], there is a constant need to update the ML models involved in malware detection algorithms. But complex data availability and continuous data collection for different cases are difficult. The machine learning and deep learning models need to be updated with each new type of training sample to generalize well. Due to this, the efficiency of machine learning models for malware detection is often debated. So there is a need to build an efficient malware detection model with only a few samples that do not need constant updating. Few-shot learning is a supervised learning technique that aims to learn different class concepts using a few samples. And could improve ML models which have limited complex data availability.

The important frameworks for few-shot learning are data augmentation techniques. These models improve the feature extraction capability of few-shot learning algorithms. Models such as Generative Adversarial Networks (GAN) [48], Variational Autoencoders (VAE) [49] and Mixture Density Networks (MDN) [50] can generate high-quality samples. GANs can produce new samples by loss minimization in the generated samples, and MDN with the help of Gaussian Mixture Models can produce highly probable samples. VAE with its encoder-decoder architecture is said to reconstruct input data efficiently. Works such as [51], [52, 38] use techniques such as reflection, translation and augmentation to generative new samples for training. [53] used a memory augmentation technique for few-shot learning.

III Motivation and Problem Formulation

With technology advancements, attackers are introducing complex hidden malware, by sneaking them into general applications. This is mathematically represented in Equation (1). Even advanced anti-malware software fails to detect these advanced malware families [11].

\centering\mathbb{IOT}_{devices}\leftarrow(B\oplus M)\@add@centering

(1)

As represented in Equation (1), B represents benign and M represents the malware executables for IoT devices. The target for the malware is the IoT devices, represented as $\mathbb{IOT}_{devices}$ . One can represent the problem of malware detection on IoT devices as follows:

	$\displaystyle\mathbb{C}(D^{n}):{X}\rightarrow{Y}$				(2)
		s.t.	$\displaystyle\mathfrak{M}[\mathbb{C}]<\mathfrak{M}[node]$		(2)

Refer to caption — Figure 1: (a) Distributed IoT device framework, (b) HPC and Binary data pre-processing to extract input image dataset and generating additional synthetic samples with Code-Aware Data Generation technique using GANs, (c) Framework to identify the resources in the malware detection model using a lightweight linear regressor

As shown in Equation (2), $\mathbb{C}$ is a pre-trained classifier trained with dataset $D^{n}$ to perform malware detection. The dataset $D^{n}$ contains, a combination of malware $M$ and benign $B$ samples. As a pre-trained model, the classifier $\mathbb{C}$ will not incur any overhead and can be used for inference. This model has the ability to detect if there is malware in any sample $X$ and map it to either malware class ${M}$ or benign class $B$ . The output class is represented as $Y$ . The memory required to perform inference, represented as $\mathfrak{M}[\mathbb{C}]$ should be less than the available resources in an IoT node, represented as $\mathfrak{M}[node]$ . If the constraint in equation (2) is not met, then the inference task can’t be carried out by the device. Also, to produce an effective ML model there is a need for enough training samples $D$ . With the need for enough training data and memory, the problem of implementing malware detection in IoT devices can be defined as a dual optimization problem.

	maximize		$\displaystyle\sum_{i}\sum_{j}D_{ij}\mathfrak{M}_{ij}$		(3)
	s.t.		$\displaystyle\sum_{{j\in\mathcal{P}}}d_{xj}=1\quad\forall i=d,\mathfrak{m}$		(3)

Equation (LABEL:eq3) describes the problem of optimizing training data and the available resources such as memory. Our proposed technique solves this by introducing a novel resource-aware malware detection model through off-loading the workload inference to neighboring nodes. We also introduce a code-aware data generation technique to increase the training samples. Thus addressing the problem in IoT devices of limited memory and training data.

IV Proposed Resource- and Workload-aware Malware Detection

IV-A Overview of the Proposed Technique

The overview of the proposed technique is shown in Figure 1. The computations happening at node level are presented. The Figure 1(a) represent the IoT devices present in a network. The proposed technique starts with data collection at the IoT device, in which the popular malware and benign application files are collected. Figure 1(b) describes the data collection process. The HPC traces are considered as input for the proposed technique to improve the reliability of malware detection. Along with the HPC data, the benign binary samples used in IoT devices and malware binary samples which affect the IoT devices are collected. The HPC data and binary files are converted to grayscale images. To increase the training data for better training capabilities synthetic data is generated using code-aware data generation technique is employed. These image samples are fed as input to the machine learning algorithms such as CNNs for effective malware detection. As shown in Figure 1(c), an automatic estimation is done using a lightweight regression model to analyze the resources needed to perform the malware detection. Depending on the resource availability, workload in a IoT node and the communication overhead, the malware detection task is either performed on-device or off-loaded to neighbouring nodes with sufficient resources as shown in Figure 1. The $MP$ block in Figure 1 represents the model parallelism task.

IV-B Pre-processing and Data Collection

IV-B1 Generation HPC-based Grayscale Images

To address the reliability concerns which are not addressed in the existing techniques, we propose fine-tuning state-of-the-art model-specific registers (MSRs) available in the modern computing system architectures, which are the source of the HPC information. Firstly, to solve the non-determinism challenge in HPCs, we redesign HPC capturing protocols with proper context switching and handling performance monitoring interrupt (PMI) units in the system while collecting HPCs. To obtain the HPCs solely for a given application, context switching needs to be accommodated, thereby eliminating the contamination of the obtained HPCs. From our preliminary analysis, the overhead (in terms of latency) to perform context switching for MiBench applications is around 3% of an average application runtime which is affordable for enhanced security. Further, to ensure proper context switching and reading of HPCs, PMIs can aid. It has been seen that configuring PMI per process often leads to better capturing of the HPCs [15, 54, 55, 56]. Through this two-pronged utilization of context-switching+PMI, we collect reliable HPCs. To address the challenges such as over counting [15], we perform calibration through testing.

We also require the microarchitectural event traces captured through HPCs for malware detection. One of the challenges is that there are a limited number of available on-chip HPCs that one can extract at a given time-instance. However, executing an application generates few tens of microarchitectural events. Thus, to perform real-time malware detection, one needs to determine the non-trivial microarchitectural events that could be captured through the limited number of HPCs and yield high detection performance. To achieve this, we use principal component analysis (PCA) for feature/event reduction on all the microarchitectural event traces captured offline by iteratively executing the application. Based on the PCA, we determine the most prominent events and monitor them during runtime. The ranking of the events is determined as follows:

\rho_{i}=\frac{cov(App_{i},Z_{i})}{\sqrt{var(App_{i})\times var(Z_{i})}}

(4)

where $\rho_{i}$ is pearson correlation coefficient of any $i^{th}$ application. $App_{i}$ is any $i^{th}$ incoming application. $Z_{i}$ is an output data contains different classes, backdoor, rootkit, trojan, virus and worm in our case. $cov(App_{i},Z_{i})$ measures covariance between input and output. $var(App_{i})$ and $var(Z_{i})$ measure variance of both input and output data respectively. Based on the ranking, we can select most prominent HPCs and monitor them during runtime for efficient malware detection. These reduced features collected at runtime are provided as input to ML classifiers which determine the malware class label ( ${\widehat{Y}}$ $\Rightarrow$ Backdoor, Rootkit, Trojan, Virus and Worm) with higher confidence.

To the best of our knowledge, this is the first work which captures functionality of dynamic HPC attributes (values) and converts/represents them into grayscale images. We execute malware and benign application in a sandbox environment and capture range of HPC values (e.g. for 20 ns, 40 ns) using Quick HPC tool. Capturing the range of HPC values for a particular executable (benign or malware), illustrates the trend in variation in the HPC values for benign and malware samples. Hence, we have unique patterns in grayscale image for each executable file. However, it should be noted that the grayscale images of same class of malware tend to show similar texture in some portion of the grayscale image. Moreover, the advantage of this technique is the malware payload which is triggered by stealthy and code obfuscated malware can be identified and classified based on HPC based grayscale images because the grayscale texture of triggered malware tend to match either of a malicious pattern from the generated training data.

TABLE I: Parameter Estimations per Each Layer in a CNN Algorithm

Layers	Description	Parameters
Input	No learnable parameters	0
CONV	(width of filter * height of filter * No. of filters in previous layer+1) * No. of filters in current layer	$f_{conv}=(whp)+1)*c$
POOL	No learnable parameters	0
FC	(current layer neurons * previous layer neurons)+1 * current layer neurons	$f_{FC}=(n_{c}n_{p})+1n_{c}$
Softmax	(current layer neurons * previous layer neurons) + 1 * current layer neurons	$f_{S}=(n_{c}n_{p})+1n_{c}$

IV-B2 Code-Aware Data Generation

Code-aware data generation technique is a novel data augmentation technique to generate reliable synthetic data. This synthetic data helps in feature extraction from limited samples. The data generation is done using generative adversarial networks (GANs). It is code-aware because GANs are trained with images constructed from binary code files. So the feature extraction carried out in GANs can be interpreted as capturing the malware code patterns. So in the case of varied test data, there won’t be the need to train ML models again. The obfuscated and morphic malware samples, which have hidden malware code blocks can be detected easily as GAN is made able to detect these hidden patterns. This makes the data generation process code-aware. In the case of HPC samples, grayscale images are constructed based on the functional attributed. GANs are trained with dynamic HPC grayscale images, to generate augmented HPC samples. The generated images are loss-controlled which makes them effective in capturing the features of limited available data. GAN consists of two parts a generator and a discriminator. Generator considers a random uniform distribution as a reference to generate new data points. Based on this uniform distribution and input data, generator tries to generate a correlated sample. This generated sample augments the real image with the help of uniform distribution so that when given to a ML model the feature extraction rate improves. The discriminator block of a GAN tries to classify the generated image as real or fake.

The generator and discriminator are loss-controlled, so that the generator can generate realistic images which are as close to the real images. And the discriminator is trained to invalidate the fake images. This helps the generator to learn and improve its ability to generate data. And discriminator is trained to classify them better.

Algorithm 1 Code-Aware Data Generation Algorithm

1: Input:

D_{l}

(Dataset with limited data version),

B

(Benign grayscale images),

M

(Malware grayscale images),

M_{O}

(Random obfuscated malware),

M_{ST}

(Generated Stealthy malware),

D_{l}=\{B+M+M_{O}+M_{ST}\}

2: define CAD_generator(X):

3: for

X\leftarrow D_{l}

: do

4: for

epoch\leftarrow range(1000)

: do

G\_model=define\_generator()

D\_model=define\_discriminator()

noise\leftarrow vector(256,None)

X\_{fake}\leftarrow G\_model(noise)

9: end for

10:

D_{mu}\leftarrow G\_model\cdot predict(vector)

11: end for

12: return

D_{mu}

13: Output:

D_{mu}

(Generated dataset with mutated samples)

Algorithm 1 takes in the limited version dataset $D_{l}$ as input. For each class in the dataset, the CAD_generator(X) function trains a generator and a discriminator. We train our GAN for 1000 epochs (Line 4), enough times to minimize the loss and generate images similar to training data. As represented in the algorithm (Line 5- Line 6), the generator model is described as $G\_model$ , and the discriminator model is described as $D\_model$ . They are convolutional neural networks, where, $G\_model$ is trained to generate an image when a latent space is given as input. As represented in the algorithm 1 (Line 7- Line 9), when a latent noise generated by function $vector()$ of size $(256,None)$ is given as input, it generates an image of size $(32,32)$ . The $D\_model$ tries to classify the generated fake image X_fake. A loss function is generated for $D\_model$ and $G\_model$ . To decrease the gradient loss, the generator learns to generate better fake images $X\_{fake}$ , and the discriminator keeps on learning to classify them. After 1000 epochs, the generator model learns enough to be able to generate realistic fake images. So vectors of latent spaces are created to generate mutated data by using the $model\cdot predict()$ function, they are represented using dataset $D_{mu}$ (Line 12).

\centering D_{mu}(X)\sim D_{w}(X)\vspace{-1em}\@add@centering

(5)

The samples in the generated synthetic dataset represented as $D_{mu}(X)$ have a high correlation with real samples $D_{w}(X)$ . A few shots of real samples $D_{w}(X)$ are used for training a CNN classifier along with the generated synthetic data $D_{mu}(X)$ for malware detection.

After the data generation happens, a dataset is built using a few shots of real data, and a CNN model is trained with this data. The augmented data generated in the code-aware data generation technique helps in training the CNN model for the few-shot learning technique and helps improve the model performance. The CNN model is trained offline and for inference, in IoT devices, the model is taken as a pre-trained model. As the training happens offline, the few-shot learning-based CNN model doesn’t incur any memory overhead in IoT devices.

IV-C Automatic Resource Estimation

Execution, inference, and training of CNNs and DNNs for malware detection and other applications often incur a significant amount of resources. Deploying them on a single IoT device is not also always feasible due to the available limited resources. Furthermore, the on-going parallel execution of other applications on IoT devices such as sensing, and other computations minimize the available resources for CNN/DNN execution. Thus, the number of resources available in each node changes based on its workload. Instead of, manually calculating the parameters of CNN and estimating whether available resources on a node will be sufficient each time, a regression model is developed in this work. The binary regression model is trained using data such as CNN’s parameters, memory requirements of these parameters, and available memory at each node. As output, the binary regression model gives an estimate of whether the CNN model inference can be performed on a single node or must be distributed onto multiple nodes. The rationale for adopting the binary regression is its low overhead and complexity along with higher efficiency.

As illustrated in Algorithm 2, the binary regression algorithm is constructed. The training features of the regressor are the parameters of CNN. So, the parameters of each layer are calculated. For each layer of CNN $\mathbb{C}$ , (Line 3 - Line 5) the variables weight matrix $W$ , bias $B$ , and activation $A$ are collected and stored in the variable $var$ . These variables contribute to parameter calculations of different layers in CNN (Line 6 - Line 11). As shown in Table I, the input layer and pooling layer represented as $POOL$ of the Convolutional Neural Networks does not have any learnable parameters. So, parameters $par$ are zero for these two layers. For convolutional layer $CONV$ , fully connected layer represented as $FC$ and softmax layer represented as $Softmax$ , the parameters are calculated using the equation shown in Table I.

Algorithm 2 Lightweight Linear Regression Algorithm

1: Require:

B_{exe}

(Benign application files),

M_{exe}

(Malware application files)

2: Input:

\mathfrak{Mem}[node],\mathbb{C}

3: define

Regressor(\mathbb{C})

4: for

layer\leftarrow\mathbb{C}

: do

var\leftarrow f(W,B,A)

6: if

{layer\rightarrow CONV}

par=f_{conv}(var)

8: elif

{layer\rightarrow(FC\vee Softmax)}

par=f_{FC}(var)

10: else

11:

par=0

12: end if

13:

\bar{(}P).append(par)

14: end for

15:

\mathfrak{Mem}[model]\leftarrow N*batch\_size*\bar{(}P)*1KB

16:

X_{R}.features\leftarrow\{W,A,B,\bar{(}P),\mathfrak{Mem}[model],

17:

\mathfrak{Mem}[node]\}

18:

Res\leftarrow\mathbb{R}:(X_{R},\beta)

19: return

Res

If there are multiple convolutional and pooling layers, the parameters are calculated multiple times with different activation functions $A$ . At last, the estimated parameters of each layer are appended to give $\bar{(}P)$ (Line 13 - Line 18). Then, the memory for the model $\mathfrak{Mem}[model]$ , is calculated. The memory is a function of parameters $\bar{(}P)$ for each batch per N number of batches. It is assumed that each parameter needs one Kilo Byte (1KB) for inference, based on which the final memory required will be in MBs ( $\sim 5MB$ ). $X_{R}.features$ represents the features to be given as input to the regressor $\mathbb{R}$ which predicts the resource estimations $Res$ . The features in $X_{R}.features$ include, weight matrix $W$ , bias $B$ , activation $A$ , parameters of CNN at each layer $\bar{(}P)$ , memory estimation of model $\mathfrak{Mem}[model]$ and memory available at each node $\mathfrak{Mem}[node]$ . This resource estimation provides a prediction whether the inference can be performed on a single node or if it needs to be done using parallelism.

IV-D Workload- and resource-aware malware detection

We develop a few-shot learning based convolutional neural network, trained on malware and benign samples. The inference task using this pre-tained malware detection model is partitioned and executed on different devices [44]. It was also ensured that child devices have no access to complete information. The partitioning is performed based on the independency of the nodes of the ML classifier, represented as a graph, and the workload that could be accommodated on the parent and child devices [57]. We provide an upper bound on the number of devices to which the task can be distributed. As the ML architecture is defined during design time, the model parallelism and model splitting overheads do not affect during the runtime. The overhead to determine whether distributed ML is needed is minimal due to involved low-complex computations.

Given the model is distributed on multiple IoT devices as shown in Figure 2, the accumulation of the gradients from the child nodes is a challenging task [44]. Techniques such as DistBelief [58] are highly dependent on the partitioning of the model. Thus, they can lead to varied performances in our case and hence not adaptable. We adapt AllReduce [59] paradigm in this project, where the parent node accumulates the gradients from the children nodes. To update the gradients and perform other computations including inference, a synchronous Allreduce approach is utilized for better scalability [59]. However, a direct adaptation of such a method makes it vulnerable to faults such as the unavailability or garbage data from one device can stagnate or contaminate the whole process. To address such concerns, we deploy Downpour stochastic gradient descent (SGD) [60]. Downpour SGD is more resilient to machine failures and data manipulations, as it allows the training and inference to continue even if some model replicas are offline. It needs to be noted that the training happens offline, and inference is performed in real-time. To minimize the communication overheads, we let the parent device choose the child devices within a threshold radius $R$ for which the communication costs are lower and ensure the devices have a smaller workload to process. As frequent communication between parent and child nodes lead to large overheads, we let the system communicate whenever a device’s output is required as input for another device.

Algorithm 3 Pseudo-Code for Distributed Runtime Modelling of Malware Detection

1: Require:

M

(Malware grayscale images),

B

(Generated Benign grayscale images)

2: Input:

D^{n}=\{B+M\},\mathfrak{Mem}[model]

3: define

Distribute\_CNN\_model()

4: for

n\leftarrow range(0,x)

: do

5: if

\mathfrak{Mem}[model]\leq\mathfrak{Mem}[node]

node.append(n)

\mathfrak{Mem}[node]\leftarrow\mathfrak{Mem}[0]+\mathfrak{Mem}[1]+...+% \mathfrak{Mem}[n]

8: end if

9: end for

10:

l_{1}=nn.layer1.cuda(0)

11:

l_{2}=nn.layer2.cuda(1)

12:

\cdots

13:

\cdots

14:

l_{n}=nn.layern.cuda(n)

15:

model=nn.Sequential(l_{1},l_{2},...,l_{n})

16:

input=D^{n}.cuda(0)

17:

output=model(input)

18: return

O_{m}

Algorithm 3, represents the Pseudo-code for proposed distributed runtime-based modeling of malware detection. The function to distribute CNN represented as, $Distribute\_CNN\_model()$ , is called based on the output of the regressor. It also takes the memory estimation $\mathfrak{Mem}[model]$ of the CNN model for malware detection as input. It compares the model memory $\mathfrak{Mem}[model]$ and available memory at each node $\mathfrak{Mem}[node]$ . It appends multiple node memory elements to find the number of nodes, required to distribute the model. The number of nodes $n$ should have a combined memory more than or equal to the model memory $\mathfrak{Mem}[model]$ (Line 3 - Line 5). If this condition is met, the CNN is distributed on $n$ nodes. The different layers of malware detection model $l_{1},l_{2},\cdots,l_{n}$ (Line 10 - Line 14) are divided on $n$ and trained. The input data is made available to the input layers, by passing them to the $node0$ . Communication between the nodes is made possible due to the interdependent variables and back pass algorithm by the function $model=nn.Sequential(l_{1},l_{2},...,l_{n})$ (Line 15 - Line 17).

V Results

V-A Experimental Setup

For the IoT network setup, we deployed 20 IoT nodes encompassing Broadcom BCM2711, and quad-core Cortex-A72 (ARM v8) 64-bit boards. These nodes are connected through a wireless interface (WiFi). For the purpose of model parallelism, we deployed multiple Jetson Nanos containing 128-core NVIDIA Maxwell architecture-based GPU and Quad-core ARM® A57 CPU. The 4 JetsonNano boards are deployed for employing model parallelism and providing access to multiple CPU and GPU nodes to IoT nodes. Each Jetson Nano board acts as a single entity. We have obtained malware and benign applications from VirusTotal [61] with 12,500 benign samples and 70,000 malware samples that encompass 5 malware classes: backdoor, rootkit, trojan, virus, and worm. These files are executed in Sandbox to capture malware HPC attributes. These HPC attributes of benign and malware samples are converted to grayscale images of size 256 $\times$ 256. The benign and malware binary samples are also converted to grayscale images of size 256 $\times$ 256. In this image dataset, 70% of the data is divided into the training set and 30% of unseen data is taken as the test set. To further improve the model training and make it resilient to malware evolution, synthetic data generated using code-aware data generation technique based on few-shot learning technique is added to the training set. This synthetic data is also split into 70%-30% and used to augment the training and test data during the runtime. A CNN is built on all this data in offline and the inference task of test data is performed on multiple CPU and GPU nodes based on resource availability.

V-B Simulation Results

The inference is performed using a pre-trained convolution neural network algorithm. If the resources to perform inference are not enough, the malware detection task is off-loaded to multiple nodes. Table II represents the performance of different datasets when model parallelism is applied. These datasets have only a few samples and are populated with synthetic data generated using GANs. Difference models trained on HPC, binary, and combined datasets are divided over multiple nodes. We compare the performance of the proposed technique in terms of accuracy, F1-score, precision, and recall. We can observe the highest accuracy of 98.62% in the case of training data containing both HPC and binary image data, and where model parallelism is performed on two nodes. This case performs well because there are more training samples in this case which improves the model learning capability. And also the inference model is only divided into two nodes, so the penalty is less compared to three or four-node model parallelism. The lowest accuracy of 89.45% is observed in the case of the model trained on a binary dataset and performed model parallelism on four nodes. This is due to the limited features present in the binary dataset which help in detecting complex malware. And the penalty due to dividing the model into four nodes. With the increase in the number of nodes the model is divided, we can observe a minute penalty in performance.

TABLE II: Performance comparison of the proposed model with different MP algorithms

Model	Nodes	Accuracy	Precision	Recall	F1-score
		(%)	(%)	(%)	(%)
	2	97.64	97.62	97.65	97.63
HPC	3	97.24	97.24	97.26	76.45
(with MP)	4	96.73	96.73	97.21	96.72
	2	91.62	91.63	91.62	91.70
Binares	3	91.18	91.16	91.15	91.18
(with MP)	4	89.45	89.43	89.42	89.45
	2	98.62	98.63	98.62	98.70
HPC + Binaries	3	98.14	98.12	97.86	98.12
(with MP)	4	97.12	97.13	97.12	97.12

V-C Comparison with Previous Works

Table III presents the comparison of the proposed technique with the existing HPC-based malware detection techniques. We compare the performance of the proposed technique in terms of accuracy, F1-score, recall, latency, and area. All the models in table III focus on malware detection based on HPC runtime features. Compared to the existing techniques the proposed CNN-based distributed training on HPC-based image data achieves the highest accuracy. It maintains an accuracy of 96.7% while producing comparable latency and area values.

TABLE III: Comparison with existing HPC-based detection techniques

Model	Accuracy	F1-score	Recall	Latency	Area
	(%)	(%)	(%)	(@ 10ns)	( $\mu$ m²)
OneR [62]	81.00	81.00	82.00	1	1258
JRIP [62]	83.00	83.00	84.00	4	1504
PART [62]	81.00	81.50	83.10	6	2131
J48 [62]	82.00	82.00	82.00	9	1801
Adaptive-HMD [63]	85.30	85.30	85.80	4	876
SVM [64]	73.90	73.60	77.20	-	-
RF [64]	83.50	83.40	82.20	-	-
NN [64]	81.10	81.10	81.60	-	-
SMO [65]	93.20	93.30	93.10	22	2466
Proposed	96.70	96.70	97.20	10	1044

The F1 score and recall also support the performance of the proposed technique compared to other techniques. From these results, it is evident that the proposed technique achieves state-of-the-art HPC-based malware detection accuracy. The latency is represented in terms of clock cycles of 10ns to measure inference time, obtained from the Synopsys DC tool. The inference time of a few tens of nanoseconds indicates real-time malware detection.

V-D Impact on Latency with Proposed Technique

Normalized inference execution time is analyzed for cases a) the parent node has sufficient resources; b) the parent node does not have enough resources and outsources to multiple nodes. Figure 3, represent the latency of these cases. In Figure 3, Node represent a 128-core NVIDIA Maxwell architecture-based GPU present in Jetson Nano boards and the ARM represents the Quad-core ARM® A57 CPU present in Jetson boards. Also, P represents the parent node, C1 represents the first child node, C2 represents the second child node and C3 represents the third child node. We observe that with an increase in the number of nodes the inference time decreases. As the parameters are divided over various nodes the execution time needed for inference decreases. As the executions run in parallel, the total latency to perform inference in model parallelism is the latency of node which takes the highest time to execute (usually the model parallelism latency is the latency of parent node P). In Figure 3, for the case of sufficient resources, it takes 98 seconds to perform the inference task. For the case of model parallelism, we can observe a speedup of 4 $\times$ when the inference task is parallelized between two nodes. If we further off-load the inference task to three nodes, an additional speedup of about 1.5 $\times$ is observed. The ARM boards used as child nodes in model parallelism also produced notable speed-ups. We observed an overall speedup of 9.8 $\times$ while using four Jetson Nano boards.

V-E Impact of Proposed Technique on Resource Consumption

The resource consumption of different worker nodes can be observed in Figure 4. In Figure 4, P represents the parent node, C1 represents the first child node, C2 represents the second child node and C3 represents the third child node. We observe the resource consumption for the following scenarios: a) the parent node has sufficient resources; b) the parent node does not have enough resources and outsources to multiple nodes. The inference task takes 4 MB of data to complete. In the first case, the single parent node P can provide this data to complete the inference task. In other cases, the inference task is divided between multiple nodes (model parallelism), so the data required is also divided into multiple nodes. We can observe that the resources are not equally consumed in the parent and child nodes. This is because the parent node usually has additional steps to perform, like the gradient collection from child nodes, adding them, etc, so, it consumes high resources. When compared to using a single node, in model parallelism, the required resources are provided from various nodes, and this helps to improve the processing speed.

V-F ASIC Implementation of Proposed Technique

We present the hardware implementation of the classifiers on ASIC to estimate resource utilization. The power, area, and energy values are reported at 100MHz. We used Design Compiler Graphical by Synopsys to obtain the area for the models. Power consumption is obtained using Synopsys Primetime PX. The post-layout area, power, and energy are summarized in Table IV. The resource utilization of the binary regression model is significantly less, whereas the CNN consumes high power, energy, and area on-chip ( Table IV ), hence we split the CNN model across the neighboring nodes with the available resource for inference computation. The post-layout energy numbers were almost $\approx$ 1.8 $\times$ higher than the post-synthesis results. This increase in energy is mainly because of metal routing resulting in layout parasitics. As the tool uses different routing optimizations, the power, area, and energy values keep changing with the classifiers’ composition and architecture.

TABLE IV: Post synthesis hardware results of the classifiers (@100MHz) when deployed

Model	Power ( $mW$ )	Energy ( $mJ$ )	Area ( $mm^{2}$ )
CNN	82.45	5.12	4.42
Regressor	27.81	2.52	1.18

VI Conclusion

With the proposed resource- and workload-aware model parallelism-based malware detection technique employs distributed training to enable better security for resource-constrained IoT devices. The metrics of distributed training on multiple nodes are analyzed and a speed-up of 9.8 $\times$ is observed compared to on-device training. From the results presented, it is also evident that the proposed technique produces state-of-the-art malware detection accuracy of 96.7% among HPC-based detection techniques. We also furnished the ASIC implementations of the CNN classifier trained using the proposed technique and the lightweight logistic regressor trained to classify the availability of resources. Thus, the proposed technique is reliable and accurate for malware detection in IoT devices.

References

[1] T. Adiono, “Challenges and opportunities in designing internet of things,” 2014 The 1st International Conference on Information Technology, Computer, and Electrical Engineering, 2014.
[2] O. Abbas and et al., “Big data issues and challenges,” 2016.
[3] J. Johnson, “Number of malware attacks per year 2020,” Sep 2022. [Online]. Available: https://www.statista.com/statistics/873097/malware-attacks-per-year-worldwide/
[4] “Malware statistics trends report: Av-test,” 2021. [Online]. Available: https://www.av-test.org/en/statistics/malware/
[5] J. Wurm, K. Hoang, O. Arias, A. R. Sadeghi, and Y. **, “Security analysis on consumer and industrial iot devices,” in Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 2016.
[6] E. Ronen and A. Shamir, “Extended functionality attacks on IoT devices: The case of smart lights,” in IEEE European Symposium on Security and Privacy, 2016.
[7] A. Damodaran et al., “A comparison of static, dynamic, and hybrid analysis for malware detection,” Journal of Computer Virology and Hacking Techniques, 2015.
[8] C. Rossow and et.al, “Prudent practices for designing malware experiments: Status quo and outlook,” Symposium on Security and Privacy, 2012.
[9] L. Nataraj and et al., “Malware images: Visualization and automatic classification,” in Int. Symposium on Visualization for Cyber Security, 2011.
[10] D. Gibert and et.al, “Using convolutional neural networks for classification of malware represented as images,” Journal of Computer Virology and Hacking Techniques, 2019.
[11] B. Bashari Rad and et.al, “Camouflage in malware: From encryption to metamorphism,” IJCSNS, 2012.
[12] S. Shukla, P. D. Sai Manoj, G. Kolhe, and S. Rafatirad, “On-device malware detection using performance-aware and robust collaborative learning,” in 2021 58th ACM/IEEE Design Automation Conference (DAC), 2021.
[13] S. Shukla, S. Rafatirad, H. Homayoun, and S. M. P. Dinakarrao, “Federated learning with heterogeneous models for on-device malware detection in iot networks,” in 2023 Design, Automation and Test in Europe Conference and Exhibition (DATE), 2023.
[14] H. Sayadi, Y. Gao, H. Mohammadi Makrani, J. Lin, P. C. Costa, S. Rafatirad, and H. Homayoun, “Towards accurate run-time hardware-assisted stealthy malware detection: A lightweight, yet effective time series cnn-based approach,” Cryptography, vol. 5, no. 4, 2021. [Online]. Available: https://www.mdpi.com/2410-387X/5/4/28
[15] S. Das, J. Werner, M. Antonakakis, M. Polychronakis, and F. Monrose, “The challenges, pitfalls, and perils of using hardware performance counters for security,” in IEEE Security & Privacy, 2019.
[16] I. You and et.al, “Malware obfuscation techniques: A brief survey,” in Int. Conf. on Broadband, Wireless Comp., Comm. and Applications, 2010.
[17] S. J. Stolfo and et.al, “Towards stealthy malware detection,” in Malware Detection, 2007.
[18] S. Li, H. Song, and M. Iqbal, “Privacy and security for resource-constrained iot devices and networks: Research challenges and opportunities,” Sensors, vol. 19, no. 8, 2019. [Online]. Available: https://www.mdpi.com/1424-8220/19/8/1935
[19] A. Makandar and A. Patrot, “Malware class recognition using image processing techniques,” in Int. Conf. on Data Management, Analytics and Innovation (ICDMAI), 2017.
[20] A. Dhavlle and S. Shukla, “A novel malware detection mechanism based on features extracted from converted malware binary images,” 2021.
[21] S. Shukla, G. Kolhe, H. Homayoun, S. Rafatirad, and S. M. P D, “Rafel - robust and data-aware federated learning-inspired malware detection in internet-of-things (iot) networks,” in Proceedings of the Great Lakes Symposium on VLSI 2022, ser. GLSVLSI ’22. New York, NY, USA: Association for Computing Machinery, 2022.
[22] Brasser and et al., “Advances and throwbacks in hardware-assisted security: Special session,” in International Conference on Compilers, Architecture and Synthesis for Embedded Systems, ser. CASES ’18, 2018.
[23] D. Bilar, “Opcodes as predictor for malware,” IJESDF, 2007.
[24] D. John, M. Matthew, S. Jared, T. Adrian, W. Adam, S. Simha, and S. Salvatore, “On the feasibility of online malware detection with performance counters,” in ISCA’13, 2013.
[25] A. P. Kuruvila, S. Kundu, and K. Basu, “Analyzing the efficiency of machine learning classifiers in hardware-based malware detectors,” in 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2020, pp. 452–457.
[26] A. Dhavlle, S. Shukla, S. Rafatirad, H. Homayoun, and S. M. Pudukotai Dinakarrao, “Hmd-hardener: Adversarially robust and efficient hardware-assisted runtime malware detection,” in 2021 Design, Automation and Test in Europe Conference and Exhibition (DATE), 2021.
[27] S. Shukla, G. Kolhe, S. M. P. D, and S. Rafatirad, “Microarchitectural events and image processing-based hybrid approach for robust malware detection: Work-in-progress,” in Proceedings of the International Conference on Compliers, Architectures and Synthesis for Embedded Systems Companion, ser. CASES ’19. New York, NY, USA: Association for Computing Machinery, 2019.
[28] S. Shukla and P. D. Sai Manoj, “Bring it on: Kinetic energy harvesting to spark machine learning computations in iots,” in 2024 International Symposium on Quality Electronic Design (ISQED), 2024.
[29] S. Barve, S. Shukla, S. M. P. Dinakarrao, and R. Jha, “Adversarial attack mitigation approaches using rram-neuromorphic architectures,” in Proceedings of the 2021 on Great Lakes Symposium on VLSI, ser. GLSVLSI ’21. New York, NY, USA: Association for Computing Machinery, 2021.
[30] S. Kasarapu, R. Hassan, S. Rafatirad, H. Homayoun, and S. M. P. Dinakarrao, “Demography-aware covid-19 confinement with game theory,” in 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2021, pp. 1–4.
[31] S. Kasarapu, R. Hassan, H. Homayoun, and S. M. Pudukotai Dinakarrao, “Scalable and demography-agnostic confinement strategies for covid-19 pandemic with game theory and graph algorithms,” COVID, vol. 2, no. 6, pp. 767–792, 2022.
[32] S. Kasarapu, S. Shukla, and S. M. P. Dinakarrao, “Resource- and workload-aware malware detection through distributed computing in iot networks,” in 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC), 2024, pp. 368–373.
[33] S. Kasarapu, S. Shukla, and S. M. Pudukotai Dinakarrao, “Resource- and workload-aware model parallelism-inspired novel malware detection for iot devices,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 12, pp. 4618–4628, 2023.
[34] S. Kasarapu, S. Shukla, R. Hassan, A. Sasan, H. Homayoun, and S. M. PD, “Cad-fsl: Code-aware data generation based few-shot learning for efficient malware detection,” 2022.
[35] S. Shukla, S. Kasarapu, R. Hasan, S. M. P. D, and H. Shen, “Ubol: User-behavior-aware one-shot learning for safe autonomous driving,” in 2022 Fifth International Conference on Connected and Autonomous Driving (MetroCAD), 2022, pp. 7–12.
[36] S. Kasarapu, S. Bavikadi, and S. M. P. Dinakarrao, “Processing-in-memory architecture with precision-scaling for malware detection,” in 2024 37th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID), 2024, pp. 529–534.
[37] G. P. C. R. S. Bharani Surya, S and N. Mohankumar, “Evolving reversible fault-tolerant adder architectures and their power estimation,” in International Conference on Communication, Computing and Electronics Systems: Proceedings of ICCCES 2019, 2020.
[38] R. Saravanan, S. Bavikadi, S. Rai, A. Kumar, and S. M. Pudukotai Dinakarrao, “Reconfigurable fet approximate computing-based accelerator for deep learning applications,” in 2023 IEEE International Symposium on Circuits and Systems (ISCAS), 2023, pp. 1–5.
[39] M. N. Raghul, S, “Microcontroller based ann for pick and place robot coordinate monitoring system,” in Proceedings of the 1st International Conference on Data Science, Machine Learning and Applications (ICDSMLA), 2020.
[40] R. S, Y. Akhileswar, and M. N, “N configuration trng based scrambler protocol for secured file transfer,” in Proceedings of the International Conference on IoT Based Control Networks and Intelligent Systems (ICICNIS), 2021.
[41] Y. Akhileswar, S. Raghul, C. Meghana, and N. Mohankumar, Hardware-Assisted QR Code Generation Using Fault-Tolerant TRNG, 2020.
[42] M. D. F. De Grazia, I. Stoianov, and M. Zorzi, “Parallelization of deep networks.” in ESANN. Citeseer, 2012.
[43] M. Li, D. Andersen, J. Park, A. Smola, A. Ahmed, V. Josifovski, J. Long, E. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” Proc. OSDI, pp. 583–598, 01 2014.
[44] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning,” ACM Comput. Surv., vol. 53, no. 2, Mar. 2020.
[45] R. Hewett and I. Grady, “A linear algebraic approach to model parallelism in deep learning,” 06 2020.
[46] M. Shoeybi, M. M. A. Patwary, R. Puri, P. Legresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using gpu model parallelism,” 09 2019.
[47] Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing parallelism in distributed training for huge neural networks,” 05 2021.
[48] S. M. Grigorescu, “Generative one-shot learning (gol): A semi-parametric approach to one-shot learning in autonomous vision,” Int. Conf. on Robotics and Automation (ICRA), 2018.
[49] D. Bogdoll, J. Jestram, J. Rauch, C. Scheib, M. Wittig, and J. M. Zöllner, “Compressing sensor data for remote assistance of autonomous vehicles using deep generative models,” CoRR, 2021.
[50] S. Pillai and J. J. Leonard, “Towards visual ego-motion learning in robots,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
[51] L. Taylor and G. Nitschke, “Improving deep learning with generic data augmentation,” in Symposium Series on Computational Intelligence (SSCI), 2018.
[52] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” Neural Information Processing Systems, 2012.
[53] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “Meta-learning with memory-augmented neural networks,” in Int. conf. on machine learning. PMLR, 2016.
[54] S. Shukla, A. Dhavlle, S. M. P. D, H. Homayoun, and S. Rafatirad, “Iron-dome: Securing iot networked systems at runtime by network and device characteristics to confine malware epidemics,” in 2022 IEEE 40th International Conference on Computer Design (ICCD), 2022.
[55] S. Shukla, G. Kolhe, S. M. PD, and S. Rafatirad, “Rnn-based classifier to detect stealthy malware using localized features and complex symbolic sequence,” in 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 2019.
[56] S. Shukla, G. Kolhe, S. M. P D, and S. Rafatirad, “Stealthy malware detection using rnn-based automated localized feature extraction and classifier,” in 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), 2019.
[57] J. Geng, D. Li, and S. Wang, “Elasticpipe: An efficient and dynamic model-parallel solution to dnn training,” in Workshop on Scientific Cloud Computing, 2019.
[58] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “Large scale distributed deep networks,” in International Conference on Neural Information Processing Systems, 2012.
[59] P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms for clusters of workstations,” J. Parallel Distrib. Comput., vol. 69, no. 2, p. 117–124, Feb. 2009.
[60] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” J. Mach. Learn. Res., vol. 12, no. null, p. 2121–2159, Jul. 2011.
[61] “Virustotal package,” 2021. [Online]. Available: https://www.rdocumentation.org/packages/virustotal/versions/0.2.1
[62] N. Patel, A. Sasan, and H. Homayoun, “Analyzing hardware based malware detectors,” in 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), 2017, pp. 1–6.
[63] Y. Gao, H. M. Makrani, M. Aliasgari, A. Rezaei, J. Lin, H. Homayoun, and H. Sayadi, “Adaptive-hmd: Accurate and cost-efficient machine learning-driven malware detection using microarchitectural events,” in 2021 IEEE 27th International Symposium on On-Line Testing and Robust System Design (IOLTS), 2021, pp. 1–7.
[64] A. P. Kuruvila, S. Kundu, and K. Basu, “Analyzing the efficiency of machine learning classifiers in hardware-based malware detectors,” in 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2020, pp. 452–457.
[65] H. Sayadi, H. Mohammadi Makrani, O. Randive, S. M. P.D., S. Rafatirad, and H. Homayoun, “Customized machine learning-based hardware-assisted malware detection in embedded devices,” in 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), 2018, pp. 1685–1688.