ZEST: Attention-based Zero-Shot Learning for Unseen IoT Device Classification

Binghui Wu, Philipp Gysel, Dinil Mon Divakaran, and Mohan Gurusamy Binghui Wu and Mohan Gurusamy (Senior Member, IEEE) are with National University of Singapore (NUS); e-mail-id: [email protected], [email protected]. Philipp Gysel and Dinil Mon Divakaran (Senior Member, IEEE) are with Acronis Research; e-mail-id: [email protected], [email protected].

Abstract

Recent research works have proposed machine learning models for classifying IoT devices connected to a network. However, there is still a practical challenge of not having all devices (and hence their traffic) available during the training of a model. This essentially means, during the operational phase, we need to classify new devices not seen in the training phase. To address this challenge, we propose ZEST—a ZSL (zero-shot learning) framework based on self-attention for classifying both seen and unseen devices. ZEST consists of i) a self-attention based network feature extractor, termed SANE, for extracting latent space representations of IoT traffic, ii) a generative model that trains a decoder using latent features to generate pseudo data, and iii) a supervised model that is trained on the generated pseudo data for classifying devices. We carry out extensive experiments on real IoT traffic data; our experiments demonstrate i) ZEST achieves significant improvement (in terms of accuracy) over the baselines; ii) SANE is able to better extract meaningful representations than LSTM which has been commonly used for modeling network traffic.

Index Terms:

IoT, fingerprinting, zero-shot learning (ZSL), network traffic, attention, security, transformer

I Introduction

Offices, homes, and enterprises in various industry verticals have numerous IoT devices connected to their networks, including smart thermostats, hubs, lighting systems, alarms, TVs, and wearable devices. While IoT devices offer new and efficient services, they also present security threats. Currently, manufacturers do not follow a standard framework to announce the device identities and their functionalities. The lack of standardization often results in vulnerabilities left open for different kinds of attacks [1, 2]. An important first step in securing IoT devices is to identify the different types of devices operating in a home/office environment. The challenge to the requirements arises from the constantly evolving landscape of IoT devices and their network behaviors. Traditional static methods struggle to adapt to changing device behaviors, recognize unknown devices, and capture complex communication patterns [3]. Consequently, monitoring network traffic dynamically is now the most practical method for identifying devices and ensuring their security.

IoT fingerprinting is a well-studied problem [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], but there are still open challenges in practical settings. Many existing works take a supervised approach, thereby dealing only with known devices [4, 5, 6, 7, 13]. With the number of IoT devices expected to grow to tens of billions, new device types will continue to enter the market, making it impractical to assume that traffic of all devices will be available in advance to train a machine learning model. The challenge remains to identify devices not present in the training set, which we refer to as “unseen” devices. Conversely, “seen” devices refer to those devices that have labeled examples available during the model’s training phase. We need a system that can classify unseen IoT devices, in addition to seen devices.

Zero-shot learning (ZSL) could help to classify unseen devices. ZSL is known to work well for image classification—it leverages textual descriptions as attributes to relate unseen classes to seen classes [17, 18, 19]. ZSL on images involves the use of specific attributes obtained through manual or automatic annotation. For instance, datasets such as Caltech-UCSD Birds-200-2011 (CUB-200) [20] and Animals with Attributes (AwA) [21] provide pre-extracted feature representations for image descriptions. Given the information that a giraffe is a herbivore, has a long neck, has brown spots, and has ossicones on its head, one can easily distinguish it from other animals such as pigs or cows, even without having seen a giraffe before. By leveraging the semantic relationships between different classes, the ZSL approach enables the model to recognize unseen classes by inferring their attributes in association with other classes. The question we ask in this research is, can ZSL be leveraged to carry out classification of both seen and unseen IoT devices, by map** network traffic data to an attribute space? Figure 1 illustrates this concept.

For IoT devices, we have textual descriptions, e.g., information on the product webpages or user manuals; but they do not directly translate into the domain of network traffic data. Defining attributes in the IoT domain is non-trivial, and it plays a critical role in model performance. The primary challenge in ZSL for IoT devices is the definition of suitable attributes. These attributes should possess sufficient information density to accurately represent and distinguish the traffic patterns of individual devices. In the context of network traffic analysis, one must examine the intricacies associated with packets. Network packets are characterized by a multitude of information, including IP addresses, service ports, transport protocols, inter-arrival time, etc. As data flows through the network, the accumulation of the packets results in extremely long sequences; packet sequences in hundreds or thousands or even more are common with different applications (such as browsing, email, SSH, etc.). Furthermore, the challenge posed by high dimension becomes evident when considering the diverse set of features and their potential combinations with packet sequences. This results in the creation of a high-dimensional feature space, in which each unique feature introduces an additional dimension.

While traditional statistical models and ML algorithms, such as Support Vector Machine (SVM), Linear Regression, k-NN, and Decision Trees, have been proposed for learning network traffic behaviors (e.g., see [22, 23, 24]), such models encounter significant difficulties when tasked with effective processing and extracting meaningful patterns from such intricate and high dimensional data. Consequently, traditional machine learning algorithms, such as Support Vector Machine (SVM), Linear Regression, and Decision Trees, encounter significant difficulties when tasked with effective processing and extracting meaningful patterns from such intricate and high dimensional data. To address these computational challenges inherent in dealing with data of long sequences and high dimensions, specialized sequence models have emerged as an effective approach. In particular, the transformer models [25] have shown excellent performance in language modeling [26], image classification [27], DNA analysis [28], resource allocation[29], etc. The self-attention mechanism in transformers enables parallel evaluation with each token of the input sequence, thus eliminating the sequential dependency present in prior sequence models such as recurrent neural networks (RNNs). Transformers consider the entire context, rather than relying solely on local information, enabling a deeper understanding of context and dependencies. Building upon the encoder of the transformer architecture, we develop SANE, a self-attention mechanism designed to comprehend traffic patterns and autonomously generate concise attributes for IoT devices (Section III-C).

In this work, we propose ZEST, a zero-shot learning framework based on the self-attention mechanism for IoT fingerprinting. ZEST involves i) training a self-attention based network feature extractor, i.e., SANE, to extract features and attribute vectors of devices, subsequently ii) training a generative model to map attribute vectors to traffic data and generate pseudo data for unseen devices, and finally iii) training a supervised classifier with the generated data (as illustrated in Figure 3). The main contributions of this work are:

1.

We introduce a ZSL framework for IoT fingerprinting, ZEST. To the best of our knowledge, it is the first generative ZSL framework for IoT fingerprinting. Our work here is also the first to leverage transformer model for learning network traffic characteristics. Based on the self-attention mechanism, ZEST achieves state-of-the-art performance when compared with other semi-supervised and unsupervised learning methods.
2.

As the use of attention mechanism for network traffic understanding is new, we study its effects in classifying IoT devices and compare it with existing solutions that use LSTMs for the same purpose. We find that an attention-based mechanism improves the performance of even existing baseline models, making it a better choice for IoT fingerprinting.
3.

We propose a new approach for generating attribute vectors for IoT devices. Unlike image classification tasks, network traffic data of IoT devices do not come with apparent class descriptions. Hence, we leverage pre-trained models to extract attribute vectors of unseen devices, providing a viable and attractive technique to overcome this challenge. We conduct experiments with varying attribute vector dimensions and identify the most suitable dimension for optimal performance.
4.

We make the source code implementations of the models openly available to facilitate research¹¹1Code is available at: https://github.com/Binghui99/ZEST.. This includes the implementation of the ZEST framework, LSTM-based baselines, as well as the data processing methods.

In the following, we first present the related literature for IoT fingerprinting and the background of ZSL. In Section III, we present our ZSL framework for identifying seen and unseen devices. Performance evaluations are carried out in Section IV.

II Related Works

II-A IoT fingerprinting

With the rapid growth of the IoT ecosystem, there has been an increasing interest in characterizing and fingerprinting (i.e., classifying) IoT devices. In recent years, several works have proposed IoT traffic analysis methods, and use supervised learning approaches to perform device classification [4, 5, 6, 7]. An interesting work [7] is the application of a sequence model, specifically Bi-LSTM (bi-directional long short-term memory), for modeling traffic of IoT devices. This deep learning model shows a good ability to learn sequence information and achieves high accuracy in IoT device classification.

However, a supervised approach is limited in practice, since we have numerous new unseen devices entering the market regularly. Therefore, researchers proposed unsupervised methods [8, 30, 9] for IoT fingerprinting. Sivanathan et al. [30] extract key features from flow-level network traffic and use PCA (principal component analysis) to project data into lower dimensional space. As a complementary approach, the authors in [9] train a VAE (variational autoencoder) with an encoder and a decoder in an unsupervised way, subsequently leverage the encoder to compress raw data, and finally use k-means for clustering. However, unsupervised learning methods do not use all available information, such as labels of seen devices and their semantic descriptions, thus achieving only modest results (see Section IV-C).

Semi-supervised methods [10, 14, 31, 16], on the other hand, utilize the available information, and they can also deal with unseen classes. Authors in [14] propose a semi-supervised method based on a CNN model (convolutional neural network) and multi-task learning. Given a few labeled data, they train a CNN to transform raw features into dense high-level features, and thus achieve a dimension reduction. An alternative semi-supervised approach called DEFT [10], extracts traffic features and utilizes seeded k-means [31] to conduct unsupervised clustering. Based on the clustered data, DEFT trains a supervised random forest to perform the final classification, using the cluster numbers as labels.

The semi-supervised solutions are often based on clustering [10, 14, 31]. However, clustering has limitations, including the requirement to inform the number of clusters in advance, sensitivity to initial centroids and outliers, and unsuitability for high-dimensional and non-linear data [32, 31, 33, 34]. In order to overcome these shortcomings, we choose to avoid clustering for our solution. Instead, our ZEST framework uses a generative model for generating unseen data, based on high-level attribute definitions. Then we use the generated data to train a supervised model. To deal with high-dimensional and non-linear data, we use a deep sequence model and extract small-dimensional latent features.

II-B Zero-shot learning (ZSL)

Broadly, there are two approaches for ZSL: embedding-based methods [19, 18] and generative-based methods [17, 35, 36]. Embedding-based methods learn a high-dimensional embedding space that maps the low-level features of seen classes to their corresponding semantic vectors. This approach recognizes new classes by comparing prototypes and predicted representations of data samples in the embedding space. On the other hand, generative-based methods use samples of seen classes and semantic representations of both seen and unseen classes to generate pseudo data for the unseen classes, thus converting a ZSL problem into a supervised learning problem.

In this work, we focus on the generative method. The authors in [17] propose a CVAE-based approach using a generative model to learn the probability distribution of the input space conditioned on the attribute representation of the classes. The generative model is then used to generate samples for the unseen classes based on their attribute vectors. Gao et al. [36] propose Zero-VAE-GAN, which combines VAE and GAN to generate features for novel classes. This approach uses a dual encoder-decoder structure to map data samples into a joint feature space, improving the quality of generated samples. Our proposal ZEST is inspired by [17]; but there are some significant differences. Firstly, the proposal in [17] is for the image domain; therefore, the training data utilized is accompanied by well-defined semantic information. The authors apply word2vec to text descriptions (e.g., from Wikipedia) for different image classes. However, such a method is unsuitable for our problem, since the text description from device manuals cannot be transferred to the network traffic domain. Therefore, we propose a different approach to extract attributes (see Section III-C). Moreover, in the image classification domain, there are highly proficient pre-trained models trained on large datasets, like ResNet50 [37]. In the absence of such pre-trained models, we train our attention-based model from scratch, and employ it to extract features from network traffic. The details of our ZSL pipeline are explained in Section III-D.

II-C Transformer

Transformers, introduced by Vaswani et al. [25], are used for sequence-to-sequence learning tasks. The self-attention mechanism is good at capturing contextual relationships within a sequence, empowering models to extract rich and informative features. Furthermore, it can be efficiently parallelized, making it suitable for modern hardware accelerators like GPUs and TPUs [38]. We leverage the advancements made in transformers to enhance the performance of IoT fingerprinting. The visual transformer (ViT) proposed in [27] is a BERT-like [26] model for image classification that achieves superior performance on multiple benchmarks with fewer parameters than competing models, by efficiently modeling long-range dependencies between image patches and global image information. Our attention-based model is inspired by ViT.

However, unlike the input images to ViT, network traffic comes as a data sequence. For our data pre-processing pipeline, we split network traffic into packet sequences of pre-defined length, where packets are represented by a small number of raw features (Section III-B). For the final classification, we use an average pooling layer to get the whole sequence information. Besides, we define a special token to summarize the sequence-level network traffic information. Our design is driven by the need to understand network traffic at both the sequence level and packet level.

III ZEST: model architecture

III-A System definition

We consider a network connecting a set, $\Gamma$ , of IoT devices (such as smart cameras, hubs, alarms, etc.). We use $\mathcal{S}$ to denote the set of devices already seen, and $\mathcal{U}$ for the set of unseen devices; both are mutually exclusive, i.e., $\mathcal{S}\cap\mathcal{U}=\emptyset$ , $\mathcal{S}\cup\mathcal{U}=\Gamma$ . For our classification system, a single data point $\mathbf{x}$ is defined as a sequence of network packets with length $n$ , and each packet has $f$ features, i.e., $\mathbf{\mathbf{x}}\in\mathbb{R}^{n\times f}$ . For each seen device $d\in\mathcal{S}$ , we have a set of $m$ data points $\mathcal{X}^{d}=\{\mathbf{x}_{1}^{d},\mathbf{x}_{2}^{d},\cdots,\mathbf{x}_{m}^% {d}\}$ , and their corresponding labels $\mathcal{Y}^{d}=\{y_{1}^{d},y_{2}^{d},\cdots,y_{m}^{d}\}$ . However, for $d\in\mathcal{U}$ , we have data points $\mathcal{X}^{d}$ without the labels. Note that, a data point representing a sequence of (say) n=200 packets, each with f=8 features, results in a feature space of 1,600—a very high dimension.

III-B Traffic representation

From the traffic of an IoT device, we extract, what can be referred to as, raw features, for each packet in the traffic. Flow-level features capture the packet statistics but in a lossy way. In comparison, per-packet features provide the finest granularity of information in traffic. The raw packet features that are beneficial for device classification are: packet size, time since the last packet, the direction of the packet, transport protocol (TCP/UDP), the application protocol (HTTP/S, DNS, NTP, etc.), TLS version, src/dst IP address category, and src/dst port category. Intuitively, a single packet in itself might not present sufficient information for device classification, e.g., a TCP SYN or SYN+ACK is present in all TCP flows. Therefore, these per-packet features are extracted from non-overlap** fixed-length sequences of packets in a network trace of an IoT device and provided as input to the model, both in the training and inference phases. In our work, we limit the features to:

•

Source and destination IP addresses: Using raw IP addresses would overfit the model to the dataset used, besides creating a large latent space for representation. Instead, we use a binary value indicating whether the source/destination IP address is internal or external to the network of IoT devices.
•

Port representation: Between the source and destination ports, we assume the lower one is the service port, and represent only the service port. The other port is typically a random port, and therefore set to a constant number. The idea is to minimize the influence of ephemeral ports.
•

Transport layer protocol (e.g., UDP or TCP).
•

The time since the previous network packet.
•

The size of the packet.
•

The direction of the packet (inbound/outbound)

III-C Attributes

To overcome the lack of meaningful textual descriptions for IoT device traffic data, we adopt a novel approach inspired by attribute-based image classification. As an analogy, attributes of a giraffe are given by the wise people who see the giraffe and describe it based on their experience of describing other animals. Specifically, we train a self-attention model on the traffic data of seen devices as “wise people” to learn the knowledge of describing traffic patterns. Subsequently, when presented with traffic sequences of unseen devices, the model generates a description based on its learned knowledge, even though it has no prior knowledge of the unseen devices. The average description generated by the model is considered as the general attributes of the unseen devices. In this process, the unseen devices come with no label information. We develop and employ a powerful self-attention model based on the encoder of transformer [26] to extract latent features. In Figure 2, we present the architecture of SANE—a self-attention based feature extractor.

The input data consists of sequences of packets with high-dimensional features, which are noisy and sparse, making it challenging for a model to recover meaningful attributes. To facilitate the learning process, we require data with high information density enabling the model to effectively map the data space to the attribute space. Therefore, we define two latent space representations to extract features at different levels for each device $d\in\Gamma$ . The first one is $\mathbf{L}\in\mathbb{R}^{M\times 1}$ , where $\mathbf{L}^{d}=\{l_{1}^{d},l_{2}^{d},\cdots,l_{m}^{d}\}$ , such that, $l^{d}_{i}\in\mathbb{R}^{M\times 1}$ is the latent space representation of a single data point $\mathbf{x}_{i}^{d}$ and $m\in\mathbb{N}$ is the number of traffic sessions corresponding to device $d$ . The second latent space is defined as $\mathbf{\Lambda}\in\mathbb{R}^{N\times 1}$ , where $\mathbf{\Lambda}^{d}=\{\mathbf{\lambda}_{1}^{d},\mathbf{\lambda}_{2}^{d},% \cdots,\mathbf{\lambda}_{m}^{d}\}$ , such that $\mathbf{\lambda}_{i}^{d}\in\mathbb{R}^{N\times 1}$ is another latent feature corresponding to data point $\mathbf{x}_{i}^{d}$ . Based on the second latent feature, we define the attribute vector, $\mathcal{A}^{d}\in\mathbb{R}^{N\times 1}$ , for both seen and unseen devices, which are embedded into a vector space representing the semantic relationship of different devices. We use the average value of $\mathbf{\Lambda}^{d}$ as the attribute vector $\mathcal{A}^{d}$ of corresponding device $d$ , i.e., $\mathcal{A}^{d}$ = $\frac{\sum_{i}^{m}\mathbf{\lambda}_{i}^{d}}{m}$ .

III-D Architecture design

The architecture of ZEST is depicted in Figure 3. It consists of four phases: feature extractor training, feature and attribute extraction, generative model training, and training of a final supervised classifier.

SANE training (Algorithm 1). SANE is used to transform the input data into a latent feature space, as shown in Figure 3a. The deep sequence classifier is pre-trained on a significant amount of seen device data. After comparing well-known sequence models (namely, LSTM and transformer), we find that the transformer model has better ability to extract traffic patterns (see Section III-C) as well as the lowest inference time. Therefore, we design SANE, based on transformer (encoder), as the feature extractor for ZEST. The raw data for seen devices undergo a transformation into packet sequences with a sequence length of $n$ = 200 packets. Recall from Section III-B that each packet is represented using $f$ = 8 features. The steps of SANE training are listed in Algorithm 1.

First, we project packet sequences using a normal linear layer to create an embedding. Then, we add the special SLA (Sequence Level Aggregation) token and the positional embedding. The learnable SLA token is specifically designed for summarizing sequence-level features, facilitating the integration of both packet-level and sequence-level information. It is randomly initialized and added before the packet embedding. This integration results in powerful representations of network traffic. Additionally, we employ a positional embedding, denoted as $\mathbf{P}$ , to provide the model with information about the original packet positions. This embedding is initialized as a random vector, learned by the model, and subsequently added to the packet embedding.

Algorithm 1 SANE training

\mathcal{X}^{d}

;

\mathcal{Y}^{d}

\forall\>d\in\mathcal{S}

\mathcal{\xi}_{\text{{SANE}}}

3:Randomly initialize parameters in SANE

4:for

(\mathbf{b}_{x},\mathbf{b}_{y})

: sample a batch from

\mathcal{X}

and

\mathcal{Y}

\mathbf{E}=\texttt{SLA}\oplus\text{Embedding}(\mathbf{b}_{x})

\triangleright

Concatenate SLA

\mathcal{E}\leftarrow\mathbf{P}+\mathbf{E}

\triangleright

Add positional embedding

7: for each stacked encoder

\nu

R^{1}\leftarrow\text{Multi\_head\_attention}(\text{Norm}(\mathcal{E}))

R^{2}\leftarrow\text{MLP}(\text{Norm}(\mathcal{E}+R^{1}))

10:

\mathcal{E}\leftarrow R^{2}+(\mathcal{E}+R^{1})

11: end for

12:

l_{i}=\text{$\mathbf{NL}_{l}$}(\text{Average\_pooling}(\mathcal{E}))

;

\mathbf{\lambda}_{i}=\text{$\mathbf{NL}_{\lambda}$}(l_{i})

13:

\text{Predictions}\leftarrow\mathbf{Softmax}(\text{Classifier\_head}(\mathbf{% \lambda}_{i}))

14:

\text{Loss}\leftarrow\mathbf{Cross\_entropy}(\text{Predictions},\mathbf{b}_{y})

15:

\theta\leftarrow\theta+\mathbf{Compute\_update}(\text{Loss})

16:end for

17:

\mathcal{\xi}_{\text{{SANE}}}

{\xi(\theta)}

\triangleright

Trained SANE

18:return

\mathcal{\xi}_{\text{{SANE}}}

As a next step in our SANE architecture, we employ a stack of encoders [27] to learn the map** from the raw data space to a latent space by classifying seen devices, see line no. 7-11 in Algorithm 1. After the encoder stack, we adopt an average pooling layer to aggregate the features from both packet-level and sequence-level. Next we add two different dense layers $\mathbf{NL}_{l}$ and $\mathbf{NL}_{\lambda}$ to $\mathcal{C}$ for deriving latent features $\mathbf{L}$ and $\mathbf{\Lambda}$ , respectively. Finally, we pass the latent features to a classifier head and use $\mathbf{Softmax}$ to get the prediction probabilities per device class. The final step of the forward path is the calculation of the loss. Then we perform back-propagation to compute the weight updates. This procedure is repeated for several epochs until we get our final trained model, $\mathcal{\xi}_{\text{{SANE}}}$ .

Algorithm 2 SANE feature and attribute extraction

\mathcal{X}

;

\mathcal{Y}

;

\mathcal{\xi}_{\text{{SANE}}}

\mathcal{C}

: feature extraction model;

\mathcal{A}

: Attribute vector;

\mathbf{L}

: latent features

\mathcal{C}\leftarrow

Remove_classifier_head (

\mathcal{\xi}_{\text{{SANE}}}

)

\mathcal{C}_{l}\leftarrow\text{Remove}\_\mathbf{NL}_{\lambda}(\mathcal{C})

\triangleright

Remove layer

\mathbf{NL}_{\lambda}

\mathcal{C}_{\lambda}\leftarrow\text{Remove}\_\mathbf{NL}_{l}(\mathcal{C}_{l})

\triangleright

Remove layer

\mathbf{NL}_{l}

6:for

\mathbf{x}_{i}

\mathcal{X}

\forall\>i=1,2,\cdots,m

l_{i}=\mathcal{C}_{l}(\mathbf{x}_{i})

\triangleright

Extract feature

l\in\mathbb{R}^{M\times 1}

\lambda_{i}=\mathcal{C}_{\lambda}(\mathbf{x}_{i})

\triangleright

Extract feature

\mathbf{\lambda}\in\mathbb{R}^{N\times 1}

9:end for

10:

\mathbf{L}

\{l_{1},l_{2},\cdots,l_{m}\}

11:

\mathbf{\Lambda}

\{\mathbf{\lambda}_{1},\mathbf{\lambda}_{2},\cdots,\mathbf{\lambda}_{m}\}

12:

\mathcal{A}

\frac{\sum_{i}^{m}\mathbf{\lambda}_{i}}{m}

\triangleright

Take average of

\mathbf{\Lambda}

13:return

\mathcal{C}

\mathcal{A}

\mathbf{L}

Feature and attribute extraction (Algorithm 2). In this stage of ZEST, we feed $\mathcal{X}^{d},\forall\>d\in\mathcal{U}$ to $\mathcal{\xi}_{\text{{SANE}}}$ , an attention-based encoder trained only on seen classes (the “wise people”), to derive the attribute vectors of the unseen devices (without labels); see Figure 3a. To achieve this, we first remove the classification head of the trained SANE $\mathcal{\xi}_{\text{{SANE}}}$ to get the feature extraction model $\mathcal{C}$ . For a device $d\in\Gamma$ , we remove the layer $\mathbf{NL}_{\lambda}$ from $\mathcal{C}$ to get $\mathcal{C}_{l}$ for $\mathbf{L}^{d}$ extraction. Then we remove layer $\mathbf{NL}_{l}$ from $\mathcal{C}_{l}$ to get $\mathcal{C}_{\lambda}$ for $\mathbf{\Lambda}^{d}$ . The attribute extractor takes the average value of $\mathbf{\Lambda}^{d}$ to compute attribute vector $\mathcal{A}^{d}$ for device $d$ . Line no. 6-12 in Algorithm 2 illustrate the process of extracting attribute vectors for both seen and unseen devices.

Algorithm 3 CVAE and SVM training

\mathcal{A}^{s}

;

\mathbf{L}^{s},\forall\>s\in\mathcal{S}

;

\mathcal{A}^{u},\forall\>u\in\mathcal{U}

;

\mathcal{F}_{\text{SVM}}

: Trained SVM classifier

3:Randomly initialize

\text{Encoder}_{\text{cvae}}

\text{Decoder}_{\text{cvae}}

4:for

\mathbf{b}

: sample a batch from

\mathbf{L}^{s}

\mu,\sigma\leftarrow\text{Encoder}_{\text{cvae}}

(

\mathbf{b},\mathcal{A}^{s})

6: Sample

\mathbf{z}

form

\mathcal{N}\sim(\mu,\sigma)

\mathbf{\hat{b}}\leftarrow\text{Decoder}_{\text{cvae}}

(

\mathbf{z},\mathcal{A}^{s})

\triangleright

Reconstruction

8: Loss

\leftarrow

|(\mathbf{\hat{b}}-\mathbf{b})|

+ KL(

\mathcal{N}(\mu,\sigma),\mathcal{N}(0,1)

)

\theta\leftarrow\theta+\mathbf{Compute\_update}(\text{Loss})

10:end for

11:

\mathcal{D}

\text{Decoder}_{\text{cvae}}(\theta)

\triangleright

Trained decoder

12:Sample random noise

\tau

form

N\sim(0,1)

13:

\mathcal{P}_{u}\leftarrow\mathcal{D}(\tau,\mathcal{A}^{u})

;

\mathcal{P}_{s}\leftarrow\mathcal{D}(\tau,\mathcal{A}^{s})

\triangleright

Pseudo data generation

14:

\mathcal{P}=\mathcal{P}_{u}\cup\mathcal{P}_{s}

15:

\mathcal{F}_{\text{SVM}}\leftarrow\text{SVM\_fit}(\mathcal{P})

\triangleright

SVM training

16:return

\mathcal{F}_{\text{SVM}}

Generative model training (Algorithm 3). The goal of this stage is to generate pseudo data for unseen IoT devices. For this purpose, we leverage a well-known generative model, namely a CVAE (conditional variational autoencoder). The CVAE contains two parts: an encoder and a decoder. The encoder compresses the latent features conditioned on the attributes of devices. Next, the decoder reconstructs the latent features, based on IoT device attributes. CVAE training is based only on seen devices’ latent features $\mathbf{L}^{s},\forall\>s\in\mathcal{S}$ and attribute vector $\mathcal{A}^{s},\forall\>s\in\mathcal{S}$ , see line no. 3-11 in Algorithm 3. For training of the CVAE architecture, we use two losses functions, namely the reconstruction loss and the KL divergence [17] between the latent distribution and normal Gaussian distribution. During the training phase, the decoder $\mathcal{D}$ of the CVAE learns to map attribute vectors to the initial latent features. We use this ability of $\mathcal{D}$ to generate pseudo data $\mathcal{P}_{u}$ of unseen device classes based on attributes $\mathcal{A}^{u},\forall\>u\in\mathcal{U}$ , by querying random noise $\tau$ sampled from the normal Gaussian distribution. This unseen pseudo data is then used in the next stage for a supervised classifier. During our experiments, we found that the final classifier faces a bias towards seen classes when we use real seen device data, an observation that has already been made in [17]. Therefore, we generate an equal number of pseudo data $\mathcal{P}=\mathcal{P}_{u}\cup\mathcal{P}_{s}$ for devices from both seen and unseen classes. These steps are illustrated in Figure 3b.

Supervised classifier training. After generating labeled data for both seen and unseen devices, we employ a final supervised classifier, support vector machine (SVM), to predict the IoT device label; see Figure 3c. Finally, we test the performance of the trained SVM model in classifying real traffic data from both seen and unseen devices.

IV Performance evaluation

In this section, we evaluate ZEST and compare it with existing semi-supervised learning approaches. Typically, the performance evaluation happens in two different settings: one is ZSL, and another one is GZSL (generalized zero-shot learning). In both settings, the training set contains only the labeled data of seen classes, $\{\mathcal{X}^{d},\mathcal{Y}^{d},\forall\>d\in\mathcal{S}\}$ . The objective of ZSL is to classify the unseen devices $\mathcal{U}$ , which are not seen during the training stage. Different from ZSL, the test data of GZSL come from both seen $\mathcal{S}$ and unseen devices $\mathcal{U}$ . To sum up, the goals for ZSL and GZSL are to learn the classifications: $f_{\text{ZSL}}:\mathcal{X}\rightarrow\mathcal{U}$ and $f_{\text{GZSL}}:\mathcal{X}\to\mathcal{U}\cup\mathcal{S}$ , respectively.

IV-A Dataset for evaluation

We carry out evaluations on the publicly available UNSW 2018 dataset [39]. The appeal of this dataset stems from the fact that it is relatively large (27 GB of data) and contains a good number of IoT devices (28 device types). The UNSW dataset is composed of raw pcap data collected during a period of 61 days. We ignore the gateway device, as it only acts as an intermediary between IoT devices and the Internet. Due to the imbalance in data per device type, for both ZSL and GZSL, we use only the 12 device classes with the most number of data points; of these, 10 random devices form the seen category, and the rest form the unseen category. The dataset contains more than a million data points across 12 devices, with the minimum being 14,034 data point for a device, and the maximum being 197,876 data points. We do not include the unseen devices in the first supervised training step. The seen and unseen devices are selected randomly five times, and we report the average result over all five experimental runs.

IV-B Baselines for comparisons

We compare ZEST with the baselines on the benchmark dataset. All the training and testing of baselines and ZEST are performed on an Nvidia Tesla T4 GPU. As Bi-LSTM has been shown to have excellent performance in supervised IoT fingerprinting [7], we use Bi-LSTM as the feature extractor for all the baselines. We elaborate: sequences of packets are transformed to a smaller dimension feature vector $\mathbf{L}\in\mathbb{R}^{20\times 1}$ using a Bi-LSTM model. These feature vectors produced by the Bi-LSTM model are then provided as input to the respective models of the baselines. Therefore, just like ZEST, all baselines take the transformed and reduced feature vectors as input, thus making sure the comparisons are fair and are at the algorithmic level. Note, the Bi-LSTM model, as well as the SANE model for ZEST, are trained on the seen classes. We now describe the baselines:

1.

VAE-K: This is originally proposed as an unsupervised learning method in [9], employing VAE to extract features and applying k-means to perform clustering subsequently. In our implementation, the VAE encoder compresses the features $\mathbf{L}$ generated by the Bi-LSTM model to the same dimension as $\mathbf{\Lambda}$ , another latent feature with a lower dimension. It then employs k-means to cluster data into 12 groups, the same as the total number of seen and unseen devices. Since the Bi-LSTM model is trained on seen classes, the model is now semi-supervised, and therefore also more capable than the unsupervised counterpart.
2.

SeqCR: This is a semi-supervised sequence clustering model. Rather than using the feature vectors $\mathbf{L}$ generated by Bi-LSTM directly, it adds one more layer to extract ${\mathbf{\Lambda}}$ from the initial packet sequences. Thus, avoiding VAE, we still obtain feature vectors of a much lower dimension, $\mathbf{\Lambda}$ . Subsequently, we apply k-means clustering to assign the latent features ${\mathbf{\Lambda}}$ to 12 groups. The cluster centers are randomly initialized.
3.

SeqCS: This is an extension of SeqCR and a semi-supervised sequence-based clustering model. Unlike SeqCR, it initializes the k-means cluster centers using the attribute vector $\mathcal{A}$ , which is the average value of $\mathbf{\Lambda}$ (from both seen and unseen devices), and uses seeded k-means [31] for clustering. This helps improve the classification performance since the attribute vectors contain auxiliary information for each cluster.
4.

DEFT: This is a semi-supervised learning method proposed in [10], and can be seen as an extension of SeqCS. After getting initial clustering results based on SeqCS, DEFT trains another random forest for classification based on the clustered dataset with $\mathbf{\Lambda}$ , as well as their cluster labels. However, if the initial clustering done by SeqCS is of poor quality, it can negatively impact the final performance.

We make the code of ZEST and all the baselines public [40].

IV-C Results

To the best of our knowledge, our work is the first to apply the generative ZSL framework on network traffic. Therefore, besides studying the efficacy of ZEST for IoT device classification, we conduct various experiments: we carry out ablation experiments to investigate the efficiency of the SANE model, we perform SANE hyper-parameter tuning, and we analyze the ability of ZEST to handle varying numbers of unseen devices.

IV-C1 Comparison of baselines vs. ZEST

We evaluate the performance using two evaluation metrics: ZSL accuracy and GZSL accuracy. We consistently use a 5:1 ratio of seen to unseen devices (10 seen, 2 unseen) in our experiments. As depicted in Figure 4, in the ZSL setting, SeqCS achieves the best accuracy of $\sim 64\%$ among the baselines. ZEST achieves superior accuracy of $\sim 93\%$ , meaning we get nearly 30% absolute improvement with ZEST over the best performing baseline. As for GZSL, the accuracy of ZEST is $\sim 92\%$ , which is an absolute 10% improvement when compared to the next best-performing baseline DEFT.

IV-C2 Analysis of SANE architecture

We investigate the impact of the number of stacked encoders, denoted as $e$ , and the number of attention heads, denoted as $h$ , in our SANE model. Using a stack of encoders helps the SANE model to learn more complex relationships, similar to a CNN with many convolution layers. Within one encoder, the attention mechanism is applied multiple times in parallel, giving us a so-called multi-head attention layer. This mechanism enables the model to jointly consider different patterns to pay attention to, while also offering great possibility for parallelization during training [25].

Initially, we set the number of attention heads in SANE to $8$ and vary the encoder stack size for $e=\{1,2,4,6,8\}$ . The batch size is set to 64 and we train 20 epochs with a learning rate of 0.0005. Our analysis considers the resource consumption, which takes into account the parameter size and training time, as well as the supervised training accuracy of the different settings. The results presented in Table I demonstrate that increasing the number of encoders leads to a proportional increase in training time and parameter size. However, the prediction accuracy remains around $98.8\%$ when the number of encoders ranges from $2$ to $8$ . Hence, after considering the trade-off between resources and accuracy, we set $e=2$ .

TABLE I: Comparison of different encoder stack sizes

Number of	Parameter	Training time	Accuracy
encoders	size (MB)	/epoch (min)	(%)
1	2.81	2.10	98.23
2	5.56	3.90	98.84
4	11.09	7.85	98.76
6	16.61	11.10	98.81
8	22.14	14.90	98.86

Having fixed the encoder stack size, we now vary the number of attention heads to $h=\{1,2,4,8\}$ . The results presented in Table II show that increasing the number of attention heads leads to higher accuracy. This is because more attention heads enable the model to capture a broader range of aspects in the traffic pattern. Therefore, we choose $h=8$ . In conclusion of our hyperparameter analysis, we find that the optimal encoder stack size is relatively small at $e=2$ . On the other hand, the multi-head attention layers should be relatively large, each containing $h=8$ parallel attention heads.

TABLE II: Comparison of different number of attention heads in SANE

Number of	Parameter	Training time	Accuracy
attention heads	size (MB)	/epoch (min)	(%)
1	1.24	1.42	97.81
2	1.86	1.68	98.36
4	3.09	2.37	98.45
8	5.56	3.90	98.84

IV-C3 Comparison of SANE vs. LSTM

In this study, we first investigate the impact of using SANE in comparison to a Bi-LSTM for feature extraction in supervised learning. We consider all 28 classes and create a train/val/test data split in the ratio 60:20:20, via random shuffling. Table III presents the prediction accuracy and inference time of these two models. From the results, we observe that the SANE model achieves a slightly higher accuracy, yet at a much lower prediction time than the Bi-LSTM model. To further study the effectiveness of using SANE, we conduct another experiment on the baseline models, where we replace the underlying Bi-LSTM sequence models with the SANE architecture.

TABLE III: Comparison of SANE and Bi-LSTM

Method	Bi-LSTM	SANE
Classification Accuracy (%)	97.46	98.84
Inference Time (ms)	0.59	0.12

Figure 5 illustrates the ZSL and GZSL accuracy when feature extraction is done by Bi-LSTM and SANE. In the ZSL setting, SANE-based models yield an average absolute improvement of approximately 20% over various baselines, with a maximum improvement of about 40% observed for DEFT. In the GZSL setting, SANE-based models achieve an average absolute accuracy gain of about 5%, compared to Bi-LSTM. These results suggest that SANE is capable of extracting more informative features, leading us to select it as our preferred feature extraction method in ZEST. Additionally, as is evident from Figure 4 and Figure 5, ZEST also outperforms SANE-based baselines. ZEST brings an average absolute improvement of about 10% and 5% for ZSL and GZSL settings, respectively, when compared with the best SANE-based baseline, DEFT.

IV-C4 Comparison of different attribute dimensions

Here we search for the optimal attribute dimension for ZEST. To assess the quality of the reconstructed data from the CVAE, we experiment by varying the dimension of the attribute vector $\mathcal{A}$ , specifically, we use $|\mathcal{A}|=\{2,3,4,5\}$ . As shown in Figure 6a, we observe that the best performance is achieved when the dimension is 3 in both ZSL and GZSL settings. This trend of increasing accuracy followed by decreasing accuracy is reasonable—while a higher-dimensional attribute vector represents more traffic information, it can also make the map** between attributes and traffic data more challenging.

IV-C5 Varying the number of unseen classes

We investigate ZEST’s ability to handle multiple unseen classes. In Figure 6b, we compare its performance when varying the number of unseen devices from 1 to 4, for a total of 12 classes. As the number of unseen classes increases, both ZSL and GZSL accuracy values tend to decrease. When the ratio changes from $2/10$ to $3/9$ , there is a sharp decrease for both ZSL and GZSL. Lower ratios of unseen classes to seen classes facilitate a more effective map** of attribute vectors to traffic data. In other words, ZEST requires a significant number of devices in the seen category to classify unseen devices. A rigorous validation of this observation using 50-100 devices is worth a different study. In practice, however, given that MAC addresses of devices remain static over a duration, the model can effectively classify traffic from a small number of unseen IoT devices.

V Conclusions

In this work, we develop a novel zero-shot learning framework (ZSL), called ZEST, for IoT fingerprinting. As the first attempt to work on ZSL for network traffic modeling, we analyze the effectiveness of the self-attention based model for extracting traffic features, in this framework. Our experiments show that SANE yields higher accuracy, lower inference time, and better feature extraction ability in comparison to Bi-LSTM. Since ZSL relies on class-specific attributes, which are typically not present for traffic-based IoT device classification, we propose a novel attention-based approach, i.e., SANE, to automatically compute attributes for any IoT device. Finally, we compare our ZEST framework with four baselines from both the unsupervised and semi-supervised domains. Our results demonstrate that ZEST achieves state-of-the-art performance, with an absolute accuracy improvement of about 30% and 10% for ZSL and GZSL, respectively.

References

[1] M. Antonakakis, T. April, M. Bailey, M. Bernhard, E. Bursztein, J. Cochran, Z. Durumeric, J. A. Halderman, L. Invernizzi, M. Kallitsis et al., “Understanding the Mirai Botnet,” in Proc. USENIX Security, 2017, pp. 1093–1110.
[2] S. Herwig, K. Harvey, G. Hughey, R. Roberts, and D. Levin, “Measurement and Analysis of Hajime: a Peer-to-peer IoT Botnet,” in Proc. NDSS, 2019.
[3] B. Bezawada, M. Bachani, J. Peterson, H. Shirazi, I. Ray, and I. Ray, “Behavioral fingerprinting of IoT devices,” in Proc. of the workshop on attacks and solutions in hardware security, 2018, pp. 41–50.
[4] N. J. Apthorpe, D. Reisman, and N. Feamster, “A Smart Home is No Castle: Privacy Vulnerabilities of Encrypted IoT Traffic,” CoRR, vol. abs/1705.06805, 2017.
[5] M. Miettinen, S. Marchal, I. Hafeez, N. Asokan, A.-R. Sadeghi, and S. Tarkoma, “IoT SENTINEL: Automated Device-Type Identification for Security Enforcement in IoT,” in 37th IEEE International Conference on Distributed Computing Systems, ICDCS, 2017, pp. 2177–2184.
[6] A. Sivanathan, D. Sherratt, H. H. Gharakheili, A. Radford, C. Wijenayake, A. Vishwanath, and V. Sivaraman, “Characterizing and classifying IoT traffic in smart cities and campuses,” in IEEE International Conference on Computer Communications Workshops, INFOCOM, 2017, pp. 559–564.
[7] S. Dong, Z. Li, D. Tang, J. Chen, M. Sun, and K. Zhang, “Your Smart Home Can’t Keep a Secret: Towards Automated Fingerprinting of IoT Traffic,” in Proc. AsiaCCS, 2020, p. 47–59.
[8] F. Sawadogo, J. Violos, A. Hameed, and A. Leivadeas, “An Unsupervised Machine Learning Approach for IoT Device Categorization,” in IEEE International Mediterranean Conference on Communications and Networking (MeditCom), 2022, pp. 25–30.
[9] S. Zhang, Z. Wang, J. Yang, D. Bai, F. Li, Z. Li, J. Wu, and X. Liu, “Unsupervised IoT Fingerprinting Method via Variational Auto-encoder and K-means,” in IEEE ICC, 2021.
[10] V. Thangavelu, D. M. Divakaran, R. Sairam, S. S. Bhunia, and M. Gurusamy, “DEFT: A Distributed IoT Fingerprinting Technique,” IEEE Internet of Things Journal, vol. 6, no. 1, pp. 940–952, 2019.
[11] B. Atul Desai, D. M. Divakaran, I. Nevat, G. W. Peters, and M. Gurusamy, “A feature-ranking framework for IoT device classification,” in 11th Int’l Conf. on Communication Systems & Networks (COMSNETS 2019), Jan. 2019.
[12] R. Trimananda, J. Varmarken, A. Markopoulou, and B. Demsky, “Packet-Level Signatures for Smart Home Devices,” in Proc. NDSS, 2020.
[13] B. Chakraborty, D. M. Divakaran, I. Nevat, G. W. Peters, and M. Gurusamy, “Cost-aware Feature Selection for IoT Device Classification,” IEEE Internet of Things Journal, 2021.
[14] L. Fan, S. Zhang, Y. Wu, Z. Wang, C. Duan, J. Li, and J. Yang, “An IoT Device Identification Method based on Semi-supervised Learning,” in 16th International Conference on Network and Service Management (CNSM), 2020, pp. 1–7.
[15] A. Shenoi, P. K. Vairam, K. Sabharwal, J. Li, and D. M. Divakaran, “iPET: Privacy Enhancing Traffic Perturbations for Secure IoT Communications,” Proc. Privacy Enhancing Technologies Symposium (PETS), 2023.
[16] S. Liu, X. Zhu, H. Chen, and Z. Han, “Secure Communication for Integrated Satellite-Terrestrial Backhaul Networks: Focus on Up-link Secrecy Capacity based on Artificial Noise,” IEEE Wireless Communications Letters, pp. 1–1, 2023.
[17] A. Mishra, S. Krishna Reddy, A. Mittal, and H. A. Murthy, “A Generative Model for Zero Shot Learning Using Conditional Variational Autoencoders,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops, 2018, pp. 2188–2196.
[18] V. K. Verma, D. Brahma, and P. Rai, “Meta-Learning for Generalized Zero-Shot Learning,” in Proc. of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 6062–6069.
[19] B. Zhao, X. Sun, Y. Yao, and Y. Wang, “Zero-shot Learning via Shared-Reconstruction-Graph Pursuit,” arXiv preprint arXiv:1711.07302, 2017.
[20] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-UCSD birds 200,” 2010.
[21] C. H. Lampert, H. Nickisch, and S. Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 453–465, 2013.
[22] D. M. Divakaran, S. Le, Y. S. Liau, and V. L. L. Thing, “SLIC: Self-Learning Intelligent Classifier for Network Traffic,” Computer Networks, 2015.
[23] I. Nevat, D. M. Divakaran, S. G. Nagarajan, P. Zhang, L. Su, L. L. Ko, and V. L. L. Thing, “Anomaly Detection and Attribution in Networks With Temporally Correlated Traffic,” IEEE/ACM Transactions on Networking, vol. 26, no. 1, pp. 131–144, Feb 2018.
[24] K. L. K. Sudheera, D. M. Divakaran, R. P. Singh, and M. Gurusamy, “ADEPT: Detection and Identification of Correlated Attack Stages in IoT Networks,” IEEE Internet of Things Journal, vol. 8, no. 8, pp. 6591–6607, 2021.
[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in Proc. NIPS, 2017.
[26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, 2019, pp. 4171–4186.
[27] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in 9th International Conference on Learning Representations, ICLR, 2021.
[28] N. Q. K. Le, Q.-T. Ho, T.-T.-D. Nguyen, and Y.-Y. Ou, “A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information,” Briefings in Bioinformatics, vol. 22, no. 5, p. bbab005, 2021.
[29] B. Wu, D. Chen, N. V. Abhishek, and M. Gurusamy, “D3T: Double Deep Q-Network Decision Transformer for Service Function Chain Placement,” in 2023 IEEE 24th International Conference on High Performance Switching and Routing (HPSR), pp. 167–172.
[30] A. Sivanathan, H. H. Gharakheili, and V. Sivaraman, “Inferring IoT Device Types from Network Behavior Using Unsupervised Clustering,” in 2019 IEEE 44th Conference on Local Computer Networks (LCN), 2019, pp. 230–233.
[31] S. Basu, A. Banerjee, and R. J. Mooney, “Semi-supervised Clustering by Seeding,” in Machine Learning, Proceedings of the Nineteenth International Conference (ICML), 2002, pp. 27–34.
[32] D. Xu and Y. Tian, “A comprehensive survey of clustering algorithms,” Annals of Data Science, vol. 2, pp. 165–193, 2015.
[33] H. He, Y. He, F. Wang, and W. Zhu, “Improved k-means algorithm for clustering non-spherical data,” Expert Systems, vol. 39, no. 9, p. e13062, 2022.
[34] C. Yuan and H. Yang, “Research on K-value selection method of K-means clustering algorithm,” J, vol. 2, no. 2, pp. 226–235, 2019.
[35] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata, “Feature Generating Networks for Zero-Shot Learning,” in Proc. CVPR, 2018, pp. 5542–5551.
[36] R. Gao, X. Hou, J. Qin, J. Chen, L. Liu, F. Zhu, Z. Zhang, and L. Shao, “Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning,” IEEE Transactions on Image Processing, vol. 29, pp. 3665–3680, 2020.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[38] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., “Big bird: Transformers for Longer Sequences,” Advances in neural information processing systems, vol. 33, pp. 17 283–17 297, 2020.
[39] A. Sivanathan, H. H. Gharakheili, F. Loi, A. Radford, C. Wijenayake, A. Vishwanath, and V. Sivaraman, “Classifying IoT Devices in Smart Environments Using Network Traffic Characteristics,” IEEE Transactions on Mobile Computing, vol. 18, no. 8, pp. 1745–1759, 2018.
[40] “ZEST Source Code,” https://github.com/Binghui99/ZEST, 2023.