License: arXiv.org perpetual non-exclusive license
arXiv:2310.08036v2 [cs.NI] 12 Jan 2024

ZEST: Attention-based Zero-Shot Learning for Unseen IoT Device Classification

Binghui Wu, Philipp Gysel, Dinil Mon Divakaran, and Mohan Gurusamy Binghui Wu and Mohan Gurusamy (Senior Member, IEEE) are with National University of Singapore (NUS); e-mail-id: [email protected], [email protected]. Philipp Gysel and Dinil Mon Divakaran (Senior Member, IEEE) are with Acronis Research; e-mail-id: [email protected], [email protected].
Abstract

Recent research works have proposed machine learning models for classifying IoT devices connected to a network. However, there is still a practical challenge of not having all devices (and hence their traffic) available during the training of a model. This essentially means, during the operational phase, we need to classify new devices not seen in the training phase. To address this challenge, we propose ZEST—a ZSL (zero-shot learning) framework based on self-attention for classifying both seen and unseen devices. ZEST consists of i) a self-attention based network feature extractor, termed SANE, for extracting latent space representations of IoT traffic, ii) a generative model that trains a decoder using latent features to generate pseudo data, and iii) a supervised model that is trained on the generated pseudo data for classifying devices. We carry out extensive experiments on real IoT traffic data; our experiments demonstrate i) ZEST achieves significant improvement (in terms of accuracy) over the baselines; ii) SANE is able to better extract meaningful representations than LSTM which has been commonly used for modeling network traffic.

Index Terms:
IoT, fingerprinting, zero-shot learning (ZSL), network traffic, attention, security, transformer

I Introduction

Offices, homes, and enterprises in various industry verticals have numerous IoT devices connected to their networks, including smart thermostats, hubs, lighting systems, alarms, TVs, and wearable devices. While IoT devices offer new and efficient services, they also present security threats. Currently, manufacturers do not follow a standard framework to announce the device identities and their functionalities. The lack of standardization often results in vulnerabilities left open for different kinds of attacks [1, 2]. An important first step in securing IoT devices is to identify the different types of devices operating in a home/office environment. The challenge to the requirements arises from the constantly evolving landscape of IoT devices and their network behaviors. Traditional static methods struggle to adapt to changing device behaviors, recognize unknown devices, and capture complex communication patterns [3]. Consequently, monitoring network traffic dynamically is now the most practical method for identifying devices and ensuring their security.

IoT fingerprinting is a well-studied problem [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], but there are still open challenges in practical settings. Many existing works take a supervised approach, thereby dealing only with known devices [4, 5, 6, 7, 13]. With the number of IoT devices expected to grow to tens of billions, new device types will continue to enter the market, making it impractical to assume that traffic of all devices will be available in advance to train a machine learning model. The challenge remains to identify devices not present in the training set, which we refer to as “unseen” devices. Conversely, “seen” devices refer to those devices that have labeled examples available during the model’s training phase. We need a system that can classify unseen IoT devices, in addition to seen devices.

Zero-shot learning (ZSL) could help to classify unseen devices. ZSL is known to work well for image classification—it leverages textual descriptions as attributes to relate unseen classes to seen classes [17, 18, 19]. ZSL on images involves the use of specific attributes obtained through manual or automatic annotation. For instance, datasets such as Caltech-UCSD Birds-200-2011 (CUB-200) [20] and Animals with Attributes (AwA) [21] provide pre-extracted feature representations for image descriptions. Given the information that a giraffe is a herbivore, has a long neck, has brown spots, and has ossicones on its head, one can easily distinguish it from other animals such as pigs or cows, even without having seen a giraffe before. By leveraging the semantic relationships between different classes, the ZSL approach enables the model to recognize unseen classes by inferring their attributes in association with other classes. The question we ask in this research is, can ZSL be leveraged to carry out classification of both seen and unseen IoT devices, by map** network traffic data to an attribute space? Figure 1 illustrates this concept.

Refer to caption
(a)

Figure 1: Idea of ZSL in IoT fingerprinting

For IoT devices, we have textual descriptions, e.g., information on the product webpages or user manuals; but they do not directly translate into the domain of network traffic data. Defining attributes in the IoT domain is non-trivial, and it plays a critical role in model performance. The primary challenge in ZSL for IoT devices is the definition of suitable attributes. These attributes should possess sufficient information density to accurately represent and distinguish the traffic patterns of individual devices. In the context of network traffic analysis, one must examine the intricacies associated with packets. Network packets are characterized by a multitude of information, including IP addresses, service ports, transport protocols, inter-arrival time, etc. As data flows through the network, the accumulation of the packets results in extremely long sequences; packet sequences in hundreds or thousands or even more are common with different applications (such as browsing, email, SSH, etc.). Furthermore, the challenge posed by high dimension becomes evident when considering the diverse set of features and their potential combinations with packet sequences. This results in the creation of a high-dimensional feature space, in which each unique feature introduces an additional dimension.

While traditional statistical models and ML algorithms, such as Support Vector Machine (SVM), Linear Regression, k-NN, and Decision Trees, have been proposed for learning network traffic behaviors (e.g., see [22, 23, 24]), such models encounter significant difficulties when tasked with effective processing and extracting meaningful patterns from such intricate and high dimensional data. Consequently, traditional machine learning algorithms, such as Support Vector Machine (SVM), Linear Regression, and Decision Trees, encounter significant difficulties when tasked with effective processing and extracting meaningful patterns from such intricate and high dimensional data. To address these computational challenges inherent in dealing with data of long sequences and high dimensions, specialized sequence models have emerged as an effective approach. In particular, the transformer models [25] have shown excellent performance in language modeling [26], image classification [27], DNA analysis [28], resource allocation[29], etc. The self-attention mechanism in transformers enables parallel evaluation with each token of the input sequence, thus eliminating the sequential dependency present in prior sequence models such as recurrent neural networks (RNNs). Transformers consider the entire context, rather than relying solely on local information, enabling a deeper understanding of context and dependencies. Building upon the encoder of the transformer architecture, we develop SANE, a self-attention mechanism designed to comprehend traffic patterns and autonomously generate concise attributes for IoT devices (Section III-C).

In this work, we propose ZEST, a zero-shot learning framework based on the self-attention mechanism for IoT fingerprinting. ZEST involves i) training a self-attention based network feature extractor, i.e., SANE, to extract features and attribute vectors of devices, subsequently ii) training a generative model to map attribute vectors to traffic data and generate pseudo data for unseen devices, and finally iii) training a supervised classifier with the generated data (as illustrated in Figure 3). The main contributions of this work are:

  1. 1.

    We introduce a ZSL framework for IoT fingerprinting, ZEST. To the best of our knowledge, it is the first generative ZSL framework for IoT fingerprinting. Our work here is also the first to leverage transformer model for learning network traffic characteristics. Based on the self-attention mechanism, ZEST achieves state-of-the-art performance when compared with other semi-supervised and unsupervised learning methods.

  2. 2.

    As the use of attention mechanism for network traffic understanding is new, we study its effects in classifying IoT devices and compare it with existing solutions that use LSTMs for the same purpose. We find that an attention-based mechanism improves the performance of even existing baseline models, making it a better choice for IoT fingerprinting.

  3. 3.

    We propose a new approach for generating attribute vectors for IoT devices. Unlike image classification tasks, network traffic data of IoT devices do not come with apparent class descriptions. Hence, we leverage pre-trained models to extract attribute vectors of unseen devices, providing a viable and attractive technique to overcome this challenge. We conduct experiments with varying attribute vector dimensions and identify the most suitable dimension for optimal performance.

  4. 4.

    We make the source code implementations of the models openly available to facilitate research111Code is available at: https://github.com/Binghui99/ZEST.. This includes the implementation of the ZEST framework, LSTM-based baselines, as well as the data processing methods.

In the following, we first present the related literature for IoT fingerprinting and the background of ZSL. In Section III, we present our ZSL framework for identifying seen and unseen devices. Performance evaluations are carried out in Section IV.

II Related Works

II-A IoT fingerprinting

With the rapid growth of the IoT ecosystem, there has been an increasing interest in characterizing and fingerprinting (i.e., classifying) IoT devices. In recent years, several works have proposed IoT traffic analysis methods, and use supervised learning approaches to perform device classification [4, 5, 6, 7]. An interesting work [7] is the application of a sequence model, specifically Bi-LSTM (bi-directional long short-term memory), for modeling traffic of IoT devices. This deep learning model shows a good ability to learn sequence information and achieves high accuracy in IoT device classification.

However, a supervised approach is limited in practice, since we have numerous new unseen devices entering the market regularly. Therefore, researchers proposed unsupervised methods [8, 30, 9] for IoT fingerprinting. Sivanathan et al. [30] extract key features from flow-level network traffic and use PCA (principal component analysis) to project data into lower dimensional space. As a complementary approach, the authors in [9] train a VAE (variational autoencoder) with an encoder and a decoder in an unsupervised way, subsequently leverage the encoder to compress raw data, and finally use k-means for clustering. However, unsupervised learning methods do not use all available information, such as labels of seen devices and their semantic descriptions, thus achieving only modest results (see Section IV-C).

Semi-supervised methods [10, 14, 31, 16], on the other hand, utilize the available information, and they can also deal with unseen classes. Authors in [14] propose a semi-supervised method based on a CNN model (convolutional neural network) and multi-task learning. Given a few labeled data, they train a CNN to transform raw features into dense high-level features, and thus achieve a dimension reduction. An alternative semi-supervised approach called DEFT [10], extracts traffic features and utilizes seeded k-means [31] to conduct unsupervised clustering. Based on the clustered data, DEFT trains a supervised random forest to perform the final classification, using the cluster numbers as labels.

The semi-supervised solutions are often based on clustering [10, 14, 31]. However, clustering has limitations, including the requirement to inform the number of clusters in advance, sensitivity to initial centroids and outliers, and unsuitability for high-dimensional and non-linear data [32, 31, 33, 34]. In order to overcome these shortcomings, we choose to avoid clustering for our solution. Instead, our ZEST framework uses a generative model for generating unseen data, based on high-level attribute definitions. Then we use the generated data to train a supervised model. To deal with high-dimensional and non-linear data, we use a deep sequence model and extract small-dimensional latent features.

II-B Zero-shot learning (ZSL)

Broadly, there are two approaches for ZSL: embedding-based methods [19, 18] and generative-based methods [17, 35, 36]. Embedding-based methods learn a high-dimensional embedding space that maps the low-level features of seen classes to their corresponding semantic vectors. This approach recognizes new classes by comparing prototypes and predicted representations of data samples in the embedding space. On the other hand, generative-based methods use samples of seen classes and semantic representations of both seen and unseen classes to generate pseudo data for the unseen classes, thus converting a ZSL problem into a supervised learning problem.

In this work, we focus on the generative method. The authors in [17] propose a CVAE-based approach using a generative model to learn the probability distribution of the input space conditioned on the attribute representation of the classes. The generative model is then used to generate samples for the unseen classes based on their attribute vectors. Gao et al. [36] propose Zero-VAE-GAN, which combines VAE and GAN to generate features for novel classes. This approach uses a dual encoder-decoder structure to map data samples into a joint feature space, improving the quality of generated samples. Our proposal ZEST is inspired by [17]; but there are some significant differences. Firstly, the proposal in [17] is for the image domain; therefore, the training data utilized is accompanied by well-defined semantic information. The authors apply word2vec to text descriptions (e.g., from Wikipedia) for different image classes. However, such a method is unsuitable for our problem, since the text description from device manuals cannot be transferred to the network traffic domain. Therefore, we propose a different approach to extract attributes (see Section III-C). Moreover, in the image classification domain, there are highly proficient pre-trained models trained on large datasets, like ResNet50 [37]. In the absence of such pre-trained models, we train our attention-based model from scratch, and employ it to extract features from network traffic. The details of our ZSL pipeline are explained in Section III-D.

II-C Transformer

Transformers, introduced by Vaswani et al. [25], are used for sequence-to-sequence learning tasks. The self-attention mechanism is good at capturing contextual relationships within a sequence, empowering models to extract rich and informative features. Furthermore, it can be efficiently parallelized, making it suitable for modern hardware accelerators like GPUs and TPUs [38]. We leverage the advancements made in transformers to enhance the performance of IoT fingerprinting. The visual transformer (ViT) proposed in [27] is a BERT-like [26] model for image classification that achieves superior performance on multiple benchmarks with fewer parameters than competing models, by efficiently modeling long-range dependencies between image patches and global image information. Our attention-based model is inspired by ViT.

However, unlike the input images to ViT, network traffic comes as a data sequence. For our data pre-processing pipeline, we split network traffic into packet sequences of pre-defined length, where packets are represented by a small number of raw features (Section III-B). For the final classification, we use an average pooling layer to get the whole sequence information. Besides, we define a special token to summarize the sequence-level network traffic information. Our design is driven by the need to understand network traffic at both the sequence level and packet level.

III ZEST: model architecture

III-A System definition

We consider a network connecting a set, ΓΓ\Gammaroman_Γ, of IoT devices (such as smart cameras, hubs, alarms, etc.). We use 𝒮𝒮\mathcal{S}caligraphic_S to denote the set of devices already seen, and 𝒰𝒰\mathcal{U}caligraphic_U for the set of unseen devices; both are mutually exclusive, i.e., 𝒮𝒰=𝒮𝒰\mathcal{S}\cap\mathcal{U}=\emptysetcaligraphic_S ∩ caligraphic_U = ∅, 𝒮𝒰=Γ𝒮𝒰Γ\mathcal{S}\cup\mathcal{U}=\Gammacaligraphic_S ∪ caligraphic_U = roman_Γ. For our classification system, a single data point 𝐱𝐱\mathbf{x}bold_x is defined as a sequence of network packets with length n𝑛nitalic_n, and each packet has f𝑓fitalic_f features, i.e., 𝐱n×f𝐱superscript𝑛𝑓\mathbf{\mathbf{x}}\in\mathbb{R}^{n\times f}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_f end_POSTSUPERSCRIPT. For each seen device d𝒮𝑑𝒮d\in\mathcal{S}italic_d ∈ caligraphic_S, we have a set of m𝑚mitalic_m data points 𝒳d={𝐱1d,𝐱2d,,𝐱md}superscript𝒳𝑑superscriptsubscript𝐱1𝑑superscriptsubscript𝐱2𝑑superscriptsubscript𝐱𝑚𝑑\mathcal{X}^{d}=\{\mathbf{x}_{1}^{d},\mathbf{x}_{2}^{d},\cdots,\mathbf{x}_{m}^% {d}\}caligraphic_X start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }, and their corresponding labels 𝒴d={y1d,y2d,,ymd}superscript𝒴𝑑superscriptsubscript𝑦1𝑑superscriptsubscript𝑦2𝑑superscriptsubscript𝑦𝑚𝑑\mathcal{Y}^{d}=\{y_{1}^{d},y_{2}^{d},\cdots,y_{m}^{d}\}caligraphic_Y start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }. However, for d𝒰𝑑𝒰d\in\mathcal{U}italic_d ∈ caligraphic_U, we have data points 𝒳dsuperscript𝒳𝑑\mathcal{X}^{d}caligraphic_X start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT without the labels. Note that, a data point representing a sequence of (say) n=200 packets, each with f=8 features, results in a feature space of 1,600—a very high dimension.

III-B Traffic representation

From the traffic of an IoT device, we extract, what can be referred to as, raw features, for each packet in the traffic. Flow-level features capture the packet statistics but in a lossy way. In comparison, per-packet features provide the finest granularity of information in traffic. The raw packet features that are beneficial for device classification are: packet size, time since the last packet, the direction of the packet, transport protocol (TCP/UDP), the application protocol (HTTP/S, DNS, NTP, etc.), TLS version, src/dst IP address category, and src/dst port category. Intuitively, a single packet in itself might not present sufficient information for device classification, e.g., a TCP SYN or SYN+ACK is present in all TCP flows. Therefore, these per-packet features are extracted from non-overlap** fixed-length sequences of packets in a network trace of an IoT device and provided as input to the model, both in the training and inference phases. In our work, we limit the features to:

  • Source and destination IP addresses: Using raw IP addresses would overfit the model to the dataset used, besides creating a large latent space for representation. Instead, we use a binary value indicating whether the source/destination IP address is internal or external to the network of IoT devices.

  • Port representation: Between the source and destination ports, we assume the lower one is the service port, and represent only the service port. The other port is typically a random port, and therefore set to a constant number. The idea is to minimize the influence of ephemeral ports.

  • Transport layer protocol (e.g., UDP or TCP).

  • The time since the previous network packet.

  • The size of the packet.

  • The direction of the packet (inbound/outbound)

III-C Attributes

To overcome the lack of meaningful textual descriptions for IoT device traffic data, we adopt a novel approach inspired by attribute-based image classification. As an analogy, attributes of a giraffe are given by the wise people who see the giraffe and describe it based on their experience of describing other animals. Specifically, we train a self-attention model on the traffic data of seen devices as “wise people” to learn the knowledge of describing traffic patterns. Subsequently, when presented with traffic sequences of unseen devices, the model generates a description based on its learned knowledge, even though it has no prior knowledge of the unseen devices. The average description generated by the model is considered as the general attributes of the unseen devices. In this process, the unseen devices come with no label information. We develop and employ a powerful self-attention model based on the encoder of transformer [26] to extract latent features. In Figure 2, we present the architecture of SANE—a self-attention based feature extractor.

Refer to caption
(a)

Figure 2: Illustration of SANE architecture

The input data consists of sequences of packets with high-dimensional features, which are noisy and sparse, making it challenging for a model to recover meaningful attributes. To facilitate the learning process, we require data with high information density enabling the model to effectively map the data space to the attribute space. Therefore, we define two latent space representations to extract features at different levels for each device dΓ𝑑Γd\in\Gammaitalic_d ∈ roman_Γ. The first one is 𝐋M×1𝐋superscript𝑀1\mathbf{L}\in\mathbb{R}^{M\times 1}bold_L ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × 1 end_POSTSUPERSCRIPT, where 𝐋d={l1d,l2d,,lmd}superscript𝐋𝑑superscriptsubscript𝑙1𝑑superscriptsubscript𝑙2𝑑superscriptsubscript𝑙𝑚𝑑\mathbf{L}^{d}=\{l_{1}^{d},l_{2}^{d},\cdots,l_{m}^{d}\}bold_L start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = { italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ⋯ , italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }, such that, lidM×1subscriptsuperscript𝑙𝑑𝑖superscript𝑀1l^{d}_{i}\in\mathbb{R}^{M\times 1}italic_l start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × 1 end_POSTSUPERSCRIPT is the latent space representation of a single data point 𝐱idsuperscriptsubscript𝐱𝑖𝑑\mathbf{x}_{i}^{d}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and m𝑚m\in\mathbb{N}italic_m ∈ blackboard_N is the number of traffic sessions corresponding to device d𝑑ditalic_d. The second latent space is defined as 𝚲N×1𝚲superscript𝑁1\mathbf{\Lambda}\in\mathbb{R}^{N\times 1}bold_Λ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT, where 𝚲d={λ1d,λ2d,,λmd}superscript𝚲𝑑superscriptsubscript𝜆1𝑑superscriptsubscript𝜆2𝑑superscriptsubscript𝜆𝑚𝑑\mathbf{\Lambda}^{d}=\{\mathbf{\lambda}_{1}^{d},\mathbf{\lambda}_{2}^{d},% \cdots,\mathbf{\lambda}_{m}^{d}\}bold_Λ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = { italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ⋯ , italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }, such that λidN×1superscriptsubscript𝜆𝑖𝑑superscript𝑁1\mathbf{\lambda}_{i}^{d}\in\mathbb{R}^{N\times 1}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT is another latent feature corresponding to data point 𝐱idsuperscriptsubscript𝐱𝑖𝑑\mathbf{x}_{i}^{d}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Based on the second latent feature, we define the attribute vector, 𝒜dN×1superscript𝒜𝑑superscript𝑁1\mathcal{A}^{d}\in\mathbb{R}^{N\times 1}caligraphic_A start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT, for both seen and unseen devices, which are embedded into a vector space representing the semantic relationship of different devices. We use the average value of 𝚲dsuperscript𝚲𝑑\mathbf{\Lambda}^{d}bold_Λ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT as the attribute vector 𝒜dsuperscript𝒜𝑑\mathcal{A}^{d}caligraphic_A start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT of corresponding device d𝑑ditalic_d, i.e., 𝒜dsuperscript𝒜𝑑\mathcal{A}^{d}caligraphic_A start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = imλidmsuperscriptsubscript𝑖𝑚superscriptsubscript𝜆𝑖𝑑𝑚\frac{\sum_{i}^{m}\mathbf{\lambda}_{i}^{d}}{m}divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG.

III-D Architecture design

The architecture of ZEST is depicted in Figure 3. It consists of four phases: feature extractor training, feature and attribute extraction, generative model training, and training of a final supervised classifier.

Refer to caption
(a) SANE training and attribute extraction
Refer to caption
(b) Generative model training (CVAE)
Refer to caption
(c) Pseudo data generation (CVAE) & supervised classifier training
Figure 3: ZEST architecture

SANE training (Algorithm 1). SANE is used to transform the input data into a latent feature space, as shown in Figure 3a. The deep sequence classifier is pre-trained on a significant amount of seen device data. After comparing well-known sequence models (namely, LSTM and transformer), we find that the transformer model has better ability to extract traffic patterns (see Section III-C) as well as the lowest inference time. Therefore, we design SANE, based on transformer (encoder), as the feature extractor for ZEST. The raw data for seen devices undergo a transformation into packet sequences with a sequence length of n𝑛nitalic_n = 200 packets. Recall from Section III-B that each packet is represented using f𝑓fitalic_f = 8 features. The steps of SANE training are listed in Algorithm 1.

First, we project packet sequences using a normal linear layer to create an embedding. Then, we add the special SLA (Sequence Level Aggregation) token and the positional embedding. The learnable SLA token is specifically designed for summarizing sequence-level features, facilitating the integration of both packet-level and sequence-level information. It is randomly initialized and added before the packet embedding. This integration results in powerful representations of network traffic. Additionally, we employ a positional embedding, denoted as 𝐏𝐏\mathbf{P}bold_P, to provide the model with information about the original packet positions. This embedding is initialized as a random vector, learned by the model, and subsequently added to the packet embedding.

Algorithm 1 SANE training
1:𝒳dsuperscript𝒳𝑑\mathcal{X}^{d}caligraphic_X start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT; 𝒴dsuperscript𝒴𝑑\mathcal{Y}^{d}caligraphic_Y start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT d𝒮for-all𝑑𝒮\forall\>d\in\mathcal{S}∀ italic_d ∈ caligraphic_S
2:ξ𝐒𝐀𝐍𝐄subscript𝜉𝐒𝐀𝐍𝐄\mathcal{\xi}_{\text{{SANE}}}italic_ξ start_POSTSUBSCRIPT SANE end_POSTSUBSCRIPT
3:Randomly initialize parameters in SANE
4:for (𝐛x,𝐛y)subscript𝐛𝑥subscript𝐛𝑦(\mathbf{b}_{x},\mathbf{b}_{y})( bold_b start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) : sample a batch from 𝒳𝒳\mathcal{X}caligraphic_X and 𝒴𝒴\mathcal{Y}caligraphic_Y do
5:    𝐄=𝚂𝙻𝙰𝐄𝐦𝐛𝐞𝐝𝐝𝐢𝐧𝐠(𝐛x)𝐄direct-sum𝚂𝙻𝙰𝐄𝐦𝐛𝐞𝐝𝐝𝐢𝐧𝐠subscript𝐛𝑥\mathbf{E}=\texttt{SLA}\oplus\text{Embedding}(\mathbf{b}_{x})bold_E = SLA ⊕ Embedding ( bold_b start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) normal-▷\triangleright Concatenate SLA
6:    𝐏+𝐄𝐏𝐄\mathcal{E}\leftarrow\mathbf{P}+\mathbf{E}caligraphic_E ← bold_P + bold_E normal-▷\triangleright Add positional embedding
7:    for each stacked encoder ν𝜈\nuitalic_ν do
8:        R1Multi_head_attention(𝐍𝐨𝐫𝐦())superscript𝑅1Multi_head_attention𝐍𝐨𝐫𝐦R^{1}\leftarrow\text{Multi\_head\_attention}(\text{Norm}(\mathcal{E}))italic_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ← Multi_head_attention ( Norm ( caligraphic_E ) )
9:        R2𝐌𝐋𝐏(𝐍𝐨𝐫𝐦(+R1))superscript𝑅2𝐌𝐋𝐏𝐍𝐨𝐫𝐦superscript𝑅1R^{2}\leftarrow\text{MLP}(\text{Norm}(\mathcal{E}+R^{1}))italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ← MLP ( Norm ( caligraphic_E + italic_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) )
10:        R2+(+R1)superscript𝑅2superscript𝑅1\mathcal{E}\leftarrow R^{2}+(\mathcal{E}+R^{1})caligraphic_E ← italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( caligraphic_E + italic_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT )
11:    end for
12:    li=𝐍𝐋l(Average_pooling())subscript𝑙𝑖subscript𝐍𝐋𝑙Average_poolingl_{i}=\text{$\mathbf{NL}_{l}$}(\text{Average\_pooling}(\mathcal{E}))italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_NL start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( Average_pooling ( caligraphic_E ) );  λi=𝐍𝐋λ(li)subscript𝜆𝑖subscript𝐍𝐋𝜆subscript𝑙𝑖\mathbf{\lambda}_{i}=\text{$\mathbf{NL}_{\lambda}$}(l_{i})italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_NL start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
13:    𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐨𝐧𝐬𝐒𝐨𝐟𝐭𝐦𝐚𝐱(Classifier_head(λi))𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐨𝐧𝐬𝐒𝐨𝐟𝐭𝐦𝐚𝐱Classifier_headsubscript𝜆𝑖\text{Predictions}\leftarrow\mathbf{Softmax}(\text{Classifier\_head}(\mathbf{% \lambda}_{i}))Predictions ← bold_Softmax ( Classifier_head ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
14:    𝐋𝐨𝐬𝐬𝐂𝐫𝐨𝐬𝐬_𝐞𝐧𝐭𝐫𝐨𝐩𝐲(𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐨𝐧𝐬,𝐛y)𝐋𝐨𝐬𝐬𝐂𝐫𝐨𝐬𝐬_𝐞𝐧𝐭𝐫𝐨𝐩𝐲𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐨𝐧𝐬subscript𝐛𝑦\text{Loss}\leftarrow\mathbf{Cross\_entropy}(\text{Predictions},\mathbf{b}_{y})Loss ← bold_Cross _ bold_entropy ( Predictions , bold_b start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT )
15:    θθ+𝐂𝐨𝐦𝐩𝐮𝐭𝐞_𝐮𝐩𝐝𝐚𝐭𝐞(𝐋𝐨𝐬𝐬)𝜃𝜃𝐂𝐨𝐦𝐩𝐮𝐭𝐞_𝐮𝐩𝐝𝐚𝐭𝐞𝐋𝐨𝐬𝐬\theta\leftarrow\theta+\mathbf{Compute\_update}(\text{Loss})italic_θ ← italic_θ + bold_Compute _ bold_update ( Loss )
16:end for
17:ξ𝐒𝐀𝐍𝐄subscript𝜉𝐒𝐀𝐍𝐄\mathcal{\xi}_{\text{{SANE}}}italic_ξ start_POSTSUBSCRIPT SANE end_POSTSUBSCRIPT = ξ(θ)𝜉𝜃{\xi(\theta)}italic_ξ ( italic_θ ) normal-▷\triangleright Trained SANE
18:return ξ𝐒𝐀𝐍𝐄subscript𝜉𝐒𝐀𝐍𝐄\mathcal{\xi}_{\text{{SANE}}}italic_ξ start_POSTSUBSCRIPT SANE end_POSTSUBSCRIPT

As a next step in our SANE architecture, we employ a stack of encoders [27] to learn the map** from the raw data space to a latent space by classifying seen devices, see line no. 7-11 in Algorithm 1. After the encoder stack, we adopt an average pooling layer to aggregate the features from both packet-level and sequence-level. Next we add two different dense layers 𝐍𝐋lsubscript𝐍𝐋𝑙\mathbf{NL}_{l}bold_NL start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and 𝐍𝐋λsubscript𝐍𝐋𝜆\mathbf{NL}_{\lambda}bold_NL start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT to 𝒞𝒞\mathcal{C}caligraphic_C for deriving latent features 𝐋𝐋\mathbf{L}bold_L and 𝚲𝚲\mathbf{\Lambda}bold_Λ, respectively. Finally, we pass the latent features to a classifier head and use 𝐒𝐨𝐟𝐭𝐦𝐚𝐱𝐒𝐨𝐟𝐭𝐦𝐚𝐱\mathbf{Softmax}bold_Softmax to get the prediction probabilities per device class. The final step of the forward path is the calculation of the loss. Then we perform back-propagation to compute the weight updates. This procedure is repeated for several epochs until we get our final trained model, ξ𝐒𝐀𝐍𝐄subscript𝜉𝐒𝐀𝐍𝐄\mathcal{\xi}_{\text{{SANE}}}italic_ξ start_POSTSUBSCRIPT SANE end_POSTSUBSCRIPT.

Algorithm 2 SANE feature and attribute extraction
1:𝒳𝒳\mathcal{X}caligraphic_X; 𝒴𝒴\mathcal{Y}caligraphic_Y; ξ𝐒𝐀𝐍𝐄subscript𝜉𝐒𝐀𝐍𝐄\mathcal{\xi}_{\text{{SANE}}}italic_ξ start_POSTSUBSCRIPT SANE end_POSTSUBSCRIPT
2:𝒞𝒞\mathcal{C}caligraphic_C: feature extraction model; 𝒜𝒜\mathcal{A}caligraphic_A: Attribute vector; 𝐋𝐋\mathbf{L}bold_L: latent features
3:𝒞𝒞absent\mathcal{C}\leftarrowcaligraphic_C ← Remove_classifier_head (ξ𝐒𝐀𝐍𝐄subscript𝜉𝐒𝐀𝐍𝐄\mathcal{\xi}_{\text{{SANE}}}italic_ξ start_POSTSUBSCRIPT SANE end_POSTSUBSCRIPT)
4:𝒞l𝐑𝐞𝐦𝐨𝐯𝐞_𝐍𝐋λ(𝒞)subscript𝒞𝑙𝐑𝐞𝐦𝐨𝐯𝐞_subscript𝐍𝐋𝜆𝒞\mathcal{C}_{l}\leftarrow\text{Remove}\_\mathbf{NL}_{\lambda}(\mathcal{C})caligraphic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ← Remove _ bold_NL start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( caligraphic_C ) normal-▷\triangleright Remove layer 𝐍𝐋λsubscript𝐍𝐋𝜆\mathbf{NL}_{\lambda}bold_NL start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT
5:𝒞λ𝐑𝐞𝐦𝐨𝐯𝐞_𝐍𝐋l(𝒞l)subscript𝒞𝜆𝐑𝐞𝐦𝐨𝐯𝐞_subscript𝐍𝐋𝑙subscript𝒞𝑙\mathcal{C}_{\lambda}\leftarrow\text{Remove}\_\mathbf{NL}_{l}(\mathcal{C}_{l})caligraphic_C start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ← Remove _ bold_NL start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( caligraphic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT )normal-▷\triangleright Remove layer 𝐍𝐋lsubscript𝐍𝐋𝑙\mathbf{NL}_{l}bold_NL start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT
6:for 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 𝒳𝒳\mathcal{X}caligraphic_X, i=1,2,,mfor-all𝑖12𝑚\forall\>i=1,2,\cdots,m∀ italic_i = 1 , 2 , ⋯ , italic_m do
7:    li=𝒞l(𝐱i)subscript𝑙𝑖subscript𝒞𝑙subscript𝐱𝑖l_{i}=\mathcal{C}_{l}(\mathbf{x}_{i})italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) normal-▷\triangleright Extract feature lM×1𝑙superscript𝑀1l\in\mathbb{R}^{M\times 1}italic_l ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × 1 end_POSTSUPERSCRIPT
8:    λi=𝒞λ(𝐱i)subscript𝜆𝑖subscript𝒞𝜆subscript𝐱𝑖\lambda_{i}=\mathcal{C}_{\lambda}(\mathbf{x}_{i})italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_C start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) normal-▷\triangleright Extract feature λN×1𝜆superscript𝑁1\mathbf{\lambda}\in\mathbb{R}^{N\times 1}italic_λ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT
9:end for
10:𝐋𝐋\mathbf{L}bold_L = {l1,l2,,lm}subscript𝑙1subscript𝑙2subscript𝑙𝑚\{l_{1},l_{2},\cdots,l_{m}\}{ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }
11:𝚲𝚲\mathbf{\Lambda}bold_Λ = {λ1,λ2,,λm}subscript𝜆1subscript𝜆2subscript𝜆𝑚\{\mathbf{\lambda}_{1},\mathbf{\lambda}_{2},\cdots,\mathbf{\lambda}_{m}\}{ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }
12:𝒜𝒜\mathcal{A}caligraphic_A = imλimsuperscriptsubscript𝑖𝑚subscript𝜆𝑖𝑚\frac{\sum_{i}^{m}\mathbf{\lambda}_{i}}{m}divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG normal-▷\triangleright Take average of 𝚲𝚲\mathbf{\Lambda}bold_Λ
13:return 𝒞𝒞\mathcal{C}caligraphic_C, 𝒜𝒜\mathcal{A}caligraphic_A, 𝐋𝐋\mathbf{L}bold_L

Feature and attribute extraction (Algorithm 2). In this stage of ZEST, we feed 𝒳d,d𝒰superscript𝒳𝑑for-all𝑑𝒰\mathcal{X}^{d},\forall\>d\in\mathcal{U}caligraphic_X start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ∀ italic_d ∈ caligraphic_U to ξ𝐒𝐀𝐍𝐄subscript𝜉𝐒𝐀𝐍𝐄\mathcal{\xi}_{\text{{SANE}}}italic_ξ start_POSTSUBSCRIPT SANE end_POSTSUBSCRIPT, an attention-based encoder trained only on seen classes (the “wise people”), to derive the attribute vectors of the unseen devices (without labels); see Figure 3a. To achieve this, we first remove the classification head of the trained SANE ξ𝐒𝐀𝐍𝐄subscript𝜉𝐒𝐀𝐍𝐄\mathcal{\xi}_{\text{{SANE}}}italic_ξ start_POSTSUBSCRIPT SANE end_POSTSUBSCRIPT to get the feature extraction model 𝒞𝒞\mathcal{C}caligraphic_C. For a device dΓ𝑑Γd\in\Gammaitalic_d ∈ roman_Γ, we remove the layer 𝐍𝐋λsubscript𝐍𝐋𝜆\mathbf{NL}_{\lambda}bold_NL start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT from 𝒞𝒞\mathcal{C}caligraphic_C to get 𝒞lsubscript𝒞𝑙\mathcal{C}_{l}caligraphic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for 𝐋dsuperscript𝐋𝑑\mathbf{L}^{d}bold_L start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT extraction. Then we remove layer 𝐍𝐋lsubscript𝐍𝐋𝑙\mathbf{NL}_{l}bold_NL start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from 𝒞lsubscript𝒞𝑙\mathcal{C}_{l}caligraphic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to get 𝒞λsubscript𝒞𝜆\mathcal{C}_{\lambda}caligraphic_C start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT for 𝚲dsuperscript𝚲𝑑\mathbf{\Lambda}^{d}bold_Λ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The attribute extractor takes the average value of 𝚲dsuperscript𝚲𝑑\mathbf{\Lambda}^{d}bold_Λ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to compute attribute vector 𝒜dsuperscript𝒜𝑑\mathcal{A}^{d}caligraphic_A start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for device d𝑑ditalic_d. Line no. 6-12 in Algorithm 2 illustrate the process of extracting attribute vectors for both seen and unseen devices.

Algorithm 3 CVAE and SVM training
1:𝒜ssuperscript𝒜𝑠\mathcal{A}^{s}caligraphic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT; 𝐋s,s𝒮superscript𝐋𝑠for-all𝑠𝒮\mathbf{L}^{s},\forall\>s\in\mathcal{S}bold_L start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , ∀ italic_s ∈ caligraphic_S; 𝒜u,u𝒰superscript𝒜𝑢for-all𝑢𝒰\mathcal{A}^{u},\forall\>u\in\mathcal{U}caligraphic_A start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , ∀ italic_u ∈ caligraphic_U;
2:𝐒𝐕𝐌subscript𝐒𝐕𝐌\mathcal{F}_{\text{SVM}}caligraphic_F start_POSTSUBSCRIPT SVM end_POSTSUBSCRIPT : Trained SVM classifier
3:Randomly initialize 𝐄𝐧𝐜𝐨𝐝𝐞𝐫𝐜𝐯𝐚𝐞subscript𝐄𝐧𝐜𝐨𝐝𝐞𝐫𝐜𝐯𝐚𝐞\text{Encoder}_{\text{cvae}}Encoder start_POSTSUBSCRIPT cvae end_POSTSUBSCRIPT, 𝐃𝐞𝐜𝐨𝐝𝐞𝐫𝐜𝐯𝐚𝐞subscript𝐃𝐞𝐜𝐨𝐝𝐞𝐫𝐜𝐯𝐚𝐞\text{Decoder}_{\text{cvae}}Decoder start_POSTSUBSCRIPT cvae end_POSTSUBSCRIPT
4:for 𝐛𝐛\mathbf{b}bold_b: sample a batch from 𝐋ssuperscript𝐋𝑠\mathbf{L}^{s}bold_L start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT do
5:    μ,σ𝐄𝐧𝐜𝐨𝐝𝐞𝐫𝐜𝐯𝐚𝐞𝜇𝜎subscript𝐄𝐧𝐜𝐨𝐝𝐞𝐫𝐜𝐯𝐚𝐞\mu,\sigma\leftarrow\text{Encoder}_{\text{cvae}}italic_μ , italic_σ ← Encoder start_POSTSUBSCRIPT cvae end_POSTSUBSCRIPT (𝐛,𝒜s)\mathbf{b},\mathcal{A}^{s})bold_b , caligraphic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT )
6:    Sample 𝐳𝐳\mathbf{z}bold_z form 𝒩(μ,σ)similar-to𝒩𝜇𝜎\mathcal{N}\sim(\mu,\sigma)caligraphic_N ∼ ( italic_μ , italic_σ )
7:    𝐛^𝐃𝐞𝐜𝐨𝐝𝐞𝐫𝐜𝐯𝐚𝐞^𝐛subscript𝐃𝐞𝐜𝐨𝐝𝐞𝐫𝐜𝐯𝐚𝐞\mathbf{\hat{b}}\leftarrow\text{Decoder}_{\text{cvae}}over^ start_ARG bold_b end_ARG ← Decoder start_POSTSUBSCRIPT cvae end_POSTSUBSCRIPT (𝐳,𝒜s)\mathbf{z},\mathcal{A}^{s})bold_z , caligraphic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) normal-▷\triangleright Reconstruction
8:    Loss \leftarrow |(𝐛^𝐛)|^𝐛𝐛|(\mathbf{\hat{b}}-\mathbf{b})|| ( over^ start_ARG bold_b end_ARG - bold_b ) | + KL(𝒩(μ,σ),𝒩(0,1)𝒩𝜇𝜎𝒩01\mathcal{N}(\mu,\sigma),\mathcal{N}(0,1)caligraphic_N ( italic_μ , italic_σ ) , caligraphic_N ( 0 , 1 ))
9:    θθ+𝐂𝐨𝐦𝐩𝐮𝐭𝐞_𝐮𝐩𝐝𝐚𝐭𝐞(𝐋𝐨𝐬𝐬)𝜃𝜃𝐂𝐨𝐦𝐩𝐮𝐭𝐞_𝐮𝐩𝐝𝐚𝐭𝐞𝐋𝐨𝐬𝐬\theta\leftarrow\theta+\mathbf{Compute\_update}(\text{Loss})italic_θ ← italic_θ + bold_Compute _ bold_update ( Loss )
10:end for
11:𝒟𝒟\mathcal{D}caligraphic_D = 𝐃𝐞𝐜𝐨𝐝𝐞𝐫𝐜𝐯𝐚𝐞(θ)subscript𝐃𝐞𝐜𝐨𝐝𝐞𝐫𝐜𝐯𝐚𝐞𝜃\text{Decoder}_{\text{cvae}}(\theta)Decoder start_POSTSUBSCRIPT cvae end_POSTSUBSCRIPT ( italic_θ ) normal-▷\triangleright Trained decoder
12:Sample random noise τ𝜏\tauitalic_τ form N(0,1)similar-to𝑁01N\sim(0,1)italic_N ∼ ( 0 , 1 )
13:𝒫u𝒟(τ,𝒜u)subscript𝒫𝑢𝒟𝜏superscript𝒜𝑢\mathcal{P}_{u}\leftarrow\mathcal{D}(\tau,\mathcal{A}^{u})caligraphic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ← caligraphic_D ( italic_τ , caligraphic_A start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ); 𝒫s𝒟(τ,𝒜s)subscript𝒫𝑠𝒟𝜏superscript𝒜𝑠\mathcal{P}_{s}\leftarrow\mathcal{D}(\tau,\mathcal{A}^{s})caligraphic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← caligraphic_D ( italic_τ , caligraphic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) normal-▷\triangleright Pseudo data generation
14:𝒫=𝒫u𝒫s𝒫subscript𝒫𝑢subscript𝒫𝑠\mathcal{P}=\mathcal{P}_{u}\cup\mathcal{P}_{s}caligraphic_P = caligraphic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∪ caligraphic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
15:𝐒𝐕𝐌SVM_fit(𝒫)subscript𝐒𝐕𝐌SVM_fit𝒫\mathcal{F}_{\text{SVM}}\leftarrow\text{SVM\_fit}(\mathcal{P})caligraphic_F start_POSTSUBSCRIPT SVM end_POSTSUBSCRIPT ← SVM_fit ( caligraphic_P ) normal-▷\triangleright SVM training
16:return 𝐒𝐕𝐌subscript𝐒𝐕𝐌\mathcal{F}_{\text{SVM}}caligraphic_F start_POSTSUBSCRIPT SVM end_POSTSUBSCRIPT

Generative model training (Algorithm 3). The goal of this stage is to generate pseudo data for unseen IoT devices. For this purpose, we leverage a well-known generative model, namely a CVAE (conditional variational autoencoder). The CVAE contains two parts: an encoder and a decoder. The encoder compresses the latent features conditioned on the attributes of devices. Next, the decoder reconstructs the latent features, based on IoT device attributes. CVAE training is based only on seen devices’ latent features 𝐋s,s𝒮superscript𝐋𝑠for-all𝑠𝒮\mathbf{L}^{s},\forall\>s\in\mathcal{S}bold_L start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , ∀ italic_s ∈ caligraphic_S and attribute vector 𝒜s,s𝒮superscript𝒜𝑠for-all𝑠𝒮\mathcal{A}^{s},\forall\>s\in\mathcal{S}caligraphic_A start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , ∀ italic_s ∈ caligraphic_S, see line no. 3-11 in Algorithm 3. For training of the CVAE architecture, we use two losses functions, namely the reconstruction loss and the KL divergence [17] between the latent distribution and normal Gaussian distribution. During the training phase, the decoder 𝒟𝒟\mathcal{D}caligraphic_D of the CVAE learns to map attribute vectors to the initial latent features. We use this ability of 𝒟𝒟\mathcal{D}caligraphic_D to generate pseudo data 𝒫usubscript𝒫𝑢\mathcal{P}_{u}caligraphic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT of unseen device classes based on attributes 𝒜u,u𝒰superscript𝒜𝑢for-all𝑢𝒰\mathcal{A}^{u},\forall\>u\in\mathcal{U}caligraphic_A start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , ∀ italic_u ∈ caligraphic_U, by querying random noise τ𝜏\tauitalic_τ sampled from the normal Gaussian distribution. This unseen pseudo data is then used in the next stage for a supervised classifier. During our experiments, we found that the final classifier faces a bias towards seen classes when we use real seen device data, an observation that has already been made in [17]. Therefore, we generate an equal number of pseudo data 𝒫=𝒫u𝒫s𝒫subscript𝒫𝑢subscript𝒫𝑠\mathcal{P}=\mathcal{P}_{u}\cup\mathcal{P}_{s}caligraphic_P = caligraphic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ∪ caligraphic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for devices from both seen and unseen classes. These steps are illustrated in Figure 3b.

Supervised classifier training. After generating labeled data for both seen and unseen devices, we employ a final supervised classifier, support vector machine (SVM), to predict the IoT device label; see Figure 3c. Finally, we test the performance of the trained SVM model in classifying real traffic data from both seen and unseen devices.

IV Performance evaluation

In this section, we evaluate ZEST and compare it with existing semi-supervised learning approaches. Typically, the performance evaluation happens in two different settings: one is ZSL, and another one is GZSL (generalized zero-shot learning). In both settings, the training set contains only the labeled data of seen classes, {𝒳d,𝒴d,d𝒮}superscript𝒳𝑑superscript𝒴𝑑for-all𝑑𝒮\{\mathcal{X}^{d},\mathcal{Y}^{d},\forall\>d\in\mathcal{S}\}{ caligraphic_X start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , caligraphic_Y start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ∀ italic_d ∈ caligraphic_S }. The objective of ZSL is to classify the unseen devices 𝒰𝒰\mathcal{U}caligraphic_U, which are not seen during the training stage. Different from ZSL, the test data of GZSL come from both seen 𝒮𝒮\mathcal{S}caligraphic_S and unseen devices 𝒰𝒰\mathcal{U}caligraphic_U. To sum up, the goals for ZSL and GZSL are to learn the classifications: f𝐙𝐒𝐋:𝒳𝒰:subscript𝑓𝐙𝐒𝐋𝒳𝒰f_{\text{ZSL}}:\mathcal{X}\rightarrow\mathcal{U}italic_f start_POSTSUBSCRIPT ZSL end_POSTSUBSCRIPT : caligraphic_X → caligraphic_U and f𝐆𝐙𝐒𝐋:𝒳𝒰𝒮:subscript𝑓𝐆𝐙𝐒𝐋𝒳𝒰𝒮f_{\text{GZSL}}:\mathcal{X}\to\mathcal{U}\cup\mathcal{S}italic_f start_POSTSUBSCRIPT GZSL end_POSTSUBSCRIPT : caligraphic_X → caligraphic_U ∪ caligraphic_S, respectively.

IV-A Dataset for evaluation

We carry out evaluations on the publicly available UNSW 2018 dataset [39]. The appeal of this dataset stems from the fact that it is relatively large (27 GB of data) and contains a good number of IoT devices (28 device types). The UNSW dataset is composed of raw pcap data collected during a period of 61 days. We ignore the gateway device, as it only acts as an intermediary between IoT devices and the Internet. Due to the imbalance in data per device type, for both ZSL and GZSL, we use only the 12 device classes with the most number of data points; of these, 10 random devices form the seen category, and the rest form the unseen category. The dataset contains more than a million data points across 12 devices, with the minimum being 14,034 data point for a device, and the maximum being 197,876 data points. We do not include the unseen devices in the first supervised training step. The seen and unseen devices are selected randomly five times, and we report the average result over all five experimental runs.

IV-B Baselines for comparisons

We compare ZEST with the baselines on the benchmark dataset. All the training and testing of baselines and ZEST are performed on an Nvidia Tesla T4 GPU. As Bi-LSTM has been shown to have excellent performance in supervised IoT fingerprinting [7], we use Bi-LSTM as the feature extractor for all the baselines. We elaborate: sequences of packets are transformed to a smaller dimension feature vector 𝐋20×1𝐋superscript201\mathbf{L}\in\mathbb{R}^{20\times 1}bold_L ∈ blackboard_R start_POSTSUPERSCRIPT 20 × 1 end_POSTSUPERSCRIPT using a Bi-LSTM model. These feature vectors produced by the Bi-LSTM model are then provided as input to the respective models of the baselines. Therefore, just like ZEST, all baselines take the transformed and reduced feature vectors as input, thus making sure the comparisons are fair and are at the algorithmic level. Note, the Bi-LSTM model, as well as the SANE model for ZEST, are trained on the seen classes. We now describe the baselines:

  1. 1.

    VAE-K: This is originally proposed as an unsupervised learning method in [9], employing VAE to extract features and applying k-means to perform clustering subsequently. In our implementation, the VAE encoder compresses the features 𝐋𝐋\mathbf{L}bold_L generated by the Bi-LSTM model to the same dimension as 𝚲𝚲\mathbf{\Lambda}bold_Λ, another latent feature with a lower dimension. It then employs k-means to cluster data into 12 groups, the same as the total number of seen and unseen devices. Since the Bi-LSTM model is trained on seen classes, the model is now semi-supervised, and therefore also more capable than the unsupervised counterpart.

  2. 2.

    SeqCR: This is a semi-supervised sequence clustering model. Rather than using the feature vectors 𝐋𝐋\mathbf{L}bold_L generated by Bi-LSTM directly, it adds one more layer to extract 𝚲𝚲{\mathbf{\Lambda}}bold_Λ from the initial packet sequences. Thus, avoiding VAE, we still obtain feature vectors of a much lower dimension, 𝚲𝚲\mathbf{\Lambda}bold_Λ. Subsequently, we apply k-means clustering to assign the latent features 𝚲𝚲{\mathbf{\Lambda}}bold_Λ to 12 groups. The cluster centers are randomly initialized.

  3. 3.

    SeqCS: This is an extension of SeqCR and a semi-supervised sequence-based clustering model. Unlike SeqCR, it initializes the k-means cluster centers using the attribute vector 𝒜𝒜\mathcal{A}caligraphic_A, which is the average value of 𝚲𝚲\mathbf{\Lambda}bold_Λ (from both seen and unseen devices), and uses seeded k-means [31] for clustering. This helps improve the classification performance since the attribute vectors contain auxiliary information for each cluster.

  4. 4.

    DEFT: This is a semi-supervised learning method proposed in [10], and can be seen as an extension of SeqCS. After getting initial clustering results based on SeqCS, DEFT trains another random forest for classification based on the clustered dataset with 𝚲𝚲\mathbf{\Lambda}bold_Λ, as well as their cluster labels. However, if the initial clustering done by SeqCS is of poor quality, it can negatively impact the final performance.

We make the code of ZEST and all the baselines public [40].

IV-C Results

To the best of our knowledge, our work is the first to apply the generative ZSL framework on network traffic. Therefore, besides studying the efficacy of ZEST for IoT device classification, we conduct various experiments: we carry out ablation experiments to investigate the efficiency of the SANE model, we perform SANE hyper-parameter tuning, and we analyze the ability of ZEST to handle varying numbers of unseen devices.

IV-C1 Comparison of baselines vs. ZEST

We evaluate the performance using two evaluation metrics: ZSL accuracy and GZSL accuracy. We consistently use a 5:1 ratio of seen to unseen devices (10 seen, 2 unseen) in our experiments. As depicted in Figure 4, in the ZSL setting, SeqCS achieves the best accuracy of 64%similar-toabsentpercent64\sim 64\%∼ 64 % among the baselines. ZEST achieves superior accuracy of 93%similar-toabsentpercent93\sim 93\%∼ 93 %, meaning we get nearly 30% absolute improvement with ZEST over the best performing baseline. As for GZSL, the accuracy of ZEST is 92%similar-toabsentpercent92\sim 92\%∼ 92 %, which is an absolute 10% improvement when compared to the next best-performing baseline DEFT.

Refer to caption
Figure 4: Comparison of baselines and ZEST

IV-C2 Analysis of SANE architecture

We investigate the impact of the number of stacked encoders, denoted as e𝑒eitalic_e, and the number of attention heads, denoted as hhitalic_h, in our SANE model. Using a stack of encoders helps the SANE model to learn more complex relationships, similar to a CNN with many convolution layers. Within one encoder, the attention mechanism is applied multiple times in parallel, giving us a so-called multi-head attention layer. This mechanism enables the model to jointly consider different patterns to pay attention to, while also offering great possibility for parallelization during training [25].

Initially, we set the number of attention heads in SANE to 8888 and vary the encoder stack size for e={1,2,4,6,8}𝑒12468e=\{1,2,4,6,8\}italic_e = { 1 , 2 , 4 , 6 , 8 }. The batch size is set to 64 and we train 20 epochs with a learning rate of 0.0005. Our analysis considers the resource consumption, which takes into account the parameter size and training time, as well as the supervised training accuracy of the different settings. The results presented in Table I demonstrate that increasing the number of encoders leads to a proportional increase in training time and parameter size. However, the prediction accuracy remains around 98.8%percent98.898.8\%98.8 % when the number of encoders ranges from 2222 to 8888. Hence, after considering the trade-off between resources and accuracy, we set e=2𝑒2e=2italic_e = 2.

TABLE I: Comparison of different encoder stack sizes
Number of Parameter Training time Accuracy
encoders size (MB) /epoch (min) (%)
1 2.81 2.10 98.23
2 5.56 3.90 98.84
4 11.09 7.85 98.76
6 16.61 11.10 98.81
8 22.14 14.90 98.86

Having fixed the encoder stack size, we now vary the number of attention heads to h={1,2,4,8}1248h=\{1,2,4,8\}italic_h = { 1 , 2 , 4 , 8 }. The results presented in Table II show that increasing the number of attention heads leads to higher accuracy. This is because more attention heads enable the model to capture a broader range of aspects in the traffic pattern. Therefore, we choose h=88h=8italic_h = 8. In conclusion of our hyperparameter analysis, we find that the optimal encoder stack size is relatively small at e=2𝑒2e=2italic_e = 2. On the other hand, the multi-head attention layers should be relatively large, each containing h=88h=8italic_h = 8 parallel attention heads.

TABLE II: Comparison of different number of attention heads in SANE
Number of Parameter Training time Accuracy
attention heads size (MB) /epoch (min) (%)
1 1.24 1.42 97.81
2 1.86 1.68 98.36
4 3.09 2.37 98.45
8 5.56 3.90 98.84

IV-C3 Comparison of SANE vs. LSTM

In this study, we first investigate the impact of using SANE in comparison to a Bi-LSTM for feature extraction in supervised learning. We consider all 28 classes and create a train/val/test data split in the ratio 60:20:20, via random shuffling. Table III presents the prediction accuracy and inference time of these two models. From the results, we observe that the SANE model achieves a slightly higher accuracy, yet at a much lower prediction time than the Bi-LSTM model. To further study the effectiveness of using SANE, we conduct another experiment on the baseline models, where we replace the underlying Bi-LSTM sequence models with the SANE architecture.

TABLE III: Comparison of SANE and Bi-LSTM
Method Bi-LSTM SANE
Classification Accuracy (%) 97.46 98.84
Inference Time (ms) 0.59 0.12

Figure 5 illustrates the ZSL and GZSL accuracy when feature extraction is done by Bi-LSTM and SANE. In the ZSL setting, SANE-based models yield an average absolute improvement of approximately 20% over various baselines, with a maximum improvement of about 40% observed for DEFT. In the GZSL setting, SANE-based models achieve an average absolute accuracy gain of about 5%, compared to Bi-LSTM. These results suggest that SANE is capable of extracting more informative features, leading us to select it as our preferred feature extraction method in ZEST. Additionally, as is evident from Figure 4 and Figure 5, ZEST also outperforms SANE-based baselines. ZEST brings an average absolute improvement of about 10% and 5% for ZSL and GZSL settings, respectively, when compared with the best SANE-based baseline, DEFT.

Refer to caption
(a)

Figure 5: SANE vs. Bi-LSTM for baselines

IV-C4 Comparison of different attribute dimensions

Here we search for the optimal attribute dimension for ZEST. To assess the quality of the reconstructed data from the CVAE, we experiment by varying the dimension of the attribute vector 𝒜𝒜\mathcal{A}caligraphic_A, specifically, we use |𝒜|={2,3,4,5}𝒜2345|\mathcal{A}|=\{2,3,4,5\}| caligraphic_A | = { 2 , 3 , 4 , 5 }. As shown in Figure 6a, we observe that the best performance is achieved when the dimension is 3 in both ZSL and GZSL settings. This trend of increasing accuracy followed by decreasing accuracy is reasonable—while a higher-dimensional attribute vector represents more traffic information, it can also make the map** between attributes and traffic data more challenging.

Refer to caption
(a) Comparison of different attribute dimensions
Refer to caption
(b) Comparison of varying number of unseen classes
Figure 6: Analysis for attribute vector dimension and number of unseen classes

IV-C5 Varying the number of unseen classes

We investigate ZEST’s ability to handle multiple unseen classes. In Figure 6b, we compare its performance when varying the number of unseen devices from 1 to 4, for a total of 12 classes. As the number of unseen classes increases, both ZSL and GZSL accuracy values tend to decrease. When the ratio changes from 2/102102/102 / 10 to 3/9393/93 / 9, there is a sharp decrease for both ZSL and GZSL. Lower ratios of unseen classes to seen classes facilitate a more effective map** of attribute vectors to traffic data. In other words, ZEST requires a significant number of devices in the seen category to classify unseen devices. A rigorous validation of this observation using 50-100 devices is worth a different study. In practice, however, given that MAC addresses of devices remain static over a duration, the model can effectively classify traffic from a small number of unseen IoT devices.

V Conclusions

In this work, we develop a novel zero-shot learning framework (ZSL), called ZEST, for IoT fingerprinting. As the first attempt to work on ZSL for network traffic modeling, we analyze the effectiveness of the self-attention based model for extracting traffic features, in this framework. Our experiments show that SANE yields higher accuracy, lower inference time, and better feature extraction ability in comparison to Bi-LSTM. Since ZSL relies on class-specific attributes, which are typically not present for traffic-based IoT device classification, we propose a novel attention-based approach, i.e., SANE, to automatically compute attributes for any IoT device. Finally, we compare our ZEST framework with four baselines from both the unsupervised and semi-supervised domains. Our results demonstrate that ZEST achieves state-of-the-art performance, with an absolute accuracy improvement of about 30% and 10% for ZSL and GZSL, respectively.

References

  • [1] M. Antonakakis, T. April, M. Bailey, M. Bernhard, E. Bursztein, J. Cochran, Z. Durumeric, J. A. Halderman, L. Invernizzi, M. Kallitsis et al., “Understanding the Mirai Botnet,” in Proc. USENIX Security, 2017, pp. 1093–1110.
  • [2] S. Herwig, K. Harvey, G. Hughey, R. Roberts, and D. Levin, “Measurement and Analysis of Hajime: a Peer-to-peer IoT Botnet,” in Proc. NDSS, 2019.
  • [3] B. Bezawada, M. Bachani, J. Peterson, H. Shirazi, I. Ray, and I. Ray, “Behavioral fingerprinting of IoT devices,” in Proc. of the workshop on attacks and solutions in hardware security, 2018, pp. 41–50.
  • [4] N. J. Apthorpe, D. Reisman, and N. Feamster, “A Smart Home is No Castle: Privacy Vulnerabilities of Encrypted IoT Traffic,” CoRR, vol. abs/1705.06805, 2017.
  • [5] M. Miettinen, S. Marchal, I. Hafeez, N. Asokan, A.-R. Sadeghi, and S. Tarkoma, “IoT SENTINEL: Automated Device-Type Identification for Security Enforcement in IoT,” in 37th IEEE International Conference on Distributed Computing Systems, ICDCS, 2017, pp. 2177–2184.
  • [6] A. Sivanathan, D. Sherratt, H. H. Gharakheili, A. Radford, C. Wijenayake, A. Vishwanath, and V. Sivaraman, “Characterizing and classifying IoT traffic in smart cities and campuses,” in IEEE International Conference on Computer Communications Workshops, INFOCOM, 2017, pp. 559–564.
  • [7] S. Dong, Z. Li, D. Tang, J. Chen, M. Sun, and K. Zhang, “Your Smart Home Can’t Keep a Secret: Towards Automated Fingerprinting of IoT Traffic,” in Proc. AsiaCCS, 2020, p. 47–59.
  • [8] F. Sawadogo, J. Violos, A. Hameed, and A. Leivadeas, “An Unsupervised Machine Learning Approach for IoT Device Categorization,” in IEEE International Mediterranean Conference on Communications and Networking (MeditCom), 2022, pp. 25–30.
  • [9] S. Zhang, Z. Wang, J. Yang, D. Bai, F. Li, Z. Li, J. Wu, and X. Liu, “Unsupervised IoT Fingerprinting Method via Variational Auto-encoder and K-means,” in IEEE ICC, 2021.
  • [10] V. Thangavelu, D. M. Divakaran, R. Sairam, S. S. Bhunia, and M. Gurusamy, “DEFT: A Distributed IoT Fingerprinting Technique,” IEEE Internet of Things Journal, vol. 6, no. 1, pp. 940–952, 2019.
  • [11] B. Atul Desai, D. M. Divakaran, I. Nevat, G. W. Peters, and M. Gurusamy, “A feature-ranking framework for IoT device classification,” in 11th Int’l Conf. on Communication Systems & Networks (COMSNETS 2019), Jan. 2019.
  • [12] R. Trimananda, J. Varmarken, A. Markopoulou, and B. Demsky, “Packet-Level Signatures for Smart Home Devices,” in Proc. NDSS, 2020.
  • [13] B. Chakraborty, D. M. Divakaran, I. Nevat, G. W. Peters, and M. Gurusamy, “Cost-aware Feature Selection for IoT Device Classification,” IEEE Internet of Things Journal, 2021.
  • [14] L. Fan, S. Zhang, Y. Wu, Z. Wang, C. Duan, J. Li, and J. Yang, “An IoT Device Identification Method based on Semi-supervised Learning,” in 16th International Conference on Network and Service Management (CNSM), 2020, pp. 1–7.
  • [15] A. Shenoi, P. K. Vairam, K. Sabharwal, J. Li, and D. M. Divakaran, “iPET: Privacy Enhancing Traffic Perturbations for Secure IoT Communications,” Proc. Privacy Enhancing Technologies Symposium (PETS), 2023.
  • [16] S. Liu, X. Zhu, H. Chen, and Z. Han, “Secure Communication for Integrated Satellite-Terrestrial Backhaul Networks: Focus on Up-link Secrecy Capacity based on Artificial Noise,” IEEE Wireless Communications Letters, pp. 1–1, 2023.
  • [17] A. Mishra, S. Krishna Reddy, A. Mittal, and H. A. Murthy, “A Generative Model for Zero Shot Learning Using Conditional Variational Autoencoders,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops, 2018, pp. 2188–2196.
  • [18] V. K. Verma, D. Brahma, and P. Rai, “Meta-Learning for Generalized Zero-Shot Learning,” in Proc. of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 6062–6069.
  • [19] B. Zhao, X. Sun, Y. Yao, and Y. Wang, “Zero-shot Learning via Shared-Reconstruction-Graph Pursuit,” arXiv preprint arXiv:1711.07302, 2017.
  • [20] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-UCSD birds 200,” 2010.
  • [21] C. H. Lampert, H. Nickisch, and S. Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 453–465, 2013.
  • [22] D. M. Divakaran, S. Le, Y. S. Liau, and V. L. L. Thing, “SLIC: Self-Learning Intelligent Classifier for Network Traffic,” Computer Networks, 2015.
  • [23] I. Nevat, D. M. Divakaran, S. G. Nagarajan, P. Zhang, L. Su, L. L. Ko, and V. L. L. Thing, “Anomaly Detection and Attribution in Networks With Temporally Correlated Traffic,” IEEE/ACM Transactions on Networking, vol. 26, no. 1, pp. 131–144, Feb 2018.
  • [24] K. L. K. Sudheera, D. M. Divakaran, R. P. Singh, and M. Gurusamy, “ADEPT: Detection and Identification of Correlated Attack Stages in IoT Networks,” IEEE Internet of Things Journal, vol. 8, no. 8, pp. 6591–6607, 2021.
  • [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in Proc. NIPS, 2017.
  • [26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, 2019, pp. 4171–4186.
  • [27] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in 9th International Conference on Learning Representations, ICLR, 2021.
  • [28] N. Q. K. Le, Q.-T. Ho, T.-T.-D. Nguyen, and Y.-Y. Ou, “A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information,” Briefings in Bioinformatics, vol. 22, no. 5, p. bbab005, 2021.
  • [29] B. Wu, D. Chen, N. V. Abhishek, and M. Gurusamy, “D3T: Double Deep Q-Network Decision Transformer for Service Function Chain Placement,” in 2023 IEEE 24th International Conference on High Performance Switching and Routing (HPSR), pp. 167–172.
  • [30] A. Sivanathan, H. H. Gharakheili, and V. Sivaraman, “Inferring IoT Device Types from Network Behavior Using Unsupervised Clustering,” in 2019 IEEE 44th Conference on Local Computer Networks (LCN), 2019, pp. 230–233.
  • [31] S. Basu, A. Banerjee, and R. J. Mooney, “Semi-supervised Clustering by Seeding,” in Machine Learning, Proceedings of the Nineteenth International Conference (ICML), 2002, pp. 27–34.
  • [32] D. Xu and Y. Tian, “A comprehensive survey of clustering algorithms,” Annals of Data Science, vol. 2, pp. 165–193, 2015.
  • [33] H. He, Y. He, F. Wang, and W. Zhu, “Improved k-means algorithm for clustering non-spherical data,” Expert Systems, vol. 39, no. 9, p. e13062, 2022.
  • [34] C. Yuan and H. Yang, “Research on K-value selection method of K-means clustering algorithm,” J, vol. 2, no. 2, pp. 226–235, 2019.
  • [35] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata, “Feature Generating Networks for Zero-Shot Learning,” in Proc. CVPR, 2018, pp. 5542–5551.
  • [36] R. Gao, X. Hou, J. Qin, J. Chen, L. Liu, F. Zhu, Z. Zhang, and L. Shao, “Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning,” IEEE Transactions on Image Processing, vol. 29, pp. 3665–3680, 2020.
  • [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [38] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., “Big bird: Transformers for Longer Sequences,” Advances in neural information processing systems, vol. 33, pp. 17 283–17 297, 2020.
  • [39] A. Sivanathan, H. H. Gharakheili, F. Loi, A. Radford, C. Wijenayake, A. Vishwanath, and V. Sivaraman, “Classifying IoT Devices in Smart Environments Using Network Traffic Characteristics,” IEEE Transactions on Mobile Computing, vol. 18, no. 8, pp. 1745–1759, 2018.
  • [40] “ZEST Source Code,” https://github.com/Binghui99/ZEST, 2023.