License: CC BY 4.0
arXiv:2403.10550v1 [cs.LG] 13 Mar 2024

Semi-Supervised Learning for Anomaly Traffic Detection via Bidirectional Normalizing Flows

Zhangxuan Dang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Yu Zheng11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,Xinglin Lin11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,Chunlei Peng11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Qiuyu Chen22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Xinbo Gao33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Xidian University
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTAmazon
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTChongqing University of Posts and Telecommunications
Abstract

With the rapid development of the Internet, various types of anomaly traffic are threatening network security. We consider the problem of anomaly network traffic detection and propose a three-stage anomaly detection framework using only normal traffic. Our framework can generate pseudo anomaly samples without prior knowledge of anomalies to achieve the detection of anomaly data. Firstly, we employ a reconstruction method to learn the deep representation of normal samples. Secondly, these representations are normalized to a standard normal distribution using a bidirectional flow module. To simulate anomaly samples, we add noises to the normalized representations which are then passed through the generation direction of the bidirectional flow module. Finally, a simple classifier is trained to differentiate the normal samples and pseudo anomaly samples in the latent space. During inference, our framework requires only two modules to detect anomalous samples, leading to a considerable reduction in model size. According to the experiments, our method achieves the state of-the-art results on the common benchmarking datasets of anomaly network traffic detection. The code is given in the https://github.com/ZxuanDang/ATD-via-Flows.git

1 Introduction

With the development of the Internet, the proliferation of devices has led to explosive growth in the Internet traffic, which poses significant challenges to the management of network resources and the assurance of network security. In particular, the increasing complexity and diversity of network attacks require systems to enhance their ability to detect anomaly traffic. Anomaly network traffic detection is a vital component in ensuring network security by detecting anomaly traffic passing through computer network nodes. Such network traffic may include malicious activity that is not in alignment with normal behavior. It is critical to maintaining the security of the network infrastructure and reduces the likelihood of network intrusions.

Supervised methods are used to detect anomaly traffic [1, 2, 3, 4, 5, 6]. For example, a machine learning classification model, trained on appropriately labelled manual features, will declare anomaly traffic when the data does not follow the normal distribution. However, the main drawbacks of supervised anomaly detection are [7, 8, 9, 10]: (1) Collecting anomaly traffic would be a time-consuming and labor-intensive task due to the nature of the anomaly traffic; (2) It can be challenging to obtain accurate and representative labels for normal and abnormal traffic. Due to limited access to a large amount of anomaly data, semi-supervised methods are often adopted for detecting anomalies by training only on normal traffic.

Refer to caption
Figure 1: (a) Anomalies in images comprise of both colour and shape. Based on prior knowledge of anomaly patterns, images can simulate anomalies by introducing ”noise” [11, 12]. (b) Network traffic anomaly patterns are difficult to generalise. Simulating abnormal network traffic packets by directly introducing ”noise” may destroy the semantic information of the data packets and produce meaningless pseudo anomalies, as shown in Section 4.5. Our framework is able to simulate anomaly samples without prior knowledge of anomaly patterns.

Alternative methods generate network traffic to address data labeling and scaling issues. For example, Ring et al. propose three different preprocessing methods for GANs that generate flow-based data and evaluate the quality of the generated flows [13]. In data imputation, SS-GACN and GACN allow for missing values in data labels and features [14]. The methods impute the missing data features based on classification accuracy. Both of these methods demonstrate that GANs can successfully generate real network traffic. However, the generation of network traffic by GANs necessitates large-scale data. Moreover, such techniques can only generate network traffic with a distribution similar to the collected data, making it challenging to simulate anomaly traffic from diverse distributions [15].

In this work, we propose a novel method for simulating anomalies that uses only normal traffic during training. Our method can generate anomaly samples of network traffic without any prior knowledge of the anomalies, thereby improving anomaly detection. Anomaly simulation-based methods are often used for anomaly detection in computer vision [16, 17, 18]. As shown in Figure 1, they generate new data outside the distribution of normal data by applying transformations such as rotation, CutPaste [11], flip**, and Cutout [19] to the original normal images, and then classify the data using a classifier. It has been proven that this approach can successfully distinguish between normal and anomaly samples [11].

Using the prior knowledge of anomaly patterns, geometric transformation enables the generation of anomaly samples by altering the colour and shape of normal images. For network traffic packets, it is difficult to simulate anomaly patterns with transformations. We cannot directly use geometric transformations such as Cutout to obtain anomaly samples, as the packets are one-dimensional data structures with precise semantics and no spatial semantics, producing meaningless pseudo anomalies, as shown in Section 4.5. Therefore, a bidirectional flow module is proposed in our method. This module can normalize the normal packet feature to a specific tractable distribution. The unknown anomaly samples will be outside this distribution after normalization, as the experiments show. By manipulating the vectors in the distribution, we are able to change the properties of the samples, enabling the simulation of anomalies [20]. By introducing noise to the normalized features of normal samples, we can make them deviate from the distribution of normal samples. Then, through the direction of the flow generation, we can map them back to the original space to generate anomaly samples. Our framework introduces random noise to achieve simulation of anomaly samples, provided the anomaly pattern is unknown. By conducting a proxy classification task between the normal samples and the synthetic ones, we facilitate the model in accurately identifying normal samples. As shown in Section 4, pseudo anomaly samples help the model to better detect normal representations, even if they are almost not overlapped with real anomaly traffic. To the best of our knowledge, this is the first time that a normalizing flow module has been used to generate anomaly traffic network samples with no prior knowledge of anomaly patterns and has led to good results in anomaly detection.

In summary, the main contributions of this paper are in three folds:

  • This paper introduces the normalizing flows to formulate a three-stage framework for anomaly traffic detection by using only normal traffic data. The normalizing flows are utilized to process the packets obtained from the feature extractor. The exceptional performance observed in downstream anomaly detection demonstrates the potential usage for adopting normalizing flows in anomaly traffic detection.

  • This paper embeds the normalizing flows into the process of generating anomaly traffic samples with no prior knowledge of anomaly patterns. The proposed bidirectional flow module effectively utilizes both normalization and generation directions to simulate anomaly samples by manipulating the normalized vector without prior knowledge of the anomaly patterns. The detection results demonstrate a significant improvement in performance achieved through the simulated anomaly samples.

  • Our method outperforms other popular anomaly detection methods on three common benchmarking datasets for anomaly network traffic detection and is efficient in computation.

2 Related Work

Anomaly Network Traffic Detection Methods

In the current research on anomaly network traffic detection, deep learning-based methods are widely used. Some researchers combine traditional feature extraction with the classification ability of neural network models. Cao et al. [21] utilize the RFP algorithm for traffic feature extraction and incorporate convolutional neural networks (CNNs) and Gated Recurrent Units (GRU) for classification. Saba et al. [22] utilize a CNN-based classification approach to predict anomaly traffic based on the features of traffic datasets. Liu et al. [23] utilize a BP neural network model to detect manually extracted flow features.

Other researchers leverage the powerful feature extraction capability of neural network models to extract features from traffic data, thereby enhancing the performance of classifiers. Shone et al. [24] utilize a non-symmetric autoencoder for feature extraction and subsequently integrate it with random forests for intrusion detection. Javaid et al. [25] propose a sparse autoencoder to learn feature extraction from unlabeled data, effectively leveraging the available data to enhance the feature extraction capability. Subsequently, they apply a classification task for detection purposes. Wang et al. [26] employ CNNs and long short-term memory (LSTM) networks to effectively learn spatial and temporal features for classification. These methods are all fully supervised learning methods, which require collecting a large amount of labeled anomaly traffic. In contrast, our method achieves effective anomaly detection by leveraging easily collectible normal data without the need for labeled anomaly traffic.

Anomaly Detection Methods

The research on anomaly detection encompasses various methods, including reconstruction-based, feature matching-based, and anomaly simulation-based approaches. Reconstruction-based methods aim to detect anomaly samples through the analysis of reconstruction errors. Akcay et al. [27] detect anomaly images by reconstructing the latent vectors using encoders. Feature matching-based methods calculate the difference between test samples and stored embeddings to detect anomaly samples. Roth et al. [28] obtain anomaly scores by calculating the distance between test samples and normal embeddings stored in a memory bank. Simulation anomaly sample-based methods utilize the synthetic anomaly samples to enhance the feature extraction or make the models clearly distinguish the differences between normal and anomaly samples. Some researches use geometric transformations to simulate anomaly samples[17, 18, 11, 29], other researches simulate anomaly samples by adding noise [12, 30, 31]. The experimental results in these papers show that simulated anomaly samples enhance the overall detection performance. However, these methods directly process images to obtain synthetic anomaly samples, which can not be applied to network traffic. Our method can simulate network traffic anomaly samples and demonstrates excellent performance in anomaly traffic detection.

Traffic Generation Methods

In the current work on network traffic analysis, there are many studies that use traffic generation to improve the performance of detection systems.

On the one hand, in research on preventing adversarial sample attacks, traffic generation techniques are used to generate adversarial samples to improve the robustness and accuracy of the model. Elie et al. [32] focus on the attack perspective and investigate techniques to generate adversarial examples that can evade machine learning models. They specifically explore the use of evolutionary computation and deep learning as tools for adversarial example generation. Ye et al. [33] propose a defense algorithm using a bidirectional generative adversarial network (GAN) to improve the robustness and accuracy of NIDS in the adversarial environment. The algorithm involves training the generator to learn the data distribution of normal samples and using the discriminator to detect adversarial samples based on reconstruction and matching errors. Wang et al. [34] proposed Def-IDS mechanism is a two-module training framework that integrates multi-class generative adversarial networks and multi-source adversarial retraining to improve model robustness while maintaining detection accuracy on unperturbed samples. Zolbayar et al. [35] develop a generative adversarial network (GAN)-based attack algorithm called NIDSGAN to generate realistic adversarial network traffic flows that can evade ML-based NIDS. The main contributions of the paper [36] are the proposal of GADoT, an adversarial training framework that leverages GANs to generate fake-benign samples for perturbing DDoS samples, and the evaluation of GADoT using network traffic traces capturing adversarially perturbed SYN and HTTP DDoS flood attacks.

On the other hand, in the work of network traffic classification, traffic generation techniques are used to perform data augmentation on the traffic data to synthesise network packets that are as realistic as possible. Shahid et al. [37] propose combining an autoencoder with a Generative Adversarial Network (GAN) to generate sequences of packet sizes that mimic the behavior of real bidirectional flows. The autoencoder is trained to learn a latent representation of real packet size sequences, and the GAN is trained on the latent space to generate realistic sequences. Nukavarapu et al. [38] introduce MirageNet, a GAN-based framework for synthetic network traffic generation. The first component of MirageNet, MiragePkt, validates the performance of their framework using synthesized DNS packets. Yin et al. [39] propose an end-to-end framework called NetShare to explore the feasibility of using Generative Adversarial Networks (GANs) to generate synthetic packet and flow header traces for networking tasks. Hui et al. [40] propose a knowledge-enhanced generative adversarial network (GAN) framework to generate realistic IoT traffic. The framework incorporates semantic knowledge and network structure knowledge of various IoT devices through a knowledge graph.

Generative Adversarial Networks (GANs) are frequently utilised in all of these approaches to generate network traffic. However, generating network traffic through GANs demands a vast amount of training samples, and the procurement of malicious traffic can be challenging. In addition, GANs can only generate samples that are similar to the training set, making it difficult to generate data outside of the training distribution [15].

Refer to caption
Figure 2: An overview of our framework for anomaly detection. c𝑐citalic_c corresponds to the representation of normal packets in the standard normal space, and η𝜂\etaitalic_η corresponds to the noise vector sampled from a Gaussian Distribution. Feature Extractor is trained to perform deep feature extraction on one-dimensional normal packets. Bidirectional Flow Module is trained to normalize the representation of normal packets to a standard normal distribution. During the training of Classifier, the representation of normal packets is normalized to a standard normal distribution. In the standard normal space, we introduce noise sampled from a Gaussian Distribution to the normalized representation, and then simulate the representation of anomaly traffic through the generation direction. Classifier is trained to distinguish the representation of normal packets z𝑧zitalic_z and the simulated representation of anomaly packets z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG, enabling efficient anomaly detection. In the inference phase, our method can achieve good anomaly detection by maintaining only two modules, which greatly reduces the size of the model.

3 Proposed Method

As shown in Figure 2, our framework is developed in three stages: feature extractor, bidirectional flow module, and classifier. In the following, we will introduce each stage.

3.1 Feature Extractor

The traffic packets is a one-dimensional structure with different protocol layers. By following the pre-processing steps showed in Section 4.2, the headers of Network Layer and Transport Layer as well as the payloads are the one-dimensional vector input for the model. Intuitively speaking, an extracted feature vector that effectively represents the traffic packet is crucial for the development of downstream tasks, because the original packet often contains a significant amount of redundant information that can confuse the model. Related experiments will be described in Section 4.5. In the field of computer vision, pre-trained models are often used when feature extraction is required, such as ResNet18 and ResNet50. To perform feature extraction on original traffic packets, we pre-trained the feature extractor using our datasets of normal samples.

As we only have normal samples, we use a reconstruction method to extract features. Our feature extractor is composed of a generator G𝐺Gitalic_G and a discriminator D𝐷Ditalic_D, as illustrated in Figure 2. The generator G𝐺Gitalic_G is further divided into an encoder GEsubscript𝐺𝐸G_{E}italic_G start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and a decoder GDsubscript𝐺𝐷G_{D}italic_G start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT.

Similar to [27], we adopt both reconstruction loss and adversarial loss for better reconstruction training. The training objectives of the model are as follows:

G=subscript𝐺absent\displaystyle\mathcal{L}_{G}=\ caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = ωadv𝔼xp𝐗D(x)D(G(x))22subscript𝜔𝑎𝑑𝑣subscript𝔼similar-to𝑥subscript𝑝𝐗superscriptsubscriptnorm𝐷𝑥𝐷𝐺𝑥22\displaystyle\omega_{adv}\mathbb{E}_{x\sim p_{\mathbf{X}}}\|D(x)-D(G(x))\|_{2}% ^{2}italic_ω start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_D ( italic_x ) - italic_D ( italic_G ( italic_x ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (1)
+ωrec𝔼xp𝐗xG(x)22subscript𝜔𝑟𝑒𝑐subscript𝔼similar-to𝑥subscript𝑝𝐗superscriptsubscriptnorm𝑥𝐺𝑥22\displaystyle+\omega_{rec}\mathbb{E}_{x\sim p_{\mathbf{X}}}\|x-G(x)\|_{2}^{2}+ italic_ω start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_x - italic_G ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
D=𝔼xp𝐗[1D(x)]+𝔼xp𝐗[D(G(x))]subscript𝐷subscript𝔼similar-to𝑥subscript𝑝𝐗delimited-[]1𝐷𝑥subscript𝔼similar-to𝑥subscript𝑝𝐗delimited-[]𝐷𝐺𝑥\displaystyle\mathcal{L}_{D}=\mathbb{E}_{x\sim p_{\mathbf{X}}}[1-D(x)]+\mathbb% {E}_{x\sim p_{\mathbf{X}}}[D(G(x))]caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ 1 - italic_D ( italic_x ) ] + blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_D ( italic_G ( italic_x ) ) ] (2)

where x^=G(x)^𝑥𝐺𝑥\hat{x}=G(x)over^ start_ARG italic_x end_ARG = italic_G ( italic_x ).

We use one-dimensional traffic packets x,xRn𝑥𝑥superscript𝑅𝑛x,x\in R^{n}italic_x , italic_x ∈ italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as the input for our model. The encoder GEsubscript𝐺𝐸G_{E}italic_G start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT compresses the input x𝑥xitalic_x into a hidden vector z𝑧zitalic_z, while the decoder GDsubscript𝐺𝐷G_{D}italic_G start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT outputs x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG, which is responsible for reconstructing the hidden vector z𝑧zitalic_z back to the input packets. We believe that if the reconstructed x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG by the model is very close to the input x𝑥xitalic_x, then the hidden vector z𝑧zitalic_z can effectively represent the features of the network packet. After pre-training, we retain only the GEsubscript𝐺𝐸G_{E}italic_G start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT of the feature extractor for extracting features from the traffic packets, resulting in a significant reduction in the model size.

3.2 Bidirectional Flow Module

The normalizing flows contain both normalization and generation directions. Generative flow models leverage a sequence of invertible and differentiable operations to transform a simple and tractable distribution into a complex distribution [41]. Generally, it can be described by the following formula:

cPθ(c)similar-to𝑐subscript𝑃𝜃𝑐c\sim P_{\theta}\left(c\right)italic_c ∼ italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_c ) (3)
z=gθ(c)𝑧subscript𝑔𝜃𝑐z=g_{\theta}\left(c\right)italic_z = italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_c ) (4)

where c𝑐citalic_c is a random vector that follows a distribution Pθsubscript𝑃𝜃P_{\theta}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , typically Pθsubscript𝑃𝜃P_{\theta}italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a simple distribution such as a standard normal distribution. z𝑧zitalic_z is a extracted vector which follows the unknow true data distribution P*(z)superscript𝑃𝑧P^{*}(z)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_z ). The function gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a reversible and differentiable function that can generate real samples from the complex distribution by utilizing samples from a simple distribution. This direction is often called the generation direction; The function fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the inverse function of gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, fθ=gθ1subscript𝑓𝜃superscriptsubscript𝑔𝜃1f_{\theta}=g_{\theta}^{-1}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, which normalizes the real samples from the complex distribution into the space of a simple distribution. This direction is often called the normalization direction.

According to the change of variables, the probability density of the vector z𝑧zitalic_z can be written in the following form:

logpθ(𝐳)=logpθ(𝐜)+log|det(d𝐜/d𝐳)|subscript𝑝𝜃𝐳subscript𝑝𝜃𝐜𝑑𝐜𝑑𝐳\log p_{{\theta}}(\mathbf{z})=\log p_{{\theta}}(\mathbf{c})+\log|\det(d\mathbf% {c}/d\mathbf{z})|roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z ) = roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_c ) + roman_log | roman_det ( italic_d bold_c / italic_d bold_z ) | (5)

where log|det(d𝐜/d𝐳)|𝑑𝐜𝑑𝐳\log|\det(d\mathbf{c}/d\mathbf{z})|roman_log | roman_det ( italic_d bold_c / italic_d bold_z ) | refers to the logarithm of the absolute value of the determinant of the Jacobian matrix (d𝐜/d𝐳𝑑𝐜𝑑𝐳d\mathbf{c}/d\mathbf{z}italic_d bold_c / italic_d bold_z), which can be easily calculated using matrix transformations [42], such as triangular matrix transformations.

In fact, it is difficult to construct a powerful, reversible, differentiable and easy-to-calculate Jacobian function [43]. So in the normalizing flows, it is common to combine a sequence of reversible and differentiable functions to achieve the desired transformation. This is the reason why this approach is referred to as a ”flow”. Therefore, the transform between c𝑐citalic_c and z𝑧zitalic_z can be written as:

𝐳𝐠1𝐡1𝐠2𝐡2𝐠K𝐜superscriptsubscript𝐠1𝐳subscript𝐡1superscriptsubscript𝐠2subscript𝐡2superscriptsubscript𝐠𝐾𝐜\mathbf{z}\stackrel{{\scriptstyle\mathbf{g}_{1}}}{{\leftarrow}}\mathbf{h}_{1}% \stackrel{{\scriptstyle\mathbf{g}_{2}}}{{\leftarrow}}\mathbf{h}_{2}\cdots% \stackrel{{\scriptstyle\mathbf{g}_{K}}}{{\leftarrow}}\mathbf{c}bold_z start_RELOP SUPERSCRIPTOP start_ARG ← end_ARG start_ARG bold_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_RELOP bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ← end_ARG start_ARG bold_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_RELOP bold_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ start_RELOP SUPERSCRIPTOP start_ARG ← end_ARG start_ARG bold_g start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG end_RELOP bold_c (6)
𝐳𝐟1𝐡1𝐟2𝐡2𝐟K𝐜superscriptsubscript𝐟1𝐳subscript𝐡1superscriptsubscript𝐟2subscript𝐡2superscriptsubscript𝐟𝐾𝐜\mathbf{z}\stackrel{{\scriptstyle\mathbf{f}_{1}}}{{\rightarrow}}\mathbf{h}_{1}% \stackrel{{\scriptstyle\mathbf{f}_{2}}}{{\rightarrow}}\mathbf{h}_{2}\cdots% \stackrel{{\scriptstyle\mathbf{f}_{K}}}{{\rightarrow}}\mathbf{c}bold_z start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG bold_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_RELOP bold_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG bold_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_RELOP bold_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG bold_f start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG end_RELOP bold_c (7)

We follow the work in [41] and employ the affine coupling layers in each block of the bidirectional flow module, as showed in Figure 2. Our bidirectional flow module takes the feature extracted by the feature extractor as input, and we maximize Eq. 5 to train this module through the normalization direction. After training, the bidirectional flow module can map the features of normal samples to the standard normal distribution space V𝑉Vitalic_V. For normal samples, the flow module maps their features to the standard normal distribution. However, the situation is different for anomaly samples. During the feature extraction phase, the pre-trained feature extractor has only been trained on normal samples. As a result, the features extracted from anomaly samples are likely to deviate from the distribution of normal features. The bidirectional flow module is also trained on normal samples, which means that the anomaly features, deviating from the normal sample distribution may fall outside the standard normal distribution in the space V𝑉Vitalic_V, as showned in Figure 4. To simulate the distribution of anomaly samples, we introduce noise into normal samples in standard normal space V𝑉Vitalic_V. For the sake of simplicity, we choose to randomly sample from Gaussian Distributions to generate noise. We employ a reparameterization trick to represent noise:

η=μ+σε𝜂𝜇direct-product𝜎𝜀\eta=\mu+\sigma\odot\varepsilonitalic_η = italic_μ + italic_σ ⊙ italic_ε (8)

where μ𝜇\muitalic_μ represents the mean vector, σ𝜎\sigmaitalic_σ represents the standard deviation vector, and ε𝜀\varepsilonitalic_ε is the random noise sampled from the standard normal distribution. By adjusting μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ, we can generate noise samples from different Gaussian Distributions.

Then we use the generation direction of the bidirectional flow module to transform the simulated anomaly samples in the normal distribution space V𝑉Vitalic_V, resulting in vectors in the latent space. In this way, we have the representation vector of normal samples and anomaly samples in the latent space. In the field of computer vision, operations such as Cutpaste [11] are applied to images of normal samples to obtain synthetic anomaly samples. However, when it comes to traffic packets, it is challenging to obtain anomaly samples that closely resemble the real network environment through image processing techniques. This is because traffic packets lack spatial semantic information. Therefore, it is necessary to use the bidirectional flow module to map the feature vectors to the standard normal space V𝑉Vitalic_V to construct anomaly samples. In the standard normal space V𝑉Vitalic_V, we can manipulate the attributes of the vectors to bring them closer to real anomaly traffic samples in the latent space. This capability allows for the generation of synthetic anomaly samples that exhibit similarity to real-world anomaly traffic patterns.

3.3 Classifier

After obtaining normal and pseudo anomaly sample features, it is natural and straightforward to consider using a classifier for anomaly detection. We only employ a simple classifier to classify the obtained normal and anomaly feature vectors. We aim to improve the detection performance of real anomaly samples by encouraging the classifier to focus more on real normal samples. To achieve this, we reduce the number of anomaly samples to half that of the normal samples. This prevents the model from overemphasising the features of the pseudo anomaly samples. In the testing phase, we only need to combine the classifier and encoder GEsubscript𝐺𝐸G_{E}italic_G start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT for anomaly detection, which significantly reducing the model’s parameters and enhancing the model’s deployment flexibility. We detect anomaly traffic based on the classification results.

Refer to caption
((a))
Refer to caption
((b))
Refer to caption
((c))
Figure 3: Histogram with density curve. We plot the detection result histogram of the samples in the testing sets of the three datasets. The curve represents the kernel density estimation of the results. Our method is more effective in distinguishing between anomaly and normal traffic on the ”UNB-CIC Tor and non-Tor” and the ”ISCX VPN and non-VPN” datasets. Although the results of distinguishing normal and anomaly traffic on the ”DataCon2020” dataset are not satisfactory, they are still better than those achieved by other methods.
Methods VPN TOR DataCon
Reverse Distillation [44] 0.6116 0.7450 0.6762
DFKDE [45] 0.5907 0.7356 0.3969
DFM [46] 0.7156 0.7514 0.6744
DRAEM [12] 0.5698 0.7028 0.6479
FastFlow [47] 0.6195 0.6689 0.6571
PADIM [48] 0.6726 0.7516 0.6768
PatchCore [28] 0.7058 0.7434 0.4605
STFPM [49] 0.5657 0.7371 0.6292
CFlow [50] 0.5433 0.7025 0.5850
GANomaly [27] 0.6239 0.7823 0.6871
GANomaly_1d 0.5913 0.7166 0.6884
Ours 0.8658 0.8458 0.7292
Table 1: Anomaly detection AUROC of state-of-the-art methods on DataCon2020, ISCX VPN and non-VPN, UNB-CIC Tor and non-Tor datasets. We set up different random seeds for three experiments to obtain the average results. Our method achieves the best detection performance on each dataset.

4 Experimental Results

4.1 Datasets

We have selected three widely used network traffic datasets for our experiments.

The ”UNB-CIC Tor and non-Tor” dataset, captured by Arash et al. [51], is collected using Wireshark and Tcpdump. The dataset includes both regular and Tor traffic captured from the workstation and gateway, encompassing 14 categories of traffic such as Chat, Streaming, Email, and others.

The ”ISCX VPN and non-VPN” dataset, captured by Gerard et al. [52], is collected using Wireshark and tcpdump. During the capturing process, only the packets with the target IP were captured. The dataset comprises a total of 14 categories for regular and VPN traffic, including File Transfer, P2P, and more.

Model Params FLOPs
(M) (G)
PADIM [48] 2.78 0.05
GANomaly [27] 10.73 0.65
GANomaly_1d 45.89 11.94
FastFlow [47] 7.46 0.13
DRAEM [12] 97.43 3.11
Reverse Distillation [44] 80.61 0.61
CFlow [50] 6.45 0.15
STFPM [49] 5.57 0.10
Ours 3.91 0.02
Table 2: Comparison of model parameters and FLOPs in the inference phase. Our approach has the lowest FLOPs and the best detection performance, while also having small model parameter sizes. The effectiveness of the PADIM method depends on the performance of the pre-trained model employed. The size of the model will increase as the capability of the pre-trained model increases.

The ”DataCon2020” dataset [53] is derived from malicious and benign software collected between February and June 2020. The traffic was generated by sandbox collection from Qi An Xin. The dataset defines malicious traffic as encrypted traffic generated by malware, while the benign traffic represents encrypted traffic generated by benign software.

Our datasets include encrypted and unencrypted traffic datasets, benign traffic and malicious traffic datasets. In some scenarios, encrypted traffic is not allowed to be used, so naturally we define encrypted traffic and malicious traffic as anomaly traffic, and non-encrypted traffic and benign traffic as normal traffic. Our training set consists only of normal samples. We have randomly selected 10,000 normal samples from one dataset to form our training set. Subsequently, we have obtained our testing set by randomly selecting 5,000 samples from the remaining pool of normal samples and another 5,000 samples randomly drawn from the abnormal samples. This process is also replicated for the training and testing sets of the other two datasets.

4.2 Pre-processing

Traffic cleaning

We first remove packets in the datasets that may introduce confusion in anomaly detection. The DNS protocol is used to map domain names and IP addresses to each other. Both normal and anomaly traffic obtain corresponding IP addresses through the DNS protocol prior to conducting activities. These traffic packets are not directly associated with normal and anomaly characteristics and do not contribute to anomaly detection. The TCP protocol has a series of stages related to connections, including connection establishment, termination, and confirmation. These packets often do not contain any actual payload, they are only associated with the connection and have no direct relevance to specific activities. So we remove both types of packets [54]. In addition, there is also the ARP protocol, which is responsible for the map** between IP addresses and MAC addresses, but it is often not directly associated with the activities of upper-layer users. Therefore, we also remove packets related to the ARP protocol.

Then we process the structure information in the packet. The packet header of the data link layer is often responsible for managing the transmission of specific physical links and cannot provide sufficient information for anomaly detection. Therefore, we remove the data link layer header. In the header of the IP protocol, there are two fields, the destination IP address and the source IP address. These two fields can summarize a series of data communications between hosts. Detecting anomaly traffic only based on IP addresses is considered a shortcut rather than true learning. Therefore, we anonymize the IP addresses of all packets.

Traffic encoding

We use byte encoding to process the packets by converting the individual bytes in the packets into corresponding decimal numbers. This allows us to obtain a one-dimensional array of packets. To ensure uniformity and facilitate model comprehension, we fill the optional fields of the IP and TCP protocols. In addition, there are two protocols, TCP and UDP, in the transport layer. We have also filled the header of the UDP protocol to match the header length of TCP, ensuring consistency between the two protocols. The neural network model requires us to unify the length of all packets. Taking into account our statistical analysis of packet lengths and the preprocessing requirements for comparative experiments, we have determined the length of the one-dimensional packet array to be 1600 bytes. Subsequently, we normalize the packets to a range of 0 to 1. Finally, we combine the processed packets with the corresponding labels and store them in a CSV file format.

4.3 Implementaion Details

Our model is trained for 100 epochs with early stop** techniques. For the generator G𝐺Gitalic_G and discriminator D𝐷Ditalic_D in the feature extractor, we use two Adam optimizers with a learning rate of 0.001, betas of 0.5 and 0.999 to train respectively. The wadvsubscript𝑤𝑎𝑑𝑣w_{adv}italic_w start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT in the feature extractor loss is set to 1, while wrecsubscript𝑤𝑟𝑒𝑐w_{rec}italic_w start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT is set to 50. Additionally, we have determined the size of the hidden vector extracted by the feature extractor to be 70. We utilize the bidirectional flow module, which consists of 8 coupling blocks. For training, we employ the Adam optimizer with a learning rate of 0.001 and betas of 0.5 and 0.999. The two parameters, μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ, used for generating noise are determined through experimental analysis. When training the classifier, we utilize the Adam optimizer with a learning rate of 0.001 and betas of 0.5 and 0.999. These parameters have been determined through an process of grid search and experiments.

Noise Distribution Dataset Noise Distribution Dataset
μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ TOR VPN DataCon μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ TOR VPN DataCon
-100 0.1 0.6344 0.4598 0.6850 5 0.1 0.7695 0.7910 0.7058
0.5 0.5888 0.5771 0.6906 0.5 0.5749 0.8082 0.6985
5 0.6766 0.7582 0.7292 5 0.7004 0.7707 0.6828
15 0.6051 0.7221 0.7137 15 0.7418 0.8076 0.6926
-25 0.1 0.6482 0.7609 0.4488 9 0.1 0.6364 0.6703 0.6740
0.5 0.7189 0.8191 0.5752 0.5 0.6331 0.8160 0.7020
5 0.8458 0.7732 0.7146 5 0.6545 0.7989 0.6924
15 0.8161 0.7891 0.7092 15 0.6471 0.7427 0.6697
-10 0.1 0.6553 0.8201 0.6893 10 0.1 0.4361 0.8370 0.7207
0.5 0.6138 0.8414 0.6023 0.5 0.6484 0.4467 0.7026
5 0.5583 0.8602 0.6494 5 0.7855 0.7529 0.6966
15 0.7696 0.7892 0.5721 15 0.6094 0.8002 0.6836
-9 0.1 0.6526 0.8118 0.6971 25 0.1 0.5958 0.6327 0.6486
0.5 0.7909 0.8155 0.6710 0.5 0.4326 0.7602 0.7101
5 0.5763 0.8658 0.3903 5 0.8331 0.5499 0.6550
15 0.6485 0.7780 0.7062 15 0.5875 0.8542 0.6755
-5 0.1 0.6523 0.7979 0.6531 100 0.1 0.7961 0.1944 0.6966
0.5 0.5938 0.7042 0.5486 0.5 0.6590 0.7584 0.6836
5 0.6146 0.4137 0.6834 5 0.5877 0.8143 0.6998
15 0.7199 0.7368 0.6655 15 0.5606 0.8347 0.6805
Table 3: Detection performance on different noise distributions. The noise distribution determines the generated pseudo anomaly samples. By the trick of reparameterization, we can easily get different noise distributions. We explore the effects of different noise distributions on the three datasets.

4.4 Comparison with Existing Methods

We analyze anomaly network traffic through our framework of anomaly detection. To demonstrate the superiority of our framework, we extensively have compared it with state-of-the-art methods in the field of anomaly detection, including knowledge distillation based [44, 49], reconstruction based [27, 12], normalization flow based [47, 50], memory matching based [28], and distribution learning based methods [46, 12, 48]. The methods used for comparison can be found in [45], including the code and settings employed. For the sake of fairness, We evaluate some methods on the three datasets by transforming the preprocessed packets from a one-dimensional structure to a two-dimensional grayscale image. Since these methods employ pre-trained feature extractors specifically designed for two-dimensional images. In addition, we have also modified the model in [27] and obtained a new method specifically for processing one-dimensional structured packets, which we refer to as GANomaly_1d provided within the code. We run experiments 5 times with different random seeds and report the mean AUC.

As shown in Table 1, we can clearly see that on the ”UNB-CIC Tor and non-Tor” and ”ISCX VPN and non-VPN” datasets, our method achieves the best results and leads the second-ranked methods with a significant advantage of 6% and 15% respectively. In the knowledge distillation based anomaly detection method [44, 49], the difference between the representations of the anomaly samples by the teacher model and the student model is used to detect the anomaly. However, the detection is not effective because the pre-trained teacher model on ImageNet [55] has not seen the network traffic samples. Additionally, the conversion of the packets from their original one-dimensional semantic structure to a two-dimensional format disrupts their inherent structure. The added dimension does not add any additional information. Feature extraction is commonly employed in anomaly detection [56]. Our method employs the feature extractor that is pre-trained specifically on normal one-dimensional network traffic packets. We design the unsupervised feature extractor to represent network traffic packets effectively. This approach avoids directly generating raw packets, which would destroy the data structure and produce meaningless samples, as demonstrated in the Section 4.5.3. The extracted features are more beneficial for enhancing the performance of downstream anomaly detection tasks.

A traffic packet is a one-dimensional data structure containing specific semantic fields but lacks spatial semantics [6]. For methods that employ generated anomaly samples for anomaly detection, such as DREAM [12], introducing noise into normal packet images is meaningless as it destroys the semantic fields of the original packet and does not effectively generate anomaly samples, resulting in poor results. Some of the problems that exist in current network traffic generation research [57, 39], such as the difficulty of training GANs and the need to collect a large number of samples. Our method is not affected by these issues. Our method maps the normal samples to the space of standard normal distribution, and simulates the anomaly samples through manipulating them in this space. This method avoids disrupting the specific semantic fields of the original data packet and allows for a closer semantic alignment with real anomaly samples. At the same time, we do not need to collect anomaly samples or a priori knowledge of anomaly patterns, but only simulate anomaly samples by normal samples and randomly sampled noise. By adjusting the distribution of random noise, we are able to simulate different anomaly samples. On the ”DataCon2020” dataset, our method also achieves the best results, but the overall performance of all methods is not particularly outstanding. We analyze that the diversity of categories and the insufficient number of samples in each category in this dataset have proposed challenges for each detection method. In addition, as shown in Figure 4, the similarity between normal packet samples and abnormal packet samples in the DataCon dataset is relatively high, posing a challenge for the model to detect abnormal samples.

As shown in Figure 3, we draw the histogram of the detection results on the test sets of the three datasets to visualize the detection effect of our model. Our model demonstrates excellent performance in distinguishing between normal and anomaly traffic on both the ”UNB-CIC Tor and non-Tor” and the ”ISCX VPN and non-VPN” datasets. Additionally, on the ”DataCon2020 dataset”, our model currently exhibits the best detection performance, although there is still potential for further improvement.

We also compare the model sizes and the FLOPs of the different methods during the inference, as shown in Table 2. Our approach is effective for detecting network traffic anomalies in computer power limited deployment environments. After training our model, anomaly detection can be achieved by retaining only the encoder part of the feature extractor and the classifier in the inference process, and the rest of the modules are only used to support the training. Unlike other normalizing flow based methods [47, 50], our approach does not require the flow module to compute anomaly scores, and serves as a module for synthesizing anomaly samples during the training process.

Refer to caption
((a))
Refer to caption
((b))
Refer to caption
((c))
Figure 4: T-SNE visualization of representations in latent space. We plot the features of the normal, anomaly, and synthetic anomaly samples. It can be seen that our synthetic anomaly samples do not overlap well with real anomaly samples, but they are significantly different from normal samples. The model learns how to accurately identify normal traffic by distinguishing between them.

We employ t-SNE to project the features of the normal, anomaly, and synthetic anomaly samples into a 2D space, as showed in Figure 4. Ideally, we hope that the simulated anomaly samples overlap well with the real anomaly samples. However, this requires us to spend more time carefully designing the distribution of the noise. For simplicity, we have tried sampling the noise from a random Gaussian distribution and changing the normal sample properties with it. While the synthetic anomaly features may not perfectly simulate real anomaly samples, the model improves the discrimination of normal samples by distinguishing between normal and synthetic anomaly samples.

Noise Distribution Dataset Noise Distribution Dataset
μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ VPN TOR DataCon μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ VPN TOR DataCon

w/o Bidirectional Flow Module

-100 5 0.7557 0.5573 0.6768

w/o Feature Extractor

-100 5 0.7325 0.5711 0.7169
-25 5 0.5324 0.6042 0.6705 -25 5 0.5696 0.4260 0.3805
-20 10 0.7736 0.4758 0.3844 -20 10 0.6315 0.5567 0.6728
-10 5 0.3972 0.5509 0.6723 -10 5 0.3749 0.4729 0.6181
-10 1 0.3746 0.4381 0.3440 -10 1 0.6870 0.5819 0.5431
-9 5 0.6804 0.6120 0.3334 -9 5 0.6680 0.4549 0.6620
-9 1 0.4185 0.6638 0.4544 -9 1 0.6383 0.4909 0.5789
-5 5 0.5244 0.7459 0.4250 -5 5 0.6515 0.4481 0.6768
5 1.5 0.3090 0.5880 0.6427 5 1.5 0.6810 0.5010 0.6783
5 15 0.5873 0.6091 0.6594 5 15 0.7976 0.4838 0.4110
9 0.1 0.7900 0.6344 0.6705 9 0.1 0.2565 0.5016 0.6624
9 1 0.7767 0.5591 0.6760 9 1 0.3284 0.5169 0.6785
10 5 0.4010 0.6177 0.3828 10 5 0.3127 0.4883 0.6603
20 1 0.2760 0.5943 0.6770 20 1 0.5016 0.4624 0.6800
25 5 0.6552 0.4819 0.6297 25 5 0.5243 0.6912 0.6745
Ours -9/-25/-100 5/5/5 0.8658 0.8458 0.7292 Ours -9/-25/-100 5/5/5 0.8658 0.8458 0.7292
Table 4: Detection performance on ablating different modules of our method. We conduct ablation experiments on the bidirectional flow module and the feature extractor to demonstrate the effectiveness of our proposed modules. The bottom row corresponds to the results of our method with both modules, where μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ represent different datasets, respectively.
Noise Distribution Dataset Noise Distribution Dataset
μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ VPN TOR DataCon μ𝜇\muitalic_μ σ𝜎\sigmaitalic_σ VPN TOR DataCon

anomaly:normal=1:1

-100 5 0.3333 0.6416 0.6771

anomaly:normal=2:1

-100 5 0.2876 0.3372 0.6693
-25 5 0.8419 0.5984 0.6985 -25 5 0.6676 0.6493 0.6827
-20 10 0.7559 0.6317 0.6945 -20 10 0.7154 0.4850 0.7059
-10 5 0.7898 0.6274 0.6837 -10 5 0.5435 0.5279 0.6670
-10 1 0.6979 0.6394 0.7031 -10 1 0.3853 0.7602 0.6894
-9 5 0.8019 0.6781 0.6895 -9 5 0.8445 0.5985 0.6731
-9 1 0.7897 0.7820 0.6981 -9 1 0.8083 0.6944 0.6915
-5 5 0.7233 0.4607 0.6775 -5 5 0.7969 0.6685 0.6545
5 1.5 0.7899 0.7173 0.7137 5 1.5 0.8115 0.5829 0.6892
5 15 0.5487 0.5469 0.6912 5 15 0.6553 0.5507 0.5893
9 0.1 0.7639 0.5671 0.7001 9 0.1 0.7880 0.6879 0.6967
9 1 0.3962 0.7811 0.6576 9 1 0.8460 0.6487 0.6999
10 5 0.8308 0.6906 0.6855 10 5 0.7889 0.5456 0.6831
20 1 0.7555 0.7493 0.6429 20 1 0.6377 0.7505 0.7047
25 5 0.8197 0.6163 0.5783 25 5 0.7645 0.6256 0.6793
Ours -9/-25/-100 5/5/5 0.8658 0.8458 0.7292 Ours -9/-25/-100 5/5/5 0.8658 0.8458 0.7292
Table 5: Detection performance on different ratios of normal samples to abnormal samples. We achieve different ratios by altering the quantity of pseudo anomaly samples during the training process. The bottom row shows our method, where pseudo abnormal samples account for half of normal samples, where μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ represent different datasets, respectively.

4.5 Ablation Study

4.5.1 Adopting Different Noise Distributions

For our method, the noise distribution plays a particularly critical role, as it directly determines the quality of the generated samples. In [20], the authors determine the direction of attribute change by the difference between two sample vectors with specific attributes. And since our model is only trained on normal samples, we can only guide the generation of simulated abnormal samples by trying random noise. This aspect is crucial for training classifiers effectively.

We have tried different combinations of μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ, and the experimental results are shown in Table 3. This shows that our model is sensitive to the parameters of the noise distribution, which is in line with our expectations. The distribution of the simulated anomaly samples in the standard normal distribution space is difficult to determine, depending on the noise distribution from which the noise is sampled. We simulate the anomaly samples by manipulating the normal samples, thereby enabling the classifier to show excellent recognition capability for normal samples. In addition, we can easily generate various types of anomaly samples only by changing noise distributions.

4.5.2 Ablating Bidirectional Flow Module

Generating anomalous samples is crucial to our approach, and the Bidirectional Flow Module is able to fit more complex distributions by combining multiple invertible transformations, resulting in high quality synthetic samples. In addition, by training this module we are able to map normal samples to a specified distribution, while anomaly samples not seen by the module will be mapped outside of the distribution, thus simulating abnormal samples by deviating from the normal samples.

We remove the bidirectional flow module and directly introduce noise to the latent vectors extracted by the feature extractor to simulate anomaly samples. Then, the simulated samples are fed into the classifier for detection. We explore different combinations of μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ, and the experimental results are shown in Table 4. It can be observed that directly introducing noise to the latent vectors to simulate anomaly samples does not achieve satisfactory detection results. The latent vectors can affect the properties of the samples, but we have difficulty in achieving the semantic conversion from normal to abnormal samples by random noise, which may require careful design of the vectors. Manipulating vectors in the standard normal space proves to be more effective to altering the properties of network traffic.

4.5.3 Ablating Feature Extractor

We conducted training using normal samples to acquire a feature extractor capable of effective feature extraction on packets of network traffic. Autoencoders are often used for extracting features from network traffic [58, 13]. However, these features are typically deep representations of manually extracted features. This approach may overlook potential data connections within the packets. Our feature extractor introduces a discriminator to improve the capability of the autoencoder, which is more helpful in generating dense sample features.

We remove the feature extractor in our feature extractor ablation experiments and directly feed the preprocessed one-dimensional packets into the bidirectional flow module. The bidirectional flow module is trained to map the one-dimensional packet vectors to the standard normal distribution space. In the standard normal distribution space, noise is introduced to the vectors, and through the generation direction, simulated anomaly samples are obtained. Both normal samples and simulated anomaly samples are then fed into the classifier for detection. The results obtained by retraining the parameters are shown in Table 4. The detection performance directly using one-dimensional packet vectors is poor on these three datasets. We believe that the packet vectors without feature extraction may contain a significant amount of redundant information, which hampers model learning by lacking concentrated information. Consequently, this leads to poor performance.

4.5.4 Adjusting the Ratio of Samples

For our model, the aim is for the classifier to learn how to distinguish normal samples with a learning focus on such samples. In addition, we think that if the number of synthetic abnormal samples is increased, it is possible that the model will shift its learning focus.

We adjust the number of generated anomaly samples to demonstrate the impact of different ratios of normal samples to anomaly samples on the model. In our method, the number of synthetic anomaly samples is half the number of normal samples. The other two comparative methods involve maintaining an equal number of normal and anomaly samples, and having twice the number of anomaly samples compared to normal samples. The experimental results are shown in Table 5. Compared to our method, when the number of anomaly samples increases to be equal to the number of normal samples, there is a slight decrease in performance on all three datasets. As the number of anomaly samples continues to increase to twice the number of normal samples, the performance still decreases. Changing the ratio of samples will have an impact on the classifier. We expect the classifier to distinguish between normal and anomaly samples. However, when the number of anomaly samples increases, the classifier tends to pay excessive attention to the anomaly samples, resulting in a decrease in detection performance.

Train Test Result
DataCon TOR DataCon TOR
square-root\surd square-root\surd 0.7292
square-root\surd square-root\surd 0.7058
square-root\surd square-root\surd 0.8458
square-root\surd square-root\surd 0.8060
Table 6: Detection performance across the datasets. ”UNB-CIC Tor and non-Tor” is the encrypted and unencrypted traffic dataset, and the ”DataCon2020” dataset is the benign and malicious traffic dataset. Comparing with training and testing on the same dataset, the detection performance across the datasets have a slightly decrease.

4.5.5 Generalizing across the Datasets

In a real network traffic detection scenario, the trained model will be faced with a large number of unknown network traffic packets, both normal and anomaly traffic. We will explore the detection ability of the model when faced with the test samples from an unknown distribution, so we further investigate the generalization ability across datasets.

While the anomaly samples in three datasets are different, the normal samples are all captured from normal activities and have similarities. We further study the generalization of our method across different datasets. Both the ”UNB-CIC Tor and non-Tor” and ”ISCX VPN and non-VPN” datasets consist of encrypted and unencrypted traffic. We extend our experiments on one of these datasets and ”DataCon2020” dataset. We train our model on one dataset and test on another dataset to assess its ability to generalize to unseen anomaly samples. As shown in Table 6, when training our model on one dataset and testing it on another, the results show a slight drop, but remain within acceptable limits. This shows that our method can have relatively good detection results even in the presence of unknown anomaly traffic in real detection environments. Our approach improves the model’s ability to identify normal network traffic by classifying pseudo anomaly traffic. It is effective across different data distributions.

5 Conclusion

In this paper, we propose a three-stage framework for anomaly traffic detection that involves generating simulated anomaly samples. Our approach is able to generate anomaly samples with unknown patterns, without prior knowledge of the anomalies, and use them to improve anomaly detection. The key lies in the feature extractor and bidirectional flow module designed specifically for traffic. These modules enable us to transform the packets into the standard normal distribution space, where we manipulate the vectors to alter the properties of the traffic packets. This technique allows us to simulate the anomaly traffic. Our method demonstrates excellent performance in anomaly detection across three real network traffic datasets. We envision that our method of constructing anomaly samples can be widely applied in many fields, serving as a reliable technique for generating simulated anomaly samples.

REFERENCES

  • [1] O. Salman, I. H. Elhajj, A. Chehab, and A. Kayssi, “A machine learning based framework for iot device identification and abnormal traffic detection,” Transactions on Emerging Telecommunications Technologies, vol. 33, no. 3, p. e3743, 2022.
  • [2] J. Niu, Y. Zhang, D. Liu, D. Guo, and Y. Teng, “Abnormal network traffic detection based on transfer component analysis,” in IEEE International Conference on Communications Workshops, pp. 1–6, 2019.
  • [3] M. Gao, L. Ma, H. Liu, Z. Zhang, Z. Ning, and J. Xu, “Malicious network traffic detection based on deep neural networks and association analysis,” Sensors, vol. 20, no. 5, p. 1452, 2020.
  • [4] Z. Li, Z. Qin, K. Huang, X. Yang, and S. Ye, “Intrusion detection using convolutional neural networks for representation learning,” in International Conference on Neural Information Processing, pp. 858–866, 2017.
  • [5] L. Yang, Y. Song, S. Gao, B. Xiao, and A. Hu, “Griffin: an ensemble of autoencoders for anomaly traffic detection in sdn,” in IEEE Global Communications Conference, pp. 1–6, 2020.
  • [6] Y. Zheng, Z. Dang, C. Peng, C. Yang, and X. Gao, “Multi-view multi-label anomaly network traffic classification based on mlp-mixer neural network,” arXiv preprint arXiv:2210.16719, 2022.
  • [7] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, pp. 1–58, 2009.
  • [8] Z. Jadidi, V. Muthukkumarasamy, E. Sithirasenan, and K. Singh, “Flow-based anomaly detection using semisupervised learning,” in 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–5, IEEE, 2015.
  • [9] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, “Towards an unsupervised method for network anomaly detection in large datasets,” Computing and informatics, vol. 33, no. 1, pp. 1–34, 2014.
  • [10] Y. Shi and H. Shen, “Unsupervised anomaly detection for network traffic using artificial immune network,” Neural Computing and Applications, vol. 34, no. 15, pp. 13007–13027, 2022.
  • [11] C.-L. Li, K. Sohn, J. Yoon, and T. Pfister, “Cutpaste: Self-supervised learning for anomaly detection and localization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9664–9674, 2021.
  • [12] V. Zavrtanik, M. Kristan, and D. Skočaj, “Draem-a discriminatively trained reconstruction embedding for surface anomaly detection,” in IEEE/CVF International Conference on Computer Vision, pp. 8330–8339, 2021.
  • [13] M. Ring, D. Schlör, D. Landes, and A. Hotho, “Flow-based network traffic generation using generative adversarial networks,” Computers & Security, vol. 82, pp. 156–172, 2019.
  • [14] R. Ghanavi, B. Liang, and A. Tizghadam, “Generative adversarial classification network with application to network traffic classification,” in 2021 IEEE Global Communications Conference (GLOBECOM), pp. 1–6, IEEE, 2021.
  • [15] A. Cheng, “Pac-gan: Packet generation of network traffic using generative adversarial networks,” in 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0728–0734, IEEE, 2019.
  • [16] L. Bergman and Y. Hoshen, “Classification-based anomaly detection for general data,” arXiv preprint arXiv:2005.02359, 2020.
  • [17] I. Golan and R. El-Yaniv, “Deep anomaly detection using geometric transformations,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [18] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song, “Using self-supervised learning can improve model robustness and uncertainty,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [19] T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
  • [20] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [21] B. Cao, C. Li, Y. Song, Y. Qin, and C. Chen, “Network intrusion detection model based on cnn and gru,” Applied Sciences, vol. 12, no. 9, p. 4184, 2022.
  • [22] T. Saba, A. Rehman, T. Sadad, H. Kolivand, and S. A. Bahaj, “Anomaly-based intrusion detection system for iot networks through deep learning model,” Computers and Electrical Engineering, vol. 99, p. 107810, 2022.
  • [23] Z. Liu, Y. He, W. Wang, and B. Zhang, “Ddos attack detection scheme based on entropy and pso-bp neural network in sdn,” China Communications, vol. 16, no. 7, pp. 144–155, 2019.
  • [24] N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi, “A deep learning approach to network intrusion detection,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 1, pp. 41–50, 2018.
  • [25] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, “A deep learning approach for network intrusion detection system,” in International Conference on Bio-inspired Information and Communications Technologies, pp. 21–26, 2016.
  • [26] W. Wang, Y. Sheng, J. Wang, X. Zeng, X. Ye, Y. Huang, and M. Zhu, “Hast-ids: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection,” IEEE Access, vol. 6, pp. 1792–1806, 2017.
  • [27] S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon, “Ganomaly: Semi-supervised anomaly detection via adversarial training,” in Asian Conference on Computer Vision, pp. 622–637, 2019.
  • [28] K. Roth, L. Pemula, J. Zepeda, B. Schölkopf, T. Brox, and P. Gehler, “Towards total recall in industrial anomaly detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14318–14328, 2022.
  • [29] J. Song, K. Kong, Y.-I. Park, S.-G. Kim, and S.-J. Kang, “Anoseg: anomaly segmentation network using self-supervised learning,” arXiv preprint arXiv:2110.03396, 2021.
  • [30] M. Yang, P. Wu, and H. Feng, “Memseg: A semi-supervised method for image surface defect detection using differences and commonalities,” Engineering Applications of Artificial Intelligence, vol. 119, p. 105835, 2023.
  • [31] A.-S. Collin and C. De Vleeschouwer, “Improved anomaly detection by training an autoencoder with skip connections on images corrupted with stain-shaped noise,” in International Conference on Pattern Recognition, pp. 7915–7922, 2021.
  • [32] E. Alhajjar, P. Maxwell, and N. Bastian, “Adversarial machine learning in network intrusion detection systems,” Expert Systems with Applications, vol. 186, p. 115782, 2021.
  • [33] Y. Peng, G. Fu, Y. Luo, J. Hu, B. Li, and Q. Yan, “Detecting adversarial examples for network intrusion detection system with gan,” in 2020 IEEE 11th International Conference on Software Engineering and Service Science (ICSESS), pp. 6–10, IEEE, 2020.
  • [34] J. Wang, J. Pan, I. AlQerm, and Y. Liu, “Def-ids: An ensemble defense mechanism against adversarial attacks for deep learning-based network intrusion detection,” in 2021 International Conference on Computer Communications and Networks (ICCCN), pp. 1–9, IEEE, 2021.
  • [35] B.-E. Zolbayar, R. Sheatsley, P. McDaniel, M. J. Weisman, S. Zhu, S. Zhu, and S. Krishnamurthy, “Generating practical adversarial network traffic flows using nidsgan,” arXiv preprint arXiv:2203.06694, 2022.
  • [36] M. Abdelaty, S. Scott-Hayward, R. Doriguzzi-Corin, and D. Siracusa, “Gadot: Gan-based adversarial training for robust ddos attack detection,” in 2021 IEEE Conference on Communications and Network Security (CNS), pp. 119–127, IEEE, 2021.
  • [37] M. R. Shahid, G. Blanc, H. Jmila, Z. Zhang, and H. Debar, “Generative deep learning for internet of things network traffic generation,” in 2020 IEEE 25th Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 70–79, IEEE, 2020.
  • [38] S. K. Nukavarapu, M. Ayyat, and T. Nadeem, “Miragenet-towards a gan-based framework for synthetic network traffic generation,” in GLOBECOM 2022-2022 IEEE Global Communications Conference, pp. 3089–3095, IEEE, 2022.
  • [39] Y. Yin, Z. Lin, M. **, G. Fanti, and V. Sekar, “Practical gan-based synthetic ip header trace generation using netshare,” in Proceedings of the ACM SIGCOMM 2022 Conference, pp. 458–472, 2022.
  • [40] S. Hui, H. Wang, Z. Wang, X. Yang, Z. Liu, D. **, and Y. Li, “Knowledge enhanced gan for iot traffic generation,” in Proceedings of the ACM Web Conference 2022, pp. 3336–3346, 2022.
  • [41] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” arXiv preprint arXiv:1605.08803, 2016.
  • [42] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling, “Improved variational inference with inverse autoregressive flow,” Advances in Neural Information Processing Systems, vol. 29, 2016.
  • [43] I. Kobyzev, S. J. Prince, and M. A. Brubaker, “Normalizing flows: An introduction and review of current methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 3964–3979, 2020.
  • [44] H. Deng and X. Li, “Anomaly detection via reverse distillation from one-class embedding,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9737–9746, 2022.
  • [45] S. Akcay, D. Ameln, A. Vaidya, B. Lakshmanan, N. Ahuja, and U. Genc, “Anomalib: A deep learning library for anomaly detection,” 2022.
  • [46] N. A. Ahuja, I. Ndiour, T. Kalyanpur, and O. Tickoo, “Probabilistic modeling of deep features for out-of-distribution and adversarial detection,” arXiv preprint arXiv:1909.11786, 2019.
  • [47] J. Yu, Y. Zheng, X. Wang, W. Li, Y. Wu, R. Zhao, and L. Wu, “Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows,” arXiv preprint arXiv:2111.07677, 2021.
  • [48] T. Defard, A. Setkov, A. Loesch, and R. Audigier, “Padim: a patch distribution modeling framework for anomaly detection and localization,” in International Conference on Pattern Recognition Workshops, pp. 475–489, 2021.
  • [49] G. Wang, S. Han, E. Ding, and D. Huang, “Student-teacher feature pyramid matching for unsupervised anomaly detection. arxiv 2021,” arXiv preprint arXiv:2103.04257.
  • [50] D. Gudovskiy, S. Ishizaka, and K. Kozuka, “Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows,” in IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 98–107, 2022.
  • [51] A. H. Lashkari, G. Draper-Gil, M. S. I. Mamun, and A. A. Ghorbani, “Characterization of tor traffic using time based features,” in International Conference on Information Systems Security and Privacy, pp. 253–262, 2017.
  • [52] G. Draper-Gil, A. H. Lashkari, M. S. I. Mamun, and A. A. Ghorbani, “Characterization of encrypted and vpn traffic using time-related features,” in International Conference on Information Systems Security and Privacy, pp. 407–414, 2016.
  • [53] D. Community, “Datacon open dataset-datacon2020-encrypted malicious traffic dataset direction open dataset.” https://datacon.qianxin.com/opendata/openpage?resourcesId=6, 2021-11-11.
  • [54] M. Lotfollahi, M. Jafari Siavoshani, R. Shirali Hossein Zade, and M. Saberian, “Deep packet: A novel approach for encrypted traffic classification using deep learning,” Soft Computing, vol. 24, no. 3, pp. 1999–2012, 2020.
  • [55] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009.
  • [56] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for anomaly detection: A review,” ACM computing surveys (CSUR), vol. 54, no. 2, pp. 1–38, 2021.
  • [57] S. Xu, M. Marwah, M. Arlitt, and N. Ramakrishnan, “Stan: Synthetic network traffic generation with generative neural models,” in Deployable Machine Learning for Security Defense: Second International Workshop, MLHat 2021, Virtual Event, August 15, 2021, Proceedings 2, pp. 3–29, Springer, 2021.
  • [58] G. Aceto, D. Ciuonzo, A. Montieri, and A. Pescapé, “Mobile encrypted traffic classification using deep learning: Experimental evaluation, lessons learned, and challenges,” IEEE Transactions on Network and Service Management, vol. 16, no. 2, pp. 445–458, 2019.