UNSUPERVISED DISTANCE METRIC LEARNING FOR ANOMALY DETECTION OVER MULTIVARIATE TIME SERIES
Abstract
Distance-based time series anomaly detection methods are prevalent due to their relative non-parametric nature and interpretability. However, the commonly used Euclidean distance is sensitive to noise. While existing works have explored dynamic time war** (DTW) for its robustness, they only support supervised tasks over multivariate time series (MTS), leaving a scarcity of unsupervised methods. In this work, we propose FCM-wDTW, an unsupervised distance metric learning method for anomaly detection over MTS, which encodes raw data into latent space and reveals normal dimension relationships through cluster centers. FCM-wDTW introduces locally weighted DTW into fuzzy C-means clustering and learns the optimal latent space efficiently, enabling anomaly identification via data reconstruction. Experiments with 11 different types of benchmarks demonstrate our method’s competitive accuracy and efficiency.
Index Terms— Anomaly Detection, Multivariate Time Series, Unsupervised Learning, Dynamic Time War**
1 Introduction
The anomaly detection for multivariate time series presents a significant challenge for the intricate correlation and dependency among dimensions. It has been studied in various application fields, e.g., financial fraud detection[1], network intrusion detection [2], and industrial control fault diagnosis [3]. Anomaly is classically defined as an observation that deviates significantly from other observations, raising suspicions that it was generated by a different mechanism [4]. Generally, MTS-generating systems remain in stable normal status, characterized by consistent relationships among dimensions. The goal of multivariate anomaly detection is to identify abnormal relationships that disrupt normal status.
Most existing anomaly detection methods are semi-supervised or unsupervised [5, 6], as anomaly labels are typically rare and costly. While the former methods train models on the normal samples and mark the points or subsequences going against the model as anomalous, the latter separate anomalies from the normal part by frequency, shape, or distribution. The state-of-the-art deep learning methods for multivariate anomaly detection are mostly semi-supervised [5]. Comparatively, they need prior knowledge of data for training, limiting their availability in practical application. Furthermore, in terms of recent survey works [5, 7, 8], including [5] that estimates 71 anomaly detection methods on 976 time series datasets, deep learning methods have yet to prove real outperformance, while they have relatively high complexity and poor interpretability. In contrast, the conventional statistical analysis and machine learning methods are simpler, lighter, and easier to interpret.
Among existing works, distance-based anomaly detection methods are particularly prevalent due to their non-parametric property and strong interpretability [5]. They are mostly based on nearest neighbors or clustering [9, 10, 11], where the choice of distance metric plays a crucial part. However, the commonly used Euclidean distance is noise-sensitive and may negatively affect performance. To address this issue, DTW has been successfully applied to anomaly detection on univariate time series [12]. Nevertheless, extending DTW to MTS remains a challenge due to the complex dependency among dimensions, which affects the local pointwise measure and the whole precision further. Previous attempts involve taking advantage of the PCA similarity factor [13] or parameterized distance metric [14, 15, 16, 11] as the pointwise measure of DTW. However, they only support supervised tasks, and no sound unsupervised distance metric learning method has been proposed so far.
To bridge this gap, we propose an unsupervised distance metric learning method for DTW in multivariate anomaly detection. To be specific, we enhance fuzzy C-means clustering (FCM) by introducing a locally weighted DTW (wDTW) as the distance metric. FCM-wDTW encodes raw data into latent space, where cluster centers represent the normal relationships among dimensions of MTS. Then the optimal latent space and wDTW are learned by an efficient closed-form optimization algorithm. Finally, anomalies are identified by reconstructing data from the latent space. For the samples far away from any clusters, their reconstruction errors are large and indicate anomalous. Our contributions are summarized as follows:
-
•
We improve FCM clustering by introducing wDTW, which enhances the clustering accuracy over MTS. And build an efficient optimization algorithm for it.
-
•
Based on the learned latent space and wDTW, we propose a multivariate anomaly detection method with data reconstruction and anomaly scoring computation.
-
•
Extensive experiments with 11 unsupervised multivariate anomaly detection benchmarks demonstrate the favorable performance of FCM-wDTW.
2 Problem statement
MTS can be seen as a sequence of sampling vectors over correlated variables. Specifically, a MTS consists of a sequence of observations, i.e., , where and . The anomaly detection over MTS is formally defined as follows.
Definition 1 (Multivariate Anomaly Detection).
Given a MTS , multivariate anomaly detection aims at computing an anomaly score for each observation , such that higher is , more anomalous is .
Note that in Definition 1, We make no assumptions about whether anomalies are point or subsequence anomalies. If the anomaly scores of continuous observations are high, these observations can be detected as a subsequence anomaly.
3 Method
In this section, we first introduce the objective and optimization algorithm of FCM-wDTW, providing a comprehensive analysis of the algorithm. Then, we describe the anomaly detection method based on the learned latent space and wDTW.
3.1 Objective of FCM-wDTW
The computation of DTW over MTS contains two layers: the pointwise measure on sampling vectors and the dynamic programming (DP) algorithm over the pointwise cost matrix (PCM). To regulate MTS dimensions, we parameterize DTW by employing the weighted Euclidean distance (WED) as the pointwise measure. Given two -dimensional MTS and , where and are sampling vectors, and the optimal war** path (OWP) , where is the PCM and , , by introducing WED, we can get wDTW formulated as
(1) |
where denotes the WED between and is the weight of the -th dimension that satisfies , and is a hyper-parameter.
To learn an adaptive wDTW for the unsupervised tasks, we introduce it into FCM as the kernel distance metric. In terms of (1), the objective function of FCM-wDTW can be formulated as:
(2) |
where denotes the set of weight parameters. denotes cluster center set. denotes the membership matrix, where .
3.2 Optimizing FCM-wDTW
With the nondifferentiable wDTW, it is difficult to optimize the objective function (2) directly. An alternating method is to solve the following four partial optimization sub-problems iteratively:
-
•
Problem 1 Kee** , and fixed, update OWPs between and ;
-
•
Problem 2 Kee** OWPs, , and fixed, update ;
-
•
Problem 3 Kee** OWPs, , and fixed, update ;
-
•
Problem 4 Kee** OWPs, , and fixed, update .
The four problems iteratively optimize the four factors determining the loss of FCM-wDTW, which have explicit meaning. contains all cluster centers that reveal the normal patterns of MTS and construct the latent space. contains the encoding feature of samples that can be seen as the proportion of all normal patterns in composing a sample (where and ). models the correlation among dimensions of MTS while OWPs regulate the temporal relationship between samples and normal patterns. They make up the distance metric together in the latent space. Problem 1 can be solved by the DP algorithm and Problem can be solved by the Lagrange multiplier method, where (3.1) is reformulated as:
(3) |
Theorem 1.
As fixing the OWPs, , and can be updated by:
(4) |
Proof.
Theorem 2.
As fixing the OWPs, , and , can be updated by
(8) |
where denotes the intra-cluster distance on the -th dimension.
The proof of Theorem 2 follows a similar procedure to that of Theorem 1, and will not be reiterated here.
Theorem 3.
As fixing the OWPs, , and can be updated by
(9) |
where are the lengths of and respectively. is an indicator function, if otherwise , where is the OWP between and .
Proof.
Based on the solutions above, FCM-wDTW can be optimized by iterating Problem 14, as shown in Algorithm 1. Step 1 initializes cluster centers and the weights of dimensions . One feasible initialization is to randomly choose samples from the dataset as the initial . And the weights of dimensions can be randomly initialized within the range of [0, 1], while ensuring that they satisfy the condition . Steps update the corresponding variables, respectively. Step 10 calculates the loss of the objective function and determines if the algorithm converges.
3.3 Algorithm Analysis
Complexity of FCM-wDTW. Given the average length of samples , the dimensionality , and the number of clusters , the computational complexity of seeking for the OWPs in Step 36 is , updating the membership matrix in Step 7 is , updating weight parameters in Step 8 is , and updating cluster centers in Step 9 is . Suppose the algorithm iterates times, the complexity of the whole algorithm is .
Hyperparameters analysis. FCM-wDTW has two hyperparameters, i.e., the fuzzy coefficient and the exponent of the weight coefficient. By (4), as , the membership matrix tends to be sparse and the algorithm tends to the hard clustering, while as tends to be average and the algorithm tends to the uniform fuzzy clustering. Previous study [17] has proved through experiments that the optimal values of are typically lower than the commonly used value of 2.0. Thus, we set to be within the range of . On the other hand, by (8), a larger weight can strengthen the contribution of the MTS dimension with a smaller intra-cluster distance , and vice versa. In terms of this principle, we can investigate the values of within different ranges. Firstly, as , and the WED becomes the original Euclidean distance, against our aim of discriminating the MTS dimensions. As , the larger , the larger , violating the principle above. As , the weight coefficient of the dimension with the smallest is one and the others are zero, meaning only a single dimension plays a role in clustering and the information loss would influence the clustering accuracy seriously. As or , the larger , the smaller , satisfying the expected principle. Thus, should be within the range of , .
3.4 Anomaly Detection
In FCM-wDTW, data is projected into latent space constructed by cluster centers, which is composed of the normal patterns only in terms of the proportion of membership degree. Intuitively, if we reconstruct data from this space, the reconstructions are expected to be as close to the cluster centers as possible, and the samples with abnormal components will have large reconstruction errors. Thus, the anomaly score can be computed based on the difference between the sample and its reconstruction.
Based on the optimal cluster centers, partition matrix, and wDTW, the reconstructed samples can be obtained by solving the objective function of FCM-wDTW directly. Let denote the reconstruction of , the objective function of clustering with FCM-wDTW can be obtained as follows
(10) |
By zeroing the gradient of with respect to , we have
(11) |
Then the anomaly score of each sample can be computed as (12), by the wDTW distance between the raw data and its reconstruction.
(12) |
4 EXPERIMENTS
In this section, we first estimate the performance of FCM-wDTW on multivariate anomaly detection, which is conducted on 4 datasets against 11 benchmarks. We then examine the runtime efficiency of the optimization algorithm for FCM-wDTW.
Method | Type | CalIt2 | PCSO5 | PCSO10 | PCSO20 | ||||
---|---|---|---|---|---|---|---|---|---|
ROC-AUC | PR-AUC | ROC-AUC | PR-AUC | ROC-AUC | PR-AUC | ROC-AUC | PR-AUC | ||
LOF [9] | Distance | 0.727 | 0.119 | 0.445 | 0.008 | 0.440 | 0.009 | 0.447 | 0.009 |
KNN [10] | Distance | 0.883 | 0.264 | 0.681 | 0.014 | 0.752 | 0.019 | 0.599 | 0.011 |
CBLOF [18] | Distance | 0.871 | 0.256 | 0.271 | 0.006 | 0.355 | 0.007 | 0.225 | 0.006 |
HBOS [19] | Distance | 0.873 | 0.250 | 0.207 | 0.006 | 0.297 | 0.007 | 0.185 | 0.006 |
COF [20] | Distance | 0.839 | 0.161 | 0.463 | 0.009 | 0.432 | 0.008 | 0.471 | 0.009 |
EIF [21] | Trees | 0.885 | 0.259 | 0.948 | 0.083 | 0.847 | 0.032 | 0.856 | 0.033 |
IF-LOF [22] | Trees | 0.795 | 0.122 | 0.610 | 0.012 | 0.603 | 0.012 | 0.619 | 0.013 |
iForest [23] | Trees | 0.881 | 0.259 | 0.859 | 0.033 | 0.838 | 0.030 | 0.836 | 0.029 |
COPOD [24] | Distribution | 0.884 | 0.269 | 0.809 | 0.024 | 0.465 | 0.009 | 0.808 | 0.024 |
PCC [25] | Reconstruction | 0.757 | 0.240 | 0.874 | 0.038 | 0.600 | 0.012 | 0.870 | 0.037 |
Torsk [26] | Forecasting | 0.585 | 0.054 | 0.909 | 0.100 | — | — | 0.919 | 0.065 |
FCM-wDTW | Reconstruction | 0.904 | 0.465 | 0.993 | 0.818 | 0.974 | 0.423 | 0.952 | 0.345 |
4.1 Setup
Environment. The configuration is Intel(R) Core(TM) i9-12900k CPU @3.2GHz, 32GB memory, Ubuntu 20.04 OS. The programming language is Python 3.8.
Datasets.
For estimating the anomaly detection accuracy, we consider four real-world datasets, namely CalIt [27] and PCSO5, PCSO10, and PCSO20 from GutenTAG [5]. CalIt2 comes from the data streams of people flowing in and out of the building of University of California at Irvine over 15 weeks, 48 time slices per day (half-hour count aggregates). The purpose is to predict the presence of an event such as a conference in the building that is reflected by unusually high people counts for that day/time period. GutenTAG is actually a time series anomaly generator that consists of single or multiple channels containing a base oscillation with a large variety of well-labeled anomalies at different positions. It generated time series of five base types (sine, ECG, random walk, cylinder bell funnel, and polynomial) with different lengths, variances, amplitudes, frequencies, and dimensions. The selection of injected anomalies covers nine different types. Fig. 1 exhibits the samples from both data collections. In the CalIt2 sample, the relationship between two dimensions turns over with large amplitude in the shaded area, indicating a conference as an anomaly event. In the GutenTAG sample, the first dimension is injected with polynomial amplitude anomalies within the shaded area.
Parameters. To guarantee clustering robustness, the cluster centers of FCM-wDTW are initialized with density peak clustering (DPC) [28]. By analysis ahead, the value ranges of the two hyper-parameters and are and , the adjustment steps are 0.3 and 2 respectively. In addition, the sliding window size in anomaly detection is 16.
Metrics. We utilize two threshold-agnostic evaluation metrics. 1) AUC-ROC: contrasts the TP rate with the FP rate (or Recall). It focuses on an algorithm’s sensitivity. 2) AUC-PR: contrasts the precision with the recall. It focuses on an algorithm’s preciseness.
Baselines. We adopt all the 11 unsupervised multivariate benchmarks published by the state-of-the-art survey work [5] (only except DBStream which has no results published), Table 1 summarizes the name and type of each baseline. To avoid implementation bias, we use the results from [5] directly. All baseline results are gained by optimizing the parameters globally for the best average AUC-ROC score.
4.2 Accuracy
We test the parameters of cluster number with values {10, 20, 30, 40, 50}. The optimal anomaly detection results of FCM-wDTW and the benchmark results are reported in Table 1 respectively. From the results, we note that for distance-based methods like LOF, KNN, and CBLOF, their performances are relatively worse. The reason behind this may be the sensitivity of Euclidean distance to noise. Additionally, tree-based methods achieve a relatively high ROC-AUC on most datasets. However, they tend to have lower PR-AUC, indicating that they struggle to achieve a good trade-off between precision and recall. In contrast, FCM-wDTW not only achieves the best ROC-AUC but also shows a decent PR-AUC. This indicates that FCM-wDTW is capable of achieving a favorable trade-off between the TP rate TPR and FP rate, as well as between precision and recall. Overall, FCM-wDTW outperforms all other methods on four datasets in terms of both AUC-ROC and AUC-PR, signifying its effectiveness and robustness in accurately detecting anomalies.
4.3 Runtime
We compare the real runtime of FCM-wDTW against seven clustering benchmarks [29], including CD, PDC, FCFW, FCMD-DTW, PAM-DTW, GAK-DBA and soft-DTW, on datasets CMUsubject16 and ECG [30]. The results are shown in Fig.2. Each method is repeated 100 times to calculate the average runtime. On CMUsubject16, except GAK-DBA, the runtime of all other methods is in the same order of magnitude. On ECG, except soft-DTW, the runtime of all other methods is in the same order of magnitude. In addition, the runtime of FCM-wDTW is relatively low on ECG but high on CMUsubject16. We note that the iterations of FCM-wDTW on both datasets are less than 10, but the sample lengths and the numbers of dimensions are different greatly. By (8), the extra cost of FCM-wDTW on CMUsubject16 is caused by the procedure of initializing cluster centers and updating weight coefficients. Overall, although a more complex distance metric is introduced, the runtime of FCM-wDTW remains comparable to that of other clustering methods.
5 CONCLUSION
In this work, we propose an unsupervised distance metric learning method based on FCM and wDTW for multivariate anomaly detection. Our method solves the objective function in closed form for the reformulated optimization problem and builds an efficient optimization algorithm. The anomalies are identified by reconstructing data from the learned optimal latent space. Comprehensive experiments demonstrate the significant superiority of our methods. Future work will focus on examining its effectiveness on more practical problems, e.g., network performance monitoring, abnormal account detection, and attack behavior identification.
References
- [1] Dongxu Huang, Dejun Mu, Libin Yang, and Xiaoyan Cai, “Codetect: Financial fraud detection with anomaly feature detection,” IEEE Access, vol. 6, pp. 19161–19174, 2018.
- [2] Neeraj Kumar and Upendra Kumar, “Anomaly-based network intrusion detection: An outlier detection techniques,” in Proceedings of the Eighth International Conference on Soft Computing and Pattern Recognition (SoCPaR 2016). Springer, 2018, pp. 262–269.
- [3] Astha Garg, Wenyu Zhang, Jules Samaran, Ramasamy Savitha, and Chuan-Sheng Foo, “An evaluation of anomaly detection and diagnosis in multivariate time series,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 6, pp. 2508–2517, 2021.
- [4] Douglas M Hawkins, Identification of outliers, vol. 11, Springer, 1980.
- [5] Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock, “Anomaly detection in time series: a comprehensive evaluation,” Proceedings of the VLDB Endowment, vol. 15, no. 9, pp. 1779–1797, 2022.
- [6] Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A Lozano, “A review on outlier/anomaly detection in time series data,” ACM Computing Surveys (CSUR), vol. 54, no. 3, pp. 1–33, 2021.
- [7] Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A Zuluaga, “Do deep neural networks contribute to multivariate time series anomaly detection?,” Pattern Recognition, vol. 132, pp. 108945, 2022.
- [8] Siwon Kim, Kuk** Choi, Hyun-Soo Choi, Byunghan Lee, and Sungroh Yoon, “Towards a rigorous evaluation of time-series anomaly detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, pp. 7194–7201.
- [9] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander, “Lof: identifying density-based local outliers,” in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 93–104.
- [10] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim, “Efficient algorithms for mining outliers from large data sets,” in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 427–438.
- [11] **bo Li, Hesam Izakian, Witold Pedrycz, and Iqbal Jamal, “Clustering-based anomaly detection in multivariate time series data,” Applied Soft Computing, vol. 100, pp. 106919, 2021.
- [12] Seif-Eddine Benkabou, Khalid Benabdeslem, and Bruno Canitia, “Unsupervised outlier detection for time series by entropy and dynamic time war**,” Knowledge and Information Systems, vol. 54, no. 2, pp. 463–486, 2018.
- [13] Zoltán Bankó and János Abonyi, “Correlation based dynamic time war** of multivariate time series,” Expert Systems with Applications, vol. 39, no. 17, pp. 12814–12823, 2012.
- [14] Qinglin Cai, Ling Chen, and Jianling Sun, “Piecewise statistic approximation based similarity measure for time series,” Knowledge-Based Systems, vol. 85, pp. 181–195, 2015.
- [15] Jiangyuan Mei, Meizhu Liu, Yuan-Fang Wang, and Huijun Gao, “Learning a mahalanobis distance-based dynamic time war** measure for multivariate time series classification,” IEEE transactions on Cybernetics, vol. 46, no. 6, pp. 1363–1374, 2015.
- [16] **gyi Shen, Wei** Huang, Dongyang Zhu, and Jun Liang, “A novel similarity measure model for multivariate time series based on lmnn and dtw,” Neural Processing Letters, vol. 45, pp. 925–937, 2017.
- [17] Witold Pedrycz and José Valente de Oliveira, “A development of fuzzy encoding and decoding through fuzzy clustering,” IEEE Transactions on Instrumentation and Measurement, vol. 57, no. 4, pp. 829–837, 2008.
- [18] Zengyou He, Xiaofei Xu, and Shengchun Deng, “Discovering cluster-based local outliers,” Pattern recognition letters, vol. 24, no. 9-10, pp. 1641–1650, 2003.
- [19] Markus Goldstein and Andreas Dengel, “Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm,” KI-2012: poster and demo track, vol. 1, pp. 59–63, 2012.
- [20] Jian Tang, Zhixiang Chen, Ada Wai-Chee Fu, and David W Cheung, “Enhancing effectiveness of outlier detections for low density patterns,” in Advances in Knowledge Discovery and Data Mining: 6th Pacific-Asia Conference, PAKDD 2002 Taipei, Taiwan, May 6–8, 2002 Proceedings 6. Springer, 2002, pp. 535–548.
- [21] Sahand Hariri, Matias Carrasco Kind, and Robert J Brunner, “Extended isolation forest,” IEEE transactions on knowledge and data engineering, vol. 33, no. 4, pp. 1479–1489, 2019.
- [22] Zhangyu Cheng, Chengming Zou, and Jianwei Dong, “Outlier detection using isolation forest and local outlier factor,” in Proceedings of the conference on research in adaptive and convergent systems, 2019, pp. 161–168.
- [23] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou, “Isolation forest,” in 2008 eighth ieee international conference on data mining. IEEE, 2008, pp. 413–422.
- [24] Zheng Li, Yue Zhao, Nicola Botta, Cezar Ionescu, and Xiyang Hu, “Copod: copula-based outlier detection,” in 2020 IEEE international conference on data mining (ICDM). IEEE, 2020, pp. 1118–1123.
- [25] Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang, “A novel anomaly detection scheme based on principal component classifier,” in Proceedings of the IEEE foundations and new directions of data mining workshop. IEEE Press, 2003, pp. 172–179.
- [26] Niklas Heim and James E Avery, “Adaptive anomaly detection in chaotic time series with a spatially aware echo state network,” arXiv preprint arXiv:1909.01709, 2019.
- [27] Arthur Asuncion and David Newman, “Uci machine learning repository,” 2007.
- [28] Alex Rodriguez and Alessandro Laio, “Clustering by fast search and find of density peaks,” science, vol. 344, no. 6191, pp. 1492–1496, 2014.
- [29] Hailin Li and Miao Wei, “Fuzzy clustering based on feature weights for multivariate time series,” Knowledge-Based Systems, vol. 197, pp. 105907, 2020.
- [30] Mustafa Gokce Baydogan, “Multivariate time series classification datasets,” 2015.