11institutetext: Chongqing Institute of Green and Intelligent Technology,
Chinese Academy of Sciences, Chongqing 400714, China
22institutetext: Chongqing School, University of Chinese Academy of Sciences,
Chongqing 400714, China
22email: {chenyutong,msshang}@cigit.ac.cn
33institutetext: Intelligent Game and Decision Lab (IGDL), Bei**g 100091, China
33email: [email protected], [email protected]
44institutetext: Singapore Management University, 178902, Singapore
44email: [email protected], [email protected]

Self-Supervised Spatial-Temporal Normality Learning for Time Series Anomaly Detection

Yutong Chen Equal contribution1122    Hongzuo Xu* 33    Guansong Pang (✉) 44    Hezhe Qiao 44   
Yuan Zhou
33
   Mingsheng Shang (✉) 1122
Abstract

Time Series Anomaly Detection (TSAD) finds widespread applications across various domains such as financial markets, industrial production, and healthcare. Its primary objective is to learn the normal patterns of time series data, thereby identifying deviations in test samples. Most existing TSAD methods focus on modeling data from the temporal dimension, while ignoring the semantic information in the spatial dimension. To address this issue, we introduce a novel approach, called Spatial-Temporal Normality learning (STEN). STEN is composed of a sequence Order prediction-based Temporal Normality learning (OTN) module that captures the temporal correlations within sequences, and a Distance prediction-based Spatial Normality learning (DSN) module that learns the relative spatial relations between sequences in a feature space. By synthesizing these two modules, STEN learns expressive spatial-temporal representations for the normal patterns hidden in the time series data. Extensive experiments on five popular TSAD benchmarks show that STEN substantially outperforms state-of-the-art competing methods. Our code is available at https://github.com/mala-lab/STEN.

Keywords:
Anomaly Detection Time Series Self-supervised Learning Normality Learning

1 Introduction

Time series data are pervasive in many real-world application domains, including finance, industrial production, network traffic, and health monitoring [2, 26, 25]. Within these time series data, there can exist exceptional data observations, a.k.a., anomalies, that deviate significantly from the majority of data, which often indicates an abnormal change of the data-generating mechanism [10]. These anomalies are of great interest to the analysts since accurately detecting them is significant in alarming faults in target systems and preventing potential losses.

Anomaly detection aims to learn data normality during training and identify those exceptional data during inference [23]. It is a non-trivial task to precisely learn the normality of time series data. The challenge is primarily due to the unsupervised nature of anomaly detection. Models like neural networks are actuated by supervisory signals to learn high-level patterns from given training data. However, it is prohibitively costly to collect large-scale labeled data for anomaly detection. Thus, learning from unlabeled data is generally more preferable yet it is difficult due to the lack of labeled training data. The challenge is aggravated by the inherent complexity of time series data. Time series data is characterized by multiple components from temporal and spatial perspectives, including the trending and seasonal changes along the time dimension and the spatial proximity relation of temporal sequences in the feature space.

Current time series anomaly detection (TSAD) methods typically leverage an encoder-decoder architecture and assess data abnormality according to reconstruction/prediction errors of testing data [34, 33, 35, 30, 26, 12, 7]. Despite their general effectiveness on various datasets, these methods often tend to overfit the training data and fail to distinguish anomalies from the normal sequences, since both types of sequences have similarly small reconstruction/prediction errors. Self-supervised learning has been emerging as a promising technique in anomaly detection due to its capability of deriving supervisory signals from the data itself. Existing self-supervised methods, e.g., via pretext tasks like association learning [34] and dual attention contrastive representation learning [35], successfully learn discriminative models to capture normal semantics of the temporal continuity into the representation space. However, apart from the data regularities reflected in the temporal dimension, spatial normality is also a desideratum of comprehensive normality learning in time series data.

Refer to caption
Figure 1: Our key insights. (a) A sequence containing abnormal data is divided into equal-length sub-sequences by dashed lines, with anomalies presented in the first three sub-sequences. (b) Sub-sequences arranged according to the predicted order distribution, showing differences between the predicted order distribution and the original data due to the presence of the anomaly. (c) The distribution of distances between sequence pairs in a random projection space, which can well preserve the spatial information of the sequences within the feature space. (d) The distribution of distances between sequence pairs learned by a trainable network, effectively resembling the distance distribution. (e) The distribution of anomaly scores obtained by considering the prediction discrepancies in both temporal and spatial dimensions for normal and abnormal data. The results are based on the Epilepsy dataset [33].

In light of this limitation, this paper investigates an intriguing question: Can we simultaneously learn spatial-temporal normality of time series data in the self-supervised paradigm? As shown in Fig. 1, the input data sequences are the training data of the Epilepsy dataset [33], which illustrates the normal patterns. Time series continuously change along the time dimension, and this continuity can be fully leveraged in temporal normality learning. More specifically, after splitting the raw time sequences into several sub-sequences, the normality can be modeled by learning the sub-sequence order. During inference, giving a testing sequence with anomalies (Fig. 1(a)), the learning model ranks the randomly shuffled sub-sequences and finally yields a sequence that resembles the training data (Fig. 1(b)). This predicted order depicts a large difference compared to the original distribution for the anomalous sequence, whereas the difference is small for normal sequences.

In terms of spatial normality, we investigate the pairwise distance of time series data sequences in a projection space. Generally, it is challenging to obtain the spatial relation of data samples evolving along the time, especially under the unsupervised learning paradigm. Motivated by the effective spatial information preservation by random projection [4, 16, 29], we leverage the pairwise distance in a randomly projected feature space. Since normal sequences often locate in a dense region, as shown in Fig. 1(c), the distance distribution of a normal sequence typically follows a Gaussian-like distribution with the expected value approximating zero. On the contrary, that of anomalous sequences tends to have a markedly different distance-based spatial distribution (e.g., uniform-like distribution). Neural networks are capable of predicting those distance-based spatial information, as shown in Fig. 1(d). By learning to accurately predict these distances, the networks learn the spatial distribution of the sequence normality. During inference, testing anomalous sequences manifest significant deviation from the normal data distribution according to the anomaly score (see Fig. 1(e)).

Based on this insight, we propose a novel approach for TSAD, called Spatial-Temporal Normality learning (STEN for short), in which two pretext tasks are designed to construct a self-supervised learning pipeline for the joint spatial-temporal normality modeling. Specifically, we first devise an Order prediction-based Temporal Normality learning module (OTN). The inputting time sequences are split into several sub-sequences. Supervised by their genuine order, the neural network learns how to sort shuffled sub-sequences, during which the temporal continuity can be captured and represented. On the other hand, we further introduce a Distance prediction-based Spatial Normality learning module (DSN), which models the spatial proximity of sequences in the representation space.

The main contributions are summarized as follows.

  • We introduce the concept of spatial-temporal normality of time series data and propose a novel anomaly detection approach STEN. STEN achieves spatial-temporal modeling of the data normality, contrasting to the existing methods that are focused on the temporal perspective only.

  • To learn the temporal normality, we propose the OTN module. This module captures temporal associations and contextual information in time series by learning the distribution of sub-sequence orders. Different from many Transformer-based methods that calculate associations via heavy attention mechanisms, our method successfully models temporal normality by fully harnessing the unique continuity of time series data.

  • To learn the spatial normality, we propose the DSN module. By wielding an informative random projection space, our approach further restrains spatial proximity in the representation space. Compared to mainstream reconstruction-based models that focus on individual normality, our approach investigates data affinity among the sequences, thereby capturing the normality beyond the temporal perspective.

  • STEN outperforms state-of-the-art methods on five popular benchmark datasets for time series anomaly detection.

2 Related Work

Time Series Anomaly Detection. Time series anomaly detection is an old discipline, which has received increasing attention in recent years. Early traditional methods focus on statistical approaches such as moving averages, AutoRegressive Integrated Moving Average (ARIMA) models [5], and their multiple variants [20]. With the emergence of machine learning technology, techniques including classification [19], clustering [15], ensemble learning [32],and time series forecasting [14] are applied to time series anomaly detection. Besides, Tsfresh has inspired the window-to-feature approach, enhancing the efficiency of feature extraction in time series analysis [8]. ROCKET’s focus on sub-sequence patterns through random convolutional kernels has inspired advancements in capturing local temporal patterns [11]. However, traditional methods are often constrained by the learning capability and the quantity of labeled data, making it challenging to achieve satisfactory performance.

With the burgeoning of deep learning techniques, many deep anomaly detection methods have been introduced in the literature. Owing to deep learning’s powerful capability in modeling intricate data patterns and distributions, deep anomaly detection methods dramatically improve the detection performance over conventional methods [23]. Mainstream deep time-series anomaly detection methods are based on generative models, in which the learning models are trained to predict or reconstruct original raw time series data [10]. Prediction-based methods train models to forecast the value of the next timestamp, using the discrepancy between predicted and actual values to indicate the abnormal degree of the current timestamp. Reconstruction-based methods compare the error between reconstructed and actual values. These methods employ various neural network architectures such as internal memory, dilated convolutions, and graph structure learning, intending to capture the temporal characteristic of time series, yielding impressive results across numerous benchmarks [30, 26, 12, 7]. Additionally, some studies have coupled them with adversarial training to amplify the discriminability of anomalies [3, 22]. A recent work, named Anomaly Transformer [34], utilizes the Transformer to model the associations between sequences and their neighboring priors, identifying anomalies through association differences.

Self-supervised Learning on Time Series. Self-supervised learning generates supervision signals from the data itself. Via various proxy learning tasks, the learning models embed data patterns into the projection space, offering semantic-rich representations to downstream tasks. Self-supervised time series analysis can be categorized into generative-, contrastive-, and adversarial-based methods [38]. As introduced above, generative methods rely on prediction/reconstruction learning objectives to capture the characteristics of time series data. These methods often overfit the training data, and anomalies may have similarly small reconstruction errors. Adversarial-based techniques, noted in [39, 21], might be hampered by complex and unstable training regimes.

Contrastive learning defines positive and negative pairs, and the learning process minimizes the distance between positive pairs and maximizes the distance between negative pairs in the feature space. This approach can encourage the model to learn meaningful and discriminative features. This concept has been applied in studies such as [27, 36, 9]. DCdetector [35] is one of the latest frameworks devised for anomaly detection tasks, which employs a dual-attention contrastive representation mechanism to differentiate between normal and abnormal samples effectively. However, this approach still overlooks the importance of spatial normality learning in comprehending the normal patterns of time series data.

3 STEN: Spatial-Temporal Normality Learning

3.1 Problem Statement

Let 𝒳=𝐱1,𝐱2,,𝐱N𝒳subscript𝐱1subscript𝐱2subscript𝐱𝑁\mathcal{X}=\left\langle\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\right\ranglecaligraphic_X = ⟨ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ⟩ be a sequence of multivariate time series with N𝑁Nitalic_N observations, with each observation 𝐱tDsubscript𝐱𝑡superscript𝐷\mathbf{x}_{t}\in\mathbb{R}^{D}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT denotes the data values of D𝐷Ditalic_D variants at a certain timestamp t𝑡titalic_t, where 1tN1𝑡𝑁1\leq t\leq N1 ≤ italic_t ≤ italic_N, then unsupervised time series anomaly detection (TSAD) aims to learn an anomaly scoring function f𝑓fitalic_f without reliance on any labels of the training data 𝒳𝒳\mathcal{X}caligraphic_X. f𝑓fitalic_f is applied to measure the abnormality of observations in testing sequences 𝒳testsubscript𝒳test\mathcal{X}_{\text{test}}caligraphic_X start_POSTSUBSCRIPT test end_POSTSUBSCRIPT. Higher anomaly scores indicate a higher likelihood to be anomalies. In STEN, we design a particular f𝑓fitalic_f that can effectively capture spatial-temporal normality in training data 𝒳𝒳\mathcal{X}caligraphic_X.

3.2 Overview of The Proposed Approach

This paper proposes a novel unsupervised TSAD method STEN, in which data normality is learned from both temporal and spatial perspectives. As shown in Fig. 2, STEN consists of an Order prediction-based Temporal Normality learning module (OTN) and a Distance prediction-based Spatial Normality learning module (DSN). OTN learns the temporal normality information of the time series by modeling the sequential distribution of sub-sequences. On the other hand, DSN is designed to utilize the distance prediction between the sequence pairs for spatial normality learning. Finally, the anomaly score which reflects both temporal and spatial normality can be used to distinguish normal and abnormal sequences. The details of OTN and DSN are discussed in the following two sections.

Refer to caption
Figure 2: Overview of STEN. STEN consists of two self-supervised components: DSN and OTN. In DSN, the distances between sequence pairs after being projected by a random network η𝜂\etaitalic_η form a compact distribution, from which we distill the spatial normality patterns using a MSE loss function dsnsubscript𝑑𝑠𝑛\mathcal{L}_{dsn}caligraphic_L start_POSTSUBSCRIPT italic_d italic_s italic_n end_POSTSUBSCRIPT. OTN captures temporal normality by predicting the order among sub-sequences after a random shuffling using a distribution similarity-based loss function, Jensen-Shannon divergence.

3.3 OTN: Order Prediction-based Temporal Normality Learning

Time series datasets are essentially organized into specific sequences, where connectivity and sequential information of sub-sequences illuminate time-related patterns and contextual relationships within the data. The temporality of time series data reflects the dynamic changes of data in the time dimension. Analyzing temporality can help understand the dynamic changes and time correlation of data. Therefore, the OTN module is designed to extract temporal normality in time series data by predicting the temporal order between their sub-sequences.

To this end, OTN models the temporal patterns by predicting the primary temporal order of randomly shuffled sub-sequences. Specifically, a sliding window of length L𝐿Litalic_L and stride R𝑅Ritalic_R are first used to divide 𝒳𝒳\mathcal{X}caligraphic_X into a collection of n𝑛nitalic_n sequences 𝒮={S1,S1+R,,Sn}𝒮subscript𝑆1subscript𝑆1𝑅subscript𝑆𝑛{\cal S}=\left\{S_{1},S_{1+R},...,S_{n}\right\}caligraphic_S = { italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 1 + italic_R end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } , where each St=𝐱t,𝐱t+1,,𝐱t+L1subscript𝑆𝑡subscript𝐱𝑡subscript𝐱𝑡1subscript𝐱𝑡𝐿1S_{t}=\left\langle\mathbf{x}_{t},\mathbf{x}_{t+1},...,\mathbf{x}_{t+L-1}\right\rangleitalic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⟨ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_t + italic_L - 1 end_POSTSUBSCRIPT ⟩. For each sequence S𝒮𝒾𝑆subscript𝒮𝒾S\in{\cal S_{i}}italic_S ∈ caligraphic_S start_POSTSUBSCRIPT caligraphic_i end_POSTSUBSCRIPT, we generate a fixed number of equal-length short sequences with a fixed length l𝑙litalic_l and stride r𝑟ritalic_r. One sequence Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will result in m𝑚mitalic_m sub-sequences t={R1,R2,,Rm}subscript𝑡subscript𝑅1subscript𝑅2subscript𝑅𝑚{\cal R}_{t}=\left\{R_{1},R_{2},...,R_{m}\right\}caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }. The position orders of the sub-sequences are used as a set of self-supervised class labels to train the neural network ϕitalic-ϕ\phiitalic_ϕ. The Jensen-Shannon (JS) divergence, a popular method for measuring the difference between two different distributions, is then used to define our self-supervised loss otnsubscript𝑜𝑡𝑛\mathcal{L}_{otn}caligraphic_L start_POSTSUBSCRIPT italic_o italic_t italic_n end_POSTSUBSCRIPT in predicting the order distributions of these sub-sequences, which is formulated as follows:

otn=i=1cY^Rilog(Y^Ri12(Y^Ri+YRi))+i=1cYRilog(YRi12(Y^Ri+YRi)),subscriptotnsuperscriptsubscript𝑖1𝑐subscript^𝑌subscript𝑅𝑖subscript^𝑌subscript𝑅𝑖12subscript^𝑌subscript𝑅𝑖subscript𝑌subscript𝑅𝑖superscriptsubscript𝑖1𝑐subscript𝑌subscript𝑅𝑖subscript𝑌subscript𝑅𝑖12subscript^𝑌subscript𝑅𝑖subscript𝑌subscript𝑅𝑖{{\cal L}_{{\rm{otn}}}}=\sum\limits_{i=1}^{c}{{{\hat{Y}}_{{R_{i}}}}}\log\left(% {\frac{{{{{\hat{Y}}_{{R_{i}}}}}}}{{\frac{1}{2}\left({{{{\hat{Y}}_{{R_{i}}}}}+{% Y_{R_{i}}}}\right)}}}\right)+\sum\limits_{i=1}^{c}{{Y_{R_{i}}}}\log\left({% \frac{{{Y_{{R_{i}}}}}}{{\frac{1}{2}\left({{{{\hat{Y}}_{{R_{i}}}}}+{Y_{{R_{i}}}% }}\right)}}}\right),caligraphic_L start_POSTSUBSCRIPT roman_otn end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( divide start_ARG over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_Y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_Y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_Y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG ) , (1)

where Y^Rsubscript^𝑌𝑅\hat{Y}_{R}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT represents the softmax results of the temporal order prediction for the sub-sequence set \mathcal{R}caligraphic_R and YRsubscript𝑌𝑅Y_{{R}}italic_Y start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is the ground truth order distribution. Minimizing this loss enables the learning of temporal normal patterns based on the local sub-sequence context.

3.4 DSN: Distance Prediction-based Spatial Normality Learning

The spatial patterns of time series data we aim to capture are the spatial distribution of the data that changes over time in a feature space. It describes the spatial distribution patterns, trends, and changes of the data over a period of time, which cannot be captured by the temporal modeling. To complement the temporal normality learning, we propose the DSN module to model the spatial normality of the time series data in a feature space. Different from OTN that is designed to learn the temporal dynamics within the individual sequences, DSN is designed to model the spatial relation between the sequences, and thus, it is performed on the sequences set 𝒮𝒮\cal Scaligraphic_S rather than the sub-sequences.

One challenge here is the lack of supervision signals for learning the spatial patterns. Inspired by the solid theoretical and empirical results in [4, 16, 29], we propose to use the distance of the sequences in a feature space spanned by random projection as our supervision source for learning the spatial patterns among the sequences. It has been theoretically justified and empirically shown in these prior studies that although it is in a randomly projected feature space, the distance information in the original space can be well preserved. To utilize these random distances through deep neural networks, we design the self-supervised distance prediction method. More specifically, for each sequence Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we randomly select one sequence from the remaining sequences and construct a sequence pair (Si,Sj)subscript𝑆𝑖subscript𝑆𝑗({S}_{i},{S}_{j})( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). They are then respectively fed to a trainable network ϕ:DM:italic-ϕmaps-tosuperscript𝐷superscript𝑀\phi:\mathbb{R}^{D}\mapsto\mathbb{R}^{M}italic_ϕ : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT with parameters ΘΘ\Thetaroman_Θ and a random network η:DM:𝜂maps-tosuperscript𝐷superscript𝑀\eta:\mathbb{R}^{D}\mapsto\mathbb{R}^{M}italic_η : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, which is a representation learner with parameters ΦΦ\Phiroman_Φ. The parameters ΦΦ\Phiroman_Φ in the network η𝜂\etaitalic_η is initialized with random weights and frozen, which serves as the random project model. It is used to project the original sequences and obtain the distance-based relation via:

dη(Si,Sj)=η(Si;Φ)Tη(Sj;Φ),subscript𝑑𝜂subscript𝑆𝑖subscript𝑆𝑗𝜂superscriptsubscript𝑆𝑖Φ𝑇𝜂subscript𝑆𝑗Φ{d_{\eta}}(S_{i},S_{j})=\eta(S_{i};\Phi)^{T}\eta(S_{j};\Phi),italic_d start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_η ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; roman_Φ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_η ( italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; roman_Φ ) , (2)

where dη(Si,Sj)subscript𝑑𝜂subscript𝑆𝑖subscript𝑆𝑗{d_{\eta}}(S_{i},S_{j})italic_d start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) represents a spatial relation of the two sequences in the projected feature space. To learn these spatial relations, we train the network ϕitalic-ϕ\phiitalic_ϕ to predict/distill these distances via:

dsn=1ni=1n(dϕ(Si,Sj)dη(Si,Sj))2,subscript𝑑𝑠𝑛1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑑italic-ϕsubscript𝑆𝑖subscript𝑆𝑗subscript𝑑𝜂subscript𝑆𝑖subscript𝑆𝑗2\mathcal{L}_{dsn}=\frac{1}{n}\sum_{i=1}^{n}({d_{\phi}}(S_{i},S_{j})-{d_{\eta}}% (S_{i},S_{j}))^{2},caligraphic_L start_POSTSUBSCRIPT italic_d italic_s italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_d start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3)

where n𝑛nitalic_n is the number of sequence pairs and dϕ(Si,Sj)subscript𝑑italic-ϕsubscript𝑆𝑖subscript𝑆𝑗{d_{\phi}}(S_{i},S_{j})italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is defined as follows:

dϕ(Si,Sj)=ϕ(Si;Θ)Tϕ(Sj;Θ).subscript𝑑italic-ϕsubscript𝑆𝑖subscript𝑆𝑗italic-ϕsuperscriptsubscript𝑆𝑖Θ𝑇italic-ϕsubscript𝑆𝑗Θ{d_{\phi}}(S_{i},S_{j})=\phi(S_{i};\Theta)^{T}\phi(S_{j};\Theta).italic_d start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; roman_Θ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ϕ ( italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; roman_Θ ) . (4)

The training of the network ϕitalic-ϕ\phiitalic_ϕ can be seen as distilling semantic information from the distance of sequence pairs, learning the relative spatial normality of time series data. A normal sequence typically include a dense set of normal sequences in their local neighborhood while being distant from the other normal sequences. Thus, its distance-based spatial relation to the other sequences forms a Gaussian-like distribution. By contrast, the anomalous sequences are assumed to be distant from most sequences, so its distribution of the distances to other sequences tends to be a uniform one. DSN is designed to learn such discriminative patterns by minimizing the loss dsnsubscript𝑑𝑠𝑛\mathcal{L}_{dsn}caligraphic_L start_POSTSUBSCRIPT italic_d italic_s italic_n end_POSTSUBSCRIPT.

3.5 Training and Inference

Training. The OTN and DSN are synthesized in our approach STEN to capture the spatial-temporal dependencies of the time series data, offering a comprehensive modeling of the normal patterns. Thus, the overall loss function in STEN is composed of the loss functions from the two above two self-supervised tasks:

STEN=otn+αdsn,subscript𝑆𝑇𝐸𝑁subscript𝑜𝑡𝑛𝛼subscript𝑑𝑠𝑛\mathcal{L}_{STEN}=\mathcal{L}_{otn}+\alpha\mathcal{L}_{dsn},caligraphic_L start_POSTSUBSCRIPT italic_S italic_T italic_E italic_N end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_o italic_t italic_n end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT italic_d italic_s italic_n end_POSTSUBSCRIPT , (5)

where α𝛼\alphaitalic_α is a hyperparameter used to modulate the two modules, and the learnable parameters in network ϕitalic-ϕ\phiitalic_ϕ are jointly learned by the loss function STENsubscript𝑆𝑇𝐸𝑁\mathcal{L}_{STEN}caligraphic_L start_POSTSUBSCRIPT italic_S italic_T italic_E italic_N end_POSTSUBSCRIPT.

Inference. In the OTN module, because of the continuity of the normal patterns, the order distribution reflects the overall order prediction of a sub-sequence collection 𝓉subscript𝓉\cal{R}_{t}caligraphic_R start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT, and the discrete order of a single sub-sequence Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT reflects the prediction of the current sub-sequence. Thus, the discrepancies in both types of predictions for normal data will be less than those for anomalies. Motivated by this, we use both of these differences in our anomaly scoring, which is defined as follows:

Scoreotn(Ri)=|Y^RiYRi|otni,𝑆𝑐𝑜𝑟subscript𝑒𝑜𝑡𝑛subscript𝑅𝑖subscript^𝑌subscript𝑅𝑖subscript𝑌subscript𝑅𝑖subscript𝑜𝑡subscript𝑛𝑖{{Score}_{otn}}(R_{i})=\frac{\left|{{{{\hat{Y}}_{{R_{i}}}}}-{Y_{{R_{i}}}}}% \right|}{{\mathcal{L}}_{{otn}_{i}}},italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_o italic_t italic_n end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG | over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_Y start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT italic_o italic_t italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG , (6)

where the numerator is the difference in a single subspace while the denominator summarizes the differences across all sub-sequences. Note that isubscript𝑖{\cal L}_{i}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has a different scale from otnsubscript𝑜𝑡𝑛\mathcal{L}_{otn}caligraphic_L start_POSTSUBSCRIPT italic_o italic_t italic_n end_POSTSUBSCRIPT, so we perform an upsampling of this score in Eq. (6) by replicating it from a scalar to a vector of length of m𝑚mitalic_m.

The spatial normality of the sequences is captured in dsnsubscript𝑑𝑠𝑛\mathcal{L}_{dsn}caligraphic_L start_POSTSUBSCRIPT italic_d italic_s italic_n end_POSTSUBSCRIPT, i.e., for a normal sequence Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, it is expected to have substantially smaller dsnsubscript𝑑𝑠𝑛\mathcal{L}_{dsn}caligraphic_L start_POSTSUBSCRIPT italic_d italic_s italic_n end_POSTSUBSCRIPT than anomalous sequences when paired with other sequences Sjsubscript𝑆𝑗S_{j}italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. To utilize spatial-temporal normality for TSAD, STEN defines an overall anomaly score as:

Score(Ri)=Scoreotn(Ri)+βScoredsn(Ri),𝑆𝑐𝑜𝑟𝑒subscript𝑅𝑖𝑆𝑐𝑜𝑟subscript𝑒𝑜𝑡𝑛subscript𝑅𝑖𝛽𝑆𝑐𝑜𝑟subscript𝑒𝑑𝑠𝑛subscript𝑅𝑖Score(R_{i})={{Score}_{otn}}(R_{i})+\beta{Score}_{{dsn}}(R_{i}),italic_S italic_c italic_o italic_r italic_e ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_o italic_t italic_n end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_β italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_d italic_s italic_n end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (7)

where β𝛽\betaitalic_β is a hyperparameter to control the importance of the spatial normality term Scoredsn(Ri)=dsn(Ri)𝑆𝑐𝑜𝑟subscript𝑒𝑑𝑠𝑛subscript𝑅𝑖subscript𝑑𝑠𝑛subscript𝑅𝑖{Score}_{{dsn}}(R_{i})=\mathcal{L}_{dsn}(R_{i})italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_d italic_s italic_n end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT italic_d italic_s italic_n end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Note that dsnsubscript𝑑𝑠𝑛\mathcal{L}_{dsn}caligraphic_L start_POSTSUBSCRIPT italic_d italic_s italic_n end_POSTSUBSCRIPT measures the normality at the sequence level, and we assign the same anomaly score for the sub-subsequences within a sequence.

4 Experiments

4.1 Experimental Setup

Benchmark Datasets. Five publicly available multivariate time series datasets are used in our experiments, with the relevant statistics shown in Table 1. PSM (Pooled Server Metrics Dataset)  [1] is a dataset of IT system monitoring signals from eBay server machines with 25 dimensions. Both MSL (Mars Science Laboratory Dataset) and SMAP (Soil Moisture Active Passive Dataset) [18] are public datasets collected by NASA with 27 and 55 entities respectively, which contain sensor and actuator data from the Mars Rover, as well as soil samples and telemetry information from a satellite. Epilepsy (Epilepsy seizure dataset) is an activity dataset collected from a triaxial accelerometer on the wrist of a human subject’s hand, and we treat data during walking, running, and sawing as normal data and seizures as anomalies according to [33]. DSADS (Daily and Sports Activities Dataset) collects motion sensors from eight subjects, including 19 daily and physical activities. Following [33], we use intense activities as the anomaly class, with the data collected from other activities used as normal.

Table 1: Key dataset statistics.
Ntrainsubscript𝑁𝑡𝑟𝑎𝑖𝑛N_{train}italic_N start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT Ntestsubscript𝑁𝑡𝑒𝑠𝑡N_{test}italic_N start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT #Dimensions #Entities AnomalyRatio(%)
PSM 132,481 87,841 25 1 27.8
MSL 2,160 2,731 55 27 10.7
SMAP 2,556 8,071 25 55 13.1
Epilepsy 33,784 22,866 5 1 10.2
DSADS 85,500 57,000 47 1 6.3

Competing Methods. STEN is compared with the following eight state-of-the-art (SotA) anomaly detectors specifically designed for time series data. These competing methods can be generally categorized into four types, i.e., (1) reconstruction-based models: MSCRED (MSC for short) [37], TranAD (Tran for short) [28], and AnomalyTransformer (AT for short) [34]; (2) forecasting-based models: GDN [12]; (3) one-class classification models: TcnED (TED for short) [13], COUTA (COU for short) [33], and NCAD (NCA for short) [6]; (4) contrastive learning-based models: DCdetector (DC for short) [35].

Table 2: AUC-ROC, AUC-PR, and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT results on five TSAD datasets.
Dataset Ours DC AT NCA Tran COU TED GDN MSC
AUC-ROC PSM 0.998 0.967 0.993 0.973 0.972 0.975 0.970 0.968 0.963
MSL 0.996 0.961 0.979 0.983 0.986 0.989 0.983 0.976 0.899
SMAP 0.999 0.991 0.993 0.970 0.921 0.919 0.926 0.978 0.821
Epilepsy 0.998 0.886 0.996 0.984 0.935 0.951 0.933 0.990 0.832
DSADS 0.994 0.983 0.962 0.985 0.984 0.992 0.985 0.974 0.905
AUC-PR PSM 0.995 0.946 0.989 0.942 0.955 0.955 0.948 0.940 0.932
MSL 0.941 0.915 0.933 0.868 0.905 0.886 0.897 0.849 0.483
SMAP 0.991 0.961 0.971 0.884 0.753 0.758 0.717 0.869 0.644
Epilepsy 0.991 0.837 0.980 0.943 0.790 0.760 0.773 0.967 0.565
DSADS 0.948 0.876 0.926 0.880 0.917 0.947 0.913 0.785 0.659
BEST F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT PSM 0.986 0.959 0.982 0.908 0.914 0.925 0.881 0.870 0.889
MSL 0.944 0.932 0.942 0.809 0.888 0.867 0.890 0.852 0.605
SMAP 0.974 0.963 0.967 0.845 0.699 0.701 0.711 0.781 0.668
Epilepsy 0.991 0.875 0.987 0.904 0.803 0.793 0.777 0.918 0.640
DSADS 0.934 0.908 0.932 0.875 0.846 0.901 0.844 0.825 0.657

Evaluation Metrics. The performance of STEN is measured according to a wide range of evaluation metrics. Following the mainstream studies in this research line [3, 28, 33], we adopt three popular metrics including the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), the Area Under the Precision-Recall Curve (AUC-PR), and the best F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score. Note that these three metrics are calculated upon anomaly scores processed by the point-adjust strategy [31]. It is a commonly used strategy in time series anomaly detection [6, 33, 35, 34], which adjusts the anomaly score of each anomaly segment to the highest score within this segment. To circumvent biases introduced by point adjustment, we further employ five recently proposed evaluation measures. Considering the temporal relationship/distance between ground truths and predictions, [17] calculates the affiliation precision and recall. We utilize the harmonic mean of these two measures, i.e., Affiliation F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (denoted by Aff-F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). [24] introduces a novel metric called Range-AUC, which extends the AUC measurement to account for range-based anomalies. Additionally, the paper introduces volume under the ROC surface (VUS-ROC) and volume under the PR surface (VUS-PR) as new metrics for computing the volume under ROC and PR curves, respectively.

Implementation Details. We summarize the default implementation settings of our method STEN as follows. Both the number of generated sub-sequences m𝑚mitalic_m and the sub-sequence length l𝑙litalic_l are set to 10 across all datasets. The temporal modeling network ϕitalic-ϕ\phiitalic_ϕ is composed of a single-layer GRU network with shared parameters across 10 units, and the dimension of the hidden state dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT is set to 256. To balance the influence of OTN and DSN, the hyperparameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β are set to 1111 by default. For calculating the affiliation metric, an anomaly is defined as any timestamp whose anomaly score exceeds the (100δ)100𝛿(100-\delta)( 100 - italic_δ ) percentile, with δ𝛿\deltaitalic_δ setting to 0.6 by default. The weight parameters are optimized using Adam optimizer with a learning rate of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for a total of 5 epochs. For the PSM, SMAP, and DSADS datasets, the batch size is set to 256, while for the MSL and Epilepsy datasets, each mini-batch contains 64 training samples.

4.2 Main Results

To verify the detection performance of our method, we conduct experiments on five real-world TSAD datasets and compare it with eight SotA methods.

The results for the methods in AUC-ROC, AUC-PR, and best F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT metrics are shown in Table 2. Our method STEN is the best performer on all five datasets across the three metrics, indicating the importance and effectiveness of joint modeling of spatial-temporal normality. On average, STEN obtains an AUC-ROC of 0.997, an AUC-PR of 0.973, and an F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of 0.966. Specifically, in terms of the AUC-PR metric, STEN achieves an average improvement ranging from approximately 1.3% to 31.6% over eight SotA methods, which highlights the effectiveness of our method in the precision and recall rates of detecting anomalies. Impressively, our method marks a significant improvement on the SMAP dataset, achieving an improvement of 2% to 34.7% in the AUC-PR metric. Compared to other methods, AT exhibits relatively better detection performance on these datasets, utilizing advanced Transformer structures and concepts like association differences for effective normality modeling. Nevertheless, our method STEN still consistently outperforms AT, e.g., by a large margin on some challenging datasets like MSL and DSADS. This is mainly due to the additional spatial normality modeling in STEN, besides the temporal normality learning.

Table 3: Aff-F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, RAUC-ROCsubscript𝑅AUC-ROCR_{\text{AUC-ROC}}italic_R start_POSTSUBSCRIPT AUC-ROC end_POSTSUBSCRIPT, RAUC-PRsubscript𝑅AUC-PRR_{\text{AUC-PR}}italic_R start_POSTSUBSCRIPT AUC-PR end_POSTSUBSCRIPT, VUS-ROC, and VUS-PR results.
Dataset Method Aff-F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT RAUC-ROCsubscript𝑅AUC-ROCR_{\text{AUC-ROC}}italic_R start_POSTSUBSCRIPT AUC-ROC end_POSTSUBSCRIPT RAUC-PRsubscript𝑅AUC-PRR_{\text{AUC-PR}}italic_R start_POSTSUBSCRIPT AUC-PR end_POSTSUBSCRIPT VUS-ROC VUS-PR
PSM NCA 0.394 0.862 0.826 0.868 0.828
DC 0.583 0.817 0.839 0.805 0.826
AT 0.507 0.952 0.959 0.883 0.905
Ours 0.636 0.973 0.957 0.966 0.945
MSL NCA 0.586 0.947 0.755 0.949 0.760
DC 0.641 0.844 0.791 0.841 0.786
AT 0.659 0.915 0.844 0.917 0.804
Ours 0.687 0.955 0.853 0.957 0.858
SMAP NCA 0.662 0.958 0.838 0.957 0.837
DC 0.662 0.944 0.914 0.939 0.908
AT 0.675 0.968 0.938 0.958 0.930
Ours 0.677 0.985 0.954 0.985 0.952
Epilepsy NCA 0.585 0.965 0.917 0.965 0.917
DC 0.655 0.770 0.808 0.773 0.811
AT 0.691 0.935 0.936 0.939 0.940
Ours 0.746 0.964 0.943 0.966 0.947
DSADS NCA 0.650 0.835 0.779 0.839 0.785
DC 0.617 0.942 0.829 0.943 0.834
AT 0.682 0.866 0.860 0.870 0.864
Ours 0.657 0.940 0.856 0.944 0.865

It is important to note that recent studies have had vigorous discussions on how to fairly evaluate the performance of TSAD methods. Although the results in Table 2 have already demonstrated that our method outperforms the SotA methods under the three traditional metrics, to have a more comprehensive and fair evaluation, we also use several recently proposed evaluation metrics. The evaluation results of our model under the new metrics are shown in Table 3, with the best competing models NCA, DC, and AT as our baselines. It is evident that on the PSM, MSL and SMAP datasets, our method outperforms all others across almost all new metrics. For the remaining two datasets, we still achieve superior results, with a minimal margin of less than 0.005 in all cases except Aff-F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on DSADS. These results re-affirm the improvement of our method over SotA methods.

4.3 Ablation Study

Significance of the OTN and DSN Modules. We then investigate the importance of the two components DSN and OTN. We remove each of them from our method STEN individually, resulting in two ablation variants: one with only DSN and another with only OTN. The results are presented in Table 4, which demonstrate that both of the DSN and OTN modules have some major contributions to the model’s superior detection performance. This is particularly true for the OTN module, which significantly increases the model’s performance in terms of the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score by an average of approximately 16.9% over the five datasets.

OTN vs Error Prediction-based Approach in Modeling Temporal Normality. Furthermore, to illustrate the effectiveness of OTN in capturing temporal patterns, we derive a variant of STEN that combines the DSN with a popular temporal pattern modeling module based on error prediction (EP) of the current timestamp. The experiment results reveal that the OTN module achieves improvements of approximately 7.3% in AUC-PR and 10.1% in the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score, respectively. These results demonstrate that our OTN module is significantly more effective compared to the popular error prediction methods, which is mainly due to the fact that the OTN module allows the modeling of normal patterns in multiple diverse temporal contexts, contrasting to the single temporal context in the existing error prediction method.

Table 4: Ablation study results of STEN. EP represents the popular error prediction-based approach for temporal pattern modeling.
PSM MSL SMAP Epilepsy DSADS
DSN OTN AUC-PR F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT AUC-PR F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT AUC-PR F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT AUC-PR F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT AUC-PR F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
0.913 0.902 0.763 0.752 0.847 0.794 0.794 0.808 0.790 0.728
0.995 0.975 0.946 0.919 0.990 0.963 0.979 0.978 0.929 0.906
0.995 0.986 0.941 0.944 0.991 0.974 0.991 0.991 0.948 0.934
DSN + EP 0.984 0.952 0.930 0.897 0.921 0.856 0.957 0.939 0.707 0.681

4.4 Qualitative Analysis

Refer to caption
Figure 3: (Top to Bottom) Training/testing segments of the PSM dataset, and the anomalous timestamps (marked by small dots) predicted by STEN and the five best-performing competing methods. The ground truth anomalous segments are highlighted in gold. The most prominent false-positive data segments are encircled in red dashed lines.

We visualize the results to further showcase the anomaly detection capabilities of STEN. Fig. 3 presents the data from the PSM dataset (feature id 22), along with the anomaly timestamps detected by STEN and its competing methods. This dataset exhibits complex temporal patterns, which may lead to false positives by many algorithms during detection. Compared to the competing methods, our method STEN can detect all three anomalies while at the same time having the least false positives. For example, COU and TED incorrectly label test data segments that are similar to the normal patterns in the training data as anomalies. Meanwhile, both Tran and TED miss an anomaly, i.e., the second anomalous segment. DC and AT illustrate comparable performance in detecting all three anomalies, but they still produce more false positives than our method. This advatange is due to the spatial normality modeling of STEN, which is often ignored in existing methods like DC and AT.

4.5 Sensitivity Analysis

We also conduct sensitivity analysis on four key hyperparameters. The F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score and AUC-PR results have a similar trend and we report the AUC-PR results in Fig. 4 due to page limitation. Fig. 4(a) illustrates the sensitivity of the hyperparameter regulating the loss dsnsubscript𝑑𝑠𝑛\mathcal{L}_{dsn}caligraphic_L start_POSTSUBSCRIPT italic_d italic_s italic_n end_POSTSUBSCRIPT, i.e., α𝛼\alphaitalic_α, in the model’s overall loss function. This analysis aims to investigate the model’s capability to capture the normal patterns of data when applying different weights to our two modules. The results indicate that the model’s performance remains stable across multiple benchmark datasets within a relatively large value range. Fig. 4(b) focuses on the weight β𝛽\betaitalic_β of Scoredsn(Ri)𝑆𝑐𝑜𝑟subscript𝑒𝑑𝑠𝑛subscript𝑅𝑖{Score}_{{dsn}}(R_{i})italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_d italic_s italic_n end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), examining the contribution of the two modules to the anomaly scoring. The results suggest that varying β𝛽\betaitalic_β does not significantly affect the model’s detection performance, showcasing the model’s robustness. We also experiment with different sub-sequence lengths l𝑙litalic_l. The results are shown in Fig. 4(c), from which it is clear that our method achieves stable detection performance with sub-sequences of various lengths ranging from 10 to 100. Furthermore, the anomaly threshold δ𝛿\deltaitalic_δ serves as a hyperparameter that distinguishes anomalies from normal fluctuations in the data. Fig. 4(d) illustrates the experimental results when δ𝛿\deltaitalic_δ varies between 0.5 and 1. It can be observed that our method performs stably w.r.t. changes in δ𝛿\deltaitalic_δ.

Refer to caption
(a) α𝛼\alphaitalic_α
Refer to caption
(b) β𝛽\betaitalic_β
Refer to caption
(c) l𝑙litalic_l
Refer to caption
(d) δ𝛿\deltaitalic_δ
Figure 4: AUC-PR results of STEN w.r.t different hyperparameters.

4.6 Time Efficiency

We compare training time of neural network-based models across various datasets to investigate their time efficiency. Due to the obviously high algorithmic complexity of GDN and MSCRED, we select six methods with comparable training duration for comparison, as shown in Table 5. Unlike previous methods, our model employs two modules (DSN and OTN) to train and combine sub-sequences of different length, potentially increasing the computational complexity. Although not showing the most efficient performance, our method illustrates significantly better detection accuracy than the competing methods.

Table 5: Run times (in seconds) in multiple datasets. The best results are indicated in bold, while the worst results are underlined.
Methods PSM MSL SMAP Epilepsy DSADS
TED 308.2 140.0 317.3 76.0 1646.3
Tran 144.5 70.3 152.8 35.2 755.1
NCA 711.3 330.3 746.8 179.0 421.5
COU 282.5 127.0 286.6 71.5 1440.2
DC 508.6 380.8 967.1 98.3 1536.5
AT 11.7 11.1 12.6 10.8 15.4
Ours 214.9 137.3 477.8 44.4 483.3

5 Conclusion

This article introduces a novel TSAD method named STEN. Unlike existing methods that solely focus on the temporal normality of time series data, STEN incorporates the spatial normality of the data as well, enabling a more comprehensive normality learning of the time series data. This enables STEN to achieve greater discriminative feature representations in distinguishing abnormal sequences from the normal ones. In STEN, we design two pretext tasks to extract spatial-temporal features of time series data. The order prediction-based temporal normality learning (OTN) models the temporal dependencies of the data by modeling the distribution of sub-sequence order, while the distance prediction-based spatial normality learning (DSN) learns the spatial relation of sequence pairs in the projected space in the form of distance prediction. Extensive experiments demonstrate that STEN exhibits superior TSAD performance, outperforming eight SotA algorithms on five benchmark datasets, with the effectiveness of each of the proposed two modules justified in our ablation study.

{credits}

5.0.1 Acknowledgements

This work was supported in part by the Key Cooperation Project of the Chongqing Municipal Education Commission under Grants HZ2021017 and HZ2021008, in part by the National Natural Science Foundation of China under Grant 62102442.

5.0.2 \discintname

The authors have no competing interests to declare that are relevant to the content of this article.

References

  • [1] Abdulaal, A., Liu, Z., Lancewicki, T.: Practical approach to asynchronous multivariate time series anomaly detection and localization. In: Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. pp. 2485–2494 (2021)
  • [2] Anandakrishnan, A., Kumar, S., Statnikov, A., Faruquie, T., Di, X.: Anomaly detection in finance: Editors’ introduction. Knowledge Discovery and Data Mining (2017)
  • [3] Audibert, J., Michiardi, P., Guyard, F., Marti, S., Zuluaga, M.A.: Usad: Unsupervised anomaly detection on multivariate time series. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020)
  • [4] Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 245–250 (2001)
  • [5] Box, G.E.P., Pierce, D.A.: Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. Journal of the American Statistical Association p. 1509 (1970)
  • [6] Carmona, C.U., Aubet, F.X., Flunkert, V., Gasthaus, J.: Neural contextual anomaly detection for time series. arXiv preprint arXiv:2107.07702 (2021)
  • [7] Chen, F., Qin, Z., Zhou, M., Zhang, Y., Deng, S., Fan, L., Pang, G., Wen, Q.: LARA: A Light and Anti-overfitting Retraining Approach for Unsupervised Time Series Anomaly Detection. In: Proceedings of the ACM on Web Conference. pp. 4138-4149 (2024).
  • [8] Christ, M., Braun, N., Neuffer, J., Kempa-Liehr, A.W.: Time series feature extraction on basis of scalable hypothesis tests (tsfresh–a python package). Neurocomputing 307, 72–77 (2018)
  • [9] Sun, Y., Pang, G., Ye, G., Chen, T., Hu, X., Yin, H.: Unraveling the ‘Anomaly’ in Time Series Anomaly Detection: A Self-supervised Tri-domain Solution. arXiv preprint arXiv:2311.11235 (2023)
  • [10] Darban, Z.Z., Webb, G.I., Pan, S., Aggarwal, C.C., Salehi, M.: Deep learning for time series anomaly detection: A survey. arXiv preprint arXiv:2211.05244 (2022)
  • [11] Dempster, A., Petitjean, F., Webb, G.I.: Rocket: exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery 34(5), 1454–1495 (2020)
  • [12] Deng, A., Hooi, B.: Graph neural network-based anomaly detection in multivariate time series. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 4027–4035 (2021)
  • [13] Garg, A., Zhang, W., Samaran, J., Savitha, R., Foo, C.S.: An evaluation of anomaly detection and diagnosis in multivariate time series. IEEE Transactions on Neural Networks and Learning Systems p. 2508–2517 (2022)
  • [14] Günnemann, N., Günnemann, S., Faloutsos, C.: Robust multivariate autoregression for anomaly detection in dynamic product ratings. In: Proceedings of the 23rd international conference on World wide web. pp. 361–372 (2014)
  • [15] Hautamaki, V., Karkkainen, I., Franti, P.: Outlier detection using k-nearest neighbour graph. In: Proceedings of the 17th International Conference on Pattern Recognition. vol. 3, pp. 430–433. IEEE (2004)
  • [16] Hegde, C., Wakin, M., Baraniuk, R.: Random projections for manifold learning. Advances in neural information processing systems 20 (2007)
  • [17] Huet, A., Navarro, J.M., Rossi, D.: Local evaluation of time series anomaly detection algorithms. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 635–645 (2022)
  • [18] Hundman, K., Constantinou, V., Laporte, C., Colwell, I., Soderstrom, T.: Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Jul 2018)
  • [19] Karczmarek, P., Kiersztyn, A., Pedrycz, W., Al, E.: K-means-based isolation forest. Knowledge-Based Systems p. 105659 (2020)
  • [20] Li, J., Di, S., Shen, Y., Chen, L.: Fluxev: A fast and effective unsupervised framework for time-series anomaly detection. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining (2021)
  • [21] Li, X., Metsis, V., Wang, H., Ngu, A.H.H.: Tts-gan: A transformer-based time-series generative adversarial network. In: International Conference on Artificial Intelligence in Medicine. pp. 133–143. Springer (2022)
  • [22] Liu, S., Zhou, B., Ding, Q., Hooi, B., Zhang, Z., Shen, H., Cheng, X.: Time series anomaly detection with adversarial reconstruction networks. IEEE Transactions on Knowledge and Data Engineering p. 4293–4306 (2023)
  • [23] Pang, G., Shen, C., Cao, L., Hengel, A.V.D.: Deep learning for anomaly detection: A review. ACM computing surveys (CSUR) 54(2), 1–38 (2021)
  • [24] Paparrizos, J., Boniol, P., Palpanas, T., Tsay, R.S., Elmore, A., Franklin, M.J.: Volume under the surface: a new accuracy evaluation measure for time-series anomaly detection. Proceedings of the VLDB Endowment 15(11), 2774–2787 (2022)
  • [25] Pereira, J., Silveira, M.: Learning representations from healthcare time series data for unsupervised anomaly detection. In: 2019 IEEE international conference on big data and smart computing (BigComp). pp. 1–7. IEEE (2019)
  • [26] Ren, H., Xu, B., Wang, Y., Yi, C., Huang, C., Kou, X., Xing, T., Yang, M., Tong, J., Zhang, Q.: Time-series anomaly detection service at microsoft. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2019)
  • [27] Schneider, T., Qiu, C., Kloft, M., Latif, D.A., Staab, S., Mandt, S., Rudolph, M.: Detecting anomalies within time series using local neural transformations. arXiv preprint arXiv:2202.03944 (2022)
  • [28] Tuli, S., Casale, G., Jennings, N.R.: Tranad: Deep transformer networks for anomaly detection in multivariate time series data. arXiv preprint arXiv:2201.07284 (2022)
  • [29] Wang, H., Pang, G., Shen, C., Ma, C.: Unsupervised representation learning by predicting random distances. arXiv preprint arXiv:1912.12186 (2019)
  • [30] Wu, W., He, L., Lin, W., Su, Y., Cui, Y., Maple, C., Jarvis, S.: Develo** an unsupervised real-time anomaly detection scheme for time series with multi-seasonality. IEEE Transactions on Knowledge and Data Engineering 34(9), 4147–4160 (2020)
  • [31] Xu, H., Feng, Y., Chen, J., Wang, Z., Qiao, H., Chen, W., Zhao, N., Li, Z., Bu, J., Li, Z., Liu, Y., Zhao, Y., Pei, D.: Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In: Proceedings of theConference on World Wide Web (2018)
  • [32] Xu, H., Pang, G., Wang, Y., Wang, Y.: Deep isolation forest for anomaly detection. IEEE Transactions on Knowledge and Data Engineering 35(12), 12591-12604 (2024)
  • [33] Xu, H., Wang, Y., Jian, S., Liao, Q., Wang, Y., Pang, G.: Calibrated one-class classification for unsupervised time series anomaly detection. IEEE Transactions on Knowledge and Data Engineering (2024)
  • [34] Xu, J., Wu, H., Wang, J., Long, M.: Anomaly transformer: Time series anomaly detection with association discrepancy. In: International Conference on Learning Representations (2022)
  • [35] Yang, Y., Zhang, C., Zhou, T., Wen, Q., Sun, L.: Dcdetector: Dual attention contrastive representation learning for time series anomaly detection. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 3033–3045 (2023)
  • [36] Yue, Z., Wang, Y., Duan, J., Yang, T., Huang, C., Tong, Y., Xu, B.: Ts2vec: Towards universal representation of time series. Proceedings of the AAAI Conference on Artificial Intelligence (2022)
  • [37] Zhang, C., Song, D., Chen, Y., Feng, X., Lumezanu, C., Cheng, W., Ni, J., Zong, B., Chen, H., Chawla, N.V.: A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 1409–1416 (2019)
  • [38] Zhang, K., Wen, Q., Zhang, C., Cai, R., **, M., Liu, Y., Zhang, J., Liang, Y., Pang, G., Song, D., et al.: Self-supervised learning for time series analysis: Taxonomy, progress, and prospects. arXiv preprint arXiv:2306.10125 (2023)
  • [39] Zhou, B., Liu, S., Hooi, B., Cheng, X., Ye, J.: Beatgan: Anomalous rhythm detection using adversarially generated time series. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (2019)