License: arXiv.org perpetual non-exclusive license
arXiv:2403.01895v1 [cs.LG] 04 Mar 2024

UNSUPERVISED DISTANCE METRIC LEARNING FOR ANOMALY DETECTION OVER MULTIVARIATE TIME SERIES

Abstract

Distance-based time series anomaly detection methods are prevalent due to their relative non-parametric nature and interpretability. However, the commonly used Euclidean distance is sensitive to noise. While existing works have explored dynamic time war** (DTW) for its robustness, they only support supervised tasks over multivariate time series (MTS), leaving a scarcity of unsupervised methods. In this work, we propose FCM-wDTW, an unsupervised distance metric learning method for anomaly detection over MTS, which encodes raw data into latent space and reveals normal dimension relationships through cluster centers. FCM-wDTW introduces locally weighted DTW into fuzzy C-means clustering and learns the optimal latent space efficiently, enabling anomaly identification via data reconstruction. Experiments with 11 different types of benchmarks demonstrate our method’s competitive accuracy and efficiency.

Index Terms—  Anomaly Detection, Multivariate Time Series, Unsupervised Learning, Dynamic Time War**

1 Introduction

The anomaly detection for multivariate time series presents a significant challenge for the intricate correlation and dependency among dimensions. It has been studied in various application fields, e.g., financial fraud detection[1], network intrusion detection [2], and industrial control fault diagnosis [3]. Anomaly is classically defined as an observation that deviates significantly from other observations, raising suspicions that it was generated by a different mechanism [4]. Generally, MTS-generating systems remain in stable normal status, characterized by consistent relationships among dimensions. The goal of multivariate anomaly detection is to identify abnormal relationships that disrupt normal status.

Most existing anomaly detection methods are semi-supervised or unsupervised [5, 6], as anomaly labels are typically rare and costly. While the former methods train models on the normal samples and mark the points or subsequences going against the model as anomalous, the latter separate anomalies from the normal part by frequency, shape, or distribution. The state-of-the-art deep learning methods for multivariate anomaly detection are mostly semi-supervised [5]. Comparatively, they need prior knowledge of data for training, limiting their availability in practical application. Furthermore, in terms of recent survey works [5, 7, 8], including [5] that estimates 71 anomaly detection methods on 976 time series datasets, deep learning methods have yet to prove real outperformance, while they have relatively high complexity and poor interpretability. In contrast, the conventional statistical analysis and machine learning methods are simpler, lighter, and easier to interpret.

Among existing works, distance-based anomaly detection methods are particularly prevalent due to their non-parametric property and strong interpretability [5]. They are mostly based on nearest neighbors or clustering [9, 10, 11], where the choice of distance metric plays a crucial part. However, the commonly used Euclidean distance is noise-sensitive and may negatively affect performance. To address this issue, DTW has been successfully applied to anomaly detection on univariate time series [12]. Nevertheless, extending DTW to MTS remains a challenge due to the complex dependency among dimensions, which affects the local pointwise measure and the whole precision further. Previous attempts involve taking advantage of the PCA similarity factor [13] or parameterized distance metric [14, 15, 16, 11] as the pointwise measure of DTW. However, they only support supervised tasks, and no sound unsupervised distance metric learning method has been proposed so far.

To bridge this gap, we propose an unsupervised distance metric learning method for DTW in multivariate anomaly detection. To be specific, we enhance fuzzy C-means clustering (FCM) by introducing a locally weighted DTW (wDTW) as the distance metric. FCM-wDTW encodes raw data into latent space, where cluster centers represent the normal relationships among dimensions of MTS. Then the optimal latent space and wDTW are learned by an efficient closed-form optimization algorithm. Finally, anomalies are identified by reconstructing data from the latent space. For the samples far away from any clusters, their reconstruction errors are large and indicate anomalous. Our contributions are summarized as follows:

  • We improve FCM clustering by introducing wDTW, which enhances the clustering accuracy over MTS. And build an efficient optimization algorithm for it.

  • Based on the learned latent space and wDTW, we propose a multivariate anomaly detection method with data reconstruction and anomaly scoring computation.

  • Extensive experiments with 11 unsupervised multivariate anomaly detection benchmarks demonstrate the favorable performance of FCM-wDTW.

2 Problem statement

MTS can be seen as a sequence of sampling vectors over correlated variables. Specifically, a MTS T𝑇Titalic_T consists of a sequence of n𝑛nitalic_n observations, i.e., T={𝒕1,𝒕2,,𝒕n}𝑇subscript𝒕1subscript𝒕2subscript𝒕𝑛T=\{\boldsymbol{t}_{1},\boldsymbol{t}_{2},\ldots,\boldsymbol{t}_{n}\}italic_T = { bold_italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, where 𝒕idsubscript𝒕𝑖superscript𝑑\boldsymbol{t}_{i}\in\mathbb{R}^{d}bold_italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and d>1𝑑1d>1italic_d > 1. The anomaly detection over MTS is formally defined as follows.

Definition 1 (Multivariate Anomaly Detection).

Given a MTS T𝑇Titalic_T, multivariate anomaly detection aims at computing an anomaly score sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each observation 𝐭isubscript𝐭𝑖\boldsymbol{t}_{i}bold_italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, such that higher is sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, more anomalous is 𝐭isubscript𝐭𝑖\boldsymbol{t}_{i}bold_italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Note that in Definition 1, We make no assumptions about whether anomalies are point or subsequence anomalies. If the anomaly scores of continuous observations are high, these observations can be detected as a subsequence anomaly.

3 Method

In this section, we first introduce the objective and optimization algorithm of FCM-wDTW, providing a comprehensive analysis of the algorithm. Then, we describe the anomaly detection method based on the learned latent space and wDTW.

3.1 Objective of FCM-wDTW

The computation of DTW over MTS contains two layers: the pointwise measure on sampling vectors and the dynamic programming (DP) algorithm over the pointwise cost matrix (PCM). To regulate MTS dimensions, we parameterize DTW by employing the weighted Euclidean distance (WED) as the pointwise measure. Given two w𝑤witalic_w-dimensional MTS X={𝒙1,,𝒙i,,𝒙m}𝑋subscript𝒙1subscript𝒙𝑖subscript𝒙𝑚X=\{\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{i},\ldots,\boldsymbol{x}_{m}\}italic_X = { bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } and Y𝑌Yitalic_Y ={𝒚1,,𝒚j,,𝒚n}absentsubscript𝒚1subscript𝒚𝑗subscript𝒚𝑛=\{\boldsymbol{y}_{1},\ldots,\boldsymbol{y}_{j},\ldots,\boldsymbol{y}_{n}\}= { bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, where 𝒙i=[xi1,xi2,,xiw]subscript𝒙𝑖subscript𝑥𝑖1subscript𝑥𝑖2subscript𝑥𝑖𝑤\boldsymbol{x}_{i}=[x_{i1},x_{i2},\ldots,x_{iw}]bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i italic_w end_POSTSUBSCRIPT ] and 𝒚j=[yj1,yj2,,yjw]subscript𝒚𝑗subscript𝑦𝑗1subscript𝑦𝑗2subscript𝑦𝑗𝑤\boldsymbol{y}_{j}=[y_{j1},y_{j2},\ldots,y_{jw}]bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = [ italic_y start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_j italic_w end_POSTSUBSCRIPT ] are sampling vectors, and the optimal war** path (OWP) 𝒑={pkMpk=(i,j),1im,1jn,1k\boldsymbol{p}=\{p_{k}\in M\mid p_{k}=(i,j),1\leq i\leq m,1\leq j\leq n,1\leq kbold_italic_p = { italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_M ∣ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( italic_i , italic_j ) , 1 ≤ italic_i ≤ italic_m , 1 ≤ italic_j ≤ italic_n , 1 ≤ italic_k l}\leq l\}≤ italic_l }, where Mm×n𝑀superscript𝑚𝑛M\in\mathbb{R}^{m\times n}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT is the PCM and p1=(1,1),pl=(m,n)formulae-sequencesubscript𝑝111subscript𝑝𝑙𝑚𝑛p_{1}=(1,1),p_{l}=(m,n)italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 1 , 1 ) , italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ( italic_m , italic_n ), pk+1pk{(1,0),(0,1),(1,1)}subscript𝑝𝑘1subscript𝑝𝑘100111p_{k+1}-p_{k}\in\{(1,0),(0,1),(1,1)\}italic_p start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ { ( 1 , 0 ) , ( 0 , 1 ) , ( 1 , 1 ) }, by introducing WED, we can get wDTW formulated as

wDTW(X,Y)=pwed(𝒙i,𝒚j)=pd=1wλdq(xidyjd)2wDTW𝑋𝑌subscript𝑝wedsubscript𝒙𝑖subscript𝒚𝑗subscript𝑝superscriptsubscript𝑑1𝑤superscriptsubscript𝜆𝑑𝑞superscriptsubscript𝑥𝑖𝑑subscript𝑦𝑗𝑑2\mathrm{\emph{w}DTW}(X,Y)=\sum_{p}\mathrm{wed}(\boldsymbol{x}_{i},\boldsymbol{% y}_{j})=\sum_{p}\sum_{d=1}^{w}\lambda_{d}^{q}\left(x_{id}-y_{jd}\right)^{2}w roman_DTW ( italic_X , italic_Y ) = ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT roman_wed ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_j italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (1)

where wed(𝒙i,𝒚j)wedsubscript𝒙𝑖subscript𝒚𝑗\mathrm{wed}(\boldsymbol{x}_{i},\boldsymbol{y}_{j})roman_wed ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes the WED between 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒚j,λd[0,1]subscript𝒚𝑗subscript𝜆𝑑01\boldsymbol{y}_{j},\lambda_{d}\in[0,1]bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ [ 0 , 1 ] is the weight of the d𝑑ditalic_d-th dimension that satisfies d=1wλd=1superscriptsubscript𝑑1𝑤subscript𝜆𝑑1\sum_{d=1}^{w}\lambda_{d}=1∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 1, and q>1𝑞1q>1italic_q > 1 is a hyper-parameter.

To learn an adaptive wDTW for the unsupervised tasks, we introduce it into FCM as the kernel distance metric. In terms of (1), the objective function of FCM-wDTW can be formulated as:

J(U,V,Λ)𝐽𝑈𝑉Λ\displaystyle J(U,V,\Lambda)italic_J ( italic_U , italic_V , roman_Λ ) =mini=1cj=1nuijmwDTW(vi,xj)absentsuperscriptsubscript𝑖1𝑐superscriptsubscript𝑗1𝑛superscriptsubscript𝑢𝑖𝑗𝑚wDTWsubscript𝑣𝑖subscript𝑥𝑗\displaystyle=\min\sum_{i=1}^{c}\sum_{j=1}^{n}u_{ij}^{m}\cdot\mathrm{\emph{w}% DTW}\left(v_{i},x_{j}\right)= roman_min ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⋅ w roman_DTW ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
=mini=1cj=1npd=1wuijmλdq(vidxjd)2absentsuperscriptsubscript𝑖1𝑐superscriptsubscript𝑗1𝑛subscript𝑝superscriptsubscript𝑑1𝑤superscriptsubscript𝑢𝑖𝑗𝑚superscriptsubscript𝜆𝑑𝑞superscriptsubscript𝑣𝑖𝑑subscript𝑥𝑗𝑑2\displaystyle=\min\sum_{i=1}^{c}\sum_{j=1}^{n}\sum_{p}\sum_{d=1}^{w}u_{ij}^{m}% \lambda_{d}^{q}\left(v_{id}-x_{jd}\right)^{2}= roman_min ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
st:1)i=1c\displaystyle\mathrm{st}:1)\sum_{i=1}^{c}roman_st : 1 ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT uij=1,j=1,2,,n,2)d=1wλd=1\displaystyle u_{ij}=1,\forall j=1,2,\ldots,n,2)\sum_{d=1}^{w}\lambda_{d}=1italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 , ∀ italic_j = 1 , 2 , … , italic_n , 2 ) ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 1 (2)

where Λ={λdd[1,w]}Λconditional-setsubscript𝜆𝑑𝑑1𝑤\Lambda=\left\{\lambda_{d}\mid d\in[1,w]\right\}roman_Λ = { italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∣ italic_d ∈ [ 1 , italic_w ] } denotes the set of weight parameters. V={vii[1,c]}𝑉conditional-setsubscript𝑣𝑖𝑖1𝑐V=\left\{v_{i}\mid i\in[1,c]\right\}italic_V = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i ∈ [ 1 , italic_c ] } denotes cluster center set. U=[uij]𝑈delimited-[]subscript𝑢𝑖𝑗U=\left[u_{ij}\right]italic_U = [ italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] denotes the membership matrix, where i[1,c],j[1,n],uij[0,1]formulae-sequence𝑖1𝑐formulae-sequence𝑗1𝑛subscript𝑢𝑖𝑗01i\in[1,c],j\in[1,n],u_{ij}\in[0,1]italic_i ∈ [ 1 , italic_c ] , italic_j ∈ [ 1 , italic_n ] , italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ].

3.2 Optimizing FCM-wDTW

With the nondifferentiable wDTW, it is difficult to optimize the objective function (2) directly. An alternating method is to solve the following four partial optimization sub-problems iteratively:

  • Problem 1 Kee** U,V𝑈𝑉U,Vitalic_U , italic_V, and ΛΛ\Lambdaroman_Λ fixed, update OWPs between X𝑋Xitalic_X and V𝑉Vitalic_V;

  • Problem 2 Kee** OWPs, V𝑉Vitalic_V, and ΛΛ\Lambdaroman_Λ fixed, update U𝑈Uitalic_U;

  • Problem 3 Kee** OWPs, U𝑈Uitalic_U, and V𝑉Vitalic_V fixed, update ΛΛ\Lambdaroman_Λ;

  • Problem 4 Kee** OWPs, U𝑈Uitalic_U, and ΛΛ\Lambdaroman_Λ fixed, update V𝑉Vitalic_V.

The four problems iteratively optimize the four factors determining the loss of FCM-wDTW, which have explicit meaning. V𝑉Vitalic_V contains all cluster centers that reveal the normal patterns of MTS and construct the latent space. U𝑈Uitalic_U contains the encoding feature of samples that can be seen as the proportion of all normal patterns in composing a sample (where uij[0,1]subscript𝑢𝑖𝑗01u_{ij}\in[0,1]italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ] and j=1cuij=1superscriptsubscript𝑗1𝑐subscript𝑢𝑖𝑗1\sum_{j=1}^{c}u_{ij}=1∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1). ΛΛ\Lambdaroman_Λ models the correlation among dimensions of MTS while OWPs regulate the temporal relationship between samples and normal patterns. They make up the distance metric together in the latent space. Problem 1 can be solved by the DP algorithm and Problem 24similar-to242\sim 42 ∼ 4 can be solved by the Lagrange multiplier method, where (3.1) is reformulated as:

L=i=1cj=1n𝐿superscriptsubscript𝑖1𝑐superscriptsubscript𝑗1𝑛\displaystyle L=\sum_{i=1}^{c}\sum_{j=1}^{n}italic_L = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT uijmwDTW(vi,xj)superscriptsubscript𝑢𝑖𝑗𝑚wDTWsubscript𝑣𝑖subscript𝑥𝑗\displaystyle u_{ij}^{m}\cdot\mathrm{\emph{w}DTW}(v_{i},x_{j})italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⋅ w roman_DTW ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
+j=1nδj(i=1cuij1)+μ(d=1wλd1)superscriptsubscript𝑗1𝑛subscript𝛿𝑗superscriptsubscript𝑖1𝑐subscript𝑢𝑖𝑗1𝜇superscriptsubscript𝑑1𝑤subscript𝜆𝑑1\displaystyle+\sum_{j=1}^{n}\delta_{j}\left(\sum_{i=1}^{c}u_{ij}-1\right)+\mu% \left(\sum_{d=1}^{w}\lambda_{d}-1\right)+ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - 1 ) + italic_μ ( ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - 1 ) (3)
Theorem 1.

As fixing the OWPs, V𝑉Vitalic_V, and Λ,Unormal-Λ𝑈\Lambda,Uroman_Λ , italic_U can be updated by:

uij=1s=1c[wDTW(vi,xj)wDTW(vs,xj)]1m1subscript𝑢𝑖𝑗1superscriptsubscript𝑠1𝑐superscriptdelimited-[]wDTWsubscript𝑣𝑖subscript𝑥𝑗wDTWsubscript𝑣𝑠subscript𝑥𝑗1𝑚1u_{ij}=\frac{1}{\sum_{s=1}^{c}\left[\frac{\mathrm{\emph{w}DTW}\left(v_{i},x_{j% }\right)}{\mathrm{\emph{w}DTW}\left(v_{s},x_{j}\right)}\right]^{\frac{1}{m-1}}}italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT [ divide start_ARG w roman_DTW ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG w roman_DTW ( italic_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m - 1 end_ARG end_POSTSUPERSCRIPT end_ARG (4)
Proof.

By (3.2), we have

Luij=muijm1wDTW(vi,xj)+δj𝐿subscript𝑢𝑖𝑗𝑚superscriptsubscript𝑢𝑖𝑗𝑚1wDTWsubscript𝑣𝑖subscript𝑥𝑗subscript𝛿𝑗\frac{\partial L}{\partial u_{ij}}=mu_{ij}^{m-1}\cdot\mathrm{\emph{w}DTW}\left% (v_{i},x_{j}\right)+\delta_{j}divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG = italic_m italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ⋅ w roman_DTW ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (5)

As the OWPs, V𝑉Vitalic_V, and ΛΛ\Lambdaroman_Λ are fixed, the wDTW distance is constant. By setting Luij=0𝐿subscript𝑢𝑖𝑗0\frac{\partial L}{\partial u_{ij}}=0divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG = 0, we have

uij=(δjm)1m1[1wDTW(vi,xj)]1m1subscript𝑢𝑖𝑗superscriptsubscript𝛿𝑗𝑚1𝑚1superscriptdelimited-[]1wDTWsubscript𝑣𝑖subscript𝑥𝑗1𝑚1u_{ij}=\left(\frac{-\delta_{j}}{m}\right)^{\frac{1}{m-1}}\left[\frac{1}{% \mathrm{\emph{w}DTW}\left(v_{i},x_{j}\right)}\right]^{\frac{1}{m-1}}italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ( divide start_ARG - italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m - 1 end_ARG end_POSTSUPERSCRIPT [ divide start_ARG 1 end_ARG start_ARG w roman_DTW ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m - 1 end_ARG end_POSTSUPERSCRIPT (6)

By substituting (6) into the first constraint of (3.1), we have

(δjm)1m1=1i=1c[1wDTW(vi,xj)]1m1superscriptsubscript𝛿𝑗𝑚1𝑚11superscriptsubscript𝑖1𝑐superscriptdelimited-[]1wDTWsubscript𝑣𝑖subscript𝑥𝑗1𝑚1\left(\frac{-\delta_{j}}{m}\right)^{\frac{1}{m-1}}=\frac{1}{\sum_{i=1}^{c}% \left[\frac{1}{\mathrm{\emph{w}DTW}\left(v_{i},x_{j}\right)}\right]^{\frac{1}{% m-1}}}( divide start_ARG - italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m - 1 end_ARG end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT [ divide start_ARG 1 end_ARG start_ARG w roman_DTW ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m - 1 end_ARG end_POSTSUPERSCRIPT end_ARG (7)

By substituting (7) into (6), the proof is complete. ∎

Theorem 2.

As fixing the OWPs, U𝑈Uitalic_U, and V𝑉Vitalic_V, Λnormal-Λ\Lambdaroman_Λ can be updated by

λd=1s=1w(AdAs)1q1subscript𝜆𝑑1superscriptsubscript𝑠1𝑤superscriptsubscript𝐴𝑑subscript𝐴𝑠1𝑞1\lambda_{d}=\frac{1}{\sum_{s=1}^{w}\left(\frac{A_{d}}{A_{s}}\right)^{\frac{1}{% q-1}}}italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( divide start_ARG italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_q - 1 end_ARG end_POSTSUPERSCRIPT end_ARG (8)

where Ad=icjnpuijm(vidxjd)2subscript𝐴𝑑superscriptsubscript𝑖𝑐superscriptsubscript𝑗𝑛subscript𝑝superscriptsubscript𝑢𝑖𝑗𝑚superscriptsubscript𝑣𝑖𝑑subscript𝑥𝑗𝑑2A_{d}=\sum_{i}^{c}\sum_{j}^{n}\sum_{p}u_{ij}^{m}\left(v_{id}-x_{jd}\right)^{2}italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the intra-cluster distance on the d𝑑ditalic_d-th dimension.

The proof of Theorem 2 follows a similar procedure to that of Theorem 1, and will not be reiterated here.

Theorem 3.

As fixing the OWPs, U𝑈Uitalic_U, and Λ,Vnormal-Λ𝑉\Lambda,Vroman_Λ , italic_V can be updated by

vidr=j=1nuijms=1axjdsI(r,s)j=1nuijms=1aI(r,s),r=1,,bformulae-sequencesubscript𝑣𝑖𝑑𝑟superscriptsubscript𝑗1𝑛superscriptsubscript𝑢𝑖𝑗𝑚superscriptsubscript𝑠1𝑎subscript𝑥𝑗𝑑𝑠𝐼𝑟𝑠superscriptsubscript𝑗1𝑛superscriptsubscript𝑢𝑖𝑗𝑚superscriptsubscript𝑠1𝑎𝐼𝑟𝑠𝑟1𝑏v_{idr}=\frac{\sum_{j=1}^{n}u_{ij}^{m}\sum_{s=1}^{a}x_{jds}\cdot I(r,s)}{\sum_% {j=1}^{n}u_{ij}^{m}\sum_{s=1}^{a}I(r,s)},r=1,\ldots,bitalic_v start_POSTSUBSCRIPT italic_i italic_d italic_r end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j italic_d italic_s end_POSTSUBSCRIPT ⋅ italic_I ( italic_r , italic_s ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT italic_I ( italic_r , italic_s ) end_ARG , italic_r = 1 , … , italic_b (9)

where a,b𝑎𝑏a,bitalic_a , italic_b are the lengths of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT respectively. I(r,s)𝐼𝑟𝑠I(r,s)italic_I ( italic_r , italic_s ) is an indicator function, I(r,s)=1𝐼𝑟𝑠1I(r,s)=1italic_I ( italic_r , italic_s ) = 1 if (r,s)𝐩𝑟𝑠𝐩(r,s)\in\boldsymbol{p}( italic_r , italic_s ) ∈ bold_italic_p otherwise 00, where 𝐩𝐩\boldsymbol{p}bold_italic_p is the OWP between visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Proof.

By (3), we have

Lvidr=j=1nuijmλdqs=1a2(vidrxjds)I(r,s)𝐿subscript𝑣𝑖𝑑𝑟superscriptsubscript𝑗1𝑛superscriptsubscript𝑢𝑖𝑗𝑚superscriptsubscript𝜆𝑑𝑞superscriptsubscript𝑠1𝑎2subscript𝑣𝑖𝑑𝑟subscript𝑥𝑗𝑑𝑠𝐼𝑟𝑠\frac{\partial L}{\partial v_{idr}}=\sum_{j=1}^{n}u_{ij}^{m}\lambda_{d}^{q}% \sum_{s=1}^{a}2\left(v_{idr}-x_{jds}\right)\cdot I(r,s)divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_v start_POSTSUBSCRIPT italic_i italic_d italic_r end_POSTSUBSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT 2 ( italic_v start_POSTSUBSCRIPT italic_i italic_d italic_r end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j italic_d italic_s end_POSTSUBSCRIPT ) ⋅ italic_I ( italic_r , italic_s )

By setting Lvidr=0𝐿subscript𝑣𝑖𝑑𝑟0\frac{\partial L}{\partial v_{idr}}=0divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_v start_POSTSUBSCRIPT italic_i italic_d italic_r end_POSTSUBSCRIPT end_ARG = 0, we can get (9), and the proof is complete. ∎

Based on the solutions above, FCM-wDTW can be optimized by iterating Problem 1similar-to\sim4, as shown in Algorithm 1. Step 1 initializes cluster centers V𝑉Vitalic_V and the weights of dimensions ΛΛ\Lambdaroman_Λ. One feasible initialization is to randomly choose c𝑐citalic_c samples from the dataset as the initial V𝑉Vitalic_V. And the weights of dimensions ΛΛ\Lambdaroman_Λ can be randomly initialized within the range of [0, 1], while ensuring that they satisfy the condition d=1wλd=1superscriptsubscript𝑑1𝑤subscript𝜆𝑑1\sum_{d=1}^{w}\lambda_{d}=1∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 1. Steps 29similar-to292\sim 92 ∼ 9 update the corresponding variables, respectively. Step 10 calculates the loss of the objective function and determines if the algorithm converges.

Algorithm 1 FCM-wDTW
0:  MTS dataset D={xj|j[1,n]}𝐷conditional-setsubscript𝑥𝑗𝑗1𝑛D=\{x_{j}|j\in[1,n]\}italic_D = { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_j ∈ [ 1 , italic_n ] }, cluster number c𝑐citalic_c, fuzzy coefficient m𝑚mitalic_m, exponent q𝑞qitalic_q, and loss threshold ε𝜀\varepsilonitalic_ε
0:  Membership matrix U𝑈Uitalic_U, cluster centers V𝑉Vitalic_V, weight parameters ΛΛ\Lambdaroman_Λ
1:  initialize V𝑉Vitalic_V, ΛΛ\Lambdaroman_Λ
2:  repeat
3:     for xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT do
4:        Calculate PCM Mijsubscript𝑀𝑖𝑗M_{ij}italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT
5:        Seek for OWPijsubscriptOWP𝑖𝑗\mathrm{OWP}_{ij}roman_OWP start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT by DP algorithm
6:     end for
7:     Update U𝑈Uitalic_U by (4)
8:     Update ΛΛ\Lambdaroman_Λ by (8)
9:     Update V𝑉Vitalic_V by (9)
10:  until J(U,V,Λ)<ε𝐽𝑈𝑉Λ𝜀J(U,V,\Lambda)<\varepsilonitalic_J ( italic_U , italic_V , roman_Λ ) < italic_ε

3.3 Algorithm Analysis

Complexity of FCM-wDTW. Given the average length of samples a𝑎aitalic_a, the dimensionality w(<<a)annotated𝑤much-less-thanabsent𝑎w(<<a)italic_w ( < < italic_a ), and the number of clusters c(<<a)annotated𝑐much-less-thanabsent𝑎c(<<a)italic_c ( < < italic_a ), the computational complexity of seeking for the OWPs in Step 3similar-to\sim6 is O(ncwa2)𝑂𝑛𝑐𝑤superscript𝑎2O\left(ncwa^{2}\right)italic_O ( italic_n italic_c italic_w italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), updating the membership matrix in Step 7 is O(nc2)𝑂𝑛superscript𝑐2O\left(nc^{2}\right)italic_O ( italic_n italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), updating weight parameters in Step 8 is O(ncwa+w2)𝑂𝑛𝑐𝑤𝑎superscript𝑤2O\left(ncwa+w^{2}\right)italic_O ( italic_n italic_c italic_w italic_a + italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and updating cluster centers in Step 9 is O(ncwa)𝑂𝑛𝑐𝑤𝑎O(ncwa)italic_O ( italic_n italic_c italic_w italic_a ). Suppose the algorithm iterates e𝑒eitalic_e times, the complexity of the whole algorithm is O(encwa2)𝑂𝑒𝑛𝑐𝑤superscript𝑎2O\left(encwa^{2}\right)italic_O ( italic_e italic_n italic_c italic_w italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Hyperparameters analysis. FCM-wDTW has two hyperparameters, i.e., the fuzzy coefficient m𝑚mitalic_m and the exponent q𝑞qitalic_q of the weight coefficient. By (4), as m1𝑚1m\rightarrow 1italic_m → 1, the membership matrix U𝑈Uitalic_U tends to be sparse and the algorithm tends to the hard clustering, while as m+,U𝑚𝑈m\rightarrow+\infty,Uitalic_m → + ∞ , italic_U tends to be average and the algorithm tends to the uniform fuzzy clustering. Previous study [17] has proved through experiments that the optimal values of m𝑚mitalic_m are typically lower than the commonly used value of 2.0. Thus, we set m𝑚mitalic_m to be within the range of (1,2]12(1,2]( 1 , 2 ]. On the other hand, by (8), a larger weight can strengthen the contribution of the MTS dimension with a smaller intra-cluster distance Adsubscript𝐴𝑑A_{d}italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, and vice versa. In terms of this principle, we can investigate the values of q𝑞qitalic_q within different ranges. Firstly, as q=0𝑞0q=0italic_q = 0, λdq1superscriptsubscript𝜆𝑑𝑞1\lambda_{d}^{q}\equiv 1italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ≡ 1 and the WED becomes the original Euclidean distance, against our aim of discriminating the MTS dimensions. As 0<q<10𝑞10<q<10 < italic_q < 1, the larger Adsubscript𝐴𝑑A_{d}italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, the larger λdqsuperscriptsubscript𝜆𝑑𝑞\lambda_{d}^{q}italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT, violating the principle above. As q=1𝑞1q=1italic_q = 1, the weight coefficient of the dimension with the smallest Adsubscript𝐴𝑑A_{d}italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is one and the others are zero, meaning only a single dimension plays a role in clustering and the information loss would influence the clustering accuracy seriously. As q>1𝑞1q>1italic_q > 1 or q<0𝑞0q<0italic_q < 0, the larger Adsubscript𝐴𝑑A_{d}italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, the smaller λdqsuperscriptsubscript𝜆𝑑𝑞\lambda_{d}^{q}italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT, satisfying the expected principle. Thus, q𝑞qitalic_q should be within the range of (,0)(1(-\infty,0)\cup(1( - ∞ , 0 ) ∪ ( 1, +)+\infty)+ ∞ ).

3.4 Anomaly Detection

In FCM-wDTW, data is projected into latent space constructed by cluster centers, which is composed of the normal patterns only in terms of the proportion of membership degree. Intuitively, if we reconstruct data from this space, the reconstructions are expected to be as close to the cluster centers as possible, and the samples with abnormal components will have large reconstruction errors. Thus, the anomaly score can be computed based on the difference between the sample and its reconstruction.

Based on the optimal cluster centers, partition matrix, and wDTW, the reconstructed samples can be obtained by solving the objective function of FCM-wDTW directly. Let x¨jsubscript¨𝑥𝑗\ddot{x}_{j}over¨ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denote the reconstruction of xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the objective function of clustering x¨jsubscript¨𝑥𝑗\ddot{x}_{j}over¨ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with FCM-wDTW can be obtained as follows

J(U,V,Λ)𝐽𝑈𝑉Λ\displaystyle J(U,V,\Lambda)italic_J ( italic_U , italic_V , roman_Λ ) =mini=1cj=1nuijmwDTW(vi,x¨j)absentsuperscriptsubscript𝑖1𝑐superscriptsubscript𝑗1𝑛superscriptsubscript𝑢𝑖𝑗𝑚wDTWsubscript𝑣𝑖subscript¨𝑥𝑗\displaystyle=\min\sum_{i=1}^{c}\sum_{j=1}^{n}u_{ij}^{m}\cdot\mathrm{\emph{w}% DTW}\left(v_{i},\ddot{x}_{j}\right)= roman_min ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⋅ w roman_DTW ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over¨ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
=mini=1cj=1npd=1wuijmλdq(vidx¨jd)2absentsuperscriptsubscript𝑖1𝑐superscriptsubscript𝑗1𝑛subscript𝑝superscriptsubscript𝑑1𝑤superscriptsubscript𝑢𝑖𝑗𝑚superscriptsubscript𝜆𝑑𝑞superscriptsubscript𝑣𝑖𝑑subscript¨𝑥𝑗𝑑2\displaystyle=\min\sum_{i=1}^{c}\sum_{j=1}^{n}\sum_{p}\sum_{d=1}^{w}u_{ij}^{m}% \lambda_{d}^{q}\left(v_{id}-\ddot{x}_{jd}\right)^{2}= roman_min ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT - over¨ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (10)

By zeroing the gradient of J𝐽Jitalic_J with respect to x¨jsubscript¨𝑥𝑗\ddot{x}_{j}over¨ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we have

x¨jds=i=1cuijmr=1bvidrI(r,s)i=1cuijmr=1bI(r,s),s=1,,aformulae-sequencesubscript¨𝑥𝑗𝑑𝑠superscriptsubscript𝑖1𝑐superscriptsubscript𝑢𝑖𝑗𝑚superscriptsubscript𝑟1𝑏subscript𝑣𝑖𝑑𝑟𝐼𝑟𝑠superscriptsubscript𝑖1𝑐superscriptsubscript𝑢𝑖𝑗𝑚superscriptsubscript𝑟1𝑏𝐼𝑟𝑠𝑠1𝑎\ddot{x}_{jds}=\frac{\sum_{i=1}^{c}u_{ij}^{m}\sum_{r=1}^{b}v_{idr}\cdot I(r,s)% }{\sum_{i=1}^{c}u_{ij}^{m}\sum_{r=1}^{b}I(r,s)},s=1,\ldots,aover¨ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j italic_d italic_s end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i italic_d italic_r end_POSTSUBSCRIPT ⋅ italic_I ( italic_r , italic_s ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT italic_I ( italic_r , italic_s ) end_ARG , italic_s = 1 , … , italic_a (11)

Then the anomaly score of each sample can be computed as (12), by the wDTW distance between the raw data and its reconstruction.

sj=wDTW(xj,x¨j)subscript𝑠𝑗wDTWsubscript𝑥𝑗subscript¨𝑥𝑗s_{j}=\mathrm{\emph{w}DTW}\left(x_{j},\ddot{x}_{j}\right)italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = w roman_DTW ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over¨ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (12)

4 EXPERIMENTS

In this section, we first estimate the performance of FCM-wDTW on multivariate anomaly detection, which is conducted on 4 datasets against 11 benchmarks. We then examine the runtime efficiency of the optimization algorithm for FCM-wDTW.

Table 1: Anomaly detection accuracy comparison. The optimal results are highlighted in bold.
Method Type CalIt2 PCSO5 PCSO10 PCSO20
ROC-AUC PR-AUC ROC-AUC PR-AUC ROC-AUC PR-AUC ROC-AUC PR-AUC
LOF [9] Distance 0.727 0.119 0.445 0.008 0.440 0.009 0.447 0.009
KNN [10] Distance 0.883 0.264 0.681 0.014 0.752 0.019 0.599 0.011
CBLOF [18] Distance 0.871 0.256 0.271 0.006 0.355 0.007 0.225 0.006
HBOS [19] Distance 0.873 0.250 0.207 0.006 0.297 0.007 0.185 0.006
COF [20] Distance 0.839 0.161 0.463 0.009 0.432 0.008 0.471 0.009
EIF [21] Trees 0.885 0.259 0.948 0.083 0.847 0.032 0.856 0.033
IF-LOF [22] Trees 0.795 0.122 0.610 0.012 0.603 0.012 0.619 0.013
iForest [23] Trees 0.881 0.259 0.859 0.033 0.838 0.030 0.836 0.029
COPOD [24] Distribution 0.884 0.269 0.809 0.024 0.465 0.009 0.808 0.024
PCC [25] Reconstruction 0.757 0.240 0.874 0.038 0.600 0.012 0.870 0.037
Torsk [26] Forecasting 0.585 0.054 0.909 0.100 0.919 0.065
FCM-wDTW Reconstruction 0.904 0.465 0.993 0.818 0.974 0.423 0.952 0.345

4.1 Setup

Environment. The configuration is Intel(R) Core(TM) i9-12900k CPU @3.2GHz, 32GB memory, Ubuntu 20.04 OS. The programming language is Python 3.8.

Datasets.

Refer to caption
Refer to caption
Fig. 1: Calit2 and Gutentag samples.

For estimating the anomaly detection accuracy, we consider four real-world datasets, namely CalIt [27] and PCSO5, PCSO10, and PCSO20 from GutenTAG [5]. CalIt2 comes from the data streams of people flowing in and out of the building of University of California at Irvine over 15 weeks, 48 time slices per day (half-hour count aggregates). The purpose is to predict the presence of an event such as a conference in the building that is reflected by unusually high people counts for that day/time period. GutenTAG is actually a time series anomaly generator that consists of single or multiple channels containing a base oscillation with a large variety of well-labeled anomalies at different positions. It generated time series of five base types (sine, ECG, random walk, cylinder bell funnel, and polynomial) with different lengths, variances, amplitudes, frequencies, and dimensions. The selection of injected anomalies covers nine different types. Fig. 1 exhibits the samples from both data collections. In the CalIt2 sample, the relationship between two dimensions turns over with large amplitude in the shaded area, indicating a conference as an anomaly event. In the GutenTAG sample, the first dimension is injected with polynomial amplitude anomalies within the shaded area.

Parameters. To guarantee clustering robustness, the cluster centers of FCM-wDTW are initialized with density peak clustering (DPC) [28]. By analysis ahead, the value ranges of the two hyper-parameters m𝑚mitalic_m and q𝑞qitalic_q are (1.0,2.0]1.02.0(1.0,2.0]( 1.0 , 2.0 ] and [10,0)(1,10]100110[-10,0)\cup(1,10][ - 10 , 0 ) ∪ ( 1 , 10 ], the adjustment steps are 0.3 and 2 respectively. In addition, the sliding window size in anomaly detection is 16.

Metrics. We utilize two threshold-agnostic evaluation metrics. 1) AUC-ROC: contrasts the TP rate with the FP rate (or Recall). It focuses on an algorithm’s sensitivity. 2) AUC-PR: contrasts the precision with the recall. It focuses on an algorithm’s preciseness.

Baselines. We adopt all the 11 unsupervised multivariate benchmarks published by the state-of-the-art survey work [5] (only except DBStream which has no results published), Table 1 summarizes the name and type of each baseline. To avoid implementation bias, we use the results from [5] directly. All baseline results are gained by optimizing the parameters globally for the best average AUC-ROC score.

4.2 Accuracy

We test the parameters of cluster number c𝑐citalic_c with values {10, 20, 30, 40, 50}. The optimal anomaly detection results of FCM-wDTW and the benchmark results are reported in Table 1 respectively. From the results, we note that for distance-based methods like LOF, KNN, and CBLOF, their performances are relatively worse. The reason behind this may be the sensitivity of Euclidean distance to noise. Additionally, tree-based methods achieve a relatively high ROC-AUC on most datasets. However, they tend to have lower PR-AUC, indicating that they struggle to achieve a good trade-off between precision and recall. In contrast, FCM-wDTW not only achieves the best ROC-AUC but also shows a decent PR-AUC. This indicates that FCM-wDTW is capable of achieving a favorable trade-off between the TP rate TPR and FP rate, as well as between precision and recall. Overall, FCM-wDTW outperforms all other methods on four datasets in terms of both AUC-ROC and AUC-PR, signifying its effectiveness and robustness in accurately detecting anomalies.

4.3 Runtime

We compare the real runtime of FCM-wDTW against seven clustering benchmarks [29], including CD, PDC, FCFW, FCMD-DTW, PAM-DTW, GAK-DBA and soft-DTW, on datasets CMUsubject16 and ECG [30]. The results are shown in Fig.2. Each method is repeated 100 times to calculate the average runtime. On CMUsubject16, except GAK-DBA, the runtime of all other methods is in the same order of magnitude. On ECG, except soft-DTW, the runtime of all other methods is in the same order of magnitude. In addition, the runtime of FCM-wDTW is relatively low on ECG but high on CMUsubject16. We note that the iterations of FCM-wDTW on both datasets are less than 10, but the sample lengths and the numbers of dimensions are different greatly. By (8), the extra cost of FCM-wDTW on CMUsubject16 is caused by the procedure of initializing cluster centers and updating weight coefficients. Overall, although a more complex distance metric is introduced, the runtime of FCM-wDTW remains comparable to that of other clustering methods.

Refer to caption
Refer to caption
Fig. 2: Runtime comparison of different clustering methods on CMUsubject16 and ECG.

5 CONCLUSION

In this work, we propose an unsupervised distance metric learning method based on FCM and wDTW for multivariate anomaly detection. Our method solves the objective function in closed form for the reformulated optimization problem and builds an efficient optimization algorithm. The anomalies are identified by reconstructing data from the learned optimal latent space. Comprehensive experiments demonstrate the significant superiority of our methods. Future work will focus on examining its effectiveness on more practical problems, e.g., network performance monitoring, abnormal account detection, and attack behavior identification.

References

  • [1] Dongxu Huang, Dejun Mu, Libin Yang, and Xiaoyan Cai, “Codetect: Financial fraud detection with anomaly feature detection,” IEEE Access, vol. 6, pp. 19161–19174, 2018.
  • [2] Neeraj Kumar and Upendra Kumar, “Anomaly-based network intrusion detection: An outlier detection techniques,” in Proceedings of the Eighth International Conference on Soft Computing and Pattern Recognition (SoCPaR 2016). Springer, 2018, pp. 262–269.
  • [3] Astha Garg, Wenyu Zhang, Jules Samaran, Ramasamy Savitha, and Chuan-Sheng Foo, “An evaluation of anomaly detection and diagnosis in multivariate time series,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 6, pp. 2508–2517, 2021.
  • [4] Douglas M Hawkins, Identification of outliers, vol. 11, Springer, 1980.
  • [5] Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock, “Anomaly detection in time series: a comprehensive evaluation,” Proceedings of the VLDB Endowment, vol. 15, no. 9, pp. 1779–1797, 2022.
  • [6] Ane Blázquez-García, Angel Conde, Usue Mori, and Jose A Lozano, “A review on outlier/anomaly detection in time series data,” ACM Computing Surveys (CSUR), vol. 54, no. 3, pp. 1–33, 2021.
  • [7] Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A Zuluaga, “Do deep neural networks contribute to multivariate time series anomaly detection?,” Pattern Recognition, vol. 132, pp. 108945, 2022.
  • [8] Siwon Kim, Kuk** Choi, Hyun-Soo Choi, Byunghan Lee, and Sungroh Yoon, “Towards a rigorous evaluation of time-series anomaly detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, pp. 7194–7201.
  • [9] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander, “Lof: identifying density-based local outliers,” in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 93–104.
  • [10] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim, “Efficient algorithms for mining outliers from large data sets,” in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 427–438.
  • [11] **bo Li, Hesam Izakian, Witold Pedrycz, and Iqbal Jamal, “Clustering-based anomaly detection in multivariate time series data,” Applied Soft Computing, vol. 100, pp. 106919, 2021.
  • [12] Seif-Eddine Benkabou, Khalid Benabdeslem, and Bruno Canitia, “Unsupervised outlier detection for time series by entropy and dynamic time war**,” Knowledge and Information Systems, vol. 54, no. 2, pp. 463–486, 2018.
  • [13] Zoltán Bankó and János Abonyi, “Correlation based dynamic time war** of multivariate time series,” Expert Systems with Applications, vol. 39, no. 17, pp. 12814–12823, 2012.
  • [14] Qinglin Cai, Ling Chen, and Jianling Sun, “Piecewise statistic approximation based similarity measure for time series,” Knowledge-Based Systems, vol. 85, pp. 181–195, 2015.
  • [15] Jiangyuan Mei, Meizhu Liu, Yuan-Fang Wang, and Huijun Gao, “Learning a mahalanobis distance-based dynamic time war** measure for multivariate time series classification,” IEEE transactions on Cybernetics, vol. 46, no. 6, pp. 1363–1374, 2015.
  • [16] **gyi Shen, Wei** Huang, Dongyang Zhu, and Jun Liang, “A novel similarity measure model for multivariate time series based on lmnn and dtw,” Neural Processing Letters, vol. 45, pp. 925–937, 2017.
  • [17] Witold Pedrycz and José Valente de Oliveira, “A development of fuzzy encoding and decoding through fuzzy clustering,” IEEE Transactions on Instrumentation and Measurement, vol. 57, no. 4, pp. 829–837, 2008.
  • [18] Zengyou He, Xiaofei Xu, and Shengchun Deng, “Discovering cluster-based local outliers,” Pattern recognition letters, vol. 24, no. 9-10, pp. 1641–1650, 2003.
  • [19] Markus Goldstein and Andreas Dengel, “Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm,” KI-2012: poster and demo track, vol. 1, pp. 59–63, 2012.
  • [20] Jian Tang, Zhixiang Chen, Ada Wai-Chee Fu, and David W Cheung, “Enhancing effectiveness of outlier detections for low density patterns,” in Advances in Knowledge Discovery and Data Mining: 6th Pacific-Asia Conference, PAKDD 2002 Taipei, Taiwan, May 6–8, 2002 Proceedings 6. Springer, 2002, pp. 535–548.
  • [21] Sahand Hariri, Matias Carrasco Kind, and Robert J Brunner, “Extended isolation forest,” IEEE transactions on knowledge and data engineering, vol. 33, no. 4, pp. 1479–1489, 2019.
  • [22] Zhangyu Cheng, Chengming Zou, and Jianwei Dong, “Outlier detection using isolation forest and local outlier factor,” in Proceedings of the conference on research in adaptive and convergent systems, 2019, pp. 161–168.
  • [23] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou, “Isolation forest,” in 2008 eighth ieee international conference on data mining. IEEE, 2008, pp. 413–422.
  • [24] Zheng Li, Yue Zhao, Nicola Botta, Cezar Ionescu, and Xiyang Hu, “Copod: copula-based outlier detection,” in 2020 IEEE international conference on data mining (ICDM). IEEE, 2020, pp. 1118–1123.
  • [25] Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang, “A novel anomaly detection scheme based on principal component classifier,” in Proceedings of the IEEE foundations and new directions of data mining workshop. IEEE Press, 2003, pp. 172–179.
  • [26] Niklas Heim and James E Avery, “Adaptive anomaly detection in chaotic time series with a spatially aware echo state network,” arXiv preprint arXiv:1909.01709, 2019.
  • [27] Arthur Asuncion and David Newman, “Uci machine learning repository,” 2007.
  • [28] Alex Rodriguez and Alessandro Laio, “Clustering by fast search and find of density peaks,” science, vol. 344, no. 6191, pp. 1492–1496, 2014.
  • [29] Hailin Li and Miao Wei, “Fuzzy clustering based on feature weights for multivariate time series,” Knowledge-Based Systems, vol. 197, pp. 105907, 2020.
  • [30] Mustafa Gokce Baydogan, “Multivariate time series classification datasets,” 2015.