MELODY: Robust Semi-Supervised Hybrid Model for Entity-Level Online Anomaly Detection with Multivariate Time Series

**gchao Ni, Gauthier Guinet, Peihong Jiang^∗, Laurent Callot, Andrey Kan AWS AI Labs
{**gchni, guinetgg, lcallot, avkan}@amazon.com, [email protected]

Abstract.

In large IT systems, software deployment is a crucial process in online services as their code is regularly updated. However, a faulty code change may degrade the target service’s performance and cause cascading outages in downstream services. Thus, software deployments should be comprehensively monitored, and their anomalies should be detected timely. In this paper, we study the problem of anomaly detection for deployments. We begin by identifying the challenges unique to this anomaly detection problem, which is at entity-level (e.g., deployments), relative to the more typical problem of anomaly detection in multivariate time series (MTS). The unique challenges include the heterogeneity of deployments, the low latency tolerance, the ambiguous anomaly definition, and the limited supervision. To address them, we propose a novel framework, semi-supervised hybrid Model for Entity-Level Online Detection of anomalY (MELODY). MELODY first transforms the MTS of different entities to the same feature space by an online feature extractor, then uses a newly proposed semi-supervised deep one-class model for detecting anomalous entities. We evaluated MELODY on real data of cloud services with 1.2M+ time series. The relative F1 score improvement of MELODY over the state-of-the-art methods ranges from 7.6% to 56.5%. The user evaluation suggests MELODY is suitable for monitoring deployments in large online systems.

Time series, Anomaly detection, Deep learning

^†^†copyright: acmcopyright^†^†doi: XXXXXXX.XXXXXXX^†^†conference: ; ; ^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper

1. Introduction

As cloud native systems become prevalent in modern IT industry, most applications “born on cloud” are composed of a multitude of interconnected services (or microservices), each specialized in accomplishing a narrow range of tasks (Balalaie et al., 2016; Dragoni et al., 2017). This composability of the services facilitates independent deployment, rapid delivery, and flexible expansion of applications in many cloud architectures (Wang et al., 2018). It is particularly useful in large cloud platforms such as Amazon AWS, Google Cloud, and Microsoft Azure due to their culture of team-level service ownership.

Refer to caption — Figure 1. An illustration of (a) point-level anomaly detection, and (b) entity-level anomaly detection.

As the code implementing these services is regularly updated, whether to add new functionality or improve performance, faults may be introduced. Faulty changes can degrade the target service’s performance or cause externally-facing outages, which directly impact the customer’s experience and the company’s reputation. To prevent faulty changes, code deployments should be monitored comprehensively for rapid detection of anomalous behaviors so that faulty deployments can be stopped in time and the software can be returned to its previous, safe status. This process is called rollback, which helps avoid cascading damages to a cloud application.

The existing approaches mainly detect anomalies in online systems from four perspectives, namely KPI-level (key performance indicator) (Liu et al., 2015; Xu et al., 2018), log-level (Du et al., 2017; Meng et al., 2019), trace-level (Gan et al., 2019; Nedelkoski et al., 2019), and entity-level (Su et al., 2019; Huang et al., 2022). In this paper, we focus on anomaly detection at the entity level, where an entity could be a cloud server, a container, or a deployment of services. For example, for a deployment of certain services at Amazon, a large number of metrics, such as CPU usage, memory usage, threads usage, etc., are continuously monitored. Each metric emits time series (TS) and all of the metrics are aggregated into multivariate time series (MTS) as illustrated in Fig. 1(b). As such, it is intuitive to resort to MTS anomaly detection approaches for entity-level anomaly detection.

Existing anomaly detection techniques on MTS are mostly designed for point-level anomaly detection, which is to identify time points that have anomalous observations from their contextual time points in a single entity (Schmidl et al., 2022), such as a server, as Fig. 1(a) presents. Entity-level anomaly detection is remarkably different. It aims to detect anomalous entities in a stream of entities (e.g. deployments), where each entity emits MTS, as Fig. 1 (b) presents. In particular, entity-level anomaly detection poses four unique challenges.

1. Multiple Heterogeneous Entities. To detect anomalous entities, a model should be trained on the MTS data across different entities for capturing the behavioral patterns of the entities. However, a model trained using the metrics in one entity is hard to be applied to other entities: (1) for the same metric, the time series values are non-comparable for different entities. For example, the normal CPU usage of deploying a computational service is higher than that of a notification service. This leads to different MTS spaces in different entities. (2) different entities may have different durations, thus their MTSs may last for different lengths. (3) different entities may have a different (sub)set of metrics, making their MTSs have different number of variates (or dimensions). As such, we seek a robust model that can be shared across heterogeneous entities with MTSs that have varying scales, lengths, and variates.

2. Low Latency Tolerance. In an online system, the constantly emergent entities form a stream, where a single entity may only live for a short time. For example, a deployment of service at Amazon may only last for a few minutes, and the duration is unknown at the onset. Therefore, it is infeasible to train a model per entity online. This consolidates the necessity to share a pre-trained model across entities. Also, it requires the model to adapt to all historical MTS of a new entity and perform inference with a low latency.

3. Ambiguous Anomaly Definition. Unlike the existing anomaly detection methods that detect unexpected changes from their contextual time points, the definition of anomalies in an industrial environment is more ambiguous. A change in time series may not indicate anomaly, but a normal launch of a new deployment. Instead, the definition of anomaly is that a service cannot work normally, which necessitates supervised signals from domain experts.

4. Limited Supervision. The entity-level labels are not point-wise, i.e., it does not locate the time points when anomalies occur, but only indicates whether the entity is anomalous by the end of its duration. Because the existing (semi-)supervised methods require point-level labels, they cannot be trained for solving our problem. Moreover, even for the entity-level labels, domain experts may make mistakes. Given the high human cost of labeling, the challenge is then to build models with few, noisy entity labels.

To address these challenges, we propose a Semi-supervised Hybrid Model for Entity-Level Online Detection of anomalY (MELODY). MELODY is a system used in production monitoring several million deployments every month. MELODY consists of two major components, namely an online feature extractor (OFE) and a semi-supervised anomaly detection (SemiAD) module. OFE embeds the MTSs of different entities to the same, comparable features space of fixed dimension, where the features are computed dynamically and incrementally for varying MTS lengths (Challenge 1). Because feature extraction takes inference time, we design OFE with efficient initialization and updating capability (Challenge 2). Based on the extracted features, SemiAD leverages supervised signals for learning anomalous patterns (Challenge 3). SemiAD is a hybrid model with two sub-modules, a supervised ensemble model that is robust to noisy features and labels (Challenge 1, 4), and a semi-supervised deep one-class model that can leverage the large amount of unlabeled data for complementing the limited supervision (Challenge 4). Two strategies to combine the outputs of the two sub-modules, i.e., an ensemble strategy and a sequential strategy, were introduced. Our contributions can be summarized as follows.

•

We investigate a new entity-level anomaly detection problem, which is motivated by real applications in cloud native systems. Its unique challenges can not be directly addressed by existing MTS anomaly detection approaches.
•

We propose MELODY, a novel robust semi-supervised framework for online anomaly detection in streaming entities. It resolves the ambiguous definition of anomalies via limited labels, and leverages the vast amount of unlabeled data for enhancing its robustness and performance.
•

We evaluate MELODY using the data of 30K+ deployments with 1.2M+ time series from Amazon, and compare it with the state-of-the-art (SOTA) approaches. The results demonstrate MELODY significantly outperforms the baseline methods, with up to 56.5% relative improvement on F1 score.
•

We deploy MELODY as a core component of an AutoRollback system on the deployments of services at Amazon, and evaluate customer experience of the enhanced system.

2. Preliminary

In this work, we consider real-time monitoring of deployment entities. A deployment is a process of updating software packages, configuration, environment variables, etc. Each deployment affects one or more services, each service has multiple metrics, and each metric has a univariate time series. Therefore, each deployment has multiple univariate time series, each of which has a unique label (service, metric). The set of possible metrics is fixed, but there could be unlimited number of possible services. Moreover, a service may be monitored with a subset of the metrics. Therefore, each deployment is associated with a variable number of univariate time series. Fig. 1(b) illustrates the time series of three deployments.

2.1. Problem Statement

Suppose the collection of entities is $\mathcal{X}=\{\mathcal{X}_{i}\}_{i=1}^{N}$ . Each entity (e.g., deployment) $\mathcal{X}_{i}$ has a historical multivariate time series ${\textbf{X}}_{i}=[{\textbf{x}}_{i,1},...,{\textbf{x}}_{i,n_{i}}]$ , where ${\textbf{x}}_{i,j}=[x_{i,j}^{1},...,x_{i,j}^{T}]\in\mathbb{R}^{T}$ ( $1\leq j\leq n_{i}$ ) is the $j$ -th univariate time series associated with a unique label ( $\texttt{service}_{j}$ , $\texttt{metrics}_{j}$ ) in an observation window of size $T$ . Due to the variability of service and the different subsets of metrics, the number $n_{i}$ could be different for different $\mathcal{X}_{i}$ . In this paper, we also use ${\textbf{x}}_{i}^{t}=[x_{i,1}^{t},...,x_{i,n_{i}}^{t}]$ to denote an observation of the $n_{i}$ variates of $\mathcal{X}_{i}$ at time step $t$ , and use ${\textbf{X}}_{i}^{t-w:t}=[{\textbf{x}}_{i,1}^{t-w:t},...,{\textbf{x}}_{i,n_{i}% }^{t-w:t}]$ to denote a sequence of observations from time $t-w$ to $t$ .

To resolve the ambiguous anomaly definition, partial labels are available. Formally, let $\mathcal{X}=\{\mathcal{X}^{u},\mathcal{X}^{l}\}$ , $\mathcal{X}^{u}$ be the subset of unlabeled entities, and $\mathcal{X}^{l}$ be the labeled subset with binary labels ${\textbf{y}}\in\{0,1\}^{N_{l}}$ , where $N_{l}=|\mathcal{X}^{l}|$ and $\mathcal{X}^{u}\cap\mathcal{X}^{l}=\varnothing$ . Label ${\textbf{y}}_{i}=1$ indicates the $i$ -th entity in $\mathcal{X}^{l}$ is anomalous; ${\textbf{y}}_{i}=0$ otherwise.

The entity-level anomaly detection problem is to train a model $f:\mathcal{X}\rightarrow\mathbb{R}$ , such that given a new entity $\mathcal{X}_{\text{i}}$ with its historical MTS ${\textbf{X}}_{i}\in\mathbb{R}^{T\times n_{i}}$ , the output $f({\textbf{x}}_{i}^{t}|{\textbf{X}}_{i})$ represents the anomalous score of the observation ${\textbf{x}}_{i}^{t}\in\mathbb{R}^{n_{i}}$ at time step $t$ ( $t>T$ ).

Remark. It is noteworthy that although the model $f$ should check anomalies at new time points for timely detection for at inference time, this problem is different from point-level anomaly detection because (1) the model $f$ is shared across entities with different MTSs; and (2) at training time, the label y only marks anomalous entities, without marking anomalous time points, which are required by the existing (semi-)supervised methods (Jiang et al., 2021; Schmidl et al., 2022; Huang et al., 2022; Chen et al., 2023).

3. Proposed System Overview

Fig. 2(a) illustrates the system architecture for running our MELODY model for detecting anomalous deployments of services.

3.1. The System Architecture

The offline system in Fig. 2(a, left) consists of three key components: (1) the data labeling service for importing data from products to experimental environment, (2) the machine learning platform for develo** and training the MELODY model, and (3) the model deployment service for deploying the OFE and SemiAD artifacts of MELODY model to the online system.

In particular, the labeling service imports MTS and meta-data (e.g., config profiles) of each deployment to the datalake. It provides two ways of labeling the deployments. The first is a labeling UI which visualizes the MTS of the deployment to be labeled. The labelers (human domain experts) use the UI to assign a score at the scale of -1 to 3 for deployments in $\mathcal{X}^{l}$ : -1 means the labeler is unsure, 0 (or 1) indicates a normal (or likely normal) deployment, 3 (or 2) indicates an abnormal (or likely abnormal) deployment.

Because human labeling is costly, the second way is to automatically label normal deployments, which is referred as Bot in Fig. 2(a). Bot applies a set of expert-defined rules (e.g., whether a deployment has passed some safety checks) on the unlabeled deployments in $\mathcal{X}^{u}$ . The passed deployments are automatically labeled as “normal”. These auto-labeled “normal” deployments may include noise (i.e., anomalies), but the amount should be small because anomalies are usually rare, as is the case in most anomaly detection tasks (Schmidl et al., 2022; Han et al., 2022). This auto-labeled normal set is valuable for unsupervised or semi-supervised methods (e.g., one-class models) to learn normal patterns.

Fig. 2(a, right) is the online system. Once users launch a new deployment, the historical MTS preceding the deployment are loaded into a data cache for initializing the OFE module (as detailed in Sec. 4.1). Then, the most recent $w$ time steps of the real-time observations are sent to the OFE service by the data cache, where $w$ is a window size. The OFE service transforms the data to features and sends them to the anomaly detection service to calculate anomalous score at the current time step, which, together with a threshold and some safety checking rules, are used by the rollback decision service to determine whether to rollback the deployment. If rolled back, the deployment trouble shooting service will display the time-wise anomaly scores and the details of relevant metrics to the users for debugging through a web interface.

In summary, the architecture enables continuous labeling of the data, and scheduled training cycle for automatic model updates with new data. In the next, we will focus on the design of the MELODY framework (the blue dashed box in Fig. 2(a)).

4. The Proposed MELODY framework

Fig. 2(b) illustrates the MELODY framework for online anomaly detection. It consists of OFE and SemiAD modules. The OFE transforms different MTSs of different entities to a comparable feature space. The SemiAD is a hybrid model that consumes the features for generating an anomaly score with a combining strategy.

4.1. Online Feature Extraction (OFE)

As described by the Challenge 1 in Sec. 1, the MTS of heterogeneous entities pose three issues that prevent a model from being trained across entities. The OFE aims to address them.

4.1.1. Featurizer

First, to address the non-comparable time series values of different entities, we transform the raw values of individual univariate time series to their anomalous degrees. Specifically, for the $i$ -th deployment $\mathcal{X}_{i}$ , its $j$ -th variate’s observation at time step $t$ , i.e., $x_{i,j}^{t}$ ( $1\leq j\leq n_{i}$ ), is transformed to a score $s_{i,j}^{t}$ representing the deviation of $x_{i,j}^{t}$ from the normal history ${\textbf{x}}_{i,j}=[x_{i,j}^{1},...,x_{i,j}^{T}]$ for any $t>T$ . The more $x_{i,j}^{t}$ deviates from ${\textbf{x}}_{i,j}$ , i.e., the higher the score $s_{i,j}^{t}$ is, the more anomalous $x_{i,j}^{t}$ is. It is noteworthy that $s_{i,j}^{t}$ defines the change of $x_{i,j}^{t}$ relative to its history ${\textbf{x}}_{i,j}$ for any univariate time series consistently, thus it is a comparable score across entities.

As illustrated in Fig. 2(b), the OFE has three Featurizers to generate score $s_{i,j}^{t}$ . Each Featurizer is a general class. Given a concrete deployment $\mathcal{X}_{i}$ , each Featurizer first initializes a set of instances, one per variate using its history, for some (or all) of the variates, e.g., an instance is initialized using the history ${\textbf{x}}_{i,j}$ for variate $j$ .

Then the $j$ -th instance is used to transform $x_{i,j}^{t}$ to $s_{i,j}^{t}$ in an online manner. Because of the Low Latency Tolerance (Challenge 2 in Sec. 1), we design three Featurizers that initialize instances efficiently:

Rule-based Featurizer. We integrate the prior knowledge of domain experts into our model by three types of rule-based features: (1) statistics based features (SbF), (2) threshold based features (TbF), and (3) count based features (CbF).

At the initialization stage, SbF learns the mean $\mu$ and standard deviation $\sigma$ of all $T$ values in the history ${\textbf{x}}_{i,j}$ , and defines a threshold $\tau=\mu+\alpha*\sigma$ , where $\alpha$ is a multiplier. At the inference stage, it sets $s_{i,j}^{t}=1$ at time step $t$ if it observes $w_{s}$ continuous values $x_{i,j}^{t-w_{s}+1}$ , …, $x_{i,j}^{t}$ above $\tau$ ; and sets $s_{i,j}^{t}=0$ otherwise. By defining different $\alpha$ and $w_{s}$ for the metrics of interest, SbF has a total of 11 instances.

TbF is similar to SbF except that TbF uses a predefined threshold $\tau$ directly without resorting to $\mu$ and $\sigma$ . By defining $\tau$ and $w_{s}$ , TbF has 7 possible instances. CbF is used to count the number of continuous missing observations, which can indicate anomalies on certain metrics. CbF’s initialization sets up a sliding window of size $w_{c}$ . At the inference stage, CbF sets $s_{i,j}^{t}=1$ if the window ending at time step $t$ is full of missing observations; and sets $s_{i,j}^{t}=0$ otherwise. CbF has 1 possible instance.

It is noteworthy that all of the SbF, TbF, CbF Featurizers have efficient initialization. In total, rule-based featurizer has 19 possible instances for the metrics that the rules monitor.

Algorithm-based Featurizer. We also integrate two efficient point-level anomaly detection algorithms on univariate time series for scoring $s_{i,j}^{t}$ : subsequence-based nearest neighbor (SubNN) (Schmidl et al., 2022) and median forecast (MD) (Basu and Meckesheimer, 2007).

SubNN is a distance based scorer. At the initialization stage, SubNN segments the history ${\textbf{x}}_{i,j}$ into subsequences with window size $w_{n}$ and stride 1, forming a set $\mathcal{S}$ of length- $w_{n}$ subsequences, and sets up a sliding window of size $w_{n}$ for the inference stage. At the inference stage, SubNN sets $s_{i,j}^{t}$ as the distance between the sliding window ending at time step $t$ and its nearest neighbor in $\mathcal{S}$ . MD is an efficient forecasting approach. At the initialization stage, it only sets up a sliding window of size $w_{m}$ . At the inference stage, based on the sliding window at $t-1$ , it calculates $m_{\text{obs}}=\text{median}([x_{i,j}^{t-w_{m}},...,x_{i,j}^{t-1}])$ , and $m_{\text{dif}}=\text{median}([x_{i,j}^{t-w_{m}+1}-x_{i,j}^{t-w_{m}},...,x_{i,j% }^{t-1}-x_{i,j}^{t-2}])$ , and forecasts the subsequent value $\hat{x}_{i,j}^{t}=m_{\text{obs}}+(w_{m}/2)*m_{\text{dif}}$ . We adapted it for scoring anomalies by setting $s_{i,j}^{t}$ as the deviation of $x_{i,j}^{t}$ from $\hat{x}_{i,j}^{t}$ , i.e., $s_{i,j}^{t}=|x_{i,j}^{t}-\hat{x}_{i,j}^{t}|$ .

The algorithm-based featurizers are applicable to any metric. There are 22 possible metrics in total, thus the SubNN and MD featurizers have 44 instances.

Meta-Data Featurizer. For entity-level anomaly detection, we append some static meta-data features of a deployment to the time-wise features emitted by the rule-based and algorithm-based featurizers, as illustrated by Fig. 2(b). There are 8 meta-data features pertaining to the configurations of each deployment, which are useful as anomalous pattern may vary with configurations.

4.1.2. Time Pooling Layer

The second issue in Challenge 1 (Sec. 1) is the variable duration of different deployments. Using any instance of the rule-based and algorithm-based featurizer, we can obtain a feature $s_{i,j}^{t}$ at time step $t$ . However, as described in Sec. 2, the labels y of the training deployments in $\mathcal{X}^{l}$ are at the entity-level, where ${\textbf{y}}_{i}=1$ only indicates the deployment $\mathcal{X}_{i}$ is anomalous before it ended, without marking any anomalous time points. Thus we cannot train a model on the feature $s_{i,j}^{t}$ at a specific time point $t$ using ${\textbf{y}}_{i}$ . To address it, we seek for an entity-level method that (1) can be updated efficiently to track dynamic features, and (2) is invariant to the variable duration of deployments.

To this end, we designed a time pooling layer with two pooling methods up to the current time point $t$ :

(1)

\displaystyle\hat{s}_{i,j}^{t}=\text{MaxPool}([s_{i,j}^{1},...,s_{i,j}^{t}]),~% {}~{}~{}\bar{s}_{i,j}^{t}=\text{MeanPool}([s_{i,j}^{1},...,s_{i,j}^{t}])

where the $\hat{s}_{i,j}^{t}$ records the most anomalous status up to time point $t$ , and $\bar{s}_{i,j}^{t}$ records the accumulative anomalous status.

The pooling methods are invariant to variable sequence lengths. For model training, Eq. (1) generates a feature per training deployment by pooling up to the end of its length. Also, because both MaxPool and MeanPool can be updated incrementally in constant time, we can use Eq. (1) to dynamically track salient features of new deployments up to any time point for online inference.

4.1.3. Feature Aggregator

So far, using any of the featurizer instances, we can generate features $[\hat{s}_{i,j}^{t},\bar{s}_{i,j}^{t}]$ for each univariate time series of $\mathcal{X}_{i}$ , which has a unique label ( $\texttt{service}_{j}$ , $\texttt{metrics}_{j}$ ) (Sec. 2.1). The third issue in Challenge 1 (Sec. 1) implies the number of variates $n_{i}$ is different for different $\mathcal{X}_{i}$ , leading to different feature dimensions for different $\mathcal{X}_{i}$ . There are two reasons for the different $n_{i}$ : (1) the deployments can have different subsets of metrics, and (2) multiple services may be monitored for the same metric, generating multiple univariate time series on the same metric.

To address this challenge, and embed different deployments in the same feature space, we propose a feature aggregator. After we obtain the feature $[\hat{s}_{i,j}^{t},\bar{s}_{i,j}^{t}]$ from Eq. (1) for each univariate time series, we aggregates the features over different services for the same metric. Suppose $m_{k}$ is the $k$ -th metric. Taking $\hat{s}_{i,j}^{t}$ as an example, we perform an aggregation for $m_{k}$

(2)

\displaystyle\hat{z}_{i,k}^{t}=\text{MaxPool}(\{\hat{s}_{i,j}^{t}|\texttt{% metrics}_{j}=m_{k},1\leq j\leq n_{i}\})

where MaxPool is used because we want to keep the most salient anomalous feature from different services.

Similarly, we can obtain $\bar{z}_{i,j}^{t}$ from $\bar{s}_{i,j}^{t}$ , and this step addresses the issue (2). To address issue (1), if a deployment misses a specific metric value $m_{k}$ and Eq. (2) cannot be applied for $m_{k}$ , we perform mean-based imputation on $\hat{z}_{i,k}^{t}$ using the training deployments in $\mathcal{X}$ . Thus we align all deployments to the same set of 134 features (63 $\hat{z}_{i,k}^{t}$ , 63 $\bar{z}_{i,k}^{t}$ from the 63 featurizer instances and 8 meta-data features in Sec. 4.1.1). Then each deployment $\mathcal{X}_{i}$ is represented by a vector ${\textbf{z}}_{i}\in\mathbb{R}^{d}$ of fixed dimension $d=134$ .

To address noisy imputation and model the correlation of variates in MTS, next, we propose a semi-supervised model on $\{{\textbf{z}}_{i}\}_{i=1}^{N}$ for learning meaningful embeddings using supervision signals.

4.2. Semi-Supervised Anomaly Detection

Taking features ${\textbf{z}}_{i}\in\mathbb{R}^{d}$ , SemiAD aims to train a detector $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , and uses it to emit the final anomaly score $\hat{y}_{i}$ for every deployment $\mathcal{X}_{i}$ . To be robust to the imputed values in ${\textbf{z}}_{i}$ , and to address Challenge 3 (ambiguous anomaly definition) and 4 (limited supervision) in Sec. 1, we propose a hybrid model comprising a semi-supervised one-class model and a supervised ensemble model.

4.2.1. Semi-Supervised Deep One-Class Model

A one-class model is an unsupervised anomaly detector that is trained on samples of a single, typically normal, class. It is used to predict whether a testing sample belongs to this class or not. One prominent example is kernel-based method, such as One-Class SVM (OC-SVM) (Schölkopf et al., 2001) and Support Vector Data Description (SVDD) (Tax and Duin, 2004), which aims to maximize the margin between normal samples and others. For example, SVDD aims to find the smallest hypersphere with a center vector c and a radius $R>0$ to enclose the majority of the (normal) samples in the feature space.

(3)			$\displaystyle\min_{{\textbf{c}},R,\boldsymbol{\xi}}~{}~{}R^{2}+\frac{1}{\nu N}% \sum_{i=1}^{N}\xi_{i}$
(3)			$\displaystyle\text{s.t.}~{}~{}\\|\phi({\textbf{z}}_{i})-{\textbf{c}}\\|^{2}\leq R% ^{2}+\xi_{i},~{}~{}\xi_{i}\geq 0,~{}~{}\forall i=1,...,N$

where $\phi({\textbf{z}}_{i})$ is a kernel function on input feature ${\textbf{z}}_{i}$ , $\xi_{i}$ is a slack variable to allow soft boundary, and $\nu\in(0,1]$ is a hyperparameter to control the trade-off between the volume of the sphere and the penalties on $\xi_{i}$ . Once c and $R$ are determined by solving the dual form of the primal problem in Eq. (3), samples that are outside the sphere, i.e., $\|\phi({\textbf{z}}_{i})-{\textbf{c}}\|^{2}>R^{2}$ , are deemed anomalies.

Recently, DeepSVDD was introduced in (Ruff et al., 2018). Compared to the kernel-based methods, it is more robust to the noises from feature engineering, and more scalable to large datasets. It replaces $\phi(\cdot)$ by a neural network $\phi(\cdot;\boldsymbol{\theta}):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{e}}$ with parameter $\boldsymbol{\theta}$ , where $d_{e}$ is the dimension of the embedding space. The objective of DeepSVDD is to train $\boldsymbol{\theta}$ for learning a transformation $\phi(\cdot;\boldsymbol{\theta})$ that minimizes the volume of a hypersphere centered on a predetermined c.

(4)

\displaystyle\min_{\boldsymbol{\theta}}\frac{1}{N}\sum_{i=1}^{N}D\big{(}\phi({% \textbf{z}}_{i};\boldsymbol{\theta}),{\textbf{c}}\big{)}+\lambda\|\boldsymbol{% \theta}\|_{F}^{2}

where $D(\cdot,\cdot)$ is a distance function, such as Euclidean or Hamming distance, and $\lambda$ is the hyperparameter for weight decay.

Although DeepSVDD can learn embeddings that are less sensitive to the noises in ${\textbf{z}}_{i}$ , it can not harness the supervised signals for learning embeddings that resolve the ambiguous anomaly definition (Challenge 3 in Sec. 1). Recent works also indicated even with a small amount of labels, semi-supervised methods could outperform unsupervised methods significantly on anomaly detection (Han et al., 2022).

Therefore, we introduce our Semi-Supervised Deep One-Class Model (SemiDOC). Our goal is to use the small amount of labeled anomalies to tighten the boundary of the hypersphere. To this end, we propose a negative sampling based model trained in batch-wise. In each batch, we randomly sample $B$ normal samples from the labeled set $\mathcal{X}^{l}$ or the unlabeled set $\mathcal{X}^{u}$ (which is auto-labeled as normal as described in Sec. 3.1) as the queries $\{{\textbf{z}}_{i}^{q}\}_{i=1}^{B}$ . For each query ${\textbf{z}}_{i}^{q}$ , we sample an anomaly from $\mathcal{X}^{l}$ as its negative sample, and form a tuple $({\textbf{z}}_{i}^{q},{\textbf{z}}_{i}^{n})$ . Then each batch is a set of $B$ tuples $\{({\textbf{z}}_{i}^{q},{\textbf{z}}_{i}^{n})\}_{i=1}^{B}$ , and the learning objective is

(5)

\displaystyle\min_{\boldsymbol{\theta}}\frac{1}{B}\sum_{i=1}^{B}\Big{(}D\big{(% }\phi({\textbf{z}}_{i}^{q};\boldsymbol{\theta}),{\textbf{c}}\big{)}+\ell({% \textbf{z}}_{i}^{q},{\textbf{z}}_{i}^{n};\boldsymbol{\theta})\Big{)}+\lambda\|% \boldsymbol{\theta}\|_{F}^{2}

where

(6)

\displaystyle\ell({\textbf{z}}_{i}^{q},{\textbf{z}}_{i}^{n};\boldsymbol{\theta% })=\max{\Big{(}\delta-D\big{(}\phi({\textbf{z}}_{i}^{q};\boldsymbol{\theta}),% \phi({\textbf{z}}_{i}^{n};\boldsymbol{\theta})\big{)},0\Big{)}}

is a hinge loss to maximize the distance between the embeddings of ${\textbf{z}}_{i}^{q}$ and ${\textbf{z}}_{i}^{n}$ , and $\delta$ is a threshold to avoid arbitrarily large distance values in the loss function, for training stability.

After training the model, SemiDOC uses the following function to infer anomaly score of a new sample ${\textbf{z}}_{\text{new}}$ .

(7)

\displaystyle\text{AnomalyScore}({\textbf{z}}_{\text{new}};\boldsymbol{\theta}% )=\text{Clip}{\bigg{(}\frac{D\big{(}\phi({\textbf{z}}_{\text{new}};\boldsymbol% {\theta}),{\textbf{c}}\big{)}}{R},0,1\bigg{)}}

where $R=\max_{{\textbf{z}}_{i}\in\mathcal{Z}_{\text{normal}}}D\big{(}\phi({\textbf{z% }}_{i};\boldsymbol{\theta}),{\textbf{c}}\big{)}$ is the maximal radius of the normal embeddings in set $\mathcal{Z}_{\text{normal}}$ in the training set, i.e., the learned hypersphere, and the Clip function is used to prevent extreme values from dominating the anomaly scores.

Remark. Our method is different from DeepSAD (Ruff et al., 2019), which only regularizes the distance between the labeled samples and the center c, without explicit manipulation on the boundary of normal data. In contrast, the proposed SemiDOC uses negative sampling in Eq. (6) to explicitly tighten the boundary of normal embeddings, facilitating the detection of hard anomalies that are close to the boundary. We empirically demonstrate the superiority of SemiDOC in Sec. 5.4.3.

4.2.2. Supervised Anomaly Detector

Given the intricate anomalous patterns and the potential presence of noise in the unlabeled set $\mathcal{X}^{u}$ , a semi-supervised model may be biased from learning an accurate boundary. To address it, we add a robust supervised model to provide another angle of class boundary, and combine it with SemiDOC in an ensemble in Sec. 4.2.3. Our empirical findings in Sec. 5 consolidate the superiority of such a hybrid model.

We employ LightGBM (Ke et al., 2017) as the supervised detector. LightGBM is a boosting tree-based ensemble method that has high accuracy and efficiency. As a binary classifier, it has been demonstrated as useful in anomaly detection tasks (Vargaftik et al., 2021; Han et al., 2022), and been deployed in many production pipelines of fraud prevention systems. Similar methods such as XGBoost (Chen and Guestrin, 2016) and CatBoost (Prokhorenkova et al., 2018) are also compatible with our framework. We selected LightGBM for its better efficiency and superior accuracy. By feeding a feature ${\textbf{z}}_{i}$ to LightGBM, we use the probability to the anomalous class as the anomaly score.

4.2.3. Hybrid Model

Ensembling was found as a powerful regularization technique for performance improvement (Oreshkin et al., 2019). We ensemble SemiDOC and LightGBM to form a hybrid model for anomaly detection. The core property of an ensemble is diversity. Thus we perform a bagging procedure (Breiman, 1996) by including models (SemiDOC or LightGBM) trained with different random initializations.

As for the ensemble aggregation function, the widely used approach is taking mean of the anomaly scores from different models in the hybrid. However, as a one-class model, a high score from SemiDOC in Eq. (7) does not necessarily mean anomalies, but indicates an unknown entity is different from the normal majority of the training set. Thus taking mean of the scores of SemiDOC and LightGBM may generate a high score for unknown but normal entities, leading to more false positives.

To alleviate it, we propose a sequential approach with two steps. First, SemiDOC is used to filter normal entities with scores lower than a threshold. Second, the entities that SemiDOC is less confident (i.e., with high scores) are sent to LightGBM for anomaly detection. If there are multiple SemiDOC (or LightGBM) in the hybrid, the mean score of SemiDOCs (or LightGBMs) is used in the two steps.

In our experiments, we evaluated both mean-based and sequential models, which are named as MELODY-M and MELODY-S.

4.3. Time Complexity Analysis

We analyzed the time complexity of OFE and SemiAD in detail in Appendix A.1. In summary, the time complexity of MELODY for anomaly inference is approximately $O(T)$ , which is efficient as $T$ can be fixed as a constant length of historical time series.

5. Experiments

5.1. Datasets

Table 1. The statistics of dataset

Dataset	# entities	# time series	# anomalies
Hard-Labeled set	4,966	234,508	288 (5.8%)
Soft-Labeled set	4,966	234,508	544 (11.0%)
Naive-Labeled set	4,688	220,320	544 (11.6%)
Unlabeled set	27,590	1,034,711	NA

Table 2. The performance on anomaly detection of the compared methods.

\uparrow

means higher is better.

\downarrow

means lower is better. The best and second results in the F1 column (i.e., the overall performance metrics) are in bold and underlined, respectively.

Method		Hard Labels				Soft Labels				Naive Labels
Method		F1 $\uparrow$	Prec. $\uparrow$	Recall $\uparrow$	FPR $\downarrow$	F1 $\uparrow$	Prec. $\uparrow$	Recall $\uparrow$	FPR $\downarrow$	F1 $\uparrow$	Prec. $\uparrow$	Recall $\uparrow$	FPR $\downarrow$
MTS	OmniAnomaly	0.206	0.126	0.573	0.241	0.294	0.193	0.622	0.317	0.311	0.209	0.608	0.296
MTS	AnomalyTrans	0.217	0.132	0.616	0.246	0.298	0.210	0.514	0.236	0.316	0.227	0.521	0.229
	TranAD	0.207	0.126	0.573	0.240	0.297	0.194	0.629	0.318	0.309	0.209	0.594	0.289
General	OFE+DeepSVDD	0.188	0.168	0.357	0.174	0.286	0.203	0.666	0.422	0.357	0.274	0.533	0.190
	OFE+DeepSVDD-B	0.221	0.176	0.348	0.107	0.316	0.214	0.717	0.379	0.332	0.231	0.631	0.283
	OFE+DeepSAD	0.337	0.244	0.555	0.105	0.422	0.377	0.497	0.104	0.439	0.398	0.494	0.096
	OFE+DeepSAD-B	0.355	0.274	0.508	0.082	0.437	0.399	0.485	0.089	0.449	0.390	0.532	0.107
	OFE+RF	0.382	0.300	0.535	0.077	0.455	0.397	0.533	0.098	0.466	0.475	0.458	0.065
	OFE+RF-B	0.362	0.279	0.521	0.082	0.456	0.395	0.540	0.101	0.470	0.411	0.552	0.103
	OFE+LGBM	0.397	0.331	0.502	0.062	0.491	0.442	0.553	0.085	0.518	0.481	0.564	0.078
	OFE+LGBM-B	0.399	0.325	0.522	0.066	0.490	0.434	0.563	0.091	0.515	0.466	0.576	0.086
Ours	MELODY-M	0.411	0.333	0.540	0.066	0.499	0.436	0.584	0.092	0.514	0.464	0.577	0.086
Ours	MELODY-S	0.432	0.393	0.485	0.045	0.493	0.452	0.546	0.081	0.544	0.482	0.625	0.086

We sampled real data of the deployments of a variety of services from Amazon AWS between April 7, 2022 and Dec. 29, 2022. This dataset contains 4966 labeled deployments in $\mathcal{X}^{l}$ and 27590 unlabeled deployments in $\mathcal{X}^{u}$ . The deployments in $\mathcal{X}^{l}$ were labeled by the labeling service as described in Sec. 3.1. Because one deployment may be scored by multiple human judges at the scale of -1 to 3, we used three approaches to aggregate and binarize the labels: (1) Hard Labels: ${\textbf{y}}_{i}=1$ if all labelers scored $\mathcal{X}_{i}$ as 3; ${\textbf{y}}_{i}=0$ otherwise, (2) Soft Labels: ${\textbf{y}}_{i}=1$ if the scores of $\mathcal{X}_{i}$ are either 2 or 3; ${\textbf{y}}_{i}=0$ otherwise, (3) Naive Labels: the same as Soft Labels except that ${\textbf{y}}_{i}=0$ if the scores of $\mathcal{X}_{i}$ are either 0 or 1 (-1 were excluded). For each deployment, there are 22 monitored metrics, such as Threads, CPU usage, and Memory usage, and 8 meta-data features, such as number of services and number of hosts. For each metric, we used its 2 day observations prior to the launch of deployment $\mathcal{X}_{i}$ as its history ${\textbf{x}}_{i,j}\in\mathbb{R}^{T}$ for the OFE module. Because the observations were collected in every minute, $T=2880$ . The lengths of different deployments after launch could be different, and the average length is 16.1 minutes. Table 1 summarizes the statistics of the dataset.

5.2. Experimental Setup

5.2.1. Baselines

We compare MELODY with a variety of anomaly detection (AD) methods from two groups: MTS-based methods and General AD methods. For the first group, because our entity-level AD problem does not assume the availability of point-level labels on MTS, we only evaluated unsupervised SOTA methods: (1) OmniAnomaly (Su et al., 2019), (2) AnomalyTransformer (Xu et al., 2021), (3) TranAD (Tuli et al., 2022). We applied these methods on the MTS of each deployment individually. Following (Xu et al., 2021), once an anomalous time point is detected, the deployment is considered as anomalous. For the second group, we included unsupervised method: (4) DeepSVDD (Ruff et al., 2018), semi-supervised method: (5) DeepSAD (Ruff et al., 2019), and supervised methods: (6) RandomForest (RF) (Breiman, 2001), (7) LightGBM (Ke et al., 2017), which can use the entity-level labels by applying them on the features generated by our OFE module. Thus they were named with prefix “OFE+”, e.g., OFE+LGBM. For fair comparison, the bootstrap ensemble versions of the methods in the second group were included and named with suffix “-B”, e.g., OFE+LGBM-B.

For our method, we considered two variants: MELODY-M and MELODY-S as discussed in Sec. 4.2.3. We used Hamming distance $D(\cdot,\cdot)$ in Eq. (5). The hyperparameter $\delta$ , i.e. the threshold in Eq. (6) was grid searched in {1, 10, 100} using validation set. By default, we used 3 LightGBMs and 3 SemiDOCs in our method, with a total 6 models. For fair comparison, each of the ensemble baseline used 6 models. In Sec. 5.4.1 and Sec. 5.4.3, we performed hyperparameter study and ablation analysis to evaluate the design choice of MELODY. Its implementation details are in Appendix A.2.

5.2.2. Evaluation

We randomly split the labeled set $\mathcal{X}^{l}$ into 60% /20%/20% train/validation/test sets 5 times, and run each method on these 5 splits to evaluate the average performance. The unlabeled set $\mathcal{X}^{u}$ was only used for training semi-supervised methods as it does not have valid labels for validation and testing.

Following (Xu et al., 2021; Huang et al., 2022), we use Precision, Recall, and F1-score to evaluate the AD performance of the compared methods. Additionally, we added False Positive Rate (FPR) because in a rollback system, it is important to avoid false positives, i.e., unnecessary rollbacks, that lead to slow deployments and poor user experience. Following (Huang et al., 2022), for each method, we select the threshold with the best F1-score.

5.3. Experimental Results

Table 2 summarizes the average results of the compared methods on the test sets, from which we have several observations. First, the methods using OFE generally outperform MTS-based methods, indicating the unsupervised MTS-based AD methods are not applicable to entity-level anomaly detection, and it is important to align different entities to the same feature space. Our OFE module provides a foundation to accomplish this task. Second, for the general AD methods, their ensemble versions (with “-B”) are overall better than the original methods (e.g., in terms of F1), indicating the usefulness of ensembling in entity-level anomaly detection. Third, OFE+DeepSAD outperforms OFE+DeepSVDD, indicating a semi-supervised method is better than unsupervised method. Fourth, both OFE+RF and OFE+LGBM outperform OFE+DeepSAD, indicating the labeled set may be more important that the unlabeled set. Finally, both MELODY-M and MELODY-S outperform the baselines in most cases in terms of F1, e.g., MELODY-S has a 7.6% to 56.5% relative improvement on F1 on the Hard Labels dataset. This demonstrates the effective use of both labeled and unlabeled sets by the proposed methods. In particular, MELODY-S is superior in precision and FPR in most cases. This attributes to the sequential design of its ensembling (Sec. 4.2.3), where SemiDOC can effectively filter out true normal entities, resulting in less false positives and high precision, meanwhile maintains a reasonable recall.

5.4. More Details on Effectiveness

5.4.1. Parameter Analysis

There are two major hyperparameters in MELODY, the number of model instances in the hybrid and the margin threshold $\delta$ in the hinge loss in Eq. (6). In this section, we use MELODY-S to evaluate the influence of these parameters (MELODY-M is similar thus is ignored for brevity). Fig. 3 presents the change of F1 scores w.r.t. the two parameters on the three labeled sets. In Fig. 3(a), the number represents the the number of either SemiDOC or LightGBM, e.g., 3 indicates 3 SemiDOC + 3 LightGBM. We can see that using either 3 or 5 is better than 1, indicating MELODY-S also benefits from ensembling. Also, the performance marginally improves or degrades after the number is larger than 3, validating our choice in Sec. 5.2.1. From Fig. 3(b), we can see MELODY-S is not very sensitive to $\delta$ . A small margin (e.g., 0.1) is insufficient for distinguishing normal and abnormal entities, and a too large margin (e.g., 1000) may lead to overfitting. Thus a proper choice is $\delta=100$ , which is our setup for MELODY.

5.4.2. Visualization of Embeddings

As discussed in Sec. 4.2.1, SemiDOC uses a small amount of labeled anomalies to tighten the boundary of the hypersphere of normal entities. To understand how it works, we uniformly sampled a batch of training data and took all test data for visualization (the full training set is too large to visualize). The embeddings of SemiDOC were visualized using tSNE (Van der Maaten and Hinton, 2008) in a 2D space. Fig. 4 presents the distribution of the training and test embeddings. From Fig. 4(a), SemiDOC effectively pushed anomalies away from the boundary of normal entities. As designed by the negative sampling in Eq. (5), the anomalies do not need to form a cluster but are used to shape the boundary of normal class. From Fig. 4(b), SemiDOC generalizes well on the test set. Although the anomalies overlap with a small amount of normal entities, the majority of the normal class is distinguishable. This explains the design of MELODY-S, where SemiDOC first filters out most of the normal entities (with a few false negatives), and LightGBM classifies the rest set of normal and abnormal entities.

5.4.3. Ablation Analysis

Table 3. The ablation analysis in terms of F1 score (relative change w.r.t. MELODY-S). H-Dist. is Hamming distance. E-Dist. is Euclidean distance.

Model	Hard	Soft	Naive
MELODY-S	0.432	0.493	0.544
(a) - Rule features	0.366 (-15.3%)	0.456 (-7.5%)	0.497 (-8.6%)
(b) - Alg. features	0.384 (-11.1%)	0.490 (-0.6%)	0.521 (-4.2%)
(c) - Meta features	0.406 (-6.0%)	0.485 (-1.6%)	0.525 (-3.5%)
(d) SemiOC $\rightarrow$ DSVDD	0.406 (-6.0%)	0.480 (-2.6%)	0.515 (-5.3%)
(e) SemiOC $\rightarrow$ DSAD	0.403 (-6.7%)	0.490 (-0.6%)	0.524 (-3.7%)
(f) H-Dist. $\rightarrow$ E-Dist.	0.404 (-6.5%)	0.200 (-59.4%)	0.437 (-19.7%)

In this section, we evaluate several variants of MELODY-S to validate the design choices of its OFE and SemiAD modules. Table 3 summarizes the testing results of our ablation analysis. In (a)-(c), we alternately removed the three featurizers in OFE (Sec. 4.1.1). In (d) and (e), we replaced SemiDOC in SemiAD module by DeepSVDD and DeepSAD, respectively. In (f), we replaced the Hamming distance with Euclidean distance in Eq. (5) and Eq. (6) to evaluate the choice of distance function. First, in (a)-(c), we observe removing any of the featurizers degrades the performance of MELODY-S. Removing Rule/Algorithmic featurizers have more impacts than the Meta featurizer because they are more fine-grained and dynamic at the time series level. In (d)(e), we observe SemiDOC is better than DeepSVDD and DeepSAD for anomaly detection, validating the design of the negative sampling based regularization in Eq. (6). Finally, (f) suggests Hamming distance, which is commonly used in embedding retrieval (Song et al., 2018), is more suitable than Euclidean distance in MELODY-S for entity-level anomaly detection. Therefore, the results in Table 3 justify the design choices of our method.

5.5. Human Evaluation

Table 4. Online evaluation of the proposed model.

Model	# Deployments	Precision
The Existing Model	64	29.7%
Ours	64	35.9%

To evaluate the impact of the proposed MELODY model in real applications, we run an online A/B experiment, where the control group is a LightGBM based model running in the existing rollback system, and the treatment group is our MELODY based model. For fair comparison, both of the existing model and the proposed model were configured to take 50% traffic of deployments between Jun. 8, 2023, and Jul. 7, 2023. The user of each deployment can score a rollback using the labeling service described in Sec. 3.1. In this period, 128 scored rollbacks were collected, with 64 for each model. We binarized the scores in the same way as Naive Labels, and calculated the precision of each model in Table 4. Recall (and F1) cannot be computed because there is no ground truth for the total number of true positives. Because the scores were provided voluntarily, and users tend to provide feedback for wrong decisions, the precisions in Table 4 are generally lower than those in Table 2. However, the higher precision of the proposed model indicates it agrees with the users more often than the existing model in the system, suggesting its effectiveness in the online systems.

6. Related Work

To the best of our knowledge, this the first work for online detection of anomalies in streaming entities where each entity owns its specific MTS. There are several works claimed entity-level anomaly detection (AD) (Su et al., 2019; Huang et al., 2022), but their methods aim to detect anomalous time points of the MTS emitted by a few entities such as cloud servers. So they are still point-level methods and cannot be applied to streaming entities due to the non-shareable, entity-specific model in (Su et al., 2019) or the need of poin-level supervision in (Huang et al., 2022).

Typical AD methods include TS-based methods and general methods that can be applied to non-TS vectorial data such as images, as summarized in surveys (Schmidl et al., 2022) and (Han et al., 2022), respectively. Among them, MTS-based AD methods are most relevant, most of which were designed to be unsupervised. Traditional MTS methods apply general AD techniques such as LOF and iForest on subsequences of MTS. Recent works use RNN or Transformer to encode MTS, and developed reconstruction based methods (Park et al., 2018; Su et al., 2019; Xu et al., 2021) forecasting based methods (Malhotra et al., 2015; Munir et al., 2018; Hundman et al., 2018), and generative methods (Li et al., 2023). Moreover, graph-based AD methods have been proposed on MTS by encoding and tracking the change in the correlations between variates using GNN (Zhao et al., 2020; Deng and Hooi, 2021) or CNN (Zhang et al., 2019). However, these methods cannot readily take the advantage of labels, if available. Several recent TS methods were proposed for semi-supervised AD (Jiang et al., 2021; Chen et al., 2023), but they assume point-level labels are available for locating anomalous time points or segments. This is not the case for entity-level AD as described in Challenge 4 in Sec. 1, where labels only mark anomalous entities. Because none of the aforementioned methods can be used to address the challenges of entity-level AD as discussed in Sec. 1, a proper solution such as the proposed MELODY is in demand.

As suggested by (Han et al., 2022), leveraging a small amount of labels for semi-supervised AD has a big potential. Recently, general semi-supervised AD methods have been proposed for images (Ruff et al., 2019; Huang et al., 2021), texts (Zhou et al., 2023), and tabular data (Yoon et al., 2022). They cannot be used for solving our problem until being embedded into our MELODY framework, such as DeepSAD (Ruff et al., 2019) in Table 3. This suggests the importance and flexibility of the entire MELODY framework for entity-level AD.

7. Conclusion

In this paper, we introduced MELODY, a semi-supervised framework for online entity-level anomaly detection. MELODY uses an online feature extractor to align the MTS of different entities to the same feature space, and a hybrid model SemiAD for detecting anomalous entities. In SemiAD, SemiDOC was proposed for tightening the boundary of normal entities by negative sampling. It is combined with a supervised detector for robust detection. The comprehensive experiments on large-scale datasets indicate MELODY outperforms the SOTA methods, and the human evaluation further suggest its effectiveness in large online systems.

References

(1)
Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
Balalaie et al. (2016) Armin Balalaie, Abbas Heydarnoori, and Pooyan Jamshidi. 2016. Microservices architecture enables devops: Migration to a cloud-native architecture. Ieee Software 33, 3 (2016), 42–52.
Basu and Meckesheimer (2007) Sabyasachi Basu and Martin Meckesheimer. 2007. Automatic outlier detection for time series: an application to sensor data. Knowledge and Information Systems 11 (2007), 137–154.
Breiman (1996) Leo Breiman. 1996. Bagging predictors. Machine learning 24 (1996), 123–140.
Breiman (2001) Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5–32.
Chen et al. (2023) Ningjiang Chen, Huan Tu, Xiaoyan Duan, Liangqing Hu, and Chengxiang Guo. 2023. Semisupervised anomaly detection of multivariate time series based on a variational autoencoder. Applied Intelligence 53, 5 (2023), 6074–6098.
Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
Deng and Hooi (2021) Ailin Deng and Bryan Hooi. 2021. Graph neural network-based anomaly detection in multivariate time series. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 4027–4035.
Dragoni et al. (2017) Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. 2017. Microservices: yesterday, today, and tomorrow. Present and ulterior software engineering (2017), 195–216.
Du et al. (2017) Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. 1285–1298.
Gan et al. (2019) Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices. In Proceedings of the twenty-fourth international conference on architectural support for programming languages and operating systems. 19–33.
Han et al. (2022) Songqiao Han, Xiyang Hu, Hailiang Huang, Minqi Jiang, and Yue Zhao. 2022. Adbench: Anomaly detection benchmark. Advances in Neural Information Processing Systems 35 (2022), 32142–32159.
Huang et al. (2021) Chaoqin Huang, Fei Ye, Peisen Zhao, Ya Zhang, Yanfeng Wang, and Qi Tian. 2021. ESAD: end-to-end semi-supervised anomaly detection. Restoration 69, 70 (2021), 71.
Huang et al. (2022) Tao Huang, Pengfei Chen, and Ruipeng Li. 2022. A semi-supervised vae based active anomaly detection framework in multivariate time series for online systems. In Proceedings of the ACM Web Conference 2022. 1797–1806.
Hundman et al. (2018) Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom. 2018. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In KDD. 387–395.
Jiang et al. (2021) Jehn-Ruey Jiang, Jian-Bin Kao, and Yu-Lin Li. 2021. Semi-supervised time series anomaly detection based on statistics and deep learning. Applied Sciences 11, 15 (2021), 6698.
Ke et al. (2017) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Li et al. (2023) Yuxin Li, Wenchao Chen, Bo Chen, Dongsheng Wang, Long Tian, and Mingyuan Zhou. 2023. Prototype-oriented unsupervised anomaly detection for multivariate time series. In International conference on machine learning. PMLR.
Liu et al. (2015) Dapeng Liu, Youjian Zhao, Haowen Xu, Yongqian Sun, Dan Pei, Jiao Luo, Xiaowei **g, and Mei Feng. 2015. Opprentice: Towards practical and automatic anomaly detection through machine learning. In Proceedings of the 2015 internet measurement conference. 211–224.
Malhotra et al. (2015) Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, Puneet Agarwal, et al. 2015. Long Short Term Memory Networks for Anomaly Detection in Time Series.. In ESANN, Vol. 2015. 89.
Meng et al. (2019) Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, et al. 2019. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs.. In IJCAI, Vol. 19. 4739–4745.
Munir et al. (2018) Mohsin Munir, Shoaib Ahmed Siddiqui, Andreas Dengel, and Sheraz Ahmed. 2018. DeepAnT: A deep learning approach for unsupervised anomaly detection in time series. Ieee Access 7 (2018), 1991–2005.
Nedelkoski et al. (2019) Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. 2019. Anomaly detection from system tracing data using multimodal deep learning. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). IEEE, 179–186.
Oreshkin et al. (2019) Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2019. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In International Conference on Learning Representations.
Park et al. (2018) Daehyung Park, Yuuna Hoshi, and Charles C Kemp. 2018. A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder. IEEE Robotics and Automation Letters 3, 3 (2018), 1544–1551.
Prokhorenkova et al. (2018) Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems 31 (2018).
Ruff et al. (2018) Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. 2018. Deep one-class classification. In International conference on machine learning. PMLR, 4393–4402.
Ruff et al. (2019) Lukas Ruff, Robert A Vandermeulen, Nico Görnitz, Alexander Binder, Emmanuel Müller, Klaus-Robert Müller, and Marius Kloft. 2019. Deep Semi-Supervised Anomaly Detection. In International Conference on Learning Representations.
Schmidl et al. (2022) Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. 2022. Anomaly detection in time series: a comprehensive evaluation. Proceedings of the VLDB Endowment 15, 9 (2022), 1779–1797.
Schölkopf et al. (2001) Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural computation 13, 7 (2001), 1443–1471.
Song et al. (2018) Dong** Song, Ning Xia, Wei Cheng, Haifeng Chen, and Dacheng Tao. 2018. Deep r-th root of rank supervised joint binary embedding for multivariate time series retrieval. In KDD. 2229–2238.
Su et al. (2019) Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. 2019. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2828–2837.
Tax and Duin (2004) David MJ Tax and Robert PW Duin. 2004. Support vector data description. Machine learning 54 (2004), 45–66.
Tuli et al. (2022) Shreshth Tuli, Giuliano Casale, and Nicholas R Jennings. 2022. TranAD: deep transformer networks for anomaly detection in multivariate time series data. Proceedings of the VLDB Endowment 15, 6 (2022), 1201–1214.
Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
Vargaftik et al. (2021) Shay Vargaftik, Isaac Keslassy, Ariel Orda, and Yaniv Ben-Itzhak. 2021. RADE: resource-efficient supervised anomaly detection using decision tree-based ensemble methods. Machine Learning 110, 10 (2021), 2835–2866.
Wang et al. (2018) ** Wang, **gmin Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. 2018. Cloudranger: Root cause identification for cloud native systems. In 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 492–502.
Xu et al. (2018) Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao, Dan Pei, Yang Feng, et al. 2018. Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In Proceedings of the 2018 world wide web conference. 187–196.
Xu et al. (2021) Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. 2021. Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. In International Conference on Learning Representations.
Yoon et al. (2022) **sung Yoon, Kihyuk Sohn, Chun-Liang Li, Sercan O Arik, and Tomas Pfister. 2022. SPADE: Semi-supervised Anomaly Detection under Distribution Mismatch. Transactions on Machine Learning Research (2022).
Zhang et al. (2019) Chuxu Zhang, Dong** Song, Yuncong Chen, Xinyang Feng, Cristian Lumezanu, Wei Cheng, **gchao Ni, Bo Zong, Haifeng Chen, and Nitesh V Chawla. 2019. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 1409–1416.
Zhao et al. (2020) Hang Zhao, Yu**g Wang, Juanyong Duan, Congrui Huang, Defu Cao, Yunhai Tong, Bixiong Xu, **g Bai, Jie Tong, and Qi Zhang. 2020. Multivariate time-series anomaly detection via graph attention network. In ICDM. IEEE, 841–850.
Zhou et al. (2023) Yixuan Zhou, Peiyu Yang, Yi Qu, Xing Xu, Fumin Shen, and Heng Tao Shen. 2023. AnoOnly: Semi-Supervised Anomaly Detection without Loss on Normal Data. arXiv preprint arXiv:2305.18798 (2023).

Appendix A Appendix

A.1. Time Complexity Analysis

As an online anomaly detection system, the efficiency for online inference is important. For a new deployment, the computational load of MELODY for online anomaly detection includes the computation for featurizer instances (Sec. 4.1.1) and the inference using SemiAD model. In this section, we first analyze the time complexity for initializing each of the featurizers in Sec. 4.1.1 and their score computation, then analyze the time complexity for SemiAD model inference.

Rule-based Featurizer. At the initialization stage, SbF learns the mean $\mu$ and standard deviation $\sigma$ of all $T$ values in the history ${\textbf{x}}_{i,j}$ , so its complexity is $O(1)$ . TbF sets a predefined threshold $\tau$ with complexity $O(1)$ . CbF sets up a sliding window of size $w_{c}$ with complexity is $O(1)$ . At inference stage, SbF and TbF check whether $x_{i,j}^{t}$ ( $t>T$ ) is above their thresholds, so the complexity is $O(1)$ . Similarly, CbF checks whether $x_{i,j}^{t}$ is a missing value with complexity $O(1)$ . Therefore, the complexity of Rule-based Featurizer is $O(1)$ .

Algorithm-based Featurizer. At the initialization stage, SubNN segments the length- $T$ history ${\textbf{x}}_{i,j}$ into subsequences with window size $w_{n}$ and stride 1 to form a set $\mathcal{S}$ of length- $w_{n}$ subsequences, so its complexity is $O(T-w_{n})$ . MD only sets up a sliding window of size $w_{m}$ , so its complexity is $O(1)$ . At inference stage, subNN calculates the distance between a length- $w_{n}$ sliding window ending at time step $t$ and its nearest neighbor in $\mathcal{S}$ , so the complexity is $O(Tw_{n})$ . MD calculates $m_{\text{obs}}=\text{median}([x_{i,j}^{t-w_{m}},...,x_{i,j}^{t-1}])$ , and $m_{\text{dif}}=\text{median}([x_{i,j}^{t-w_{m}+1}-x_{i,j}^{t-w_{m}},...,x_{i,j% }^{t-1}-x_{i,j}^{t-2}])$ with complexity $O(w_{m})$ . Therefore, the complexity of Algorithm-based Featurizer is $O(Tw_{n}+w_{m})$ .

Meta-Data Featurizer. The deployment-level meta-data features are static and can be retrieved and appended to rule-based and algorithm-based features with $O(1)$ time complexity.

On top of the featurizers, there is a time pooling layer (Sec. 4.1.2) for addressing the variable duration of different deployments and a feature aggregator (Sec. 4.1.3) for aligning different deployments into the same feature space. For the time pooling layer, both MaxPool and MeanPool in Eq. (1) were updated incrementally in constant time during online inference, and in parallel for different features, so its complexity is $O(1)$ . For feature aggregator, the key computational step in Eq. (2) was implemented in parallel for all metrics, and its time complexity for each metric is $O(n_{i})$ for $n_{i}$ variates in the MTS. Additionally, the imputation step in the feature aggregator takes $O(n_{i})$ time for filling missing values at a time step. Therefore, by parallelizing all featurizer instances, the overall time complexity of the OFE module during online inference is $O(Tw_{n}+w_{m}+n_{i})$ .

SemiAD Model Inference. The SemiDOC module in SemiAD uses Eq. (7) for inferring anomaly scores, where c and $R$ were pre-computed during training time and stored for model inference. So the key computation is the embedding $\phi({\textbf{z}}_{\text{new}};\boldsymbol{\theta})$ , which takes $O(dd_{\text{max}})$ for an MLP encoder with $d_{\text{max}}$ as the maximal dimension of all layers, where $d$ is the dimension of ${\textbf{z}}_{\text{new}}$ . In addition, the LightGBM in SemiAD uses $O(d)$ time for inference with a constant number of estimators (Ke et al., 2017). Therefore, the time complexity for both ensemble methods MELODY-M and MELODY-S are $O(dd_{\text{max}})$ at online inference time.

Summary. Integrating the time complexity of the processes for initializing featurizers, online computation of features, and SemiAD model inference, the time complexity of MELODY for online anomaly detection is $O(Tw_{n}+w_{m}+dd_{\text{max}})$ , where $T$ is the length of historical time series, $w_{n}$ is the window size used by SubNN, $w_{m}$ is the window size used by MD, $d$ is the dimension of features input to SemiOC, and $d_{\text{max}}$ is the maximal dimension of all layers in the neural network $\phi(\cdot;\boldsymbol{\theta})$ . In practice, $w_{n}$ , $w_{m}$ , and $d_{\text{max}}$ are set as small constants. According to Sec. 4.1.3, $d=134$ is also a small constant. Therefore, the time complexity is approximately $O(T)$ . In our experiments, we used 2-day historical time series in minute-wise, so $T=2880$ . We set $w_{n}=100$ , $w_{m}=100$ , and $d_{\text{max}}=128$ . According to Sec. 4.1.3, $d=134$ . With this setup, the model is efficient and can perform online anomaly detection.

A.2. Implementation Details

For the baseline methods, we employed their official code when available. In the group of MTS-based methods, OmniAnomaly used a 2-layer GRU, 3-layer encoder, and 3-layer decoder, with PReLU as the internal activation and Sigmoid as the output activation. Its embedding dimension was set as 8, and other hidden layers had dimension of 32. Its hyperparameter $\beta=0.01$ for KL divergence. It was trained with Adam optimizer (Kingma and Ba, 2014) with learning rate $2e^{-3}$ and weight decay $1e^{-5}$ . AnomalyTransformer used a 3-layer transformer with 8 heads as the encoder. The embedding dimension was set as 512, and the activation function was GELU. Its hyperparameter $\lambda=3$ . I was trained with Adam optimizer with learning rate $1e^{-4}$ . TranAD used a transformer encoder-decoder architecture with embedding dimension of 16, appended with a fully connected output layer. It was trained with Adam optimizer with learning rate $1e^{-3}$ and weight decay $1e^{-5}$ .

In the group of general methods, both DeepSVDD and DeepSAD used a 2-layer neural network encoder with embedding dimension 128. LeakyRELU (slope 0.1) was used as the activation function in the input layers, and Sigmoid was used as the output activation. LayerNorm (Ba et al., 2016) was added in each layer. The hyperparameter of DeepSAD was set as $\eta=1$ . Both DeepSVDD and DeepSAD were optimized by Adam with learning rate $1e^{-3}$ , weight decay $1e{-4}$ . RandomForest used 100 estimators with maximum depth 2. LightGBM used 100 estimators with maximum depth as 5, and a learning rate of 0.1.

For the proposed MELODY method, the neural network $\phi(\cdot;\boldsymbol{\theta})$ in SemiDOC in Eq. (5) was implemented with a two layer encoder with embedding dimension 128. LeakyRELU (slope 0.1) was used as the activation function in the input layers, and Sigmoid was used as the output activation. LayerNorm was added in each layer. The architecture was the same as DeepSVDD and DeepSAD for fair comparison. The SemiDOC model was optimized by Adam with learning rate $1e^{-3}$ , weight decay $1e^{-4}$ , batch size 256, and a maximum of 500 epochs. Early stop** was employed using the validation set.