MELODY: Robust Semi-Supervised Hybrid Model for Entity-Level Online Anomaly Detection with Multivariate Time Series

**gchao Ni, Gauthier Guinet, Peihong Jiang, Laurent Callot, Andrey Kan AWS AI Labs
{**gchni, guinetgg, lcallot, avkan}@amazon.com, [email protected]
Abstract.

In large IT systems, software deployment is a crucial process in online services as their code is regularly updated. However, a faulty code change may degrade the target service’s performance and cause cascading outages in downstream services. Thus, software deployments should be comprehensively monitored, and their anomalies should be detected timely. In this paper, we study the problem of anomaly detection for deployments. We begin by identifying the challenges unique to this anomaly detection problem, which is at entity-level (e.g., deployments), relative to the more typical problem of anomaly detection in multivariate time series (MTS). The unique challenges include the heterogeneity of deployments, the low latency tolerance, the ambiguous anomaly definition, and the limited supervision. To address them, we propose a novel framework, semi-supervised hybrid Model for Entity-Level Online Detection of anomalY (MELODY). MELODY first transforms the MTS of different entities to the same feature space by an online feature extractor, then uses a newly proposed semi-supervised deep one-class model for detecting anomalous entities. We evaluated MELODY on real data of cloud services with 1.2M+ time series. The relative F1 score improvement of MELODY over the state-of-the-art methods ranges from 7.6% to 56.5%. The user evaluation suggests MELODY is suitable for monitoring deployments in large online systems.

Time series, Anomaly detection, Deep learning
copyright: acmcopyrightdoi: XXXXXXX.XXXXXXXconference: ; ; isbn: 978-1-4503-XXXX-X/18/06ccs: Do Not Use This Code Generate the Correct Terms for Your Paperccs: Do Not Use This Code Generate the Correct Terms for Your Paperccs: Do Not Use This Code Generate the Correct Terms for Your Paper

1. Introduction

As cloud native systems become prevalent in modern IT industry, most applications “born on cloud” are composed of a multitude of interconnected services (or microservices), each specialized in accomplishing a narrow range of tasks (Balalaie et al., 2016; Dragoni et al., 2017). This composability of the services facilitates independent deployment, rapid delivery, and flexible expansion of applications in many cloud architectures (Wang et al., 2018). It is particularly useful in large cloud platforms such as Amazon AWS, Google Cloud, and Microsoft Azure due to their culture of team-level service ownership.

Refer to caption
Figure 1. An illustration of (a) point-level anomaly detection, and (b) entity-level anomaly detection.

As the code implementing these services is regularly updated, whether to add new functionality or improve performance, faults may be introduced. Faulty changes can degrade the target service’s performance or cause externally-facing outages, which directly impact the customer’s experience and the company’s reputation. To prevent faulty changes, code deployments should be monitored comprehensively for rapid detection of anomalous behaviors so that faulty deployments can be stopped in time and the software can be returned to its previous, safe status. This process is called rollback, which helps avoid cascading damages to a cloud application.

The existing approaches mainly detect anomalies in online systems from four perspectives, namely KPI-level (key performance indicator) (Liu et al., 2015; Xu et al., 2018), log-level (Du et al., 2017; Meng et al., 2019), trace-level (Gan et al., 2019; Nedelkoski et al., 2019), and entity-level (Su et al., 2019; Huang et al., 2022). In this paper, we focus on anomaly detection at the entity level, where an entity could be a cloud server, a container, or a deployment of services. For example, for a deployment of certain services at Amazon, a large number of metrics, such as CPU usage, memory usage, threads usage, etc., are continuously monitored. Each metric emits time series (TS) and all of the metrics are aggregated into multivariate time series (MTS) as illustrated in Fig. 1(b). As such, it is intuitive to resort to MTS anomaly detection approaches for entity-level anomaly detection.

Existing anomaly detection techniques on MTS are mostly designed for point-level anomaly detection, which is to identify time points that have anomalous observations from their contextual time points in a single entity (Schmidl et al., 2022), such as a server, as Fig. 1(a) presents. Entity-level anomaly detection is remarkably different. It aims to detect anomalous entities in a stream of entities (e.g. deployments), where each entity emits MTS, as Fig. 1 (b) presents. In particular, entity-level anomaly detection poses four unique challenges.

1. Multiple Heterogeneous Entities. To detect anomalous entities, a model should be trained on the MTS data across different entities for capturing the behavioral patterns of the entities. However, a model trained using the metrics in one entity is hard to be applied to other entities: (1) for the same metric, the time series values are non-comparable for different entities. For example, the normal CPU usage of deploying a computational service is higher than that of a notification service. This leads to different MTS spaces in different entities. (2) different entities may have different durations, thus their MTSs may last for different lengths. (3) different entities may have a different (sub)set of metrics, making their MTSs have different number of variates (or dimensions). As such, we seek a robust model that can be shared across heterogeneous entities with MTSs that have varying scales, lengths, and variates.

2. Low Latency Tolerance. In an online system, the constantly emergent entities form a stream, where a single entity may only live for a short time. For example, a deployment of service at Amazon may only last for a few minutes, and the duration is unknown at the onset. Therefore, it is infeasible to train a model per entity online. This consolidates the necessity to share a pre-trained model across entities. Also, it requires the model to adapt to all historical MTS of a new entity and perform inference with a low latency.

3. Ambiguous Anomaly Definition. Unlike the existing anomaly detection methods that detect unexpected changes from their contextual time points, the definition of anomalies in an industrial environment is more ambiguous. A change in time series may not indicate anomaly, but a normal launch of a new deployment. Instead, the definition of anomaly is that a service cannot work normally, which necessitates supervised signals from domain experts.

4. Limited Supervision. The entity-level labels are not point-wise, i.e., it does not locate the time points when anomalies occur, but only indicates whether the entity is anomalous by the end of its duration. Because the existing (semi-)supervised methods require point-level labels, they cannot be trained for solving our problem. Moreover, even for the entity-level labels, domain experts may make mistakes. Given the high human cost of labeling, the challenge is then to build models with few, noisy entity labels.

To address these challenges, we propose a Semi-supervised Hybrid Model for Entity-Level Online Detection of anomalY (MELODY). MELODY is a system used in production monitoring several million deployments every month. MELODY consists of two major components, namely an online feature extractor (OFE) and a semi-supervised anomaly detection (SemiAD) module. OFE embeds the MTSs of different entities to the same, comparable features space of fixed dimension, where the features are computed dynamically and incrementally for varying MTS lengths (Challenge 1). Because feature extraction takes inference time, we design OFE with efficient initialization and updating capability (Challenge 2). Based on the extracted features, SemiAD leverages supervised signals for learning anomalous patterns (Challenge 3). SemiAD is a hybrid model with two sub-modules, a supervised ensemble model that is robust to noisy features and labels (Challenge 1, 4), and a semi-supervised deep one-class model that can leverage the large amount of unlabeled data for complementing the limited supervision (Challenge 4). Two strategies to combine the outputs of the two sub-modules, i.e., an ensemble strategy and a sequential strategy, were introduced. Our contributions can be summarized as follows.

  • We investigate a new entity-level anomaly detection problem, which is motivated by real applications in cloud native systems. Its unique challenges can not be directly addressed by existing MTS anomaly detection approaches.

  • We propose MELODY, a novel robust semi-supervised framework for online anomaly detection in streaming entities. It resolves the ambiguous definition of anomalies via limited labels, and leverages the vast amount of unlabeled data for enhancing its robustness and performance.

  • We evaluate MELODY using the data of 30K+ deployments with 1.2M+ time series from Amazon, and compare it with the state-of-the-art (SOTA) approaches. The results demonstrate MELODY significantly outperforms the baseline methods, with up to 56.5% relative improvement on F1 score.

  • We deploy MELODY as a core component of an AutoRollback system on the deployments of services at Amazon, and evaluate customer experience of the enhanced system.

2. Preliminary

In this work, we consider real-time monitoring of deployment entities. A deployment is a process of updating software packages, configuration, environment variables, etc. Each deployment affects one or more services, each service has multiple metrics, and each metric has a univariate time series. Therefore, each deployment has multiple univariate time series, each of which has a unique label (service, metric). The set of possible metrics is fixed, but there could be unlimited number of possible services. Moreover, a service may be monitored with a subset of the metrics. Therefore, each deployment is associated with a variable number of univariate time series. Fig. 1(b) illustrates the time series of three deployments.

2.1. Problem Statement

Suppose the collection of entities is 𝒳={𝒳i}i=1N𝒳superscriptsubscriptsubscript𝒳𝑖𝑖1𝑁\mathcal{X}=\{\mathcal{X}_{i}\}_{i=1}^{N}caligraphic_X = { caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Each entity (e.g., deployment) 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has a historical multivariate time series Xi=[xi,1,,xi,ni]subscriptX𝑖subscriptx𝑖1subscriptx𝑖subscript𝑛𝑖{\textbf{X}}_{i}=[{\textbf{x}}_{i,1},...,{\textbf{x}}_{i,n_{i}}]X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , x start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ], where xi,j=[xi,j1,,xi,jT]Tsubscriptx𝑖𝑗superscriptsubscript𝑥𝑖𝑗1superscriptsubscript𝑥𝑖𝑗𝑇superscript𝑇{\textbf{x}}_{i,j}=[x_{i,j}^{1},...,x_{i,j}^{T}]\in\mathbb{R}^{T}x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (1jni1𝑗subscript𝑛𝑖1\leq j\leq n_{i}1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) is the j𝑗jitalic_j-th univariate time series associated with a unique label (servicejsubscriptservice𝑗\texttt{service}_{j}service start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, metricsjsubscriptmetrics𝑗\texttt{metrics}_{j}metrics start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) in an observation window of size T𝑇Titalic_T. Due to the variability of service and the different subsets of metrics, the number nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT could be different for different 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In this paper, we also use xit=[xi,1t,,xi,nit]superscriptsubscriptx𝑖𝑡superscriptsubscript𝑥𝑖1𝑡superscriptsubscript𝑥𝑖subscript𝑛𝑖𝑡{\textbf{x}}_{i}^{t}=[x_{i,1}^{t},...,x_{i,n_{i}}^{t}]x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] to denote an observation of the nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT variates of 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at time step t𝑡titalic_t, and use Xitw:t=[xi,1tw:t,,xi,nitw:t]superscriptsubscriptX𝑖:𝑡𝑤𝑡superscriptsubscriptx𝑖1:𝑡𝑤𝑡superscriptsubscriptx𝑖subscript𝑛𝑖:𝑡𝑤𝑡{\textbf{X}}_{i}^{t-w:t}=[{\textbf{x}}_{i,1}^{t-w:t},...,{\textbf{x}}_{i,n_{i}% }^{t-w:t}]X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_w : italic_t end_POSTSUPERSCRIPT = [ x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_w : italic_t end_POSTSUPERSCRIPT , … , x start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_w : italic_t end_POSTSUPERSCRIPT ] to denote a sequence of observations from time tw𝑡𝑤t-witalic_t - italic_w to t𝑡titalic_t.

To resolve the ambiguous anomaly definition, partial labels are available. Formally, let 𝒳={𝒳u,𝒳l}𝒳superscript𝒳𝑢superscript𝒳𝑙\mathcal{X}=\{\mathcal{X}^{u},\mathcal{X}^{l}\}caligraphic_X = { caligraphic_X start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT }, 𝒳usuperscript𝒳𝑢\mathcal{X}^{u}caligraphic_X start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT be the subset of unlabeled entities, and 𝒳lsuperscript𝒳𝑙\mathcal{X}^{l}caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT be the labeled subset with binary labels y{0,1}Nlysuperscript01subscript𝑁𝑙{\textbf{y}}\in\{0,1\}^{N_{l}}y ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Nl=|𝒳l|subscript𝑁𝑙superscript𝒳𝑙N_{l}=|\mathcal{X}^{l}|italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = | caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | and 𝒳u𝒳l=superscript𝒳𝑢superscript𝒳𝑙\mathcal{X}^{u}\cap\mathcal{X}^{l}=\varnothingcaligraphic_X start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∩ caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = ∅. Label yi=1subscripty𝑖1{\textbf{y}}_{i}=1y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 indicates the i𝑖iitalic_i-th entity in 𝒳lsuperscript𝒳𝑙\mathcal{X}^{l}caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is anomalous; yi=0subscripty𝑖0{\textbf{y}}_{i}=0y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 otherwise.

The entity-level anomaly detection problem is to train a model f:𝒳:𝑓𝒳f:\mathcal{X}\rightarrow\mathbb{R}italic_f : caligraphic_X → blackboard_R, such that given a new entity 𝒳isubscript𝒳i\mathcal{X}_{\text{i}}caligraphic_X start_POSTSUBSCRIPT i end_POSTSUBSCRIPT with its historical MTS XiT×nisubscriptX𝑖superscript𝑇subscript𝑛𝑖{\textbf{X}}_{i}\in\mathbb{R}^{T\times n_{i}}X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the output f(xit|Xi)𝑓conditionalsuperscriptsubscriptx𝑖𝑡subscriptX𝑖f({\textbf{x}}_{i}^{t}|{\textbf{X}}_{i})italic_f ( x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents the anomalous score of the observation xitnisuperscriptsubscriptx𝑖𝑡superscriptsubscript𝑛𝑖{\textbf{x}}_{i}^{t}\in\mathbb{R}^{n_{i}}x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT at time step t𝑡titalic_t (t>T𝑡𝑇t>Titalic_t > italic_T).

Remark. It is noteworthy that although the model f𝑓fitalic_f should check anomalies at new time points for timely detection for at inference time, this problem is different from point-level anomaly detection because (1) the model f𝑓fitalic_f is shared across entities with different MTSs; and (2) at training time, the label y only marks anomalous entities, without marking anomalous time points, which are required by the existing (semi-)supervised methods (Jiang et al., 2021; Schmidl et al., 2022; Huang et al., 2022; Chen et al., 2023).

3. Proposed System Overview

Refer to caption
Figure 2. An illustration of (a) the system architecture, and (b) the inference process of the MELODY framework.

Fig. 2(a) illustrates the system architecture for running our MELODY model for detecting anomalous deployments of services.

3.1. The System Architecture

The offline system in Fig. 2(a, left) consists of three key components: (1) the data labeling service for importing data from products to experimental environment, (2) the machine learning platform for develo** and training the MELODY model, and (3) the model deployment service for deploying the OFE and SemiAD artifacts of MELODY model to the online system.

In particular, the labeling service imports MTS and meta-data (e.g., config profiles) of each deployment to the datalake. It provides two ways of labeling the deployments. The first is a labeling UI which visualizes the MTS of the deployment to be labeled. The labelers (human domain experts) use the UI to assign a score at the scale of -1 to 3 for deployments in 𝒳lsuperscript𝒳𝑙\mathcal{X}^{l}caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT: -1 means the labeler is unsure, 0 (or 1) indicates a normal (or likely normal) deployment, 3 (or 2) indicates an abnormal (or likely abnormal) deployment.

Because human labeling is costly, the second way is to automatically label normal deployments, which is referred as Bot in Fig. 2(a). Bot applies a set of expert-defined rules (e.g., whether a deployment has passed some safety checks) on the unlabeled deployments in 𝒳usuperscript𝒳𝑢\mathcal{X}^{u}caligraphic_X start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT. The passed deployments are automatically labeled as “normal”. These auto-labeled “normal” deployments may include noise (i.e., anomalies), but the amount should be small because anomalies are usually rare, as is the case in most anomaly detection tasks (Schmidl et al., 2022; Han et al., 2022). This auto-labeled normal set is valuable for unsupervised or semi-supervised methods (e.g., one-class models) to learn normal patterns.

Fig. 2(a, right) is the online system. Once users launch a new deployment, the historical MTS preceding the deployment are loaded into a data cache for initializing the OFE module (as detailed in Sec. 4.1). Then, the most recent w𝑤witalic_w time steps of the real-time observations are sent to the OFE service by the data cache, where w𝑤witalic_w is a window size. The OFE service transforms the data to features and sends them to the anomaly detection service to calculate anomalous score at the current time step, which, together with a threshold and some safety checking rules, are used by the rollback decision service to determine whether to rollback the deployment. If rolled back, the deployment trouble shooting service will display the time-wise anomaly scores and the details of relevant metrics to the users for debugging through a web interface.

In summary, the architecture enables continuous labeling of the data, and scheduled training cycle for automatic model updates with new data. In the next, we will focus on the design of the MELODY framework (the blue dashed box in Fig. 2(a)).

4. The Proposed MELODY framework

Fig. 2(b) illustrates the MELODY framework for online anomaly detection. It consists of OFE and SemiAD modules. The OFE transforms different MTSs of different entities to a comparable feature space. The SemiAD is a hybrid model that consumes the features for generating an anomaly score with a combining strategy.

4.1. Online Feature Extraction (OFE)

As described by the Challenge 1 in Sec. 1, the MTS of heterogeneous entities pose three issues that prevent a model from being trained across entities. The OFE aims to address them.

4.1.1. Featurizer

First, to address the non-comparable time series values of different entities, we transform the raw values of individual univariate time series to their anomalous degrees. Specifically, for the i𝑖iitalic_i-th deployment 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its j𝑗jitalic_j-th variate’s observation at time step t𝑡titalic_t, i.e., xi,jtsuperscriptsubscript𝑥𝑖𝑗𝑡x_{i,j}^{t}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (1jni1𝑗subscript𝑛𝑖1\leq j\leq n_{i}1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), is transformed to a score si,jtsuperscriptsubscript𝑠𝑖𝑗𝑡s_{i,j}^{t}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT representing the deviation of xi,jtsuperscriptsubscript𝑥𝑖𝑗𝑡x_{i,j}^{t}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT from the normal history xi,j=[xi,j1,,xi,jT]subscriptx𝑖𝑗superscriptsubscript𝑥𝑖𝑗1superscriptsubscript𝑥𝑖𝑗𝑇{\textbf{x}}_{i,j}=[x_{i,j}^{1},...,x_{i,j}^{T}]x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] for any t>T𝑡𝑇t>Titalic_t > italic_T. The more xi,jtsuperscriptsubscript𝑥𝑖𝑗𝑡x_{i,j}^{t}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT deviates from xi,jsubscriptx𝑖𝑗{\textbf{x}}_{i,j}x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, i.e., the higher the score si,jtsuperscriptsubscript𝑠𝑖𝑗𝑡s_{i,j}^{t}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is, the more anomalous xi,jtsuperscriptsubscript𝑥𝑖𝑗𝑡x_{i,j}^{t}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is. It is noteworthy that si,jtsuperscriptsubscript𝑠𝑖𝑗𝑡s_{i,j}^{t}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT defines the change of xi,jtsuperscriptsubscript𝑥𝑖𝑗𝑡x_{i,j}^{t}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT relative to its history xi,jsubscriptx𝑖𝑗{\textbf{x}}_{i,j}x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT for any univariate time series consistently, thus it is a comparable score across entities.

As illustrated in Fig. 2(b), the OFE has three Featurizers to generate score si,jtsuperscriptsubscript𝑠𝑖𝑗𝑡s_{i,j}^{t}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Each Featurizer is a general class. Given a concrete deployment 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, each Featurizer first initializes a set of instances, one per variate using its history, for some (or all) of the variates, e.g., an instance is initialized using the history xi,jsubscriptx𝑖𝑗{\textbf{x}}_{i,j}x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT for variate j𝑗jitalic_j.

Then the j𝑗jitalic_j-th instance is used to transform xi,jtsuperscriptsubscript𝑥𝑖𝑗𝑡x_{i,j}^{t}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to si,jtsuperscriptsubscript𝑠𝑖𝑗𝑡s_{i,j}^{t}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT in an online manner. Because of the Low Latency Tolerance (Challenge 2 in Sec. 1), we design three Featurizers that initialize instances efficiently:

Rule-based Featurizer. We integrate the prior knowledge of domain experts into our model by three types of rule-based features: (1) statistics based features (SbF), (2) threshold based features (TbF), and (3) count based features (CbF).

At the initialization stage, SbF learns the mean μ𝜇\muitalic_μ and standard deviation σ𝜎\sigmaitalic_σ of all T𝑇Titalic_T values in the history xi,jsubscriptx𝑖𝑗{\textbf{x}}_{i,j}x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, and defines a threshold τ=μ+ασ𝜏𝜇𝛼𝜎\tau=\mu+\alpha*\sigmaitalic_τ = italic_μ + italic_α ∗ italic_σ, where α𝛼\alphaitalic_α is a multiplier. At the inference stage, it sets si,jt=1superscriptsubscript𝑠𝑖𝑗𝑡1s_{i,j}^{t}=1italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 1 at time step t𝑡titalic_t if it observes wssubscript𝑤𝑠w_{s}italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT continuous values xi,jtws+1superscriptsubscript𝑥𝑖𝑗𝑡subscript𝑤𝑠1x_{i,j}^{t-w_{s}+1}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT, …, xi,jtsuperscriptsubscript𝑥𝑖𝑗𝑡x_{i,j}^{t}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT above τ𝜏\tauitalic_τ; and sets si,jt=0superscriptsubscript𝑠𝑖𝑗𝑡0s_{i,j}^{t}=0italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 0 otherwise. By defining different α𝛼\alphaitalic_α and wssubscript𝑤𝑠w_{s}italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for the metrics of interest, SbF has a total of 11 instances.

TbF is similar to SbF except that TbF uses a predefined threshold τ𝜏\tauitalic_τ directly without resorting to μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ. By defining τ𝜏\tauitalic_τ and wssubscript𝑤𝑠w_{s}italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, TbF has 7 possible instances. CbF is used to count the number of continuous missing observations, which can indicate anomalies on certain metrics. CbF’s initialization sets up a sliding window of size wcsubscript𝑤𝑐w_{c}italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. At the inference stage, CbF sets si,jt=1superscriptsubscript𝑠𝑖𝑗𝑡1s_{i,j}^{t}=1italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 1 if the window ending at time step t𝑡titalic_t is full of missing observations; and sets si,jt=0superscriptsubscript𝑠𝑖𝑗𝑡0s_{i,j}^{t}=0italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 0 otherwise. CbF has 1 possible instance.

It is noteworthy that all of the SbF, TbF, CbF Featurizers have efficient initialization. In total, rule-based featurizer has 19 possible instances for the metrics that the rules monitor.

Algorithm-based Featurizer. We also integrate two efficient point-level anomaly detection algorithms on univariate time series for scoring si,jtsuperscriptsubscript𝑠𝑖𝑗𝑡s_{i,j}^{t}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT: subsequence-based nearest neighbor (SubNN) (Schmidl et al., 2022) and median forecast (MD) (Basu and Meckesheimer, 2007).

SubNN is a distance based scorer. At the initialization stage, SubNN segments the history xi,jsubscriptx𝑖𝑗{\textbf{x}}_{i,j}x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT into subsequences with window size wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and stride 1, forming a set 𝒮𝒮\mathcal{S}caligraphic_S of length-wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT subsequences, and sets up a sliding window of size wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for the inference stage. At the inference stage, SubNN sets si,jtsuperscriptsubscript𝑠𝑖𝑗𝑡s_{i,j}^{t}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as the distance between the sliding window ending at time step t𝑡titalic_t and its nearest neighbor in 𝒮𝒮\mathcal{S}caligraphic_S. MD is an efficient forecasting approach. At the initialization stage, it only sets up a sliding window of size wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. At the inference stage, based on the sliding window at t1𝑡1t-1italic_t - 1, it calculates mobs=median([xi,jtwm,,xi,jt1])subscript𝑚obsmediansuperscriptsubscript𝑥𝑖𝑗𝑡subscript𝑤𝑚superscriptsubscript𝑥𝑖𝑗𝑡1m_{\text{obs}}=\text{median}([x_{i,j}^{t-w_{m}},...,x_{i,j}^{t-1}])italic_m start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT = median ( [ italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ] ), and mdif=median([xi,jtwm+1xi,jtwm,,xi,jt1xi,jt2])subscript𝑚difmediansuperscriptsubscript𝑥𝑖𝑗𝑡subscript𝑤𝑚1superscriptsubscript𝑥𝑖𝑗𝑡subscript𝑤𝑚superscriptsubscript𝑥𝑖𝑗𝑡1superscriptsubscript𝑥𝑖𝑗𝑡2m_{\text{dif}}=\text{median}([x_{i,j}^{t-w_{m}+1}-x_{i,j}^{t-w_{m}},...,x_{i,j% }^{t-1}-x_{i,j}^{t-2}])italic_m start_POSTSUBSCRIPT dif end_POSTSUBSCRIPT = median ( [ italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT ] ), and forecasts the subsequent value x^i,jt=mobs+(wm/2)mdifsuperscriptsubscript^𝑥𝑖𝑗𝑡subscript𝑚obssubscript𝑤𝑚2subscript𝑚dif\hat{x}_{i,j}^{t}=m_{\text{obs}}+(w_{m}/2)*m_{\text{dif}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_m start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT + ( italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT / 2 ) ∗ italic_m start_POSTSUBSCRIPT dif end_POSTSUBSCRIPT. We adapted it for scoring anomalies by setting si,jtsuperscriptsubscript𝑠𝑖𝑗𝑡s_{i,j}^{t}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as the deviation of xi,jtsuperscriptsubscript𝑥𝑖𝑗𝑡x_{i,j}^{t}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT from x^i,jtsuperscriptsubscript^𝑥𝑖𝑗𝑡\hat{x}_{i,j}^{t}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, i.e., si,jt=|xi,jtx^i,jt|superscriptsubscript𝑠𝑖𝑗𝑡superscriptsubscript𝑥𝑖𝑗𝑡superscriptsubscript^𝑥𝑖𝑗𝑡s_{i,j}^{t}=|x_{i,j}^{t}-\hat{x}_{i,j}^{t}|italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = | italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT |.

The algorithm-based featurizers are applicable to any metric. There are 22 possible metrics in total, thus the SubNN and MD featurizers have 44 instances.

Meta-Data Featurizer. For entity-level anomaly detection, we append some static meta-data features of a deployment to the time-wise features emitted by the rule-based and algorithm-based featurizers, as illustrated by Fig. 2(b). There are 8 meta-data features pertaining to the configurations of each deployment, which are useful as anomalous pattern may vary with configurations.

4.1.2. Time Pooling Layer

The second issue in Challenge 1 (Sec. 1) is the variable duration of different deployments. Using any instance of the rule-based and algorithm-based featurizer, we can obtain a feature si,jtsuperscriptsubscript𝑠𝑖𝑗𝑡s_{i,j}^{t}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at time step t𝑡titalic_t. However, as described in Sec. 2, the labels y of the training deployments in 𝒳lsuperscript𝒳𝑙\mathcal{X}^{l}caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are at the entity-level, where yi=1subscripty𝑖1{\textbf{y}}_{i}=1y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 only indicates the deployment 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is anomalous before it ended, without marking any anomalous time points. Thus we cannot train a model on the feature si,jtsuperscriptsubscript𝑠𝑖𝑗𝑡s_{i,j}^{t}italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at a specific time point t𝑡titalic_t using yisubscripty𝑖{\textbf{y}}_{i}y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. To address it, we seek for an entity-level method that (1) can be updated efficiently to track dynamic features, and (2) is invariant to the variable duration of deployments.

To this end, we designed a time pooling layer with two pooling methods up to the current time point t𝑡titalic_t:

(1) s^i,jt=MaxPool([si,j1,,si,jt]),s¯i,jt=MeanPool([si,j1,,si,jt])formulae-sequencesuperscriptsubscript^𝑠𝑖𝑗𝑡MaxPoolsuperscriptsubscript𝑠𝑖𝑗1superscriptsubscript𝑠𝑖𝑗𝑡superscriptsubscript¯𝑠𝑖𝑗𝑡MeanPoolsuperscriptsubscript𝑠𝑖𝑗1superscriptsubscript𝑠𝑖𝑗𝑡\displaystyle\hat{s}_{i,j}^{t}=\text{MaxPool}([s_{i,j}^{1},...,s_{i,j}^{t}]),~% {}~{}~{}\bar{s}_{i,j}^{t}=\text{MeanPool}([s_{i,j}^{1},...,s_{i,j}^{t}])over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = MaxPool ( [ italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] ) , over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = MeanPool ( [ italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] )

where the s^i,jtsuperscriptsubscript^𝑠𝑖𝑗𝑡\hat{s}_{i,j}^{t}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT records the most anomalous status up to time point t𝑡titalic_t, and s¯i,jtsuperscriptsubscript¯𝑠𝑖𝑗𝑡\bar{s}_{i,j}^{t}over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT records the accumulative anomalous status.

The pooling methods are invariant to variable sequence lengths. For model training, Eq. (1) generates a feature per training deployment by pooling up to the end of its length. Also, because both MaxPool and MeanPool can be updated incrementally in constant time, we can use Eq. (1) to dynamically track salient features of new deployments up to any time point for online inference.

4.1.3. Feature Aggregator

So far, using any of the featurizer instances, we can generate features [s^i,jt,s¯i,jt]superscriptsubscript^𝑠𝑖𝑗𝑡superscriptsubscript¯𝑠𝑖𝑗𝑡[\hat{s}_{i,j}^{t},\bar{s}_{i,j}^{t}][ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] for each univariate time series of 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which has a unique label (servicejsubscriptservice𝑗\texttt{service}_{j}service start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, metricsjsubscriptmetrics𝑗\texttt{metrics}_{j}metrics start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) (Sec. 2.1). The third issue in Challenge 1 (Sec. 1) implies the number of variates nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is different for different 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, leading to different feature dimensions for different 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. There are two reasons for the different nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: (1) the deployments can have different subsets of metrics, and (2) multiple services may be monitored for the same metric, generating multiple univariate time series on the same metric.

To address this challenge, and embed different deployments in the same feature space, we propose a feature aggregator. After we obtain the feature [s^i,jt,s¯i,jt]superscriptsubscript^𝑠𝑖𝑗𝑡superscriptsubscript¯𝑠𝑖𝑗𝑡[\hat{s}_{i,j}^{t},\bar{s}_{i,j}^{t}][ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] from Eq. (1) for each univariate time series, we aggregates the features over different services for the same metric. Suppose mksubscript𝑚𝑘m_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the k𝑘kitalic_k-th metric. Taking s^i,jtsuperscriptsubscript^𝑠𝑖𝑗𝑡\hat{s}_{i,j}^{t}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as an example, we perform an aggregation for mksubscript𝑚𝑘m_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT

(2) z^i,kt=MaxPool({s^i,jt|metricsj=mk,1jni})superscriptsubscript^𝑧𝑖𝑘𝑡MaxPoolconditional-setsuperscriptsubscript^𝑠𝑖𝑗𝑡formulae-sequencesubscriptmetrics𝑗subscript𝑚𝑘1𝑗subscript𝑛𝑖\displaystyle\hat{z}_{i,k}^{t}=\text{MaxPool}(\{\hat{s}_{i,j}^{t}|\texttt{% metrics}_{j}=m_{k},1\leq j\leq n_{i}\})over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = MaxPool ( { over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | metrics start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , 1 ≤ italic_j ≤ italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } )

where MaxPool is used because we want to keep the most salient anomalous feature from different services.

Similarly, we can obtain z¯i,jtsuperscriptsubscript¯𝑧𝑖𝑗𝑡\bar{z}_{i,j}^{t}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT from s¯i,jtsuperscriptsubscript¯𝑠𝑖𝑗𝑡\bar{s}_{i,j}^{t}over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and this step addresses the issue (2). To address issue (1), if a deployment misses a specific metric value mksubscript𝑚𝑘m_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Eq. (2) cannot be applied for mksubscript𝑚𝑘m_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we perform mean-based imputation on z^i,ktsuperscriptsubscript^𝑧𝑖𝑘𝑡\hat{z}_{i,k}^{t}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using the training deployments in 𝒳𝒳\mathcal{X}caligraphic_X. Thus we align all deployments to the same set of 134 features (63 z^i,ktsuperscriptsubscript^𝑧𝑖𝑘𝑡\hat{z}_{i,k}^{t}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, 63 z¯i,ktsuperscriptsubscript¯𝑧𝑖𝑘𝑡\bar{z}_{i,k}^{t}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT from the 63 featurizer instances and 8 meta-data features in Sec. 4.1.1). Then each deployment 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is represented by a vector zidsubscriptz𝑖superscript𝑑{\textbf{z}}_{i}\in\mathbb{R}^{d}z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT of fixed dimension d=134𝑑134d=134italic_d = 134.

To address noisy imputation and model the correlation of variates in MTS, next, we propose a semi-supervised model on {zi}i=1Nsuperscriptsubscriptsubscriptz𝑖𝑖1𝑁\{{\textbf{z}}_{i}\}_{i=1}^{N}{ z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT for learning meaningful embeddings using supervision signals.

4.2. Semi-Supervised Anomaly Detection

Taking features zidsubscriptz𝑖superscript𝑑{\textbf{z}}_{i}\in\mathbb{R}^{d}z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, SemiAD aims to train a detector f:d:𝑓superscript𝑑f:\mathbb{R}^{d}\rightarrow\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R, and uses it to emit the final anomaly score y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for every deployment 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. To be robust to the imputed values in zisubscriptz𝑖{\textbf{z}}_{i}z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and to address Challenge 3 (ambiguous anomaly definition) and 4 (limited supervision) in Sec. 1, we propose a hybrid model comprising a semi-supervised one-class model and a supervised ensemble model.

4.2.1. Semi-Supervised Deep One-Class Model

A one-class model is an unsupervised anomaly detector that is trained on samples of a single, typically normal, class. It is used to predict whether a testing sample belongs to this class or not. One prominent example is kernel-based method, such as One-Class SVM (OC-SVM) (Schölkopf et al., 2001) and Support Vector Data Description (SVDD) (Tax and Duin, 2004), which aims to maximize the margin between normal samples and others. For example, SVDD aims to find the smallest hypersphere with a center vector c and a radius R>0𝑅0R>0italic_R > 0 to enclose the majority of the (normal) samples in the feature space.

(3) minc,R,𝝃R2+1νNi=1Nξisubscriptc𝑅𝝃superscript𝑅21𝜈𝑁superscriptsubscript𝑖1𝑁subscript𝜉𝑖\displaystyle\min_{{\textbf{c}},R,\boldsymbol{\xi}}~{}~{}R^{2}+\frac{1}{\nu N}% \sum_{i=1}^{N}\xi_{i}roman_min start_POSTSUBSCRIPT c , italic_R , bold_italic_ξ end_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_ν italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
s.t.ϕ(zi)c2R2+ξi,ξi0,i=1,,Nformulae-sequences.t.superscriptnormitalic-ϕsubscriptz𝑖c2superscript𝑅2subscript𝜉𝑖formulae-sequencesubscript𝜉𝑖0for-all𝑖1𝑁\displaystyle\text{s.t.}~{}~{}\|\phi({\textbf{z}}_{i})-{\textbf{c}}\|^{2}\leq R% ^{2}+\xi_{i},~{}~{}\xi_{i}\geq 0,~{}~{}\forall i=1,...,Ns.t. ∥ italic_ϕ ( z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - c ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 , ∀ italic_i = 1 , … , italic_N

where ϕ(zi)italic-ϕsubscriptz𝑖\phi({\textbf{z}}_{i})italic_ϕ ( z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is a kernel function on input feature zisubscriptz𝑖{\textbf{z}}_{i}z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a slack variable to allow soft boundary, and ν(0,1]𝜈01\nu\in(0,1]italic_ν ∈ ( 0 , 1 ] is a hyperparameter to control the trade-off between the volume of the sphere and the penalties on ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Once c and R𝑅Ritalic_R are determined by solving the dual form of the primal problem in Eq. (3), samples that are outside the sphere, i.e., ϕ(zi)c2>R2superscriptnormitalic-ϕsubscriptz𝑖c2superscript𝑅2\|\phi({\textbf{z}}_{i})-{\textbf{c}}\|^{2}>R^{2}∥ italic_ϕ ( z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - c ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, are deemed anomalies.

Recently, DeepSVDD was introduced in (Ruff et al., 2018). Compared to the kernel-based methods, it is more robust to the noises from feature engineering, and more scalable to large datasets. It replaces ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) by a neural network ϕ(;𝜽):dde:italic-ϕ𝜽superscript𝑑superscriptsubscript𝑑𝑒\phi(\cdot;\boldsymbol{\theta}):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d_{e}}italic_ϕ ( ⋅ ; bold_italic_θ ) : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with parameter 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, where desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the dimension of the embedding space. The objective of DeepSVDD is to train 𝜽𝜽\boldsymbol{\theta}bold_italic_θ for learning a transformation ϕ(;𝜽)italic-ϕ𝜽\phi(\cdot;\boldsymbol{\theta})italic_ϕ ( ⋅ ; bold_italic_θ ) that minimizes the volume of a hypersphere centered on a predetermined c.

(4) min𝜽1Ni=1ND(ϕ(zi;𝜽),c)+λ𝜽F2subscript𝜽1𝑁superscriptsubscript𝑖1𝑁𝐷italic-ϕsubscriptz𝑖𝜽c𝜆superscriptsubscriptnorm𝜽𝐹2\displaystyle\min_{\boldsymbol{\theta}}\frac{1}{N}\sum_{i=1}^{N}D\big{(}\phi({% \textbf{z}}_{i};\boldsymbol{\theta}),{\textbf{c}}\big{)}+\lambda\|\boldsymbol{% \theta}\|_{F}^{2}roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_D ( italic_ϕ ( z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) , c ) + italic_λ ∥ bold_italic_θ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where D(,)𝐷D(\cdot,\cdot)italic_D ( ⋅ , ⋅ ) is a distance function, such as Euclidean or Hamming distance, and λ𝜆\lambdaitalic_λ is the hyperparameter for weight decay.

Although DeepSVDD can learn embeddings that are less sensitive to the noises in zisubscriptz𝑖{\textbf{z}}_{i}z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, it can not harness the supervised signals for learning embeddings that resolve the ambiguous anomaly definition (Challenge 3 in Sec. 1). Recent works also indicated even with a small amount of labels, semi-supervised methods could outperform unsupervised methods significantly on anomaly detection (Han et al., 2022).

Therefore, we introduce our Semi-Supervised Deep One-Class Model (SemiDOC). Our goal is to use the small amount of labeled anomalies to tighten the boundary of the hypersphere. To this end, we propose a negative sampling based model trained in batch-wise. In each batch, we randomly sample B𝐵Bitalic_B normal samples from the labeled set 𝒳lsuperscript𝒳𝑙\mathcal{X}^{l}caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT or the unlabeled set 𝒳usuperscript𝒳𝑢\mathcal{X}^{u}caligraphic_X start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT (which is auto-labeled as normal as described in Sec. 3.1) as the queries {ziq}i=1Bsuperscriptsubscriptsuperscriptsubscriptz𝑖𝑞𝑖1𝐵\{{\textbf{z}}_{i}^{q}\}_{i=1}^{B}{ z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. For each query ziqsuperscriptsubscriptz𝑖𝑞{\textbf{z}}_{i}^{q}z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT, we sample an anomaly from 𝒳lsuperscript𝒳𝑙\mathcal{X}^{l}caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT as its negative sample, and form a tuple (ziq,zin)superscriptsubscriptz𝑖𝑞superscriptsubscriptz𝑖𝑛({\textbf{z}}_{i}^{q},{\textbf{z}}_{i}^{n})( z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ). Then each batch is a set of B𝐵Bitalic_B tuples {(ziq,zin)}i=1Bsuperscriptsubscriptsuperscriptsubscriptz𝑖𝑞superscriptsubscriptz𝑖𝑛𝑖1𝐵\{({\textbf{z}}_{i}^{q},{\textbf{z}}_{i}^{n})\}_{i=1}^{B}{ ( z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, and the learning objective is

(5) min𝜽1Bi=1B(D(ϕ(ziq;𝜽),c)+(ziq,zin;𝜽))+λ𝜽F2subscript𝜽1𝐵superscriptsubscript𝑖1𝐵𝐷italic-ϕsuperscriptsubscriptz𝑖𝑞𝜽csuperscriptsubscriptz𝑖𝑞superscriptsubscriptz𝑖𝑛𝜽𝜆superscriptsubscriptnorm𝜽𝐹2\displaystyle\min_{\boldsymbol{\theta}}\frac{1}{B}\sum_{i=1}^{B}\Big{(}D\big{(% }\phi({\textbf{z}}_{i}^{q};\boldsymbol{\theta}),{\textbf{c}}\big{)}+\ell({% \textbf{z}}_{i}^{q},{\textbf{z}}_{i}^{n};\boldsymbol{\theta})\Big{)}+\lambda\|% \boldsymbol{\theta}\|_{F}^{2}roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_D ( italic_ϕ ( z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ; bold_italic_θ ) , c ) + roman_ℓ ( z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; bold_italic_θ ) ) + italic_λ ∥ bold_italic_θ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where

(6) (ziq,zin;𝜽)=max(δD(ϕ(ziq;𝜽),ϕ(zin;𝜽)),0)superscriptsubscriptz𝑖𝑞superscriptsubscriptz𝑖𝑛𝜽𝛿𝐷italic-ϕsuperscriptsubscriptz𝑖𝑞𝜽italic-ϕsuperscriptsubscriptz𝑖𝑛𝜽0\displaystyle\ell({\textbf{z}}_{i}^{q},{\textbf{z}}_{i}^{n};\boldsymbol{\theta% })=\max{\Big{(}\delta-D\big{(}\phi({\textbf{z}}_{i}^{q};\boldsymbol{\theta}),% \phi({\textbf{z}}_{i}^{n};\boldsymbol{\theta})\big{)},0\Big{)}}roman_ℓ ( z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; bold_italic_θ ) = roman_max ( italic_δ - italic_D ( italic_ϕ ( z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ; bold_italic_θ ) , italic_ϕ ( z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; bold_italic_θ ) ) , 0 )

is a hinge loss to maximize the distance between the embeddings of ziqsuperscriptsubscriptz𝑖𝑞{\textbf{z}}_{i}^{q}z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT and zinsuperscriptsubscriptz𝑖𝑛{\textbf{z}}_{i}^{n}z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and δ𝛿\deltaitalic_δ is a threshold to avoid arbitrarily large distance values in the loss function, for training stability.

After training the model, SemiDOC uses the following function to infer anomaly score of a new sample znewsubscriptznew{\textbf{z}}_{\text{new}}z start_POSTSUBSCRIPT new end_POSTSUBSCRIPT.

(7) AnomalyScore(znew;𝜽)=Clip(D(ϕ(znew;𝜽),c)R,0,1)AnomalyScoresubscriptznew𝜽Clip𝐷italic-ϕsubscriptznew𝜽c𝑅01\displaystyle\text{AnomalyScore}({\textbf{z}}_{\text{new}};\boldsymbol{\theta}% )=\text{Clip}{\bigg{(}\frac{D\big{(}\phi({\textbf{z}}_{\text{new}};\boldsymbol% {\theta}),{\textbf{c}}\big{)}}{R},0,1\bigg{)}}AnomalyScore ( z start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ; bold_italic_θ ) = Clip ( divide start_ARG italic_D ( italic_ϕ ( z start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ; bold_italic_θ ) , c ) end_ARG start_ARG italic_R end_ARG , 0 , 1 )

where R=maxzi𝒵normalD(ϕ(zi;𝜽),c)𝑅subscriptsubscriptz𝑖subscript𝒵normal𝐷italic-ϕsubscriptz𝑖𝜽cR=\max_{{\textbf{z}}_{i}\in\mathcal{Z}_{\text{normal}}}D\big{(}\phi({\textbf{z% }}_{i};\boldsymbol{\theta}),{\textbf{c}}\big{)}italic_R = roman_max start_POSTSUBSCRIPT z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Z start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_D ( italic_ϕ ( z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) , c ) is the maximal radius of the normal embeddings in set 𝒵normalsubscript𝒵normal\mathcal{Z}_{\text{normal}}caligraphic_Z start_POSTSUBSCRIPT normal end_POSTSUBSCRIPT in the training set, i.e., the learned hypersphere, and the Clip function is used to prevent extreme values from dominating the anomaly scores.

Remark. Our method is different from DeepSAD (Ruff et al., 2019), which only regularizes the distance between the labeled samples and the center c, without explicit manipulation on the boundary of normal data. In contrast, the proposed SemiDOC uses negative sampling in Eq. (6) to explicitly tighten the boundary of normal embeddings, facilitating the detection of hard anomalies that are close to the boundary. We empirically demonstrate the superiority of SemiDOC in Sec. 5.4.3.

4.2.2. Supervised Anomaly Detector

Given the intricate anomalous patterns and the potential presence of noise in the unlabeled set 𝒳usuperscript𝒳𝑢\mathcal{X}^{u}caligraphic_X start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, a semi-supervised model may be biased from learning an accurate boundary. To address it, we add a robust supervised model to provide another angle of class boundary, and combine it with SemiDOC in an ensemble in Sec. 4.2.3. Our empirical findings in Sec. 5 consolidate the superiority of such a hybrid model.

We employ LightGBM (Ke et al., 2017) as the supervised detector. LightGBM is a boosting tree-based ensemble method that has high accuracy and efficiency. As a binary classifier, it has been demonstrated as useful in anomaly detection tasks (Vargaftik et al., 2021; Han et al., 2022), and been deployed in many production pipelines of fraud prevention systems. Similar methods such as XGBoost (Chen and Guestrin, 2016) and CatBoost (Prokhorenkova et al., 2018) are also compatible with our framework. We selected LightGBM for its better efficiency and superior accuracy. By feeding a feature zisubscriptz𝑖{\textbf{z}}_{i}z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to LightGBM, we use the probability to the anomalous class as the anomaly score.

4.2.3. Hybrid Model

Ensembling was found as a powerful regularization technique for performance improvement (Oreshkin et al., 2019). We ensemble SemiDOC and LightGBM to form a hybrid model for anomaly detection. The core property of an ensemble is diversity. Thus we perform a bagging procedure (Breiman, 1996) by including models (SemiDOC or LightGBM) trained with different random initializations.

As for the ensemble aggregation function, the widely used approach is taking mean of the anomaly scores from different models in the hybrid. However, as a one-class model, a high score from SemiDOC in Eq. (7) does not necessarily mean anomalies, but indicates an unknown entity is different from the normal majority of the training set. Thus taking mean of the scores of SemiDOC and LightGBM may generate a high score for unknown but normal entities, leading to more false positives.

To alleviate it, we propose a sequential approach with two steps. First, SemiDOC is used to filter normal entities with scores lower than a threshold. Second, the entities that SemiDOC is less confident (i.e., with high scores) are sent to LightGBM for anomaly detection. If there are multiple SemiDOC (or LightGBM) in the hybrid, the mean score of SemiDOCs (or LightGBMs) is used in the two steps.

In our experiments, we evaluated both mean-based and sequential models, which are named as MELODY-M and MELODY-S.

4.3. Time Complexity Analysis

We analyzed the time complexity of OFE and SemiAD in detail in Appendix A.1. In summary, the time complexity of MELODY for anomaly inference is approximately O(T)𝑂𝑇O(T)italic_O ( italic_T ), which is efficient as T𝑇Titalic_T can be fixed as a constant length of historical time series.

5. Experiments

5.1. Datasets

Table 1. The statistics of dataset
Dataset # entities # time series # anomalies
Hard-Labeled set 4,966 234,508 288 (5.8%)
Soft-Labeled set 4,966 234,508 544 (11.0%)
Naive-Labeled set 4,688 220,320 544 (11.6%)
Unlabeled set 27,590 1,034,711 NA
Table 2. The performance on anomaly detection of the compared methods. \uparrow means higher is better. \downarrow means lower is better. The best and second results in the F1 column (i.e., the overall performance metrics) are in bold and underlined, respectively.
Method Hard Labels Soft Labels Naive Labels
F1 \uparrow Prec. \uparrow Recall \uparrow FPR \downarrow F1 \uparrow Prec. \uparrow Recall \uparrow FPR \downarrow F1 \uparrow Prec. \uparrow Recall \uparrow FPR \downarrow
MTS OmniAnomaly 0.206 0.126 0.573 0.241 0.294 0.193 0.622 0.317 0.311 0.209 0.608 0.296
AnomalyTrans 0.217 0.132 0.616 0.246 0.298 0.210 0.514 0.236 0.316 0.227 0.521 0.229
TranAD 0.207 0.126 0.573 0.240 0.297 0.194 0.629 0.318 0.309 0.209 0.594 0.289
General OFE+DeepSVDD 0.188 0.168 0.357 0.174 0.286 0.203 0.666 0.422 0.357 0.274 0.533 0.190
OFE+DeepSVDD-B 0.221 0.176 0.348 0.107 0.316 0.214 0.717 0.379 0.332 0.231 0.631 0.283
OFE+DeepSAD 0.337 0.244 0.555 0.105 0.422 0.377 0.497 0.104 0.439 0.398 0.494 0.096
OFE+DeepSAD-B 0.355 0.274 0.508 0.082 0.437 0.399 0.485 0.089 0.449 0.390 0.532 0.107
OFE+RF 0.382 0.300 0.535 0.077 0.455 0.397 0.533 0.098 0.466 0.475 0.458 0.065
OFE+RF-B 0.362 0.279 0.521 0.082 0.456 0.395 0.540 0.101 0.470 0.411 0.552 0.103
OFE+LGBM 0.397 0.331 0.502 0.062 0.491 0.442 0.553 0.085 0.518 0.481 0.564 0.078
OFE+LGBM-B 0.399 0.325 0.522 0.066 0.490 0.434 0.563 0.091 0.515 0.466 0.576 0.086
Ours MELODY-M 0.411 0.333 0.540 0.066 0.499 0.436 0.584 0.092 0.514 0.464 0.577 0.086
MELODY-S 0.432 0.393 0.485 0.045 0.493 0.452 0.546 0.081 0.544 0.482 0.625 0.086

We sampled real data of the deployments of a variety of services from Amazon AWS between April 7, 2022 and Dec. 29, 2022. This dataset contains 4966 labeled deployments in 𝒳lsuperscript𝒳𝑙\mathcal{X}^{l}caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and 27590 unlabeled deployments in 𝒳usuperscript𝒳𝑢\mathcal{X}^{u}caligraphic_X start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT. The deployments in 𝒳lsuperscript𝒳𝑙\mathcal{X}^{l}caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT were labeled by the labeling service as described in Sec. 3.1. Because one deployment may be scored by multiple human judges at the scale of -1 to 3, we used three approaches to aggregate and binarize the labels: (1) Hard Labels: yi=1subscripty𝑖1{\textbf{y}}_{i}=1y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 if all labelers scored 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as 3; yi=0subscripty𝑖0{\textbf{y}}_{i}=0y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 otherwise, (2) Soft Labels: yi=1subscripty𝑖1{\textbf{y}}_{i}=1y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 if the scores of 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are either 2 or 3; yi=0subscripty𝑖0{\textbf{y}}_{i}=0y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 otherwise, (3) Naive Labels: the same as Soft Labels except that yi=0subscripty𝑖0{\textbf{y}}_{i}=0y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 if the scores of 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are either 0 or 1 (-1 were excluded). For each deployment, there are 22 monitored metrics, such as Threads, CPU usage, and Memory usage, and 8 meta-data features, such as number of services and number of hosts. For each metric, we used its 2 day observations prior to the launch of deployment 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as its history xi,jTsubscriptx𝑖𝑗superscript𝑇{\textbf{x}}_{i,j}\in\mathbb{R}^{T}x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT for the OFE module. Because the observations were collected in every minute, T=2880𝑇2880T=2880italic_T = 2880. The lengths of different deployments after launch could be different, and the average length is 16.1 minutes. Table 1 summarizes the statistics of the dataset.

5.2. Experimental Setup

5.2.1. Baselines

We compare MELODY with a variety of anomaly detection (AD) methods from two groups: MTS-based methods and General AD methods. For the first group, because our entity-level AD problem does not assume the availability of point-level labels on MTS, we only evaluated unsupervised SOTA methods: (1) OmniAnomaly (Su et al., 2019), (2) AnomalyTransformer (Xu et al., 2021), (3) TranAD (Tuli et al., 2022). We applied these methods on the MTS of each deployment individually. Following (Xu et al., 2021), once an anomalous time point is detected, the deployment is considered as anomalous. For the second group, we included unsupervised method: (4) DeepSVDD (Ruff et al., 2018), semi-supervised method: (5) DeepSAD (Ruff et al., 2019), and supervised methods: (6) RandomForest (RF) (Breiman, 2001), (7) LightGBM (Ke et al., 2017), which can use the entity-level labels by applying them on the features generated by our OFE module. Thus they were named with prefix “OFE+”, e.g., OFE+LGBM. For fair comparison, the bootstrap ensemble versions of the methods in the second group were included and named with suffix “-B”, e.g., OFE+LGBM-B.

For our method, we considered two variants: MELODY-M and MELODY-S as discussed in Sec. 4.2.3. We used Hamming distance D(,)𝐷D(\cdot,\cdot)italic_D ( ⋅ , ⋅ ) in Eq. (5). The hyperparameter δ𝛿\deltaitalic_δ, i.e. the threshold in Eq. (6) was grid searched in {1, 10, 100} using validation set. By default, we used 3 LightGBMs and 3 SemiDOCs in our method, with a total 6 models. For fair comparison, each of the ensemble baseline used 6 models. In Sec. 5.4.1 and Sec. 5.4.3, we performed hyperparameter study and ablation analysis to evaluate the design choice of MELODY. Its implementation details are in Appendix A.2.

5.2.2. Evaluation

We randomly split the labeled set 𝒳lsuperscript𝒳𝑙\mathcal{X}^{l}caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT into 60% /20%/20% train/validation/test sets 5 times, and run each method on these 5 splits to evaluate the average performance. The unlabeled set 𝒳usuperscript𝒳𝑢\mathcal{X}^{u}caligraphic_X start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT was only used for training semi-supervised methods as it does not have valid labels for validation and testing.

Following (Xu et al., 2021; Huang et al., 2022), we use Precision, Recall, and F1-score to evaluate the AD performance of the compared methods. Additionally, we added False Positive Rate (FPR) because in a rollback system, it is important to avoid false positives, i.e., unnecessary rollbacks, that lead to slow deployments and poor user experience. Following (Huang et al., 2022), for each method, we select the threshold with the best F1-score.

5.3. Experimental Results

Table 2 summarizes the average results of the compared methods on the test sets, from which we have several observations. First, the methods using OFE generally outperform MTS-based methods, indicating the unsupervised MTS-based AD methods are not applicable to entity-level anomaly detection, and it is important to align different entities to the same feature space. Our OFE module provides a foundation to accomplish this task. Second, for the general AD methods, their ensemble versions (with “-B”) are overall better than the original methods (e.g., in terms of F1), indicating the usefulness of ensembling in entity-level anomaly detection. Third, OFE+DeepSAD outperforms OFE+DeepSVDD, indicating a semi-supervised method is better than unsupervised method. Fourth, both OFE+RF and OFE+LGBM outperform OFE+DeepSAD, indicating the labeled set may be more important that the unlabeled set. Finally, both MELODY-M and MELODY-S outperform the baselines in most cases in terms of F1, e.g., MELODY-S has a 7.6% to 56.5% relative improvement on F1 on the Hard Labels dataset. This demonstrates the effective use of both labeled and unlabeled sets by the proposed methods. In particular, MELODY-S is superior in precision and FPR in most cases. This attributes to the sequential design of its ensembling (Sec. 4.2.3), where SemiDOC can effectively filter out true normal entities, resulting in less false positives and high precision, meanwhile maintains a reasonable recall.

5.4. More Details on Effectiveness

5.4.1. Parameter Analysis

Refer to caption
Figure 3. The performance of MELODY-S w.r.t. (a) the ensemble size, and (b) the margin δ𝛿\deltaitalic_δ in Eq. (6).

There are two major hyperparameters in MELODY, the number of model instances in the hybrid and the margin threshold δ𝛿\deltaitalic_δ in the hinge loss in Eq. (6). In this section, we use MELODY-S to evaluate the influence of these parameters (MELODY-M is similar thus is ignored for brevity). Fig. 3 presents the change of F1 scores w.r.t. the two parameters on the three labeled sets. In Fig. 3(a), the number represents the the number of either SemiDOC or LightGBM, e.g., 3 indicates 3 SemiDOC + 3 LightGBM. We can see that using either 3 or 5 is better than 1, indicating MELODY-S also benefits from ensembling. Also, the performance marginally improves or degrades after the number is larger than 3, validating our choice in Sec. 5.2.1. From Fig. 3(b), we can see MELODY-S is not very sensitive to δ𝛿\deltaitalic_δ. A small margin (e.g., 0.1) is insufficient for distinguishing normal and abnormal entities, and a too large margin (e.g., 1000) may lead to overfitting. Thus a proper choice is δ=100𝛿100\delta=100italic_δ = 100, which is our setup for MELODY.

5.4.2. Visualization of Embeddings

Refer to caption
Figure 4. The tSNE visualization of the embeddings of SemiDOC using Hard Labels on (a) the train set, (b) the test set.

As discussed in Sec. 4.2.1, SemiDOC uses a small amount of labeled anomalies to tighten the boundary of the hypersphere of normal entities. To understand how it works, we uniformly sampled a batch of training data and took all test data for visualization (the full training set is too large to visualize). The embeddings of SemiDOC were visualized using tSNE (Van der Maaten and Hinton, 2008) in a 2D space. Fig. 4 presents the distribution of the training and test embeddings. From Fig. 4(a), SemiDOC effectively pushed anomalies away from the boundary of normal entities. As designed by the negative sampling in Eq. (5), the anomalies do not need to form a cluster but are used to shape the boundary of normal class. From Fig. 4(b), SemiDOC generalizes well on the test set. Although the anomalies overlap with a small amount of normal entities, the majority of the normal class is distinguishable. This explains the design of MELODY-S, where SemiDOC first filters out most of the normal entities (with a few false negatives), and LightGBM classifies the rest set of normal and abnormal entities.

5.4.3. Ablation Analysis

Table 3. The ablation analysis in terms of F1 score (relative change w.r.t. MELODY-S). H-Dist. is Hamming distance. E-Dist. is Euclidean distance.
Model Hard Soft Naive
MELODY-S 0.432 0.493 0.544
(a) - Rule features 0.366 (-15.3%) 0.456 (-7.5%) 0.497 (-8.6%)
(b) - Alg. features 0.384 (-11.1%) 0.490 (-0.6%) 0.521 (-4.2%)
(c) - Meta features 0.406 (-6.0%) 0.485 (-1.6%) 0.525 (-3.5%)
(d) SemiOC\rightarrowDSVDD 0.406 (-6.0%) 0.480 (-2.6%) 0.515 (-5.3%)
(e) SemiOC\rightarrowDSAD 0.403 (-6.7%) 0.490 (-0.6%) 0.524 (-3.7%)
(f) H-Dist.\rightarrowE-Dist. 0.404 (-6.5%) 0.200 (-59.4%) 0.437 (-19.7%)

In this section, we evaluate several variants of MELODY-S to validate the design choices of its OFE and SemiAD modules. Table 3 summarizes the testing results of our ablation analysis. In (a)-(c), we alternately removed the three featurizers in OFE (Sec. 4.1.1). In (d) and (e), we replaced SemiDOC in SemiAD module by DeepSVDD and DeepSAD, respectively. In (f), we replaced the Hamming distance with Euclidean distance in Eq. (5) and Eq. (6) to evaluate the choice of distance function. First, in (a)-(c), we observe removing any of the featurizers degrades the performance of MELODY-S. Removing Rule/Algorithmic featurizers have more impacts than the Meta featurizer because they are more fine-grained and dynamic at the time series level. In (d)(e), we observe SemiDOC is better than DeepSVDD and DeepSAD for anomaly detection, validating the design of the negative sampling based regularization in Eq. (6). Finally, (f) suggests Hamming distance, which is commonly used in embedding retrieval (Song et al., 2018), is more suitable than Euclidean distance in MELODY-S for entity-level anomaly detection. Therefore, the results in Table 3 justify the design choices of our method.

5.5. Human Evaluation

Table 4. Online evaluation of the proposed model.
Model # Deployments Precision
The Existing Model 64 29.7%
Ours 64 35.9%

To evaluate the impact of the proposed MELODY model in real applications, we run an online A/B experiment, where the control group is a LightGBM based model running in the existing rollback system, and the treatment group is our MELODY based model. For fair comparison, both of the existing model and the proposed model were configured to take 50% traffic of deployments between Jun. 8, 2023, and Jul. 7, 2023. The user of each deployment can score a rollback using the labeling service described in Sec. 3.1. In this period, 128 scored rollbacks were collected, with 64 for each model. We binarized the scores in the same way as Naive Labels, and calculated the precision of each model in Table 4. Recall (and F1) cannot be computed because there is no ground truth for the total number of true positives. Because the scores were provided voluntarily, and users tend to provide feedback for wrong decisions, the precisions in Table 4 are generally lower than those in Table 2. However, the higher precision of the proposed model indicates it agrees with the users more often than the existing model in the system, suggesting its effectiveness in the online systems.

6. Related Work

To the best of our knowledge, this the first work for online detection of anomalies in streaming entities where each entity owns its specific MTS. There are several works claimed entity-level anomaly detection (AD) (Su et al., 2019; Huang et al., 2022), but their methods aim to detect anomalous time points of the MTS emitted by a few entities such as cloud servers. So they are still point-level methods and cannot be applied to streaming entities due to the non-shareable, entity-specific model in (Su et al., 2019) or the need of poin-level supervision in (Huang et al., 2022).

Typical AD methods include TS-based methods and general methods that can be applied to non-TS vectorial data such as images, as summarized in surveys (Schmidl et al., 2022) and (Han et al., 2022), respectively. Among them, MTS-based AD methods are most relevant, most of which were designed to be unsupervised. Traditional MTS methods apply general AD techniques such as LOF and iForest on subsequences of MTS. Recent works use RNN or Transformer to encode MTS, and developed reconstruction based methods (Park et al., 2018; Su et al., 2019; Xu et al., 2021) forecasting based methods (Malhotra et al., 2015; Munir et al., 2018; Hundman et al., 2018), and generative methods (Li et al., 2023). Moreover, graph-based AD methods have been proposed on MTS by encoding and tracking the change in the correlations between variates using GNN (Zhao et al., 2020; Deng and Hooi, 2021) or CNN (Zhang et al., 2019). However, these methods cannot readily take the advantage of labels, if available. Several recent TS methods were proposed for semi-supervised AD (Jiang et al., 2021; Chen et al., 2023), but they assume point-level labels are available for locating anomalous time points or segments. This is not the case for entity-level AD as described in Challenge 4 in Sec. 1, where labels only mark anomalous entities. Because none of the aforementioned methods can be used to address the challenges of entity-level AD as discussed in Sec. 1, a proper solution such as the proposed MELODY is in demand.

As suggested by (Han et al., 2022), leveraging a small amount of labels for semi-supervised AD has a big potential. Recently, general semi-supervised AD methods have been proposed for images (Ruff et al., 2019; Huang et al., 2021), texts (Zhou et al., 2023), and tabular data (Yoon et al., 2022). They cannot be used for solving our problem until being embedded into our MELODY framework, such as DeepSAD (Ruff et al., 2019) in Table 3. This suggests the importance and flexibility of the entire MELODY framework for entity-level AD.

7. Conclusion

In this paper, we introduced MELODY, a semi-supervised framework for online entity-level anomaly detection. MELODY uses an online feature extractor to align the MTS of different entities to the same feature space, and a hybrid model SemiAD for detecting anomalous entities. In SemiAD, SemiDOC was proposed for tightening the boundary of normal entities by negative sampling. It is combined with a supervised detector for robust detection. The comprehensive experiments on large-scale datasets indicate MELODY outperforms the SOTA methods, and the human evaluation further suggest its effectiveness in large online systems.

References

  • (1)
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
  • Balalaie et al. (2016) Armin Balalaie, Abbas Heydarnoori, and Pooyan Jamshidi. 2016. Microservices architecture enables devops: Migration to a cloud-native architecture. Ieee Software 33, 3 (2016), 42–52.
  • Basu and Meckesheimer (2007) Sabyasachi Basu and Martin Meckesheimer. 2007. Automatic outlier detection for time series: an application to sensor data. Knowledge and Information Systems 11 (2007), 137–154.
  • Breiman (1996) Leo Breiman. 1996. Bagging predictors. Machine learning 24 (1996), 123–140.
  • Breiman (2001) Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5–32.
  • Chen et al. (2023) Ningjiang Chen, Huan Tu, Xiaoyan Duan, Liangqing Hu, and Chengxiang Guo. 2023. Semisupervised anomaly detection of multivariate time series based on a variational autoencoder. Applied Intelligence 53, 5 (2023), 6074–6098.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
  • Deng and Hooi (2021) Ailin Deng and Bryan Hooi. 2021. Graph neural network-based anomaly detection in multivariate time series. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 4027–4035.
  • Dragoni et al. (2017) Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. 2017. Microservices: yesterday, today, and tomorrow. Present and ulterior software engineering (2017), 195–216.
  • Du et al. (2017) Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. 1285–1298.
  • Gan et al. (2019) Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices. In Proceedings of the twenty-fourth international conference on architectural support for programming languages and operating systems. 19–33.
  • Han et al. (2022) Songqiao Han, Xiyang Hu, Hailiang Huang, Minqi Jiang, and Yue Zhao. 2022. Adbench: Anomaly detection benchmark. Advances in Neural Information Processing Systems 35 (2022), 32142–32159.
  • Huang et al. (2021) Chaoqin Huang, Fei Ye, Peisen Zhao, Ya Zhang, Yanfeng Wang, and Qi Tian. 2021. ESAD: end-to-end semi-supervised anomaly detection. Restoration 69, 70 (2021), 71.
  • Huang et al. (2022) Tao Huang, Pengfei Chen, and Ruipeng Li. 2022. A semi-supervised vae based active anomaly detection framework in multivariate time series for online systems. In Proceedings of the ACM Web Conference 2022. 1797–1806.
  • Hundman et al. (2018) Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom. 2018. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In KDD. 387–395.
  • Jiang et al. (2021) Jehn-Ruey Jiang, Jian-Bin Kao, and Yu-Lin Li. 2021. Semi-supervised time series anomaly detection based on statistics and deep learning. Applied Sciences 11, 15 (2021), 6698.
  • Ke et al. (2017) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Li et al. (2023) Yuxin Li, Wenchao Chen, Bo Chen, Dongsheng Wang, Long Tian, and Mingyuan Zhou. 2023. Prototype-oriented unsupervised anomaly detection for multivariate time series. In International conference on machine learning. PMLR.
  • Liu et al. (2015) Dapeng Liu, Youjian Zhao, Haowen Xu, Yongqian Sun, Dan Pei, Jiao Luo, Xiaowei **g, and Mei Feng. 2015. Opprentice: Towards practical and automatic anomaly detection through machine learning. In Proceedings of the 2015 internet measurement conference. 211–224.
  • Malhotra et al. (2015) Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, Puneet Agarwal, et al. 2015. Long Short Term Memory Networks for Anomaly Detection in Time Series.. In ESANN, Vol. 2015. 89.
  • Meng et al. (2019) Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, et al. 2019. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs.. In IJCAI, Vol. 19. 4739–4745.
  • Munir et al. (2018) Mohsin Munir, Shoaib Ahmed Siddiqui, Andreas Dengel, and Sheraz Ahmed. 2018. DeepAnT: A deep learning approach for unsupervised anomaly detection in time series. Ieee Access 7 (2018), 1991–2005.
  • Nedelkoski et al. (2019) Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. 2019. Anomaly detection from system tracing data using multimodal deep learning. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). IEEE, 179–186.
  • Oreshkin et al. (2019) Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2019. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In International Conference on Learning Representations.
  • Park et al. (2018) Daehyung Park, Yuuna Hoshi, and Charles C Kemp. 2018. A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder. IEEE Robotics and Automation Letters 3, 3 (2018), 1544–1551.
  • Prokhorenkova et al. (2018) Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems 31 (2018).
  • Ruff et al. (2018) Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. 2018. Deep one-class classification. In International conference on machine learning. PMLR, 4393–4402.
  • Ruff et al. (2019) Lukas Ruff, Robert A Vandermeulen, Nico Görnitz, Alexander Binder, Emmanuel Müller, Klaus-Robert Müller, and Marius Kloft. 2019. Deep Semi-Supervised Anomaly Detection. In International Conference on Learning Representations.
  • Schmidl et al. (2022) Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. 2022. Anomaly detection in time series: a comprehensive evaluation. Proceedings of the VLDB Endowment 15, 9 (2022), 1779–1797.
  • Schölkopf et al. (2001) Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural computation 13, 7 (2001), 1443–1471.
  • Song et al. (2018) Dong** Song, Ning Xia, Wei Cheng, Haifeng Chen, and Dacheng Tao. 2018. Deep r-th root of rank supervised joint binary embedding for multivariate time series retrieval. In KDD. 2229–2238.
  • Su et al. (2019) Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. 2019. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2828–2837.
  • Tax and Duin (2004) David MJ Tax and Robert PW Duin. 2004. Support vector data description. Machine learning 54 (2004), 45–66.
  • Tuli et al. (2022) Shreshth Tuli, Giuliano Casale, and Nicholas R Jennings. 2022. TranAD: deep transformer networks for anomaly detection in multivariate time series data. Proceedings of the VLDB Endowment 15, 6 (2022), 1201–1214.
  • Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
  • Vargaftik et al. (2021) Shay Vargaftik, Isaac Keslassy, Ariel Orda, and Yaniv Ben-Itzhak. 2021. RADE: resource-efficient supervised anomaly detection using decision tree-based ensemble methods. Machine Learning 110, 10 (2021), 2835–2866.
  • Wang et al. (2018) ** Wang, **gmin Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. 2018. Cloudranger: Root cause identification for cloud native systems. In 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 492–502.
  • Xu et al. (2018) Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao, Dan Pei, Yang Feng, et al. 2018. Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In Proceedings of the 2018 world wide web conference. 187–196.
  • Xu et al. (2021) Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. 2021. Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. In International Conference on Learning Representations.
  • Yoon et al. (2022) **sung Yoon, Kihyuk Sohn, Chun-Liang Li, Sercan O Arik, and Tomas Pfister. 2022. SPADE: Semi-supervised Anomaly Detection under Distribution Mismatch. Transactions on Machine Learning Research (2022).
  • Zhang et al. (2019) Chuxu Zhang, Dong** Song, Yuncong Chen, Xinyang Feng, Cristian Lumezanu, Wei Cheng, **gchao Ni, Bo Zong, Haifeng Chen, and Nitesh V Chawla. 2019. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 1409–1416.
  • Zhao et al. (2020) Hang Zhao, Yu**g Wang, Juanyong Duan, Congrui Huang, Defu Cao, Yunhai Tong, Bixiong Xu, **g Bai, Jie Tong, and Qi Zhang. 2020. Multivariate time-series anomaly detection via graph attention network. In ICDM. IEEE, 841–850.
  • Zhou et al. (2023) Yixuan Zhou, Peiyu Yang, Yi Qu, Xing Xu, Fumin Shen, and Heng Tao Shen. 2023. AnoOnly: Semi-Supervised Anomaly Detection without Loss on Normal Data. arXiv preprint arXiv:2305.18798 (2023).

Appendix A Appendix

A.1. Time Complexity Analysis

As an online anomaly detection system, the efficiency for online inference is important. For a new deployment, the computational load of MELODY for online anomaly detection includes the computation for featurizer instances (Sec. 4.1.1) and the inference using SemiAD model. In this section, we first analyze the time complexity for initializing each of the featurizers in Sec. 4.1.1 and their score computation, then analyze the time complexity for SemiAD model inference.

Rule-based Featurizer. At the initialization stage, SbF learns the mean μ𝜇\muitalic_μ and standard deviation σ𝜎\sigmaitalic_σ of all T𝑇Titalic_T values in the history xi,jsubscriptx𝑖𝑗{\textbf{x}}_{i,j}x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, so its complexity is O(1)𝑂1O(1)italic_O ( 1 ). TbF sets a predefined threshold τ𝜏\tauitalic_τ with complexity O(1)𝑂1O(1)italic_O ( 1 ). CbF sets up a sliding window of size wcsubscript𝑤𝑐w_{c}italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT with complexity is O(1)𝑂1O(1)italic_O ( 1 ). At inference stage, SbF and TbF check whether xi,jtsuperscriptsubscript𝑥𝑖𝑗𝑡x_{i,j}^{t}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT (t>T𝑡𝑇t>Titalic_t > italic_T) is above their thresholds, so the complexity is O(1)𝑂1O(1)italic_O ( 1 ). Similarly, CbF checks whether xi,jtsuperscriptsubscript𝑥𝑖𝑗𝑡x_{i,j}^{t}italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is a missing value with complexity O(1)𝑂1O(1)italic_O ( 1 ). Therefore, the complexity of Rule-based Featurizer is O(1)𝑂1O(1)italic_O ( 1 ).

Algorithm-based Featurizer. At the initialization stage, SubNN segments the length-T𝑇Titalic_T history xi,jsubscriptx𝑖𝑗{\textbf{x}}_{i,j}x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT into subsequences with window size wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and stride 1 to form a set 𝒮𝒮\mathcal{S}caligraphic_S of length-wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT subsequences, so its complexity is O(Twn)𝑂𝑇subscript𝑤𝑛O(T-w_{n})italic_O ( italic_T - italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). MD only sets up a sliding window of size wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, so its complexity is O(1)𝑂1O(1)italic_O ( 1 ). At inference stage, subNN calculates the distance between a length-wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT sliding window ending at time step t𝑡titalic_t and its nearest neighbor in 𝒮𝒮\mathcal{S}caligraphic_S, so the complexity is O(Twn)𝑂𝑇subscript𝑤𝑛O(Tw_{n})italic_O ( italic_T italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). MD calculates mobs=median([xi,jtwm,,xi,jt1])subscript𝑚obsmediansuperscriptsubscript𝑥𝑖𝑗𝑡subscript𝑤𝑚superscriptsubscript𝑥𝑖𝑗𝑡1m_{\text{obs}}=\text{median}([x_{i,j}^{t-w_{m}},...,x_{i,j}^{t-1}])italic_m start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT = median ( [ italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ] ), and mdif=median([xi,jtwm+1xi,jtwm,,xi,jt1xi,jt2])subscript𝑚difmediansuperscriptsubscript𝑥𝑖𝑗𝑡subscript𝑤𝑚1superscriptsubscript𝑥𝑖𝑗𝑡subscript𝑤𝑚superscriptsubscript𝑥𝑖𝑗𝑡1superscriptsubscript𝑥𝑖𝑗𝑡2m_{\text{dif}}=\text{median}([x_{i,j}^{t-w_{m}+1}-x_{i,j}^{t-w_{m}},...,x_{i,j% }^{t-1}-x_{i,j}^{t-2}])italic_m start_POSTSUBSCRIPT dif end_POSTSUBSCRIPT = median ( [ italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 2 end_POSTSUPERSCRIPT ] ) with complexity O(wm)𝑂subscript𝑤𝑚O(w_{m})italic_O ( italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). Therefore, the complexity of Algorithm-based Featurizer is O(Twn+wm)𝑂𝑇subscript𝑤𝑛subscript𝑤𝑚O(Tw_{n}+w_{m})italic_O ( italic_T italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ).

Meta-Data Featurizer. The deployment-level meta-data features are static and can be retrieved and appended to rule-based and algorithm-based features with O(1)𝑂1O(1)italic_O ( 1 ) time complexity.

On top of the featurizers, there is a time pooling layer (Sec. 4.1.2) for addressing the variable duration of different deployments and a feature aggregator (Sec. 4.1.3) for aligning different deployments into the same feature space. For the time pooling layer, both MaxPool and MeanPool in Eq. (1) were updated incrementally in constant time during online inference, and in parallel for different features, so its complexity is O(1)𝑂1O(1)italic_O ( 1 ). For feature aggregator, the key computational step in Eq. (2) was implemented in parallel for all metrics, and its time complexity for each metric is O(ni)𝑂subscript𝑛𝑖O(n_{i})italic_O ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT variates in the MTS. Additionally, the imputation step in the feature aggregator takes O(ni)𝑂subscript𝑛𝑖O(n_{i})italic_O ( italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) time for filling missing values at a time step. Therefore, by parallelizing all featurizer instances, the overall time complexity of the OFE module during online inference is O(Twn+wm+ni)𝑂𝑇subscript𝑤𝑛subscript𝑤𝑚subscript𝑛𝑖O(Tw_{n}+w_{m}+n_{i})italic_O ( italic_T italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

SemiAD Model Inference. The SemiDOC module in SemiAD uses Eq. (7) for inferring anomaly scores, where c and R𝑅Ritalic_R were pre-computed during training time and stored for model inference. So the key computation is the embedding ϕ(znew;𝜽)italic-ϕsubscriptznew𝜽\phi({\textbf{z}}_{\text{new}};\boldsymbol{\theta})italic_ϕ ( z start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ; bold_italic_θ ), which takes O(ddmax)𝑂𝑑subscript𝑑maxO(dd_{\text{max}})italic_O ( italic_d italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) for an MLP encoder with dmaxsubscript𝑑maxd_{\text{max}}italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT as the maximal dimension of all layers, where d𝑑ditalic_d is the dimension of znewsubscriptznew{\textbf{z}}_{\text{new}}z start_POSTSUBSCRIPT new end_POSTSUBSCRIPT. In addition, the LightGBM in SemiAD uses O(d)𝑂𝑑O(d)italic_O ( italic_d ) time for inference with a constant number of estimators (Ke et al., 2017). Therefore, the time complexity for both ensemble methods MELODY-M and MELODY-S are O(ddmax)𝑂𝑑subscript𝑑maxO(dd_{\text{max}})italic_O ( italic_d italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) at online inference time.

Summary. Integrating the time complexity of the processes for initializing featurizers, online computation of features, and SemiAD model inference, the time complexity of MELODY for online anomaly detection is O(Twn+wm+ddmax)𝑂𝑇subscript𝑤𝑛subscript𝑤𝑚𝑑subscript𝑑maxO(Tw_{n}+w_{m}+dd_{\text{max}})italic_O ( italic_T italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_d italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ), where T𝑇Titalic_T is the length of historical time series, wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the window size used by SubNN, wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the window size used by MD, d𝑑ditalic_d is the dimension of features input to SemiOC, and dmaxsubscript𝑑maxd_{\text{max}}italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is the maximal dimension of all layers in the neural network ϕ(;𝜽)italic-ϕ𝜽\phi(\cdot;\boldsymbol{\theta})italic_ϕ ( ⋅ ; bold_italic_θ ). In practice, wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and dmaxsubscript𝑑maxd_{\text{max}}italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT are set as small constants. According to Sec. 4.1.3, d=134𝑑134d=134italic_d = 134 is also a small constant. Therefore, the time complexity is approximately O(T)𝑂𝑇O(T)italic_O ( italic_T ). In our experiments, we used 2-day historical time series in minute-wise, so T=2880𝑇2880T=2880italic_T = 2880. We set wn=100subscript𝑤𝑛100w_{n}=100italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 100, wm=100subscript𝑤𝑚100w_{m}=100italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 100, and dmax=128subscript𝑑max128d_{\text{max}}=128italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = 128. According to Sec. 4.1.3, d=134𝑑134d=134italic_d = 134. With this setup, the model is efficient and can perform online anomaly detection.

A.2. Implementation Details

For the baseline methods, we employed their official code when available. In the group of MTS-based methods, OmniAnomaly used a 2-layer GRU, 3-layer encoder, and 3-layer decoder, with PReLU as the internal activation and Sigmoid as the output activation. Its embedding dimension was set as 8, and other hidden layers had dimension of 32. Its hyperparameter β=0.01𝛽0.01\beta=0.01italic_β = 0.01 for KL divergence. It was trained with Adam optimizer (Kingma and Ba, 2014) with learning rate 2e32superscript𝑒32e^{-3}2 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and weight decay 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. AnomalyTransformer used a 3-layer transformer with 8 heads as the encoder. The embedding dimension was set as 512, and the activation function was GELU. Its hyperparameter λ=3𝜆3\lambda=3italic_λ = 3. I was trained with Adam optimizer with learning rate 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. TranAD used a transformer encoder-decoder architecture with embedding dimension of 16, appended with a fully connected output layer. It was trained with Adam optimizer with learning rate 1e31superscript𝑒31e^{-3}1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and weight decay 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT.

In the group of general methods, both DeepSVDD and DeepSAD used a 2-layer neural network encoder with embedding dimension 128. LeakyRELU (slope 0.1) was used as the activation function in the input layers, and Sigmoid was used as the output activation. LayerNorm (Ba et al., 2016) was added in each layer. The hyperparameter of DeepSAD was set as η=1𝜂1\eta=1italic_η = 1. Both DeepSVDD and DeepSAD were optimized by Adam with learning rate 1e31superscript𝑒31e^{-3}1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, weight decay 1e41𝑒41e{-4}1 italic_e - 4. RandomForest used 100 estimators with maximum depth 2. LightGBM used 100 estimators with maximum depth as 5, and a learning rate of 0.1.

For the proposed MELODY method, the neural network ϕ(;𝜽)italic-ϕ𝜽\phi(\cdot;\boldsymbol{\theta})italic_ϕ ( ⋅ ; bold_italic_θ ) in SemiDOC in Eq. (5) was implemented with a two layer encoder with embedding dimension 128. LeakyRELU (slope 0.1) was used as the activation function in the input layers, and Sigmoid was used as the output activation. LayerNorm was added in each layer. The architecture was the same as DeepSVDD and DeepSAD for fair comparison. The SemiDOC model was optimized by Adam with learning rate 1e31superscript𝑒31e^{-3}1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, weight decay 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, batch size 256, and a maximum of 500 epochs. Early stop** was employed using the validation set.