CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems
Abstract
In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies at enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable issues for normal system operations and maintenance. In this paper, we propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data, including traces, logs, and system monitoring metrics. Specifically, related information is encoded into representative embeddings and further modeled by a multimodal invocation graph. Following that, anomaly detection is performed on each instance node with attentive heterogeneous message passing from its adjacent metric and log nodes. Finally, CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization. We evaluate the proposed framework on two public microservice datasets with distinct attributes and compare with the state-of-the-art methods. The results show that CHASE achieves the average performance gain up to 36.2%(A@1) and 29.4%(Percentage@1), respectively to its best counterpart.
Index Terms:
Microservice systems, root cause localization, multimodal data, graph neural networkI Introduction
Microservice architecture has recently surged in popularity due to its significant advantages for contemporary industrial service-oriented systems. This architectural design involves a collection of small, independent, and loosely coupled applications, where each application delivers a specific business function [1]. Specifically, each microservice is independently developed, deployed, and managed by a dedicated team, with communication facilitated through APIs and protocols such as REST, HTTP, and messaging systems. It is widely acknowledged that microservice architecture enhances flexibility, scalability, and fault tolerance [2, 3, 4].
However, fault localization and analysis in microservice systems often pose significant challenges due to high complexity [5]. In reality, operation engineers and developers are frequently required to utilize data such as distributed invocation histories and logging tools to quickly identify and isolate issues. In complex systems where microservices are interconnected, each method or function call might trigger multiple distributed service invocations, either synchronously or asynchronously. This complexity can overwhelm operation engineers, especially when they need to manually gather data from various sources during system failures, thereby complicating the issue identification and resolution process. Implementing trace, metric, and log functionalities, and understanding how to use these data collectively, are essential for effective management of complex microservice systems. Tracing allows for the tracking of a request’s path across various microservices, offering valuable insights into performance and identification of bottlenecks [2]. Metrics, sometime being refered as system monitoring metrics, provide quantitative data related to system performance, such as response times and error rates, which are crucial for performance optimization and capacity planning [6]. System logs, on the other hand, capture detailed information about system events and transactions, facilitating debugging and troubleshooting processes [7]. Together, traces, metrics, and logs create a comprehensive monitoring system that encompasses all necessary multimodal information, which helps enhance visibility into the system’s behavior, thereby supporting performance monitoring, debugging, and ensuring overall system reliability.
Fig. 1 illustrates the execution process of service invocations within an enterprise-level microservice system, with trace, metrics, and logs data being monitored and recorded. The microservice trace data is visualized as a directed graph where each node represents an instance of various services such as network gateways, message queues, databases, and other business-specific services. Directed edges show the flow of requests between services and application instance. Metric data, such as response times or error rates, are captured as time-series data, illustrating changes over time. Logs are typically formatted according to the configuration of the log system and provide textual insights into instance activities, which can be analyzed using natural language processing techniques to extract semantic information.
As the complexity of systems continues to grow, there is an increasing emphasis on integrating Root Cause Analysis (RCA) methods to timely uncover the underlying reasons behind issues or failures [8, 9, 10]. RCA is particularly crucial in microservice systems due to their intricate and dynamic interactions between various service instances. When failures occur in large-scale microservice systems, they can spread rapidly, leading to significant disruptions, as evidenced by the 12-hour outage experienced by Didi, a major Chinese ride-hailing platform, in November 2023, resulting in a loss of 56.43 million [11]. RCA in microservice systems involves not just identifying general error types for further investigation but also pinpointing specific instance nodes along with their problematic logs and metrics [12, 13]. For instance, as illustrated in Fig. 1, a typical RCA framework collects the runtime information from the microservice system and outputs its error diagnosis, indicating the DB instance node circled in red as the root cause of the system failure. Additionally, the RCA framework outputs several logs from the error stack and detected anomaly metrics to facilitate further manual checks. However, traditional RCA methods only offer suggestive clues, necessitating detailed manual analysis to determine the exact root cause [14]. Furthermore, with the ongoing evolution of monolithic microservice architectures, new challenges that require handling multimodal information at runtime and the demands of robust anomaly diagnosis arise, underscoring the critical importance of effective RCA as a pressing field.
![Refer to caption](extracted/5697553/figures/arc.png)
To date, numerous models have been proposed for root cause analysis. One class of approaches explores the feasibility of building causality graphs using merely trace information, which consists of service invocation chains from various service instances. Based on such intuition, causal analysis algorithms such as PC [15] and GES [16] are employed to infer the causal relationships between service instances in the graph. Following that, a line of research [17, 18, 19, 20, 21, 22] starts to integrate root cause analysis with machine learning models and graph neural networks (GNNs), from which the graph learning techniques are leveraged to model the pairwise bivariate correlation of instances connected with directed invocation edges. In these work, multimodal data of service instances are used as supplementary features to enrich features such as node embeddings and edge weights, while the uncovering of bivariate correlation between pairs of instance nodes are treated as the homogeneous graph learning task in the trace typology, leading to the following challenges:
-
1.
Information heterogeneity of microservice systems. The microservice system information composed of log, metric and trace should be considered as multimodal data. The log information contains systemically defined natural languages. The metric information collects time series data of numerical values. And the trace information is the typology of the invocation graph. With multimodal information involved, it is both reasonable and natural to explore the heterogeneity attributes of the service invocation graph.
-
2.
Multivariate causal correlation between instances. Since each instance node affects all its downstream invocation, the causal information of such instance node can possibly propagate to nodes that lies multiple hops away from itself. It is challenging to capture the causality flow using graph convolution networks, which primarily focus on modeling pairwise locality correlation of directed edges. Instead, more sophisticated graph structures that can model multivariate correlations among subsets of nodes with varying cardinality should be considered.
To tackle these issues, we propose a novel causal heterogeneous graph based framework named CHASE, which is applicable to inductively learn the root cause anomaly of microservice systems with the presence of trace, log, and metric data all together. Based on multimodal data, a heterogeneous invocation graph is constructed using trace topology, with additional metric and log nodes connected to the source instance nodes where the information is gathered. Representative node embeddings are then generated using designated encoders for each multimodal data type. The procedure for instance-level anomaly detection is carried out using heterogeneous attentive message passing. By doing so, CHASE can accurately locate root cause instances with excellent performance by learning from the constructed hypergraph, with hyperedges representing causality propagation in the trace. We evaluate CHASE on two datasets with different attributes—one with static trace topology and one with dynamic trace topology. Our experimental results show that CHASE significantly improves fault localization accuracy compared to several state-of-the-art methods.
The main contributions of this paper are as follows:
-
1.
We propose a novel graph learning framework named CHASE to handle multimodal data of log, metric and trace under the task of root cause analysis, in which encode trace, metrics and log information are encoded along with the construction of invocation graph-based typology. We accomplish multimodal feature fusion by solving the instance-level anomaly detection with heterogeneous message passing. Hereby, we attain the modeling of multimodal data and information heterogeneity of microservice systems.
-
2.
We achieve the root cause analysis task with hypergraph learning. CHASE captures the causality flow of the anomalies in microservice systems by constructing hypergraphs on the basis of typology, from which each hyperedge represents causality propagation along the invocation path. The multivariate causality corrleation between a set of instances is modelled with hypergraph convolution.
-
3.
We conduct extensive experiments using two public datasets from microservice systems and compare the performance with a number of traditional and GNN-based baselines, demonstrating that our proposal can outperform comparative methods in the root cause analysis task.
The rest of the paper is organized as follows: Section II introduces the related work including non-GNN based and GNN-based root cause analysis. Section III explains the overall framework of CHASE and the implementation details including multimodal invocation graph construction, instance level anomaly detection with heterogeneous message passing and hypergraph causality learning for root cause analysis. In Section IV, we evaluate the performance of CHASE and extensively compare it with the state-of-the-art baselines on two distinct datasets. Section V concludes the paper.
II Related Work
II-A Non-GNN based Root Cause Analysis
The complexity of the microservice system has intensified the need for effective root cause analysis techniques to diagnose and resolve issues. In recent years, a line of research has focused on develo** RCA methods for microservices systems [23]. This section overviews some of the key research studies in this area.
Existing work mainly uses three categories of data sources: log-based [7], trace-based [2], and metric-based [6]. Log-based RCA is a technique that analyzes service logs from different instances in the microservice system to identify potential root causes of issues, which naturally rely on the accurate text parsing techniques in the log and are often hard to work in real time. For instance, Zhang et al. [7] propose a method for localizing operational faults that involve two steps: it first preprocesses system logs to generate high-quality features, and then uses machine learning model on these features to identify the root cause of operational faults. LogFlash [24] also integrates anomaly detection on logs as the main part of root cause analysis based on the calculation of deviation from normal log status. A common issue for log-based RCA is that these works often require offline efforts to extract key information in the log, and the performance of the log-based RCA is also largely dependent on the overall quality of the system logs. Trace-based research utilizes the information through the complete tracing of the execution paths and then identifies root causes that occur along the way. TraceRCA [2] uses a tracing tool to collect trace data among service invocations in the microservice system and employ the decision tree to detect the root cause. However, using trace data alone is insufficient as trace data only presents information at service invocation level [10]. Apart from the abovementioned two categories, the metirc-based RCA is now widely studied in the research community. Most metric-based RCA research [17] employs the monitoring data (e.g., CPU and memory usage, network latency etc.) gathered from different service instances to establish causal graphs and deduce the underlying root causes, including MicroRCA [20] and CloudRanger [19], while the former correlate application performance symptoms with the root cause, the latter conducts second-order random walk on impact graph to identify the problematic services. However, the metric-based root cause analysis method does not consider other types of information from microservice systems. TrinityRCL harness telemetry data of application-level, service-level, host-level, combined with metric-level to construct the causal graph with heterogeneity[25].
II-B GNN based Root Cause Analysis
Graph Neural Networks have demonstrated prowess in capturing intricate relationships within graph-structured data, i.e. data from non-Euclidean spaces. GNNs can be explained as low-pass filters for graph Fourier transform from the spectral perspective, and can also be regarded as a message passing mechanism from neighbor nodes under the spatial perspective. By leveraging popular GNN architectures such as Graph Convolution Network (GCN)[26], GraphSAGE[27], and Graph Attention Network (GAT)[28], tasks relating with graph structural learning have been solved in fields of social networks, biology, and recommendation systems, etc [29, 30]. Due to the intrinsic nature of microservice systems where individual services are pairwise correlated, it is intuitive to model the dependencies between microservices with a graph representation. Thus GNNs can be employed to learn the patterns of microservice systems and facilitate root cause analysis.
Recent advancement of GNNs has unleashed great potential in analyzing the root cause problems, especially for cases relying on graph structure data. Owing to the natural compatibility with generated causal graphs, researchers started to leverage GNN to learn the failure propagation patterns in the graph, thus giving a stronger generalization ability compared with random walk based RCA frameworks on causal graphs generated with Peter-Clark(PC) algoritm [22]. CausalRCA [21] designs a gradient-based causal structure learning to capture linear and non-linear causal relations in monitoring metrics and outputs a weighted causal directed acyclic graph (DAG). Diagfusion [22] reports the state-of-the-art RCA results using GNN. It combines deployment data and traces to build a dependency graph and uses GNN to generate the embeddings for each service instance, which are then used to achieve two-fold failure diagnosis, i.e., root cause fault localization and failure type determination. A hierarchical causal network framework named REASON[31] is applied to model the fault propagation of the microservice trace with both within-network and across-network causal relationships for root cause localization. Random walk on the learnt causal network is further applied to locate a system fault. It is worth noting that CausalRCA uses the encoder-decoder structure for unsupervised learning to derive the weight of edges, which is considered insufficient to represent the significance between causes and effects in the causal graph. Diagfusion, on the other hand, applies graph convolutional network for representation learning on invocation graphs in an end-to-end fashion, requiring a certain amount of labelled data.
In all, non-GNN based frameworks focus more on mining causal correlation from multimodal information, such as metrics, so as to build up causal structure graphs representing causality flow, while GNN-based frameworks apply graph learning methodologies and unravel the RCA task into a trainable optimizing problem by minimizing the traning loss. However, none of the aforementioned non-GNN based frameworks or GNN-based frameworks can tackle RCA task by capturing multimodal information, invocation typology as well as causality propagation simultaneously in an end-to-end trainable manner. Our proposed framework, CHASE, focuses on exploiting logs and metrics to generate pertinent and comprehensive embeddings, applies message passing in graph to learn the non-Euclidean data of invocation traces, models causality flow with hyperedges and presents an overall training loss making RCA task soluble in an end-to-end way.
III CHASE Framework
![Refer to caption](extracted/5697553/figures/framework.jpg)
In this section, we elaborate the CHASE framework with three stages, including invocation graph generation, instance anomaly detection with heterogeneous message passing and root cause analysis with hypergraph learning. Fig. 2 depicts the overall architecture. We list out all necessary notations used in this paper in Table I.
Notation | Description |
---|---|
Concatenation | |
Set size | |
Activation function | |
Invocation graph modelling a microservice trace | |
Neighbors of a node in | |
Set of instance nodes | |
Set of metric nodes | |
Set of log nodes | |
Set of edges | |
Encoded node embedding | |
One hot embedding for instance category | |
Number of attention heads | |
Key embedding of attention head with index | |
Query embedding of attention head with index | |
Anomaly information of attention head with index | |
Anomaly score between a pair of instance and log | |
Anomaly score between a pair of instance and metric | |
Nodewise normalized anomaly score of instance | |
Hypergraph incidence matrix | |
Diagonal matrix of node degree | |
Diagonal matrix of hyperedge degree | |
Detection head for root cause analysis |
III-A Multimodal Invocation Graph Construction
To begin with, we formally define the concept of an invocation graph within the context of microservice systems, which is derived from a directed acyclic graph (DAG). A DAG is a directed graph that contains no cycles, meaning there is no path that starts at a given node and loops back to the same node after traversing multiple edges. An invocation graph is a special form of a DAG, where the edges represent the meta-relationships between instances, logs, and metrics.
In a microservice system, as shown in Fig. 1, the process of handling a request is usually reflected with a microservice trace, which can be modeled by a DAG with instances as nodes and each directed edge representing that instance is called by instance . There are multiple types of service instances, including application instances, message instances, gateway instances, database instances. The instances and invocation edges on a trace are extracted and assembled to obtain the complete microservice invocation graph to represent business processes. In general, each instance will continuously output logs during runtime, while metrics of the instance’s container and physical server, will also be recorded by the DevOps platform simultaneously.
This aforementioned multimodal trace information is modeled by the invocation graph denoted with , where denotes the set of all instances in the trace, denotes the set of monitored metrics of all instances, and denotes all printed logs of the trace, including warnings and errors. We then apply directed edges and to represent the runtime records of metric and logs of instance node . Notice that each instance is monitored with multiple types of metrics so that there may exist multiple metric-instance edges for a single instance, while we encode all the log information into one node per instance to reduce for computational efficiency on the multimodal invocation graph.
Note that the raw feature of log nodes are natural language texts with semantic meaning, features of metric nodes are composed of real valued time series data and the instance nodes have categorical features. As a result, invocation graph with log, metric and instance nodes can be considered as a graph with multimodal information. Hence, we apply respective encoders to encode the feature of log, metric and instance nodes into embeddings denoted as , respectively:
(1) |
For log nodes, we follow the work from [22] which applies FastText [32] in encoding the extracted log templates into text embeddings. For metric nodes, we apply the trivial time series transformer provided by Huggingface111https://huggingface.co/blog/time-series-transformers, and take the output embedding from the last timestamp as the encoded metric embedding. For instance nodes, the one-hot embedding , representing the category of the specific instance node, is projected with a learnable matrix . And we combine the information of the temporal invocation order of each instance in the trace by applying positional encoding. We denote positional encoding as where denotes the index of a single feature in all feature dimensions , denotes the temporal order in invocation and is a hyperparameter set to in defualt, as follows:
(2) |
III-B Instance Level Anomaly Detection with Heterogeneous Message Passing
After the whole encoding process where all the log node embeddings, metric node embeddings and instance node embeddings are encoded into , , , we construct the multimodal invocation graph. The task of instance level anomaly detection is a conventional procedure in root cause analysis. In this section, we handle this task based on the intrinsic heterogeneity of the invocation graph, which is inspired by the architecture design of heterogeneous graph transformer.
III-B1 Attentive Anomaly Score
Starting with the subgraph composed of a single instance node and its related log and metric nodes, we model all the output logs with a single log node , while we model different types of monitored metrics with several metric nodes , denoted as . We first project all the encoded embeddings into a specific vector space for anomaly detection. For log and metric nodes, we perform the following linear transformation.
(3) |
where , , are learnable weight matrices of dimension , where is the inital embedding dimension from our multimodal encoders and is the total number of attention heads, denotes the index of a specific attention head.
For each instance node in the microservice system, multiple types of runtime metrics and logs are continuously produced and monitored by the DevOps platform. It is crucial to identify which of these monitored records are the most representative for the occurrence of the detected instance-level anomaly. We quantify this anomaly score, denoted as , with the heterogeneous mutual attention weight between the instance node and all its neighbors in the invocation graph.
(4) |
Both , are of dimension , which represents the projection matrix for calculating the attention weight between the instance node and its multimodal neighbors. This matrix is then to be shareable among all attention heads. and are the prior significance granted to different types of metrics and logs, which can be learnable and initialized to ones, inferring our prior assumption that all kinds of logs and metrics are equally important to the instance-level anomaly.
The anomaly score is normalized for each instance node with all its neighbors.
(5) |
The instance anomaly score for attention head is of dimension , where denotes the set of all neighbors of node and in accordance with the normalized attention weight.
III-B2 Anomaly Information Passing
CHASE further on extracts the anomaly information from the metric nodes and log nodes by performing another linear transformation with their encoded embeddings.
(6) |
In order to extract the anomaly detection information from all the attention heads, we concat all the embeddings together and form:
(7) |
where represents the anomaly information of instance node from all its neighbors and with the dimension of . Since each attention head will extract the anomaly detection information from a separate projection field, will be stacked up to the size of , which will be weighted summed by so as to gather all the anomaly information from the neighbors of instance node based on the significance of the attentively learnt anomaly score.
III-B3 Instance Level Anomaly Embedding Update
We aggregate all the anomaly information based on its attention weight calculated from Equation 5 by:
(8) |
denotes the instance level anomaly embedding learnt from the heterogeneous invocation graph. Each instance node will gather anomaly information from its monitored log and metric neighbor nodes, and the instance node embedding is updated with this learnt anomaly embedding, which reflects the procedure of anomaly detection.
(9) |
is applied to project the anomaly information embedding back to the same vector space with the instance node features after being activated. is the hyperparameter representing our prior knowledge on how significantly an instance-level anomaly will result in the root cause of a trace-level anomaly. The whole framework of instance level anomaly detection is delineated in Fig. 2(b).
III-C Root Cause Analysis with Hypergraph Learning
Lastly, CHASE perform root cause analysis with constructed hyperedges representing causality flow on the invocation graph, where heterogeneous anomaly information is propagated to each instance node. Let be a hypergraph with incidence matrix . denotes the vertices of the hypergraph, denotes the edges of it and . Considering vertex , if can be reached from edge , it then can be denoted as . Otherwise, .
In a hypergraph, each hyperedge can encompasses more than two vertices, meaning:
(10) |
We use to denote a diagonal matrix, the element of which represents the weight of each hyperedge. is the diagonal matrix which denotes the degree of each vertex and is the diagonal matrix which denotes the degree of each hyperedge as follows:
(11) | |||
(12) |
Next, we introduce the algorithm to generate hyperedges for the constructed invocation graph, so as to learn the multivariate causality information for root cause analysis with hypergraph convolution. The trace typology graph illustrated in Fig. 2(c) is generated by removing all the log nodes and metric nodes from the multimodal invocation graph as the anomaly embedding are learnt at this stage from Equation 9. Suppose an extra intervention , for instance, a 5-second network package loss of the server, is applied to a random instance . Taking the output of instance as a random variable , then
(13) |
Here we take instance node with index 6 as an example to consider the effect (centered with red circle in Fig 2(c)). Denoting the output of instance 6 with random variable . Since the instances with index larger than 6, which are descendant nodes , are called by instance 6 and their states are considered to be resulted from , intervention on these instances will not change the cause instance output . However, is not independent with the intervention on its ascendant nodes including intervention directly imposed on itself, namely . This leads to the hyperedge construction process (See Algorithm 1).
INPUT: The invocation graph
OUTPUT: Incidence matrix
To be more specific, Algorithm 1 initializes the incidence matrix to an empty zero matrix with line 1. Line 2 loops through all the vertices of the trace invocation graph , picking a single vertex each time and constructing multiple hyperedges representing both causality and results. Lines 3-10 loop through each parent node of the target instance node, and construct a hyperedge that connects all ascendants of the parent node. Lines 12-17 construct the hyperedge that connects all the descendant nodes. Fig. 2(c) gives an illustration of the hyperedge construction for instance with index 6, which is marked in red. As instance 6 has two parent nodes, namely instances 2 and 5, a total number of three hyperedges will be generated for instance 6.
With the hypergraph that represents the invocation causality being constructed, we initialize the node embeddings with learnt from Equation 9. In order to localize the trace level root cause instance, a convolution operation is performed on the hypergraph so that the causality information can propagate to each instance node from the constructed hyperedges. We consider each hyperedge as of the same significance, hence . And the hypergraph convolution can be defined as follows:
(14) |
We set from Equation 9, and is the learnable projection that captures the causality of the trace level anomaly. In all, the root cause can be localized with a detection head matrix being applied to every instance node of the trace, and an end-to-end training schema can be accomplished by minimizing the overall loss:
(15) |
IV Experiments
We evaluate the performance of CHASE framework in this section. Specifically, we compare the proposed framework with several baseline frameworks for RCA, including non-GNN based ones and GNN-based ones. To provide a comprehensive evaluation, we conduct experiments on two open-source datasets.
IV-A Experimental Setting
IV-A1 DataSets
We give the details of the datasets here. The first dataset is a public dataset named Generic AIOps Atlas, provided by Cloudwise222https://github.com/CloudWise-OpenSource/GAIA-DataSet; the second dataset is a real-world dataset collected from the AIOps 2020 competition333https://github.com/NetManAIOps/AIOps-Challenge-2020-Data.
a) GAIA - The Generic AIOps Atlas dataset contains multi-modal information from the business simulation system MicroSS, including metrics, logs and traces. This dataset contains mobile service, log service, web service, database service and Redis service. It contains 10 service instances in total. The anomalies are injected into the log to simulate the malfunction of service invocation. The dataset contains four types of anomalies, including login failure, memory anomalies, access denied exceptions and missing files. This dataset includes 1099 static traces, which means the number of instance involved in each trace, the invocation order and trace typology all remain the same. Following the training and testing set split setting from DiagFusion [22], we assign 160 traces for training, and the remaining for validation and testing.
b) AIOps 2020 - This dataset is collected for the AIOps 2020 challenge hosted by Tsinghua Netman Lab. The challenge aims at testing the availability of all microservices before releasing them into the production environment. This environment encompasses hundreds of microservice instances, including network instances, kernel instances, docker instances, etc. Each microservice instance is deployed on multiple physical machines, resulting in a highly complex system environment. Moreover, since each request can be handled by a subset of components of the overall system without involving all instances, invocation traces are considered dynamic. In other words, the trace typology in the dataset differ from one another. Fig. 1 is a service invocation sample from the dataset, with both log and metric data (CPU and memory usage, etc.) recorded for each service instance. Message instances trigger asynchronous invocations, represented with red arrows in the graph, while network gateway instances, service instances and database instances trigger synchronous invocations, represented with black arrows. For data collection, the whole system had been running for 3 months, in which 68 manually injected failures occurred, and each of them lasted for 5 minutes approximately. During each failure, there are hundreds of traces continuously being deployed in the system, among which may or may not be erroneous since a single failure would not result in the collapse of the whole microservice system. In order to evidently illustrate the performance of both baseline models and the proposed CHASE, we split each anomaly duration into a five-minute span, a three-minute span and a one-minute span based on the true label which records the starting timestamp of each failure, and we infer that a better approach can detect a higher percentage of anomaly traces within these time spans. As each failure is gradually recovered after its injection, the five-minute span potentially contains a lower percentage of anomaly traces compared with the one-minute span since the failure has a higher probability of being recovered and resulting in more non-anomaly traces.
IV-A2 Baseline Methods
We compare our proposed framework with seven baseline methods described as follows. The compared baselines can be divided into three categories: causality-based RCA approaches (i.e., PC and GES), correlation learning based approaches (i.e., CloudRanger and MicroRCA), and GNN based approaches (i.e., CausalRCA, TrinityRCL and Diagfusion).
-
•
PC [15] is widely used for causal relationship inference and proven effective in identifying the root causes of system failures, process deviations, sensor failures, and insurance claims. It has been adopted in root cause analysis across various domains, such as system failure diagnosis, semiconductor manufacturing, wind turbines, and insurance.
-
•
GES [16] stands for Greedy Equivalence Search algorithm, which is mainly applied for casual relationship inference. Different from PC which relies on test of independence, GES applies optimal structure identification with greedy search to generate the causal graph.
-
•
CloudRanger is a root cause identification tool first introduced in [19]. It collects and analyzes system logs, metrics, and other data sources to build a machine learning model for root cause prediction.
-
•
MicroRCA [20] is an automated, fine-grained root cause localization framework to analyze monitoring data and localizes faulty services in the microservice architecture. It designs a gradient-based causal structure learning to capture linear and non-linear causal relations in monitoring metrics.
-
•
TrinityRCL [25] localizes the root causes of anomalies at multiple levels of granularity by harnessing all types of telemetry data to construct a causal graph representing the intricate, dynamic, and nondeterministic relationships among the various entities related to the anomalies.
-
•
CausalRCA [21] uses causal inference to automatically identify the root cause of performance issues in microservices. Specifically, it constructs a directed acyclic graph (DAG) that represents the causal relationships between services and their performance metrics, from which the root cause can be identified by inferring the causal relationships with the trained graph neural network.
-
•
Diagfusion [22] is a robust failure diagnosis method that leverages multimodal data. It combines deployment data and traces to build a dependency graph, applies GNN to generate learnt representation for each service instance and achieve two-fold failure diagnosis, i.e., root cause instance localization and failure type determination.
Note that we apply the same hyperparameters of all the baseline RCA frameworks according to the best performance reported in their papers and source code. Both datasets are split into training/validation/testing sets, avoiding the possible overfitting issue. For baseline methods that require a PageRank to infer the root cause instance, we employ the dam** factor as 0.85, maximum iteration as 100 and maximum error tolerance as 0.01. For CHASE, we apply 3 attention layers for heterogeneous message passing, each layer with 8 attention heads. We set LeakyReLU as the activation function with a negative slope coefficient being 0.3. We apply a single hypergraph convolution layer for causality learning, with embedding dimensionality equal to 128. The weight hyperparameter in Equation. 9 is set to 0.5 by default.
IV-A3 Evaluation Metrics
Our framework aims to infer the root cause instance of each anomaly case accurately. We adopt the widely used evaluation metrics in the root cause inference task [22] while slightly modifying these metrics to suit real-world scenarios. We choose Top-1 accuracy (), Top-3 accuracy () and Top-5 average accuracy () as three different evaluation metrics. Since often only the top three predicted results would be manually examined in real-world cases, we thus include , and to evaluate the robustness of the root cause analysis methods.
Top-k accuracy () is calculated in the following way, with as the ground truth root cause instance of anomaly trace , as the Top-k root cause instances set generated by the root cause inference system based on the information of anomaly trace , as the size of the dataset:
(16) |
is calculated based on the Top-k accuracy with k ranging from 1 to 5:
(17) |
Percentage of erroneous traces occurred in -minute span (Percentage@) is calculated with the number of traces with being detected as anomalous divided by the total number of traces occurred within minutes after the labelled timestamp , where denotes the binary prediction result:
(18) |
DataSet | Evaluation Metric | Causality-based | Correlation-based | GNN-based | |||||
PC | GES | CloudRanger | MicroRCA | TrinityRCL | CausalRCA | DiagFusion | Proposed Method | ||
GAIA | A@1 | 0.2960 | 0.3003 | 0.3290 | 0.3421 | 0.4503 | 0.3652 | 0.4121 | 0.6135 |
A@3 | 0.6368 | 0.5399 | 0.4771 | 0.5528 | 0.8244 | 0.4973 | 0.8157 | 0.8823 | |
Avg@5 | 0.5953 | 0.5154 | 0.4883 | 0.5712 | 0.7651 | 0.5966 | 0.7478 | 0.8276 | |
AIOps 2020 | Percentage@5 | - | 0.11 | - | 0.08 | 0.12 | - | 0.12 | 0.15 |
Percentage@3 | - | 0.08 | - | 0.13 | 0.14 | - | 0.14 | 0.16 | |
Percentage@1 | - | 0.14 | - | 0.17 | 0.15 | - | 0.14 | 0.22 |
-
•
indicates baseline frameworks being inapplicable to certain dataset due to dynamic trace typology.
IV-B Evaluation Results
To demonstrate the effectiveness of our proposed method, we compare CHASE with the baseline frameworks from different categories. The evaluation results are shown in Table II.
The causality-based methods extract the causal graph from history traces, and both PC and GES apply PageRank on the causal graph with all edge weights equal to the out-degree of nodes. It can be observed clearly that the performance of PC and GES is worse than correlations-based methods, which generate the edge weights in the causal graph by calculating the correlations through monitored metrics. Specifically, MicroRCA integrates service invocation graphs with causal graphs and applies customized random walks, which outperforms CloudRanger that only considers the attributed graph with anomalous propagation edges, indicating the necessity of leveraging the causal relationships. Regarding GNN-based methods, CausalRCA calculates edge weights through DAG-GNN with variational inference, requiring no training data labels. Due to the limitations of unsupervised learning, it reports worse performance to DiagFusion across all evaluation metrics as DiagFusion takes advantage of multi-modal data to generate the node features in the graph and trains end-to-end supervised RCA tasks. In general, our proposed method, which integrates microservice invocation graphs, anomaly heterogeneous graph and adaptively learnt causality with hypergraph, can obtain better performance compared with all baselines. On GAIA dataset, our proposed method can achieve 36.2%, 7.2%, 8.1% higher than the best baseline on A@1, A@3, Avg@5, respectively.
When it comes to the more complex AIOps 2020 dataset, the invocation typology of traces in training and testing set varies. As a result, the PC, CouldRanger and CausalRCA, which require static trace typology and are not inductive on traces with unseen typology from the training set, are not applicable to perform root cause analysis under such condition. Therefore, evaluation results of the aforementioned methods on AIOps 2020 dataset are left blank in Table II. Our proposed method can still retain a significant performance gain of 25.0%, 14.3%, 29.4% compared with the best baseline on the percentage of erroneous traces detected in 5-min span, 3-min span and 1-min span.
![Refer to caption](extracted/5697553/figures/sena.png)
IV-C Sensitivity Analysis
We conduct the sensitivity analysis in this part on both datasets. Specifically, Fig. 3 (a1) and (a2) shows how the number of attention layers affects the performance on GAIA and AIOps 2020 dataset, respectively. When the number of attention layers increases up to 3, there is typically an improvement in the performance of our model. The performance starts to drop afterwards, indicating the potential overfitting problem. Regarding different types of positional encoding, depicted in Fig. 3 (b1) and (b2), traditional embedding and learnable embedding demonstrate superior performances. We apply tradition embedding as it is simply composed of constants calculated from Equation.2, which reduce the total amount of learnable parameters and result in lower computational cost. The effect of hidden state dimension for heterogeneous message passing and hypergraph convolution is depicted in Fig. 3 (c1) and (c2), the performance of the model evidently surges until the dimension reaches 128, then the performance is only slightly improved when hidden state dimension is doubled to 256. Thus, considering the trade-off between computational efficiency and performance, we apply a hidden state dimension of 128. For the number of hypergraph convolution layers demonstrated in Fig. 3 (d1) and (d2), the performance of the model continues to drop as the number of layer increases. Hence, we apply a single hypergraph convolution layer. Intuitively, as a single hypergraph layer is already capable of capturing high-order multivariate correlation, stacking up hypergraph layers brings in extra redundancies in the training process, thus causing undesirable performance drops.
IV-D Ablation Study
![Refer to caption](extracted/5697553/figures/attention.png)
Ablation Strategy | A@1 | A@3 | Avg@5 |
---|---|---|---|
Default Method | 0.6135 | 0.8823 | 0.8276 |
V1: w/o instance embedding | 0.5927 | 0.8554 | 0.7852 |
V2: w/o heterogeneous message passing | 0.3845 | 0.5403 | 0.6581 |
V3: w/o causal hyperedge | 0.4679 | 0.8106 | 0.7598 |
To evaluate the effectiveness of the main components of CHASE, we conducted an ablation study on the GAIA dataset using the following three variations: V1: remove the instance embedding layer while kee** the non-learnable positional embedding layer. V2: remove the heterogeneous message passing from the graph convolution layer. V3: remove the final hypergraph layer which learns the propagation of causality. Table III lists the change of performance of each variant.
Since the instance embedding layer captures the difference between instances, reducing these features in V1 will result in missing information of the whole trace typology and invocation structure, leading to a subtle drop of performance of 3.3% in A, clearly shown in Table III.
Considering the V2 variant of CHASE, we replace the heterogeneous message passing layer with a homogeneous graph attention layer, which results in a 37% performance drop in A. This emphasizes the significance to model the heterogeneous correlation of the invocation graph. To provide a clearer illustration of the impact of learning the heterogeneity attributes among metrics, logs, and instances through message passing in CHASE, Fig. 4 plots a comparison between the homogeneous edge weight and the heterogeneous edge weight on the trace in GAIA dataset (webservice2 as the root cause instance). As the edges related to webservice2 (last columns in both figures) are granted with higher weights, it indicates that heterogeneous message passing layer is able to enhance the root cause inference’s performance, which is achieved by directing more attention towards the edges related to the instance of the root cause.
In the context of causality inference, it has been observed that directly adding the detection head to the output of the heterogeneous message passing layer in the V3 variant causes the causal hypergraph to degrade into a heterogeneous graph neural network. In this scenario, information propagates only through K hops (where K represents the number of layers), which contradicts the nature of anomaly causality that may propagate across an arbitrary number of hops depending on the invocation. This leads to a performance drop of 23.7% in A1.
In comparison to the performance of V2, V3 which retains the heterogeneity is considerably more critical. As a result, V3 exhibits higher accuracy of 21.6% in A1 than V2, since causal edges provide less topological information than actual invocation edges. Nevertheless, V3 demonstrates superior robustness (33.3% higher in A3 and 13.3% higher in Avg5) than V2, as causal edges denote stronger causality than actual invocation edges.
V Conclusion
For complex microservice systems, conducting a root cause analysis is essential in identifying and addressing the underlying issues that hinder the optimal performance of the system. To tackle this challenge, we propose a causal heterogeneous graph based framework named CHASE to accurately locate the root cause instances of failures, facilitating effective troubleshooting and system maintenance. Starting with modeling microservice systems with graph representations, we apply encoders and embedding layers combined with positional encoding to encode the multimodal data of trace, log and metrics into an invocation graph. CHASE leverages heterogeneous message passing on invocation graphs to tackle instance-level anomaly detection. Next, it constructs a hypergraph with each hyperedge capturing the causality flow for each instance. Finally, the designated hypergraph module locates the root cause anomaly with a detection head applied to each updated node embedding after hypergraph convolution. We comprehensively evaluate CHASE using two real-world datasets, and the evaluation results confirm the superiority of our proposed framework. For the future work, hypergraph learning techniques on modelling spatiotemporal graphs can be applied to capture the causality structure of the trace, which is dynamically evolving since the whole trace typology is gradually formed by sequential invocations ranging in a certain time interval. Additionally, a single powerful autoregressive encoder that is intrinsically trained on multimodal data, such as the large language model, can substitute the dedicated log and metric encoders of CHASE, in order to enhance the quality of embeddings, as well as robustly unify the information of metrics and log.
References
- [1] T. Zhang, Z. Shen, J. **, A. Tagami, X. Zheng, and Y. Yang, “ESDA: an energy-saving data analytics fog service platform,” in International Conference on Service-Oriented Computing. Springer, 2019, pp. 171–185.
- [2] Z. Li, J. Chen, R. Jiao, N. Zhao, Z. Wang, S. Zhang, Y. Wu, L. Jiang, L. Yan, Z. Wang, Z. Chen, W. Zhang, X. Nie, K. Sui, and D. Pei, “Practical root cause localization for microservice systems via trace analysis,” in International Symposium on Quality of Service. IEEE, 2021, pp. 1–10.
- [3] M. Baboi, A. Iftene, and D. Gîfu, “Dynamic microservices to create scalable and fault tolerance architecture,” Procedia Computer Science, vol. 159, pp. 1035–1044, 2019.
- [4] V. Ramu, “Performance impact of microservices architecture,” The Review of Contemporary Scientific and Academic Studies, vol. 3, 2023.
- [5] X. Peng, “Large-scale trace analysis for microservice anomaly detection and root cause localization,” in Proceedings of the Federated Africa and Middle East Conference on Software Engineering, 2022, pp. 93–94.
- [6] M. Ma, W. Lin, D. Pan, and P. Wang, “Self-adaptive root cause diagnosis for large-scale microservice architecture,” IEEE Transactions on Services Computing, vol. 15, pp. 1399–1410, 2020.
- [7] Y. Zhang, Z. Guan, H. Qian, L. Xu, H. Liu, Q. Wen, L. Sun, J. Jiang, L. Fan, and M. Ke, “CloudRCA: a root cause analysis framework for cloud computing platforms,” in Proceedings of the ACM International Conference on Information & Knowledge Management, 2021, pp. 4373–4382.
- [8] A. Ikram, S. Chakraborty, S. Mitra, S. Saini, S. Bagchi, and M. Kocaoglu, “Root cause analysis of failures in microservices through causal discovery,” Advances in Neural Information Processing Systems, vol. 35, pp. 31 158–31 170, 2022.
- [9] Á. Brandón, M. Solé, A. Huélamo, D. Solans, M. S. Pérez, and V. Muntés-Mulero, “Graph-based root cause analysis for service-oriented and microservice architectures,” Journal of Systems and Software, vol. 159, p. 110432, 2020.
- [10] M. Kim, R. Sumbaly, and S. Shah, “Root cause detection in a service-oriented architecture,” ACM SIGMETRICS Performance Evaluation Review, vol. 41, pp. 93–104, 2013.
- [11] Global Times. (2023) Didi system glitch lasts 12 hours, with company apologizing for service breakdown. [Online]. Available: https://www.globaltimes.cn/page/202311/1302630.shtml
- [12] G. Rong, H. Wang, S. Gu, Y. Xu, J. Sun, D. Shao, and H. Zhang, “Locating anomaly clues for atypical anomalous services: An industrial exploration,” IEEE Transactions on Dependable and Secure Computing, pp. 2746–2761, 2023.
- [13] G. P. Bhandari and R. Gupta, “Fault analysis of service-oriented systems: A systematic literature review,” IET Software, vol. 12, no. 6, pp. 446–460, 2018.
- [14] J. Soldani and A. Brogi, “Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,” ACM Computing Surveys, vol. 55, no. 3, pp. 1–39, 2022.
- [15] M. Solé, V. Muntés-Mulero, A. I. Rana, and G. Estrada, “Survey on models and techniques for root-cause analysis,” arXiv:1701.08546, 2017.
- [16] D. M. Chickering, “Optimal structure identification with greedy search,” Journal of Machine Learning Research, vol. 3, pp. 507–554, 2002.
- [17] J. Lin, P. Chen, and Z. Zheng, “Microscope: Pinpoint performance issues with causal graphs in micro-service environments,” in International Conference on Service-Oriented Computing. Springer, 2018, pp. 3–20.
- [18] B. Sharma, P. Jayachandran, A. Verma, and C. R. Das, “CloudPD: Problem determination and diagnosis in shared dynamic clouds,” in International Conference on Dependable Systems and Networks. IEEE, 2013, pp. 1–12.
- [19] P. Wang, J. Xu, M. Ma, W. Lin, D. Pan, Y. Wang, and P. Chen, “CloudRanger: Root cause identification for cloud native systems,” in International Symposium on Cluster, Cloud and Grid Computing. IEEE, 2018, pp. 492–502.
- [20] L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “MicroRCA: Root cause localization of performance issues in microservices,” in Network Operations and Management Symposium. IEEE, 2020, pp. 1–9.
- [21] R. Xin, P. Chen, and Z. Zhao, “CausalRCA: Causal inference based precise fine-grained root cause localization for microservice applications,” Journal of Systems and Software, vol. 203, 2023.
- [22] S. Zhang, P. **, Z. Lin, Y. Sun, B. Zhang, S. Xia, Z. Li, Z. Zhong, M. Ma, W. **, D. Zhang, Z. Zhu, and D. Pei, “Robust failure diagnosis of microservice system through multimodal data,” arXiv:2302.10512, 2023.
- [23] A. Gholami and A. K. Srivastava, “Comparative analysis of ML techniques for data-driven anomaly detection, classification and localization in distribution system,” in North American Power Symposium. IEEE, 2021, pp. 1–6.
- [24] T. Jia, Y. Wu, C. Hou, and Y. Li, “LogFlash: Real-time streaming anomaly detection and diagnosis from system logs for large-scale software systems,” in International Symposium on Software Reliability Engineering. IEEE, 2021, pp. 80–90.
- [25] S. Gu, G. Rong, T. Ren, H. Zhang, H. Shen, Y. Yu, X. Li, J. Ouyang, and C. Chen, “TrinityRCL: Multi-granular and code-level root cause localization using multiple types of telemetry data in microservice systems,” IEEE Transactions on Software Engineering, vol. 49, no. 5, pp. 3071–3088, 2023.
- [26] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
- [27] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- [28] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” Statistics, vol. 1050, no. 20, pp. 10–48 550, 2017.
- [29] T. Zhang, Y. Liu, Z. Shen, R. Xu, X. Chen, X. Huang, and X. Zheng, “An adaptive federated relevance framework for spatial temporal graph learning,” IEEE Transactions on Artificial Intelligence, vol. 5, pp. 2227–2240, 2024.
- [30] Y. Liu, Z. Zhao, T. Zhang, K. Wang, X. Chen, X. Huang, J. Yin, and Z. Shen, “Exploiting spatial-temporal data for sleep stage classification via hypergraph learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2024, pp. 5430–5434.
- [31] D. Wang, Z. Chen, J. Ni, L. Tong, Z. Wang, Y. Fu, and H. Chen, “Hierarchical graph neural networks for causal discovery and root cause localization,” arXiv preprint arXiv:2302.01987, 2023.
- [32] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” arXiv:1607.01759, 2016.