ECLIPSE: Semantic Entropy-LCS for Cross-Lingual Industrial Log Parsing

Wei Zhang¹, Xianfu Cheng¹, Yi Zhang¹, Jian Yang^1†, Hongcheng Guo¹, Zhoujun Li^1†,
Xiaolin Yin², Xiangyuan Guan³, Xu Shi³, Liangfan Zheng³, Bo Zhang³ ¹State Key Laboratory of Complex & Critical Software Environment, Beihang University ²Haier Smart Home, ³Cloudwise Research zwpride, buaacxf, zhangyi2021, jiaya, hongchengguo, [email protected]; [email protected]; tim.shi, leven.zheng, [email protected];

(2018)

Abstract.

Log parsing, a vital task for interpreting the vast and complex data produced within software architectures faces significant challenges in the transition from academic benchmarks to the industrial domain. Existing log parsers, while highly effective on standardized public datasets, struggle to maintain performance and efficiency when confronted with the sheer scale and diversity of real-world industrial logs. These challenges are two-fold: 1) massive log templates: The performance and efficiency of most existing parsers will be significantly reduced when logs of growing quantities and different lengths; 2) Complex and changeable semantics: Traditional template-matching algorithms cannot accurately match the log templates of complicated industrial logs because they cannot utilize cross-language logs with similar semantics. To address these issues, we propose ECLIPSE, Enhanced Cross-Lingual Industrial log Parsing with Semantic Entropy-LCS, since cross-language logs can robustly parse industrial logs. On the one hand, it integrates two efficient data-driven template-matching algorithms and Faiss indexing. On the other hand, driven by the powerful semantic understanding ability of the Large Language Model (LLM), the semantics of log keywords were accurately extracted, and the retrieval space was effectively reduced. Notably, we launch a Chinese and English cross-platform industrial log parsing benchmark ECLIPSE-Bench to evaluate the performance of mainstream parsers in industrial scenarios. Our experimental results across public benchmarks and ECLIPSE-Bench underscore the superior performance and robustness of our proposed ECLIPSE. Notably, ECLIPSE both delivers state-of-the-art performance when compared to strong baselines and preserves a significant edge in processing efficiency¹¹1We will release our code and dataset..

Industrial Log parse, Large language Model, Information Entropy

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: ; ; ^†^†footnotetext: † Corresponding author.

1. Introduction

Logs play a crucial role in Algorithmic IT Operations (AIOps) behavior compared to other operational and maintenance data, as they provide significant insights into system behavior (He et al., 2021; Zhu et al., 2019). By analyzing logs, we can complete a variety of downstream tasks, such as anomaly detection (Du et al., 2017; Nandi et al., 2016; Zhang et al., 2024a), fault diagnosis (He et al., 2018; Zhang et al., 2024b) and root cause analysis (Tak et al., 2016; Chuah et al., 2010). In general, most log analysis methods make log parsing the primary step in automated log analysis (He et al., 2016). Illustrated in 1, semi-structured logs are generated from log statement code and need to be parsed to structured logs by algorithms. The most challenging task among them is to extract templates from logs (Zhu et al., 2019), which represent the unchanged parts of the log parsing process. Traditional log parsing was accomplished by manually configuring regular expressions (Lang, 2013). With the exponential growth in the volume and complexity of logs, cloud computing has rendered this approach impractical. Another method is source code-based log parsing (Nagappan et al., 2009; Xu et al., 2009), which are widely used as third-party libraries and some library source codes may not be accessible.

Refer to caption — Figure 1. Diagram of the log structuring process. The semi-structured logs are generated by log statement code and then parsed into the structured logs by algorithms. The log template represents the part of the keywords that are not changed during the log parsing process.

Many data-driven log parsing techniques have been proposed. Some are based on the assumption that constants appear more frequently than variables in log messages (Vaarandi, 2003; Vaarandi and Pihelgas, 2015; Dai et al., 2020), while others treat log parsing as a clustering problem (Hamooni et al., 2016; Mizutani, 2013; Shima, 2016; Xiao et al., 2020) or use specific heuristic rules (Makanju et al., 2009; Du and Li, 2016; He et al., 2017; Chu et al., 2021). Although these methods theoretically make log parsing feasible for industrial applications, most have only been tested on simple public datasets and lack practical experience with complex industrial logs. Therefore, they actually perform poorly in most industrial logs, with both efficiency and accuracy being low. Our paper addresses this gap by identifying the following issues overlooked by current log parsing algorithms based on an analysis of industrial logs and existing algorithms: 1) Enormous number of log templates: A massive number of log templates on parsing efficiency, especially in industrial applications where heuristic rules used to accelerate computation cannot fundamentally solve efficiency problems. For example, Spell (Du and Li, 2016) and Drain(He et al., 2017) perform poorly in industrial log template scenarios due to frequent computation. 2) Length-Insensitivity: Existing Log parsing algorithm assumes that the prior is generally the same keyword, which makes it insensitive to length. Drain+(Fu et al., 2022) was aware of this issue and raised Jaccard similarity to merge logs with the same token, disregarding the order and position of identical tokens. 3) Complex semantics and multilingual logs: In industrial scenarios, logs from different programming languages, system sources, and language types often aggregate under the same data source. Data-driven log parsing methods are capable of handling the former, but cannot handle the latter.

To address these issues, we propose a powerful log parser called ECLIPSE. Specifically, driven by strong keyword representations, cross-linguistic understanding, and cross-semantic comprehension of the large language model (LLM), ECLIPSE builds a dynamic dictionary from semantic keywords to log templates. K-nearest neighbor templates will be recalled from templates in the dynamic dictionary by Faiss. Then, the carefully designed method Entropy-LCS, an entropy-improved longest command subsequence, will identify candidates as log templates for updating the dynamic dictionary in real time. To construct a benchmark called ECLIPSE-Bench, we align with industrial applications compared with public benchmark datasets to further validate the effectiveness of our algorithm. The experiment results demonstrate its effectiveness on both public benchmarks and the ECLIPSE-Bench benchmark. Besides, we also explore the parsing efficiency comparison between ECLIPSE and other strong baseline algorithms, where there are a large number of templates involved.

Overall, our main contributions can be summarized as follows,

•

We propose a powerful log parsing system called ECLIPSE. ECLIPSE creatively employs LLM to drive cross-linguistic and cross-semantic relevance detection and ensures efficient retrieval of logs under massive template influence by Faiss addressing the challenges in log parsing.
•

Using Entropy-LCS, an information entropy-improved LCS, for real-time matching of log intrinsic K-nearest neighbor templates not only effectively solves the problem of length insensitivity, but also further improves the performance and flexibility of log parsing.
•

We construct ECLIPSE-Bench, a bilingual industrial log parsing benchmark dataset, which supports both Chinese and English languages. We collected nearly 102M logs from 3 industrial domains and extracted 700 templates from these logs. We evaluate ECLIPSE on both public Loghub and ECLIPSE-Bench using F-measure, grou** accuracy, and execution time, and ECLIPSE achieve advanced performance and highly competitive parsing efficiency. Specifically, it outperforms the current existing industry-standard baseline methods.

2. Related Work

2.1. Existing matching strategies

A lot of work has been proposed in the field of log parsing in recent years. In general, log parsing can be classified into four categories: frequent pattern mining, clustering, heuristics rules, and deep learning methods. 1) Frequent pattern mining: This category assumes that constants generally occur more frequently in log messages than variables. SLCT (Vaarandi, 2003) and LogCluster (Vaarandi and Pihelgas, 2015) mainly group logs into several clusters based on the frequency term of each log. Logram (Dai et al., 2020) uses frequent n-grams to separate constants and variables. 2) Clustering: This category usually clustering logs, then extracts templates from each cluster. LogMine (Hamooni et al., 2016) and SHISO (Mizutani, 2013) use hierarchical clustering to cluster logs and update the template. LKE (Fu et al., 2009) employs edit distance and k-means to cluster logs. LenMa (Shima, 2016) helps cluster logs based on length vector. LPV (** relation. Spell(Du and Li, 2016) regards log parsing as the LCS problem. Drain(He et al., 2017) utilizes length and prefix tokens to partition logs. Recently, Preﬁx-Graph(Chu et al., 2021) proposes that generate log templates form a prefix graph. 4) Deep Learning methods: In recent years, deep learning-based log parsing algorithms have emerged. For example, Nulog (Nedelkoski et al., 2021) uses a Transformer-based encoder layer to train a log parsing network. Uniparser (Liu et al., 2022) designs three modules and uses contrastive learning for log parsing. LogAP(Rand and Miranskyy, 2021) employs Machine Translation (MT) to perform parsing tasks. However, deep learning algorithms are usually inefficient and costly for training and inference. Therefore, it is still challenging to apply these methods in real scenarios.

2.2. LLM in log semantic parsing

With the rapid advances in language modeling (Guo et al., 2023b; Vaswani et al., 2017; Yang et al., 2020a; Shen et al., 2023; Zhang et al., 2024b; Wang et al., 2022; Bai et al., 2023; Yao et al., 2023; Shinn et al., 2023; Chi et al., 2020), and particularly the emergence of LLMs with Transformer-based architectures such as GPT-3.5, GPT-4 (OpenAI, 2023) and PaLM (Anil et al., 2023), Its excellent language understanding, generation, generalization, and reasoning capabilities greatly promote the integration of Natural Language Processor (NLP) and AIOps tasks. The integration of LLM with external databases and APIs further enhances its functionality(Chai et al., 2024; Guo et al., 2023b; Kojima et al., 2022; Wang et al., 2022), so that domain-specific knowledge can be more effectively integrated and continuously updated, especially when applied to the semantic analysis of logs, which greatly improves the accuracy of log parsing and anomaly detection (Rozière et al., 2023; Zhang et al., 2024a; Guo et al., 2023a). In addition, Retrieval Augmented Generation technology enables LLM to have the ability to access external knowledge sources, even when faced with more complex and knowledge-intensive tasks, generating answers that are more factual, specific, and diverse(Lewis et al., 2021; Bai et al., 2023; Yao et al., 2023). In our work, Using LLM to reorder the representation vector of log keywords to highlight the most relevant results, and then with the powerful background knowledge of LLM, the most credible template matching length is selected for the result sequence, which effectively reduces the number of templates that need to be matched in the current input log and realizes the dual purpose of information retrieval enhancer and filter. Provide refined inputs for more accurate log parsing algorithms.

3. ANALYSIS ON INDUSTRIAL LOG PARSING CHALLENGES

In this section, we aim to provide a detailed analysis of the issues mentioned earlier and use practical examples to illustrate the causes behind them.

3.1. Huge volume of log templates

The quantity level of industrial logs exceeds that of public data by one order of magnitude, and the resulting number of log templates is also relatively large. Illustrated in Table 1, we analyzed the official website and source code of open-source software and found that the number of log templates exceeded 6000. At the same time, the reason for the small number of log templates in the public dataset Loghub is that it comes from sampling, and in non-sampling situations, its log templates will also be equally large.

Table 1. Number of templates of some open source software.

Source	MySQL	Oracle	Cisco	ClickHouse
Number of templates	4251	648	1556	5997

The time complexity of a cluster-based algorithm is typically represented as $O(mn)$ or $O(mlog(n))$ , where $m$ denotes the data size and $n$ denotes the number of clusters. In log parsing, the number of clusters is the number of templates. Therefore, it is essential to consider the impact of the number of log templates for log parsing.

3.2. Various lengths of logs from the same template

Many algorithms assume that logs belonging to the same template must have the same length, but this is not always the case in complex industrial logs, especially those with complex nested objects like JSON, XML, etc (Yang et al., 2019, 2020c, 2020b). Illustrated in Figure 2, $Log1$ and $Log2$ are printed by the same code, in other words, they belong to the same template. However, their lengths vary a lot due to the significant differences in the field named ”BUSI_INFO” of JSON objects. This happens very frequently, but most algorithms don’t have the ability to handle this issue because they regard the same log length as a prerequisite for logs to belong to the same template, like Drain(He et al., 2017), IPLOM(Makanju et al., 2009), Lenma(Shima, 2016), and so on.

We are glad to see Drain+(Fu et al., 2022) has also realized this issue, and to address it, Drain+ first uses Drain(He et al., 2017) to generate a series of templates and then merges them by comparing their Jaccard Similarity. However, the flaw of this approach is that Jaccard similarity only considers the number of identical tokens between two logs but ignores the order and position of identical tokens. So templates with similar tokens but different orders of tokens may be merged by it, just as illustrated in Figure 3. The two templates both have tokens ”backup”, ”not” and ”found”, but their orders are totally different, which may cause a big problem in extracting parameters from the log. This type of problem appears more in more non-standard logs and more in Chinese, Japanese, Korean, or other languages with more flexible syntax than English.

3.3. Others

In addition to these two main issues that may significantly impact log parsing, most algorithms also often neglect the semantic differences between logs with similar characters. Logs reflect system behavior, and some events in the system are closely related. Therefore, developers often write similar-looking logs to maintain consistency. To illustrate this point, we take MySQL error logs as an example. We found three log templates that look very similar on the official website. In Figure 4, these three templates describe three totally different operations performed on the Unix socket lock file. However, in the data-driven algorithm, they are likely to be considered as one template. Similar situations can also occur when describing the same operations on different objects and so on. We analyzed the templates of MySQL error logs with error numbers MY-010000 to MY-010100 and found that around one-fifth of the templates have similar counterparts, which shows this is also a widespread problem.

4. Methodology

ECLIPSE defines parsing as a template retrieve problem. To achieve parsing ECLIPSE utilizes a special dynamic dictionary structure called template library to restore seen templates. As each log arrives, ECLIPSE aims to retrieve the most suitable template and then update our template library. In this chapter, we will describe in detail how ECLIPSE works.

4.1. Overview of ECLIPSE

Fig 5 illustrates how ECLIPSE uses a dynamic dictionary structure to store templates of previously seen logs. This structure takes log semantic keywords as keys and each key maps to a value containing a list of templates and a Faiss index. Our algorithm operates and starts with an empty dictionary. When a new log is coming, ECLIPSE takes five steps to identify the most suitable template: 1) Preprocess the log. 2) Extract the semantic keywords from the log, and take the log semantic keywords as the key to retrieve log templates and a Faiss index in the dynamic dictionary. 3) Utilize the corresponding Faiss index to recall k-nearest-neighbor templates. 4) Using Entropy-LCS to match the log to k-nearest-neighbor templates to choose the most suitable candidate. 5) Update the corresponding log templates and Faiss index.

4.2. Step 1. Preprocess

When a new log $l_{i}$ arrives, ECLIPSE preprocesses it before template searching. Previous work has proved that preprocessing can effectively improve parsing accuracy. So in our algorithm, we also use some simple regex rules to replace some special objects with tags before template searching. Special objects include IP, URL, time, and so on. For example, in Fig 5, we replace ”198.1.1.1” with tag ” $<$ ip $>$ ”. After replacing, ECLIPSE adopts some simple delimiters like space and colon to split logs into a token sequence $s_{i}$ .

4.3. Step 2. Retrieve

As mentioned previously, most algorithms have ignored the semantic differences between similar logs, we also gave an example in Fig 4 to explain. However, almost all semantic-based semantic algorithms have efficiency problems, and they may not be able to solve the problem shown in Fig 4.

To balance efficiency and effect, we adopt a compromise plan. Firstly, we use LLM to process our collected logs and identify the most commonly used and important words and expressions through a keyword extraction mission. In Figure 5, when the keyword library contains ”access”, ”denied” ”open” and ”close”, ECLIPSE will extract the keywords ”Access denied” from the log ”Access denied for user ‘root’. IP: $<$ ip $>$ ” as its log abstract. Then, we construct a special dynamic dictionary by using semantic keywords as keys and the corresponding log templates as values.

4.4. Step 3. Recall

Faiss is an open-source tool to retrieve the most similar neighbor vector efficiently from Meta, which is competent in handling efficiency issues for a huge volume of templates in industrial logs. We need to embed logs into vectors first.

However, logs are hard to be embedded as there are too many unseen tokens in logs like tokens containing digits (e.g., ”GIF534_234”), Camel-Case words (e.g., ”LogAnomalyDetection”), or some other special symbols (e.g., ”\etc\wfs\”), which cause out-of-vocabulary (OOV) problem easily. Based on investigation, we found that logs of the same template tend to have similar punctuation distribution, while logs of different templates tend to have different punctuation distribution. So we decide to utilize char-level embedding to encode logs into punctuation vectors. As shown in Fig 6, the vector has a dimension of m+1, where m denotes the number of selected punctuation features. The size of the first m dimensions of the vector reflects the number of occurrences of the corresponding punctuation in the log, while the last dimension of the vector represents the length of the log string. In ECLIPSE, we select the 39 most commonly used punctuation as punctuation features.

Before calling Faiss, we use LLM to generate punctuation vectors, and by setting prompt statements, LLM determines the number $K$ of candidate templates for Faiss in advance according to massive background knowledge and generates a ranking of candidate templates according to the semantic similarity of vector contexts. After LLM encodes log $l_{i}$ into a punctuation vector $v_{i}$ , the vectors are normalized to have a mean of 0 and variance of 1. This ensures that the importance of each dimension is measured on the same scale. Next, ECLIPSE uses Faiss to recall the $k$ -nearest-neighbor templates for log vector $v_{i}$ . Only these templates need to be considered for future similarity calculations.

4.5. Step 4. Match

Only one of the $k$ -nearest-neighbor templates will be a candidate in the recall process. We chose Entropy-LCS, an information entropy-improved LCS to choose the most suitable template as the candidate. We calculate the longest common subsequence $\gamma$ in $k$ -nearest-neighbor templates $T$ , which may either be non-empty or empty. Subsequently, For each $t$ in $T$ , we compare it with the LCS $\gamma$ and record the location of the token that is not the same. Then, we calculate the information entropy by $E(x)=-\sum_{i}p(x_{i})\log p(x_{i})$ for these position in $t^{\prime}$ . Following this, an exhaustive analysis is conducted over the positions of all divergent tokens, during which the information entropy and the list of tokens at these positions are calculated for all logs. The process of finding variables is the following:

(1)

\text{V}=\begin{cases}\text{Yes},&\text{if }-\sum_{t\in T}\frac{f(t)}{N}\log_{% 2}\frac{f(t)}{N}>\theta\\ \text{No},&\text{if }-\sum_{t\in T}\frac{f(t)}{N}\log_{2}\frac{f(t)}{N}\leq% \theta\end{cases}

where $V$ donates variation point determination, The calculation of entropy $H$ is based on the distribution of tokens occurring at a specific position. $T$ is the set of tokens at a specific position. $f(t)$ is the frequency of token $t$ in all logs. $N$ is the total number of logs. $\theta$ is the threshold value. The decision for variables is $Yes$ if the calculated entropy exceeds the threshold $\theta$ , otherwise $No$ .

4.6. Step 5. Update

If the most suitable template is returned by section 4.5, ECLIPSE will update the corresponding template in the template list. Tokens that are not present in the Entropy-LCS will be replaced by a special symbol (e.g., “ $<$ * $>$ ”) in the updated template.

In case section 4.5 does not return a suitable template, ECLIPSE will consider the log as a new template and save it in the dynamic dictionary with the log abstract extracted from section 4.3 as an index. Additionally, the vector obtained from section 4.4 will be added to the Faiss index associated with this new template.

Table 2. Summary of public datasets Loghub.

	Dataset	Description	Time Span	Data Size	Logs	Template (total)	Template (2k)
Distributed system logs	HDFS	Hadoop distributed file system log	38.7 hours	1.47 GB	11,175,629	30	14
	Hadoop	Hadoop mapreduce job log	N.A.	48.61 MB	394,308	298	114
	Spark	Spark job log	N.A.	2.75 GB	33,236,604	456	36
	ZooKeeper	ZooKeeper service log	26.7 days	9.95 MB	74,380	95	50
	OpenStack	OpenStack software log	N.A.	60.01 MB	207.820	51	43
Supercomputer logs	BGL	Blue Gene/L supercomputer log	214.7 days	708.7b MB	4,747,963	619	120
	HPC	High performance cluster log	N.A.	32.00 MB	433,489	104	46
	Thunderbird	Thunderbird supercomputer log	244 days	29.60 GB	211,212,192	4,040	149
Operating system logs	Windows	Windows event log	226.7 days	26.09 GB	114,608,388	4,833	50
	Linux	Linux system log	263.9 days	2.25 MB	25,567	488	118
	Mac	Mac OS log	7.0 days	16.09 MB	117.283	2,214	341
Mobile system logs	Android	Android framework log	N.A.	3.38 GB	30,348,042	76,923	166
Mobile system logs	HealthApp	Health App log	10.5 days	22.44 MB	253,395	220	75
Server application logs	Apache	Apache server error log	263.9 days	4.90 MB	56,481	44	6
Server application logs	OpenSSH	OpenSSH server log	28.4 days	70.02 MB	655,146	62	27
Standalone software logs	Proxifier	Proxifier software log	N.A.	2.42 MB	21,329	9	8

5. Experiment

5.1. Research Questions

To verify the effectiveness and efficiency of ECLIPSE, we conduct extensive experiments on both loghub and ECLIPSE-Bench, aiming to answer the following questions:

RQ1: How effective is ECLIPSE on public Loghub and our ECLIPSE-Bench?

RQ2: How efficient Is ECLIPSE with the change of log and template volume?

RQ3: How does ECLIPSE perform without LLM and Faiss to retrieve and recall by constructing the special dynamic dictionary?

RQ4: How do we set suitable hyperparameters in recall and E-LCS process affect the performance of ECLIPSE?

5.2. Implementation and Environment

In our experiments, we set up a Virtual Machine (VM) with 64 Intel Core I5 CPU @ 2.0GHz processors and 16GB RAM. The operating system is Ubuntu-20.04. For strong keyword representation, cross-linguistic understanding, and cross-semantic comprehension of LLM, we choose ChatGPT-4 as the base of ECLIPSE. There are three hyperparameters provided for adjustment in ECLIPSE-Bench: the number of the nearest neighbor templates $k$ is set to $5$ , the recall threshold $\tau$ is set to $0.5$ , and the matching threshold $\theta$ is set to $4.5$ . These default values remain consistent in most experiments. Using these default parameters, ECLIPSE achieves high accuracy and reliability across various datasets.

5.3. Datasets

ECLIPSE is evaluated on the public LogHub and the industrial dataset ECLIPSE-Bench collected by us. LogHub contains 16 public datasets. The data from LogHub are collected from common open-source components, such as Spark, Zookeeper, and Apache. For each log source, 2000 logs are sampled and labeled manually. As for ECLIPSE-Bench, the data was collected from actual business scenarios in the fields of finance, communication, and manufacturing, spanning over 30 days, with over 300000 logs per domain, ranging from mixed logs across languages to single language logs. We have presented a summary of our industrial datasets ECLIPSE-Bench and public Loghub datasets in Table 3 and Table 2, respectively.

Table 3. Summary of industrial datasets ECLIPSE-Bench.

Dataset	Templates	Avg. Len.	Various Lengths Proportion	Logs
Finance	438	75.21	40.8%	303579
Communication	192	47.36	60.9%	414347
Manufacturing	70	41.98	4.74%	306218

5.4. Baselines and Metrics

As for baselines, We choose Drain(He et al., 2017), IPLOM(Makanju et al., 2009), and Spell(Du and Li, 2016) as our baselines. They work well on the LogHub. To quantify the effectiveness of ECLIPSE, we leverage F-measure and Group Accuracy (GA) consistent with prior studies(He et al., 2017; Makanju et al., 2009; Du and Li, 2016). F-measure is a template-level metric that focuses on the ratio of correctly grouped templates and GA is computed as the ratio of correctly grouped log messages to the total count of log messages. Specifically, We also measure the execution time in seconds and compare ECLIPSE with other parsers in terms of efficiency.

5.5. RQ1: How effective is ECLIPSE on public loghub and our ECLIPSE-Bench?

Table 4. F-measure and GA on public logHub.

Dataset	Drain		Spell		IPLOM		ECLIPSE
Dataset	F-measure	PA	F-measure	PA	F-measure	PA	F-measure	PA
HDFS	0.999	0.998	1	1	1	1	0.999	0.998
Hadoop	0.999	0.948	0.920	0.778	0.996	0.954	0.999	0.987
Spark	0.992	0.920	0.991	0.905	0.992	0.920	0.999	0.997
Zookeeper	0.999	0.967	0.999	0.964	0.999	0.962	0.999	0.991
BGL	0.999	0.963	0.957	0.787	0.999	0.939	0.999	0.949
HPC	0.991	0.887	0.986	0.654	0.978	0.829	0.992	0.903
Thunderbird	0.999	0.955	0.994	0.844	0.999	0.663	0.999	0.946
Windows	0.999	0.997	0.999	0.683	0.995	0.568	0.999	0.994
Linux	0.992	0.690	0.937	0.605	0.964	0.672	0.994	0.868
Andriod	0.996	0.911	0.992	0.919	0.949	0.712	0.991	0.801
HealthApp	0.918	0.780	0.887	0.639	0.958	0.822	0.971	0.819
Apache	1	1	1	1	1	1	1	1
Proxifier	0.785	0.527	0.832	0.527	0.786	0.517	0.999	0.977
OpenSSH	0.999	0.788	0.918	0.554	0.998	0.540	0.999	0.814
OpenStack	0.993	0.733	0.994	0.764	0.909	0.331	1	1
Mac	0.975	0.787	0.963	0.757	0.957	0.671	0.977	0.850
Average	0.977	0.866	0.961	0.774	0.968	0.756	0.995	0.931

Table 5. F-measure and GA on ECLIPSE-Bench.

Dataset	Drain		Spell		IPLOM		ECLIPSE
Dataset	F-measure	PA	F-measure	PA	F-measure	PA	F-measure	PA
Finance	0.469	0.380	0.090	0.249	0.469	0.362	0.988	0.626
Communication	0.863	0.268	0.877	0.171	0.898	0.273	0.920	0.423
Manufacturing	0.984	0.793	0.906	0.193	0.569	0.469	0.998	0.832
Average	0.772	0.481	0.090	0.249	0.645	0.368	0.969	0.627

5.5.1. Public loghub results

Table 4 demonstrates the effectiveness of ECLIPSE and three baselines. The best results for each dataset have been highlighted. The table shows that though baselines have already achieved relatively good results in public loghub, ECLIPSE still outperforms the other algorithms in terms of F-measure on 14 out of 16 datasets and GA on 10 out of 16 datasets. Overall, ECLIPSE achieves the highest average scores for both F-measure and GA, with a 1.8% improvement in F-measure and a 6.5% improvement in GA. One of the most notable achievements of ECLIPSE is its performance on the Proxifier dataset. This dataset is particularly suitable for ECLIPSE, as it contains 47.8% similar logs with semantic differences and 47.35% logs with varying lengths.

5.5.2. ECLIPSE-Bench results

In ECLIPSE-Bench, the improvement of ECLIPSE is more obvious. It achieved an average F-measure of 0.969, which is 25.5% higher than the best result in the baselines, and an average PA of 0.627, which is 30.3% higher than the best result in the baselines. The greatest effect improvement of ECLIPSE was observed on the Finance dataset, which contains the maximum proportion of logs with different semantics and various lengths. Specifically, the proportions of such logs were 17.9% and 40.8%, respectively.

5.6. RQ2: How efficient is ECLIPSE with the change of log and template volume?

To measure the efficiency of our proposed approach, we compare the execution time of ECLIPSE with the baselines as the number of logs and templates varies. Since LogHub does not have a large scale of templates, the data for the experiment are randomly sampled from the LogHub and ECLIPSE-Bench. For fairness, we utilize only a single executor and run each log parser five times for the average execution time.

For the execution time comparison on the number of logs, we measure the execution time of each parser under the different scales $\{3000,30000,300000\}$ of log data in Figure 8. Results demonstrate that ECLIPSE achieves the lowest time consumption regardless of the log volume. Besides, as logs grow, ECLIPSE becomes more efficient compared to baselines. Especially when the number of logs reaches $300000$ , ECLIPSE takes almost half the execution time of Spell. In Figure 7, the number of logs remains unchanged $(100000)$ . Thus we conduct experiments under the different numbers $\{3,30,300,3000\}$ of log templates. When the number of templates is small, ECLIPSE is not as efficient as baselines. However, as the number of templates rises to a specific order of magnitude, ECLIPSE shows a significant advantage in efficiency. Both experiments demonstrate that ECLIPSE is far more efficient in industrial scenarios, especially when more log templates and larger log volumes.

5.7. RQ3: How does ECLIPSE perform without LLM and Faiss to retrieve and recall by constructing the special dynamic dictionary?

In this section, we conduct an ablation experiment to investigate the effectiveness and efficiency of LLM and Faiss in ECLIPSE. To verify this, we compare ECLIPSE with three variants of ECLIPSE: 1) ECLIPSE w/o LF (ECLIPSE without LLM and Faiss); 2) ECLIPSE w/o L (ECLIPSE without LLM); 3) ECLIPSE w/o F (ECLIPSE without Faiss).

5.7.1. Effect on Effectiveness

We conduct the ablation experiment to verify their effects on effectiveness based on our industrial datasets ECLIPSE-Bench as it’s closer to the real parsing situation. The results are shown in Figure 9, where the label ” $w/o$ ” means ”without”, ” $kl$ ” means ”keyword library” and ” $fi$ ” means ”Faiss”. As we can see, the keyword library has a certain impact on the effectiveness, with about 6.7% improvement on F-measure and 24.4% improvement on PA, while Faiss hardly affects the parsing effectiveness, just as we expect. This proves that recall based on the Faiss has a very high recall rate.

5.7.2. Effect on Efficency

The efficiency of these two components varies with the number of templates. As shown in Figure 10, without the keyword library and Faiss, the execution time of ECLIPSE significantly increases with the increase in the number of templates. And with the keyword library and Faiss, their growth rate has significantly decreased. When the number of templates is only 3, there are not many differences between ECLIPSE Base and other variants. But when the number of templates comes to 3000, ECLIPSE Base can be 6.76 times faster than variant 1), 1.68 times faster than variant 2), and 1.47 times faster than variant 3), which shows that both keyword dictionary and Faiss can effectively improve the efficiency of ECLIPSE.

5.8. RQ4: How do we set suitable hyperparameters in recall and E-LCS process affect the performance of ECLIPSE?

In this section, we discuss the effect of three parameters: the number of the nearest neighbor templates $k$ , the recall threshold $\tau$ , and the matching threshold $\theta$ . All experiments are conducted on Manufacturing, which is the subset of ECLIPSE-Bench.

For the validation of parameter $k$ , we set three variables $\{5,10,15\}$ and keep the other parameters constant. The results in Table 6 show that the performance is less affected by the changes of $k$ . As the parameter values become larger, the performance decreases a little, which demonstrates the robustness of the model.

For the validation of $\tau$ , we set three variables $\{0.1,0.5,0.9\}$ and keep the other parameters constant. The results prove that the model is more sensitive to $\tau$ . As the value of $\tau$ becomes larger, the performance is first superior and then inferior. Our optimal experimental setting is 0.5 for $\tau$ . For $\theta$ , we set five variables $\{3.5,4.0,4.5,5.0,\\ 5.5\}$ and keep the other parameters constant. The results show that the model is less influenced by the $\theta$ . As the $\theta$ becomes larger, the performance increases and then decreases. Our best setting is 0.6 for $\theta$ .

Table 6. Performance under different

k

Dataset	$k$	$\tau$	$\theta$	F-measure	GA
	5	0.5	4.5	0.998	0.832
Manufacturing	10	0.5	4.5	0.988	0.821
	15	0.5	4.5	0.986	0.817

Table 7. Performance under different

\tau

Dataset	$k$	$\tau$	$\theta$	F-measure	GA
	5	0.1	4.5	0.957	0.399
Manufacturing	5	0.5	4.5	0.998	0.832
	5	0.9	4.5	0.987	0.798

Table 8. Performance under different

\theta

Dataset	$k$	$\tau$	$\theta$	F-measure	GA
	5	0.5	3.5	0.985	0.761
	5	0.5	4.0	0.988	0.787
Manufacturing	5	0.5	4.5	0.998	0.832
	5	0.5	5.0	0.996	0.826
	5	0.5	5.5	0.979	0.803

6. THREATS TO VALIDITY

The following major threats to validity are identified: 1) Under-partitioning risk. ECLIPSE may be at risk of under-partitioning, which means merging different templates into one. However, we have found LLM and Faiss to construct a dynamic dictionary help a lot in reducing this risk, and experiments on LogHub and ECLIPSE-Bench show ECLIPSE performs well, and this problem rarely occurs in real. 2) Insufficient attention to infrequent templates. It is an easily overlooked problem in existing work, as the impact of frequent templates will have a greater weight than that of infrequent templates in evaluation metrics. However, logging infrequent templates may be more important in real industrial scenarios because serious events often occur infrequently. So this will be a key research direction for our future work.

7. Conclusion

In this paper, we propose a robust log parsing method called ECLIPSE. This method is designed to parse logs in a streaming manner, providing both accuracy and efficiency even in the most complex scenarios. When parsing a new log, ECLIPSE first quickly recalls a specified number of templates based on a dynamic library driven by LLM and Faiss, which has been proven effective through ablation experiments. Then, ECLIPSE utilizes our proposed Entropy-LCS to match the most suitable template and update it in real time. We evaluate ECLIPSE on both public LogHub and our ECLIPSE-Bench. The results show that while baselines perform well only on public datasets, ECLIPSE not only outperforms them on public datasets but also excels on industrial datasets. In terms of efficiency, ECLIPSE also has clear advantages, especially when the number of templates reaches a significantly high level, allowing it to parse logs quickly and accurately.

References

(1)
Anil et al. (2023) Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yan** Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yu**g Zhang, Gustavo Hernández Ábrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan A. Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vladimir Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, and et al. 2023. PaLM 2 Technical Report. CoRR abs/2305.10403 (2023).
Bai et al. (2023) **ze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
Chai et al. (2024) Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xiannian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, et al. 2024. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. arXiv preprint arXiv:2401.07037 (2024).
Chi et al. (2020) Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, and Heyan Huang. 2020. Cross-Lingual Natural Language Generation via Pre-Training. In AAAI. AAAI Press, 7570–7577.
Chu et al. (2021) Guojun Chu, **gyu Wang, Qi Qi, Haifeng Sun, Shimin Tao, and Jianxin Liao. 2021. Prefix-Graph: A Versatile Log Parsing Approach Merging Prefix Tree with Probabilistic Graph. In ICDE. IEEE, 2411–2422.
Chuah et al. (2010) Edward Chuah, Shyh-hao Kuo, Paul Hiew, William-Chandra Tjhi, Gary Lee, John Hammond, Marek T Michalewicz, Terence Hung, and James C Browne. 2010. Diagnosing the root-causes of failures from cluster log files. In 2010 HiPC. IEEE, 1–10.
Dai et al. (2020) Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun Chen. 2020. Logram: Efficient Log Parsing Using $n$ n-Gram Dictionaries. IEEE Trans. Softw. Eng. 48, 3 (2020), 879–892.
Du and Li (2016) Min Du and Feifei Li. 2016. Spell: Streaming parsing of system event logs. In ICDM. IEEE, 859–864.
Du et al. (2017) Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In SIGSAC. 1285–1298.
Fu et al. (2009) Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In ICDM. IEEE, 149–158.
Fu et al. (2022) Ying Fu, Meng Yan, Jian Xu, Jianguo Li, Zhongxin Liu, Xiaohong Zhang, and Dan Yang. 2022. Investigating and improving log parsing in practice. In ESEC/FSE. 1566–1577.
Guo et al. (2023a) Hongcheng Guo, Yuhui Guo, Renjie Chen, Jian Yang, Jiaheng Liu, Zhoujun Li, Tieqiao Zheng, Weichao Hou, Liangfan Zheng, and Bo Zhang. 2023a. LogLG: Weakly Supervised Log Anomaly Detection via Log-Event Graph Construction. arXiv:2208.10833 [cs.SE]
Guo et al. (2023b) Hongcheng Guo, Jian Yang, Jiaheng Liu, Liqun Yang, Linzheng Chai, Jiaqi Bai, Junran Peng, Xiaorong Hu, Chao Chen, Dongfeng Zhang, et al. 2023b. Owl: A large language model for it operations. arXiv preprint arXiv:2309.09298 (2023).
Hamooni et al. (2016) Hossein Hamooni, Biplob Debnath, Jianwu Xu, Hui Zhang, Guofei Jiang, and Abdullah Mueen. 2016. Logmine: Fast pattern recognition for log analytics. In CIKM. 1573–1582.
He et al. (2016) Pinjia He, Jieming Zhu, Shilin He, Jian Li, and Michael R Lyu. 2016. An evaluation study on log parsing and its use in log mining. In DSN. IEEE, 654–661.
He et al. (2017) Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In ICWS. IEEE, 33–40.
He et al. (2021) Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2021. A survey on automated log analysis for reliability engineering. CSUR 54, 6 (2021), 1–37.
He et al. (2018) Shilin He, Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, Michael R Lyu, and Dongmei Zhang. 2018. Identifying impactful service system problems via log analysis. In ESEC/FSE. 60–70.
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. arXiv preprint arXiv:2205.11916 (2022).
Lang (2013) David Lang. 2013. Using sec. ; login:: the magazine of USENIX & SAGE 38, 6 (2013), 38–43.
Lewis et al. (2021) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL]
Liu et al. (2022) Yudong Liu, Xu Zhang, Shilin He, Hongyu Zhang, Liqun Li, Yu Kang, Yong Xu, Minghua Ma, Qingwei Lin, Yingnong Dang, et al. 2022. Uniparser: A unified log parser for heterogeneous log data. In Proceedings of the ACM Web Conference 2022. 1893–1901.
Makanju et al. (2009) Adetokunbo AO Makanju, A Nur Zincir-Heywood, and Evangelos E Milios. 2009. Clustering event logs using iterative partitioning. In Proceedings of the 15th ACM SIGKDD. 1255–1264.
Mizutani (2013) Masayoshi Mizutani. 2013. Incremental mining of system log format. In 2013 IEEE International Conference on Services Computing. IEEE, 595–602.
Nagappan et al. (2009) Meiyappan Nagappan, Kesheng Wu, and Mladen A Vouk. 2009. Efficiently extracting operational profiles from execution logs using suffix arrays. In 2009 20th International Symposium on Software Reliability Engineering. IEEE, 41–50.
Nandi et al. (2016) Animesh Nandi, Atri Mandal, Shubham Atreja, Gargi B Dasgupta, and Subhrajit Bhattacharya. 2016. Anomaly detection using program control flow graph mining from execution logs. In SIGKDD. 215–224.
Nedelkoski et al. (2021) Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, Jorge Cardoso, and Odej Kao. 2021. Self-supervised log parsing. In ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part IV. Springer, 122–138.
OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774
Rand and Miranskyy (2021) Jared Rand and Andriy Miranskyy. 2021. On automatic parsing of log records. In ICSE-NIER. IEEE, 41–45.
Rozière et al. (2023) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, **gyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton-Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. 2023. Code Llama: Open Foundation Models for Code. CoRR abs/2308.12950 (2023).
Shen et al. (2023) Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace. arXiv preprint arXiv:2303.17580 (2023).
Shima (2016) Keiichi Shima. 2016. Length matters: Clustering system log messages using length of words. arXiv preprint arXiv:1611.03213 (2016).
Shinn et al. (2023) Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. In Advances in Neural Information Processing Systems.
Tak et al. (2016) Byung Chul Tak, Shu Tao, Lin Yang, Chao Zhu, and Yao** Ruan. 2016. Logan: Problem diagnosis in the cloud using log-based reference models. In IC2E. IEEE, 62–67.
Vaarandi (2003) Risto Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In IPOM. Ieee, 119–126.
Vaarandi and Pihelgas (2015) Risto Vaarandi and Mauno Pihelgas. 2015. Logcluster-a data clustering and pattern mining algorithm for event logs. In CNSM. IEEE, 1–7.
Vaswani et al. (2017) Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. NIPS (2017).
Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
Xiao et al. (2020) Tong Xiao, Zhe Quan, Zhi-Jie Wang, Kaiqi Zhao, and Xiangke Liao. 2020. Lpv: A log parser based on vectorization for offline and online log parsing. In ICDM. IEEE, 1346–1351.
Xu et al. (2009) Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009. Detecting large-scale system problems by mining console logs. In SOSP. 117–132.
Yang et al. (2020a) Jian Yang, Shuming Ma, Dongdong Zhang, Zhoujun Li, and Ming Zhou. 2020a. Improving neural machine translation with soft template prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5979–5989.
Yang et al. (2020b) Jian Yang, Shuming Ma, Dongdong Zhang, Zhoujun Li, and Ming Zhou. 2020b. Improving Neural Machine Translation with Soft Template Prediction. In ACL. Association for Computational Linguistics, 5979–5989.
Yang et al. (2020c) Jian Yang, Shuming Ma, Dongdong Zhang, Shuangzhi Wu, Zhoujun Li, and Ming Zhou. 2020c. Alternating language modeling for cross-lingual pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9386–9393.
Yang et al. (2019) Ze Yang, Wei Wu, Jian Yang, Can Xu, and Zhoujun Li. 2019. Low-Resource Response Generation with Template Prior. In EMNLP-IJCNLP. 1886–1897.
Yao et al. (2023) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL]
Zhang et al. (2024a) Wei Zhang, Hongcheng Guo, Anjie Le, Jian Yang, Jiaheng Liu, Zhoujun Li, Tieqiao Zheng, Shi Xu, Runqiang Zang, Liangfan Zheng, and Bo Zhang. 2024a. Lemur: Log Parsing with Entropy Sampling and Chain-of-Thought Merging. arXiv:2402.18205 [cs.SE]
Zhang et al. (2024b) Wei Zhang, Hongcheng Guo, Jian Yang, Yi Zhang, Chaoran Yan, Zhou** Tian, Hangyuan Ji, Zhoujun Li, Tongliang Li, Tieqiao Zheng, Chao Chen, Yi Liang, Xu Shi, Liangfan Zheng, and Bo Zhang. 2024b. mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture. arXiv:2404.12135 [cs.MA]
Zhu et al. (2019) Jieming Zhu, Shilin He, **yang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R Lyu. 2019. Tools and benchmarks for automated log parsing. In 2019 IEEE/ACM 41st ICSE-SEIP. IEEE, 121–130.