OpsEval: A Comprehensive IT Operations Benchmark Suite
for Large Language Models

Yuhe Liu Tsinghua University , Changhua Pei Chinese Academy of Sciences , Longlong Xu , Bohan Chen , Mingze Sun Tsinghua University , Zhirui Zhang Bei**g University of Posts and Telecommunications , Yongqian Sun , Shenglin Zhang Nankai University , Kun Wang Tsinghua University , Haiming Zhang , Jianhui Li , Gaogang Xie Chinese Academy of Sciences , Xidao Wen , Xiaohui Nie BizSeer , Minghua Ma Microsoft and Dan Pei Tsinghua University

Abstract.

Information Technology (IT) Operations (Ops), particularly Artificial Intelligence for IT Operations (AIOps), is the guarantee for maintaining the orderly and stable operation of existing information systems. According to Gartner’s prediction, the use of AI technology for automated IT operations has become a new trend. Large language models (LLMs) that have exhibited remarkable capabilities in NLP-related tasks, are showing great potential in the field of AIOps, such as in aspects of root cause analysis of failures, generation of operations and maintenance scripts, and summarizing of alert information. Nevertheless, the performance of current LLMs in Ops tasks is yet to be determined. A comprehensive benchmark is required to optimize LLMs tailored for Ops (OpsLLM). In this paper, we present OpsEval, a comprehensive task-oriented Ops benchmark designed for LLMs. For the first time, OpsEval assesses LLMs’ proficiency in various crucial scenarios at different ability levels. The benchmark includes 7184 multi-choice questions and 1736 question-answering (QA) formats in English and Chinese. By conducting a comprehensive performance evaluation of the current leading large language models, we show how various LLM techniques can affect the performance of Ops, including self-consistency, chain-of-thought, and few-shot in-context learning, and discussed findings related to various topics, including model quantification, QA evaluation, and hallucination issues. To ensure the credibility of our evaluation, we invite dozens of domain experts to manually review our questions. At the same time, we have open-sourced 20% of the test QA to assist current researchers in preliminary evaluations of their OpsLLM models. The remaining 80% of the data, which is not disclosed, is used to eliminate the issue of the test set leakage. Additionally, we have constructed an online leaderboard that is updated in real-time and will continue to be updated, ensuring that any newly emerging LLMs will be evaluated promptly. Both our dataset and leaderboard have been made public¹¹1Data page is available at https://github.com/NetManAIOps/OpsEval-Datasets.
Leaderboard is available at https://opseval.cstcloud.cn/content/leaderboard..

^†^†copyright: none^†^†conference: ; ^†^†booktitle: ;^†^†conference: ; ; ^†^†ccs: Networks Network management^†^†ccs: Computing methodologies Artificial intelligence

1. Introduction

The IT Operations (Ops) plays a crucial role in maintaining the efficient and stable operation of information systems such as cloud computing, 5G networks²²2Strictly speaking, 5G belongs to the field of communications technology (CT), but given its broad association with the information technology (IT) sector, for the sake of generality, we refer to it as IT operations, abbreviated as Ops, throughout the remainder of this paper. and financial information systems. As the Internet continues to expand rapidly, the scale and complexity of systems are escalating, leading to the emergence of artificial intelligence-assisted operations as a novel trend. Termed “AIOps” by Gartner (Lerner, 2017), this technique utilizes artificial intelligence to address (but is not limited to) tasks such as anomaly detection, fault analysis, generation of alert summaries, performance optimization, and capacity planning.

In recent years, large language models (LLMs) have witnessed significant advancements. The latest models, such as GPT-4V (OpenAI, 2023b), GPT-4 (OpenAI, 2023a), LLaMA-2 (et.al., 2023), and ChatGLM3 (202, 2024b), have demonstrated exceptional generalization and task planning capabilities. As a result, these models have provided numerous opportunities to enhance downstream domain-specific applications. With its advanced summarizing, report analyzing, and ability to make API calls, LLM is well suited for Ops on tasks like answer generation, information summarizing, and report analysis. Hereinafter, we refer to the LLM used for Ops as OpsLLM, regardless of whether they have been optimized specifically for Ops.

While there are benchmarks for assessing general-purpose NLP-related capabilities, such as C-EVAL (Huang et al., 2023) and AGIEval (Zhong et al., 2023), there are also benchmarks for specific domains, like FinEval (Zhang et al., 2023) in the financial sector and CMB (Wang et al., 2023a) in the medical field. However, no benchmark exists that can evaluate the effectiveness of LLMs or OpsLLMs in Ops tasks. There is an urgent need for an Ops benchmark that informs us about the performance of current LLMs on Ops tasks. On the other hand, a good benchmark can significantly aid the optimization process of OpsLLMs tailored for the Ops domain.

Nevertheless, due to the specialty of the Ops tasks, constructing an Ops benchmark presents the following challenges:

•

The Ops data is sensitive primarily and proprietary to companies, with very little publicly available data, making it difficult for any company to independently provide a sufficient amount of evaluation data to ensure confidence in the test results.
•

There are many different subdomains (such as 5G communications, cloud computing, log analysis, and network management) and sub-tasks (such as network script generation and terminology explanation) within the Ops field. Different domains and tasks have different accuracy requirements for models. For example, network script generation requires high accuracy, whereas operational terminology explanation may not be as precise as the network script generation task. Automatically annotating evaluation problems that have different accuracy requirements poses a significant challenge.
•

Since existing LLMs are not explicitly trained for the Ops domain, the evaluation results are more sensitive to prompt engineering. Designing appropriate prompts for robust and accurate evaluation is challenging.
•

Existing metrics like BLEU only consider the similarity of model output to ground-truth answers in natural language, which does not always indicate a good response in Ops tasks. Some terms and expressions in Ops scenarios have specific meanings that cannot be summarized in advance. Designing an automatic evaluation method that assesses the accuracy of Q&A from an accurate semantic level is challenging.

To address these issues, we propose OpsEval, a comprehensive IT operations benchmark for LLMs. First, to tackle the challenge of benchmark data being mostly private and not publicly shareable, spread across various companies, we built a community around AIOps with currently 10 companies participating, allowing each entity in the community to continuously contribute a small amount of data. We then aggregate data under the same domain to ensure robustness in evaluation. Additionally, based on publicly available network management books, we generated both multi-choice and QA questions as supplements. To address the challenge of different accuracy requirements across multiple Ops scenarios and tasks, we employed model-based pre-clustering and manual review to annotate eight tasks and three abilities for independent evaluation. In response to the sensitivity of benchmark results to prompts, we systematically test model performance under self-consistency, chain-of-thought, and few-shot in-context learning. The prompts used in our evaluation are also disclosed in the paper. Lastly, to address the inaccuracy of existing metrics like BLEU in assessing the performance of question answering, we designed a three-dimensional evaluation metric that automatically assesses questions from the perspectives of fluency, accuracy and evidence. Experimental results show that our designed metrics align strongly with the annotations of operations experts.

The contributions of our paper are as follows:

•

We propose OpsEval, encompassing 8920 questions and designing a semi-automatic annotation method, conducting independent and robust evaluations across eight domains, providing guidance for model selection in the Ops domain.
•

Based on OpsEval, we systematically tested over 17 mainstream LLMs, and the results show that the current models do not perform ideally on some Ops tasks, indicating a significant gap towards real-world applications.
•

We designed a multi-dimensional evaluation metric based on GPT-4 (OpenAI, 2023a), which has shown consistency with expert annotations reaching 92%, replacing BLUE for automated domain-specific QA evaluations.
•

We released an online leaderboard that updates the performance of newly emerging models on Ops tasks. Currently, the size of evaluation dataset on this platform continues to grow.
•

To assist researchers in preliminary evaluating their OpsLLMs, we have carefully selected and released 20% of QAs from our benchmark, with the remaining 80% of undisclosed data preventing unfair evaluations due to data leakage (Wei et al., 2023b). We also provide an interface on our website where researchers can upload their models for evaluation against the full dataset.

2. Related Works

As LLMs evolve rapidly, their complex and varied capabilities are increasingly recognized, while traditional NLP metrics fall short of accurately evaluating these abilities. As a result, there is a growing trend towards proposing evaluation benchmarks tailored specifically for LLMs. These can be divided into two categories: general ability benchmarks and domain-specific benchmarks.

General ability benchmarks assess the general abilities of LLMs across various tasks. These tasks evaluate LLMs’ capacity for logical reasoning, general knowledge, common sense, and other similar abilities rather than being confined to a particular domain. HELM (Liang et al., 2022) employs seven distinct metrics in 42 unique scenarios, offering a comprehensive evaluation of LLMs’ capabilities across multiple dimensions. BIG-bench (Srivastava et al., 2022) comprises 204 tasks spanning a wide array of topics, with a particular focus on tasks deemed beyond the reach of current LLMs. C-Eval (Huang et al., 2023) is the first comprehensive Chinese evaluation suite designed to assess Chinese LLMs’ advanced knowledge and reasoning abilities rigorously. AGIEval (Zhong et al., 2023) curates authentic questions from examinations such as the Chinese College Entrance Exam (CCEE) and the SAT, constructing a fundamentally human-centric evaluation dataset. In a parallel endeavor, MMCU (Zeng, 2023) leverages questions from the CCEE and various professional exams to establish a robust comprehension benchmark. CG-Eval (Zeng et al., 2023) focuses on assessing the generation capabilities of LLMs, employing a testing framework that includes term definitions, short-answering, and computation problems.

Refer to caption — Figure 1. The framework of OpsEval

Domain-specific benchmarks evaluate the abilities of LLMs to handle tasks in specific fields. These benchmarks require LLMs to possess specialized knowledge in a specific domain and to respond in a manner consistent with the cognitive patterns of that field. Despite the rapid progression of LLMs in specialized domains, the evaluation metrics for these specific areas have received less attention. FinEval (Zhang et al., 2023) is a benchmark explicitly designed to measure the advanced financial knowledge of Chinese LLMs. MultiMedQA (Singhal et al., 2022) is an extensive medical question-answering dataset, with questions derived from professional medical exams, research, and consultation records. Huatuo-26M (Li et al., 2023) comprises actual medical consultation records and medical knowledge question-answering content. CMB (Wang et al., 2023a) includes multi-choice questions (CMB-Exam) and complex clinical questions based on real case studies (CMB-Clin), with the correct answers established through expert consensus. NetOps (Miao et al., 2023) focuses on evaluations in the network field, which is relevant to the field of Ops. NetOps includes multi-choice questions in both English and Chinese and a few filling-blanks and question-answering questions. However, they only focus on wired network operations. In contrast, our research encompasses various Ops sub-domains. Furthermore, we have delineated the categorization of tasks and abilities, thereby providing a more nuanced and comprehensive assessment framework. As Ops problems in the real world often involve full-stack IT technologies, we believe OpsEval’s broader scope will provide significantly more benefits than NetOps.

3. OpsEval benchmark

Figure 1 shows the overall framework of OpsEval from construction to evaluation. We first collected data from multiple sources, then enhanced and assured the quality of the collected questions through a question curation process. In the third step, we standardized the format of questions that had undergone manual review. Finally, we evaluated the leading large language models on the dataset using various prompt engineering techniques. The following subsections provide detailed descriptions of each step.

3.1. Data Collection

Our benchmark questions have been collected from various sources; we summarize them into four categories: company materials, certification exams, Ops textbooks, and automated generation. Each source is highly esteemed globally and reviewed by our collaborators.

Company Materials include production environment materials like Ops tickets and error logs, as well as internal documents and tests for Ops staff training. We have established cooperative relationships with 10 companies and received materials from them. The companies and enterprises we collaborate with cover various sectors, including Internet, telecommunications, finance, and Ops service/tool providers. Each company has designated experts to work with us to select questions. Information about the companies and experts can be found in the appendix.

Certification Exams include knowledge assessments necessary for becoming an Ops staff and are naturally in the form of multiple-choice and short-answer questions. We obtained the relevant study guidebooks for these certification exams from public book websites and extracted sample questions from them as one of the sources for Ops questions.

Operations Textbooks. We first constructed a seeding keyword list for the Ops field and searched for related books. The textbooks contain relatively complete knowledge content, which can provide experts with materials for question creation, and some books themselves also include a certain number of exercises at the end of the chapters.

Besides the sources described above, to enhance the diversity and depth of our test set, we also source question-answering questions from authoritative books covering a range of Ops domains by extracting textbook contents and providing them to GPT4. In Sec 6, we discussed the methodology and challenges in Automated Generation, ho** to provide some experience in Ops benchmark creation.

By the end of the data collection process, we have collected 21250 questions in total, which are to be further selected and curated.

3.2. Quality Enhancement

We systematically carried out the processing of our original test set in the following stages:

Deduplication: Any repeated or highly similar questions are identified and removed to avoid redundancy in the test set. We calculate the cosine similarity of the question stems to detect duplicate questions and identify pairs of questions with a similarity above a certain threshold (th=0.7). These pairs are later manually reviewed to confirm if they are duplicates.

Dependance Filtering: Since OpsEval primarily focuses on the ability of large language models to understand Ops knowledge, we have filtered out questions that rely on external images or document content to ensure the completeness of the question content itself.

Question Categorization: In the complex landscape of Ops, recognizing the multidimensional nature of tasks is essential. We devise a categorization that captures many tasks that professionals confront in practical applications. The categorization process consists of two steps: automated screening and manual review.

We first use GPT-4 for topic modeling to gain rough insights about the dataset, leveraging its capabilities to determine the relevance of each question to Ops. The topic modeling resulted in more than 20 tasks but had an imbalanced distribution. The prompt we used to let GPT-4 do the categorization is shown in Figure 2(a).

Secondly, during the manual review process (as we will introduce this step later in this section), we involved dozens of experts in manual screening. By merging similar tasks and labeling different ability levels, we categorize the questions into eight tasks and three ability levels, respectively.

•

Task Categorization: Our eight distinct Ops tasks are formulated by industry relevance, task frequency, and the significance of each area in Ops. Details of the eight tasks can be found in Appendix A.2.
•

Ability Categorization: Based on which ability is required to answer them, questions are classified into three categories: Knowledge Recall, Analytical thinking, and Practical Application, reflecting the challenges professionals might encounter in real-world scenarios. The three abilities are described in Appendix A.3.

Table 1. The distribution of different tasks and abilities of questions in OpsEval.

	Category	Percentage (%)
Task	Automation Scripts	$3.3$
	Monitoring and Alerting	$5.2$
	Performance Optimization	$5.3$
	Software Deployment	$7.9$
	Fault Analysis and Diagnostics	$13.7$
	Network Configuration	$29.0$
	General Ops Knowledge	$20.2$
	Miscellaneous	$15.5$
Ability	Knowledge Recall	$49.8$
	Analytical Thinking	$39.9$
	Practical Application	$10.2$

The distribution of the questions across these eight tasks and three ability levels is depicted in Table 1.

Manual Review: In the manual review step, we asked Ops experts from the industry to inspect the results of the previous three automated steps, including confirming duplicate and invalid questions and examining the classification results of GPT-4. Experts were also asked to drop the questions unrelated to the Ops field. We split the dataset by n-folds and ensure each fold has at least two experts to review. By involving experts’ consensus, we ensure the reliability and authority of the dataset after automated processing.

Table 2. The number of questions in OpsEval, grouped by their scenarios.

Scenario	Type	Questions
Wired Network	Multi-Choice	3901
5G Communication	Multi-Choice	2615
5G Communication	Question-Answering	1162
Oracle Database	Multi-Choice	497
Log Analysis	Question-Answering	420
DevOps	Question-Answering	154
Securities Information System	Multi-Choice	91
Hybrid Cloud	Multi-Choice	40
Financial IT	Multi-Choice	40
Total	Multi-Choice	7184
Total	Question-Answering	1736

As listed in Table 2, this quality enhancement process resulted in a refined test set of approximately 7,000 multi-choice and 2000 question-answering questions.

3.3. Question Formatting

After quality enhancement of the collected questions, we perform question formatting for the upcoming evaluation. Each question is structured to include the context, answer, task, and ability. Figure 2 shows three examples of the formatted questions. This structured approach facilitates easy evaluation and analysis.

3.4. Evaluation Settings

In our evaluation of LLMs within the Ops domain, we categorize our assessment questions as multi-choice and question-answering.

Multi-choice questions offer a structured approach with definitive answers. These questions are straightforward and provide a clear metric for assessment. However, given the intricacies of advanced models, they may be influenced by the options provided, leading to their responses being driven more by pattern recognition rather than a proper understanding of the content.

We use accuracy as the metric. The output of LLM may contain additional strings besides the chosen option, even if we make explicit requests and use few-shot in the prompt. Therefore, we use a choice-extracting function based on regular expressions to extract the predicted answer of LLMs. Then, we calculate the accuracy based on the extracted answer and the ground-truth labels.

Question-answering questions do not come with predefined options. It necessitates the model to rely more on its comprehension and knowledge base, offering a clearer insight into its cognitive capabilities. Such questions can better assess LLMs’ ability to generate coherent and contextually relevant responses.

We use two metrics for question-answering questions: one is based on word overlaps, and the other is based on semantic similarity. For the first type, we use ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002), widely used in NLP tasks, especially in the translation task. For the second type, we use GPT-4 and experts to obtain the output score of LLMs, called GPT4-Score and Expert-Evaluation, designed explicitly in OpsEval. We provide a more detailed description of these metrics in Appendix A.5.

For GPT4-Score, we provide GPT-4 with the question and the reference answer, followed by one anonymous model’s output. Figure 2(b) shows the prompt we used to ask GPT-4 to score.

We design expert evaluation for manually scoring LLMs’ outputs based on three criteria highly related to Ops’ needs. The three criteria in consideration are as follows:

•

Fluency. Assessment of the linguistic fluency in the model’s output and compliance with the question-answering question’s answering requirements.
•

Accuracy. Evaluation of the precision and correctness of the model’s output, including whether it adequately covers key points of the ground-truth answer.
•

Evidence. Examine whether the model’s output contains sufficient argumentation and evidential support to ensure the credibility and reliability of the answer.

For each output to a question of an LLM, we asked experts to score it between 0 and 3 for each criterion. During the scoring, the raw question, the detailed answer and its key points, and the output of an anonymous model are given at each iteration. Since avoiding bias towards specific models is important, no information about the model is given.

Fairness Consideration We are currently making 20% of the data available to the public for Ops community contribution and research purposes, yet for fairness of the evaluation, the complete version of OpsEval dataset is kept private and not disclosed. To evaluate a new model, users can submit a Docker image with an initialization script when starting a container based on it. We will run the evaluation automatically and obtain the result on the OpsEval website. Users can choose to disclose their results on the leaderboard of OpsEval or not based on their preference.

4. Experiment Design

In this section, we will show the experiment design of OpsEval. We evaluate various LLMs on OpsEval, aiming to understand the multiple abilities of different LLMs in addressing different question types (multi-choice and question-answering questions) and tasks. We also evaluate LLMs with different quantization parameters.

4.1. Models

Table 3. Models evaluated in this paper

Model	Creator	#Parameters	Access³³3The “access” column in the table shows whether we have full access to the model weights or can only access them through API.
GPT-4	OpenAI	undisclosed	API
GPT-3.5-turbo	OpenAI	undisclosed	API
ERNIE-Bot-4.0	Baidu	undisclosed	API
GLM4	Tsinghua Zhipu	undisclosed	API
GLM3-turbo	Tsinghua Zhipu	undisclosed	API
LLaMA-2	Meta	7/13/70B	Weights
Qwen-Chat	Alibaba Cloud	7/14/72B	Weights
InternLM2-Chat	Shanghai AI Laboratory	7/20B	Weights
DevOps-Model-Chat	CodeFuse	14B	Weights
Baichuan2-Chat	Baichuan Intelligence	13B	Weights
ChatGLM3	Tsinghua Zhipu	6B	Weights
Mistral	Mistral	7B	Weights

Table 4. GPTQ models for LLaMA-2-70B

Model	Size	#GPTQ Dataset	Disc
LLaMA-2-70B	140GB	/	Raw LLaMA-2-70B model.
LLaMA-2-70B-Int4	35.33GB	wikitext	4-bit quantization model.
LLaMA-2-70B-Int3	26.78GB	wikitext	3-bit quantization model.

As shown in Table 3, we evaluate popular LLMs covering different weights from different organizations. The detailed information of all LLMs in Table 3 can be found in Appendix B.1.

Besides, we evaluate LLaMA-2-70B with multiple quantization parameters to get an overview of the effect of different quantization parameters. Specifically, we use GPTQ (Frantar et al., 2022) models with 3-bit and 4-bit quantization parameters.⁴⁴4The two quantization models are downloaded from https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ. The size of LLaMA-2-70B is calculated based on 70B parameters. The two GPTQ models evaluated in our experiments are calibrated on wikitext (Merity et al., 2016), a language modeling dataset extracted from the set of verified Good and Featured articles on Wikipedia. GPTQ is a post-training quantization (PTQ) method to make the model smaller with a calibration dataset. It is a one-shot weight quantization method based on approximate, highly accurate, and efficient second-order information. The details of GPTQ models can be found in Table 4.

4.2. Prompting Techniques

To get a comprehensive overview of the performance of popular LLMs on OpsEval, we use various settings to perform the evaluation. We evaluate LLMs in zero and few-shot settings (3-shot in our implementation). For each setting, we evaluate LLMs in four sub-settings of prompt engineering, that is, naive answers (Naive), self-consistency (SC (Wang et al., 2023b)), chain-of-thought (CoT (Wei et al., 2023a)), self-consistency with chain-of-thought (CoT+SC). We design prompts for both English and Chinese languages.

Naive. The Naive setting is to expect the LLMs to generate the answer without any other explanations. Since we have the task type of each question, we integrate the task into the prompt.

SC. Self-consistency is selecting the most consistent answer among several queries on LLMs. Although it aims “to replace the naive greedy decoding used in chain-of-thought prompting,” it can generate naive answers as it may generate different answers with the same prompts. We set the number of queries in SC to 5.

CoT. The CoT setting aims to enable LLMs to obtain complex reasoning capabilities through intermediate reasoning steps. We construct specific prompts for CoT setting in both zero and few-shot evaluations. Details of the prompt construction can be found in Appendix B.2.

CoT+SC. We combine CoT and SC to boost the performance of CoT prompting. By SC, We choose the consistent reasoning path and answer several of the same queries. Like the SC setting, we set the number of queries in CoT+SC to 5.

As for question-answering problems, we combine each question’s task type and ability and the question as the prompt for LLMs. Figure 9 in Appendix A.4 shows an example of constructing the prompt.

5. Evaluation

5.1. Benchmark Leakage Test

As a benchmark suited for LLM, it is necessary to ensure that the test set has not been included in the model’s pretraining process. Otherwise, the potential bias may damage the fairness of the benchmark. We adapted the methodology from Skywork (Wei et al., 2023b) to perform a leakage test on OpsEval’s dataset. We evaluate the LLM loss on samples (a sample is a concatenation of question and answer) from different datasets for several foundation models and calculate the average loss. For each dataset, we compare LM loss on the test split ( $L_{test}$ ) and a specially curated reference set ( $L_{ref}$ ) generated by GPT-4, designed to mimic the testing dataset. We define a key metric: $\Delta L=L_{test}-L_{ref}$ , serving as an indicator of potential test data leakage during the training of the LLM, i.e., a lower value suggests possible leakage. Notice that while Skywork (Wei et al., 2023b) only asked GPT-4 to generate similar questions to GSM8K, we require GPT-4 to rewrite the question while preserving its original meaning and accuracy. By preserving the original meaning, if the $\Delta L$ of the test set is a lower value, it suggests that the LLM’s lower $L_{test}$ originates from over-fitting the distribution of tokens in the test set rather than internalizing the knowledge behind the questions, thereby suggesting a leakage in the test set.

Table 5. Measurement of potential test data leakage during the training of LLM. This demonstrates the unbiased nature and non-leakage of the OpsEval test set.

Dataset	$L_{test}$	$L_{ref}$	$\Delta L$
Alpaca	1.994033	2.354260	-0.360228
Alpaca-GPT4	1.498862	1.763663	-0.391062
CEval	2.570809	2.309943	0.260866
MMLU	2.547598	2.189870	0.357728
OpsEval	1.885437	1.728079	0.105095

Table 5 shows the results of leakage measurement. In addition to the two standard evaluation benchmarks (CEval (Huang et al., 2023) and MMLU (Hendrycks et al., 2021)), we conducted the same experiments on the alpaca dataset (Taori et al., 2023) and the Alpaca-GPT4 dataset (Peng et al., 2023), which is likely used in the pre-training of large models, using its $\Delta L$ as reference. The corpora likely involved in training show a significantly smaller $\Delta L$ , whereas the loss for the OpsEval dataset remains at a relatively small positive value. This demonstrates the unbiased nature and non-leakage of the OpsEval test set. The models we used in the leakage test are listed in Appendix B.1.

5.2. Overall Performance

Table 6. LLMs’ overall performance on Wired Network Operations English test set (3-shot). Models are ranked based on their best performance (marked as bold) among different settings.

Model	Naive	SC	CoT	CoT+SC
GPT-4	/	/	88.70	/
GLM-4	64.77	64.77	77.06	77.06
GPT-3.5-turbo	68.30	68.30	70.90	72.50
Qwen-72B-Chat	70.32	70.32	70.13	70.22
ERNIE-Bot-4.0	60.00	60.00	70.00	70.00
LLaMA-2-70B	55.00	56.20	66.80	67.20
DevOps-Model-14B-Chat	63.85	61.96	41.15	44.01
GLM-3-turbo	59.53	59.53	63.65	63.65
Qwen-14B-Chat	62.60	59.70	50.58	55.88
LLaMA-2-13B	53.30	53.00	56.80	61.00
InternLM2-Chat-20B	60.48	60.48	45.10	45.10
LLaMA-2-7B	48.20	46.80	52.00	55.20
Qwen-7B-Chat	52.10	51.00	48.30	49.80
Baichuan2-13B-Chat	51.90	51.60	44.50	47.45
InternLM2-Chat-7B	48.2	48.2	49.74	49.74
Mistral-7B	47.22	47.22	45.58	45.58
ChatGLM3-6B	42.10	42.10	43.47	43.47

The results of the few-shot evaluation with four settings on the Wired Network Operation test set are shown in Table 6. Results of the other scenarios and settings are shown in Appendix B.3. ⁵⁵5Due to the consideration of time, cost, and API rate limits, for GPT-4, we only make the 3-shot evaluation with the CoT setting to serve as an upper bound of all LLMs to provide a reference. From the overall performance results, we can come to several findings.

On both English and Chinese questions, GPT-4 consistently outperforms all other models, surpassing the best performances of all other LLMs. LLM with larger parameter size like Qwen-72B-Chat, ERNIE-Bot-4.0, and LLaMA-2-70B, when employing the Self-Consistency and Chain of Thought (CoT) prompt methods, approach the performance of ChatGPT. Smaller models, such as DevOps-Model-14B-Chat and Qwen-14B-Chat, exhibit competitive performance in multi-choice questions, approaching the capabilities of models with 70B parameters, thanks to their fine-tuning process and the quality of their training data.

Furthermore, the effectiveness of the four prompt settings varies across different LLMs. We examine LLMs’ zero-shot and few-shot performances under four settings mentioned earlier in Sec. 4 for both English and Chinese test sets. From the evaluation results, we can conclude the following observations:

(1)

For most models, the performance improves from the setting of Naive to SC, CoT, and SC+CoT. Notably, few-shot performance is better than zero-shot performance.
(2)

Among these settings, CoT prompts yield the most significant improvement in LLMs’ answering capability. SC prompts result in relatively minor improvements. LLMs’ responses are consistent across repeated questions, aligning with the desired outcome in operational tasks where reliability and consistency are essential.
(3)

LLMs fine-tuned specifically for Chinese exhibit better performance on English and Chinese test sets than LLMs without Chinese fine-tuning. We discuss further insights into these observations in Appendix. B.5.
(4)

In a few cases, more advanced evaluation methods surprisingly lead to poorer results. The detailed analysis for this can be found in Appendix B.6.1.

5.3. Performance on Different Tasks and Abilities

To investigate how LLMs perform in each Ops task and to what extent they possess the abilities of Knowledge Recall, Analytical Thinking, and Practical Application, we summarize the result of different parameter-size We group LLMs’ performance based on the task and ability classification mentioned in Sec. 3.2 and plot them on two radar charts in Figure 4.

In terms of the eight tasks we tested, LLMs generally yield higher accuracy in General Knowledge tasks, while their performance drops and varies drastically in highly specialized tasks like Automation Scripts and Network Configuration, reflecting the impact of specialized corpus and domain knowledge on the performance of LLMs.

Among the three abilities, LLMs perform the best in Practical Application, followed by Knowledge Recall. It is foreseen that LLMs perform poorly in Analytical Thinking questions, as accurately deducing conclusions from existing facts for LLMs remains a challenging research topic. LLMs perform best in Practical Application because the LLMs we test are trained on the corpus where best Ops practices are involved, familiarizing the LLMs with solutions to many real-world tasks.

By grou** LLMs by their parameter size, we find that although LLMs with 10B-20B parameters have higher accuracy in their best cases compared with LLMs with no more than 7B parameters, different 10B-20B LLMs’ performance varies drastically, sometimes even lower than that of 7B. LLMs with no more than 7B parameters, on the other hand, have a more stable performance range within the group.

5.4. Performance on Question-Answering

Table 7. LLMs’ performance on English network operations question-answering problems. GPT4 for GPT4-Score, FL for Fluency, AC for Accuracy, EV for Evidence, Total is the sum of the previous three columns.

Model	ROUGE	BLEU	GPT4	Expert Evaluation
Model	ROUGE	BLEU	GPT4	FL	AC	EV	Total
GPT-3.5-turbo	12.26	6.78	8.47	3.00	1.96	1.20	6.16
LLaMA2-70B	7.74	4.20	7.28	2.92	1.48	1.32	5.72
LLaMA2-13B-Chat	4.98	3.43	7.16	2.82	1.34	1.62	5.78
Baichuan2-13B-Chat	4.76	0.35	5.85	2.40	1.12	1.02	4.54
Qwen-7B-Chat	11.82	4.33	5.63	2.56	1.14	0.84	4.54
ChatGLM3-6B	9.71	5.07	4.88	2.84	0.76	0.76	4.36
InternLM2-7B-Chat	13.27	0.54	4.52	1.80	0.70	0.10	2.60

Table 8. Correlation coefficients between Expert-Evaluation Total and other metrics

Metric	GPT4-Score	BLEU-Score	ROUGELsum
Correlation coefficient	0.9211	0.6108	-0.4559

Table 9. Correlation coefficients between GPT4-Score and sub-metrics of Expert-Evaluation

Metric	Fluency	Accuracy	Evidence	Total
Correlation coefficient	0.8700	0.9084	0.7978	0.9211

Table 7 presents the evaluation results of 200 question-answering English questions across four metrics: ROUGE, BLEU, GPT4-Score, and Expert-Evaluation, sorted by GPT4-Score results. We have conducted question-answering tests on GPT-4 to verify its capabilities before using it for scoring. Since GPT-4 is currently the most capable among large models, its performance on question-answering questions surpasses other models. However, we did not include it in the table to avoid bias and misunderstanding.

The rankings based on ROUGE and BLEU scores do not align well with GPT4-Score and Expert-Evaluation, as shown in table 8. LLMs with poor performance may generate keywords, resulting in higher ROUGE and BLEU scores. In contrast, LLMs with good performance might receive lower ROUGE/BLEU scores due to differences in wording compared to the standard answers.

Regarding GPT4-Score, the rankings closely resemble those based on Expert-Evaluation. In table 9, we calculate the correlation coefficients between GPT4-Score and different sub-metrics of Expert-Evaluation to gain more insights. Among the three metrics, rankings of GPT4-Score align most closely with the Accuracy metric, suggesting that GPT4 is most reliable on the factuality with its vast knowledge base. The format and length of the generated content also heavily influence GPT4-Score, as suggested by the high positive correlation between GPT4-Score and Fluency. On the other hand, there are more discrepancies in rankings concerning the Evidence metric, indicating that GPT4-Score needs to consider the role of arguments and evidence in cases where answers are ambiguous.

In the Expert-Evaluation, where Evidence is a significant criterion, LLMs with more elaborate arguments can outperform GPT-3.5-turbo in total scores even when their Accuracy scores are much lower than the latter.

5.5. Performance on Different Quantization parameters

Figure 6 shows the accuracy of LLaMA-2-70B of different quantization parameters on English multi-choice questions. We do both few-shot and zero-shot evaluations with the naive setting. Using quantization during inference brings a performance degradation.

LLaMA2-70B-Int4 can achieve an accuracy close to LLaMA-2-70B without quantization. Specifically, on English multi-choice questions, the accuracy of the GPTQ model with 4-bit quantization parameters is 3.50% lower in zero-shot evaluation and 0.27% in few-shot evaluation compared to LLaMA-2-70B. As for Chinese questions, the accuracy of LLaMA2-70B-Int4 is 3.67% lower in zero-shot evaluation and 5.18% in few-shot evaluation compared to LLaMA-2-70B.

However, LLaMA2-70B-Int3 has a performance degradation that cannot be ignored, as shown in Figure 6. On average, the accuracy of LLaMA2-70B-Int3 has a 12.46% degradation compared to LLaMA-2-70B and a 9.30% degradation compared to LLaMA2-70B-Int4. The reason may be that the information of the full-sized model is lost too much in 3-bit quantization.

6. Discussion

6.1. Implications of the evaluation results

We summarize our key observations from the evaluation here:

•

Few-shot and CoT can significantly increase performance if the model is tuned to adapt to these techniques, while SC may have little influence on highly consistent LLMs.
•

LLMs perform best in general knowledge while deducing specialized conclusions remains challenging.
•

GPT4-Score is suitable as an automatic metric in large-scale qualitative evaluations.
•

Quantization with more than 3 bits can be an effective way to reduce computation and memory costs while preserving performance.

Overall, LLMs yield evaluation results on the OpsEval benchmark generally lower than those in general domains like MMLU (Hendrycks et al., 2021) and CEval (Huang et al., 2023). This comparison highlights the necessity of explicitly fine-tuning OpsLLM for the Ops field.

Our tests on the model with various prompt engineering techniques indicate that prompt engineering significantly impacts the effectiveness of eliciting operations knowledge from the model. This suggests that further research into prompt engineering is needed to enhance the performance of large models in this specific vertical domain.

6.2. Automated QA generation for question-answering questions

During the data collection process, we have considered automating question-answer generation. First, we sampled the question-answer pairs and manually assessed their accuracy and domain relevance. Later, we used typical manual evaluation examples for few-shot learning, enabling GPT to evaluate question-answer pairs based on our evaluation criteria automatically.

Directly generated question-answers tend to be simple judgment or concept questions rather than reasoning questions that better demonstrate the model’s capabilities and knowledge density. Our goal is to ensure that while the topics of the questions remain relevant to the seed questions, their specific content is distinct from the original questions. By maintaining the overarching framework in the Ops domain, we can expand the number and types of questions, enabling a more comprehensive evaluation of model capabilities. Additionally, we can incorporate external knowledge during the data generation, continually enhancing our ability to evaluate new content.

6.3. Examination Methodology of Hallucination Issues in LLM within OpsEval

Recognizing the necessity for LLMs to alleviate the hallucination problem, we explore the corresponding evaluation methodology.

Our approach involves both multi-choice and question-answering evaluations. Multi-choice questions assess the LLMs based on their knowledge and diverse Ops capabilities. However, to effectively uncover hallucinations, we rely on question-answering evaluation. This involves expert review, where outputs are scrutinized against three main criteria: Fluency, Accuracy, and Evidence. The Evidence criterion, in particular, is pivotal in highlighting hallucination issues. It underscores the significance of LLMs providing evidence or a reasoning process to support their outputs.

The intersection of hallucination problems and evidence-based evaluation is a focal point in our research. By emphasizing evidence, we aim to mitigate the issue of hallucinations, ensuring that LLM outputs are not only plausible but also verifiable. We plan to refine our multi-choice question design. This will enable us to automate the evaluation of model robustness against hallucinations more effectively. This adjustment is anticipated to enhance further our understanding of LLMs’ capabilities and limitations in dealing with hallucinations, thereby contributing to develo** more reliable and trustworthy LLMs in the Ops field.

7. Conclusion

In this paper, we introduce OpsEval, a comprehensive Ops benchmark suite designed for LLMs. Unprecedented in its approach, OpsEval evaluates LLMs’ proficiency across various Ops scenarios while considering varying ability levels encompassing knowledge recall, analytical thinking, and practical application. This comprehensive benchmark comprises 7,200 questions in both multi-choice and question-answering formats, both English and Chinese. After careful review by experts from the industry, we have made the datasets in OpsEval available to the public on Github, and publish the evaluation leaderboard on the official OpsEval website.

Supported by quantitative and qualitative results, we elucidate the nuanced impact of various LLM techniques, such as chain-of-thought and few-shot in-context learning, on Ops performance. Notably, the GPT4-score emerges as a more reliable metric when compared to widely used BLEU and ROUGE, suggesting its potential as a replacement for manual labeling in large-scale quantitative evaluations.

The identified flexibility within the OpsEval framework presents opportunities for future exploration. The adaptability of this benchmark facilitates the seamless integration of additional fine-grained tasks, providing a foundation for continued research and optimization of LLMs tailored for Ops.

References

(1)
202 (2023) 2023. QwenLM/Qwen-7B. https://github.com/QwenLM/Qwen-7B
202 (2024a) 2024a. baidu/ERNIE-Bot-4.0. https://cloud.baidu.com/doc/WENXINWORKSHOP/s/clntwmv7t
202 (2024b) 2024b. THUDM/ChatGLM3-6B. https://github.com/THUDM/ChatGLM3
Bai et al. (2023) Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. 2023. Benchmarking Foundation Models with Language-Model-as-an-Examiner. arXiv:2306.04181 [cs.CL]
Baichuan (2023) Baichuan. 2023. Baichuan 2: Open Large-scale Language Models. arXiv preprint arXiv:2309.10305 (2023). https://arxiv.longhoe.net/abs/2309.10305
Chang et al. (2023) Yupeng Chang, Xu Wang, **dong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2023. A Survey on Evaluation of Large Language Models. arXiv:2307.03109 [cs.CL]
Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 320–335.
et.al. (2023) Hugo Touvron et.al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
Frantar et al. (2022) Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021).
Huang et al. (2023) Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, **ghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al. 2023. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv e-prints (2023), arXiv–2305.
Lerner (2017) Andrew Lerner. 2017. AIOps Platforms—Gartner.
Li et al. (2023) Jianquan Li, Xidong Wang, Xiangbo Wu, Zhiyi Zhang, Xiaolong Xu, Jie Fu, Prayag Tiwari, Xiang Wan, and Benyou Wang. 2023. Huatuo-26M, a Large-scale Chinese Medical QA Dataset. arXiv e-prints (2023), arXiv–2305.
Liang et al. (2022) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic Evaluation of Language Models. arXiv e-prints (2022), arXiv–2211.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL]
Miao et al. (2023) Yukai Miao, Yu Bai, Li Chen, Dan Li, Haifeng Sun, Xizheng Wang, Ziqiu Luo, Dapeng Sun, Xiuting Xu, Qi Zhang, Chao Xiang, and Xinchi Li. 2023. An Empirical Study of NetOps Capability of Pre-Trained Large Language Models. CoRR abs/2309.05557 (2023). https://doi.org/10.48550/arXiv.2309.05557
OpenAI (2022) OpenAI. 2022. ChatGPT: Optimizing Language Models for Dialogue. OpenAI Blog (2022). https://openai.com/blog/chatgpt/
OpenAI (2023a) OpenAI. 2023a. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
OpenAI (2023b) OpenAI. 2023b. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-**g Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
Peng et al. (2023) Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction Tuning with GPT-4. arXiv preprint arXiv:2304.03277 (2023).
Singhal et al. (2022) Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2022. Large Language Models Encode Clinical Knowledge. arXiv preprint arXiv:2212.13138 (2022).
Srivastava et al. (2022) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv e-prints (2022), arXiv–2206.
Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
Team (2023) InternLM Team. 2023. InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities. https://github.com/InternLM/InternLM.
Wang et al. (2023a) Xidong Wang, Guiming Hardy Chen, Dingjie Song, Zhiyi Zhang, Zhihong Chen, Qingying Xiao, Feng Jiang, Jianquan Li, Xiang Wan, Benyou Wang, et al. 2023a. CMB: A Comprehensive Medical Benchmark in Chinese. arXiv e-prints (2023), arXiv–2308.
Wang et al. (2023b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023b. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL]
Wei et al. (2023a) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023a. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
Wei et al. (2023b) Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui Hu, Chenxia Li, Liu Yang, Xilin Luo, Xuejie Wu, Lunan Liu, Wenjun Cheng, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Lei Lin, Xiaokun Wang, Yutuan Ma, Chuanhai Dong, Yanqi Sun, Yifu Chen, Yongyi Peng, Xiaojuan Liang, Shuicheng Yan, Han Fang, and Yahui Zhou. 2023b. Skywork: A More Open Bilingual Foundation Model. arXiv:2310.19341 [cs.CL]
Zeng (2023) Hui Zeng. 2023. Measuring Massive Multitask Chinese Understanding. arXiv e-prints (2023), arXiv–2304.
Zeng et al. (2023) Hui Zeng, **gyuan Xue, Meng Hao, Chen Sun, Bin Ning, and Na Zhang. 2023. Evaluating the Generation Capabilities of Large Chinese Language Models. arXiv e-prints (2023), arXiv–2308.
Zhang et al. (2023) Liwen Zhang, Weige Cai, Zhaowei Liu, Zhi Yang, Wei Dai, Yujie Liao, Qianru Qin, Yifei Li, Xingyu Liu, Zhiqiang Liu, et al. 2023. FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models. arXiv e-prints (2023), arXiv–2308.
Zhong et al. (2023) Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2023. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv e-prints (2023), arXiv–2304.

Appendix A Details of OpsEval Benchmark

A.1. Information on the companies and experts participating in OpsEval

Table 10. Information of companies collaborating in OpsEval

Organization	Domain	URL
Bank of Shanghai	Financial IT	https://www.bosc.cn/zh/
Bizseer	Ops service/tool provider	https://www.bizseer.com/
ChinaEtek	Internet	https://www.ce-service.com.cn/
Data Foundation	Internet	https://www.dfcdata.com.cn/
Guotai Junan	Securities	https://www.gtja.com/
Huawei	Communication	https://www.huawei.com/
Lenovo	Hybrid Cloud	https://www.lenovo.com/
Rizhiyi	Log Analysis	https://www.rizhiyi.com/
ZTE	Communication	https://www.zte.com.cn/china/
Zabbix	Ops service/tool provider	https://www.zabbix.com/
Total	10

Table 10 shows the companies participating in the creation of OpsEval benchmark suite. Their industries include the Internet, telecommunications, cloud computing, finance, and securities, and each company has dispatched at least two experts to participate in the OpsEval work.

A.2. Task Types of Questions

We categorize all questions in OpsEval into 8 tasks. The details of each task are as follows:

•

General Knowledge pertains to foundational concepts and universal practices within the Ops domain.
•

Fault Analysis and Diagnostics focuses on detecting and addressing discrepancies or faults within a network or system, and deducing the primary causes behind those disruptions.
•

Network Configuration revolves around suggesting optimal configurations for network devices like routers, switches, and firewalls to ensure their efficient and secure operations.
•

Software Deployment deals with the dissemination and management of software applications throughout the network or system, verifying their correct installation.
•

Monitoring and Alerts harnesses monitoring tools to supervise network and system efficiency and implements alert mechanisms to notify administrators of emerging issues.
•

Performance Optimization is centered on refining the network and system for peak performance and recognizing potential enhancement areas.
•

Automation Scripts involves the formulation of automation scripts to facilitate processes and decrease manual intervention for administrators.
•

Miscellaneous comprises tasks that do not strictly adhere to the aforementioned classifications or involve a combination of various tasks.

A.3. Ability Levels of Questions

Different questions require different levels of ability to answer. We classify all questions in OpsEval into 3 categories. The details of each ability are as follows:

(1)

Knowledge Recall: Questions under this category primarily test a model’s capacity to recognize and recall core concepts and foundational knowledge. Such questions are akin to situations where a professional might need to identify a standard procedure or recognize a well-known issue based solely on previous knowledge.
(2)

Analytical thinking: These questions demand more than mere recall. They necessitate a deeper level of thought, expecting the model to dissect a problem, correlate diverse pieces of information, and derive a coherent conclusion. It mirrors real-world scenarios where professionals troubleshoot complex issues by connecting various dots and leveraging their comprehensive understanding.
(3)

Practical Application: These questions challenge a model’s ability to apply its foundational knowledge or analytical conclusions to provide actionable recommendations for specific scenarios. It epitomizes situations where professionals are expected to make decisions or suggest solutions based on in-depth analysis and expertise.

Figure 7 illustrates examples in our question set, shedding light on our classification methodology.

A.4. An Example of Subjective Questions

A saved subjective question in OpsEval is presented in Figure 8, which contains not only the raw question but also its type of task.

As shown in Figure 9, we combine the task and ability of each question with the question itself as the prompt for LLMs.

Table 11. LLMs’ overall performance on 5G communication operations test set

Model	English Test Set								Chinese Test Set
	Zero-shot				3-shot				Zero-shot				3-shot
	Naive	SC	CoT	CoT+SC	Naive	SC	CoT	CoT+SC	Naive	SC	CoT	CoT+SC	Naive	SC	CoT	CoT+SC
Qwen-72B-Chat	53.19	53.19	55.25	55.52	58.13	58.13	58.72	58.99	64.79	64.79	65.79	65.72	70.19	70.19	68.31	68.38
GPT-4	/	/	56.30	65.49	/	/	59.62	63.54	/	/	57.19	62.11	/	/	61.55	65.68
InternLM2-Chat-20B	39.10	39.10	37.70	37.70	47.70	47.70	33.50	33.50	44.60	44.60	47.00	47.00	62.20	62.20	38.30	38.30
Qwen-14B-Chat	33.71	36.25	41.24	42.51	51.19	50.39	57.18	59.18	41.71	41.44	45.58	47.98	53.52	49.92	54.72	58.85
DevOps-Model-14B-Chat	31.04	30.51	42.84	47.37	52.25	49.38	45.90	47.23	41.04	42.70	48.71	53.57	56.85	57.25	51.30	54.29
ERNIE-Bot-4.0	43.66	43.66	51.99	51.99	44.00	44.00	50.00	50.00	45.99	45.99	48.98	48.98	46.00	46.00	54.00	54.00
LLaMA-2-70B	23.64	23.64	39.31	39.31	38.98	39.12	47.90	47.90	24.38	24.38	43.63	43.63	44.65	44.65	48.84	48.84
Mistral-7B	26.91	26.91	30.65	30.65	40.52	40.52	46.84	46.84	1.27	1.27	42.05	42.05	30.72	30.72	46.44	46.44
InternLM2-Chat-7B	36.80	36.80	31.70	31.70	46.30	46.30	36.90	36.90	38.80	38.80	44.60	44.60	46.00	46.00	35.80	35.80
LLaMA-2-13B	15.62	18.32	29.88	34.45	23.16	29.14	37.59	44.3	25.43	27.16	29.17	29.99	36.56	36.15	37.70	39.02
GPT-3.5-turbo	34.92	34.82	38.53	43.50	39.40	39.19	40.93	42.58	36.98	36.83	37.95	39.25	39.17	39.77	41.93	42.15
Qwen-7B-Chat	33.85	33.74	32.45	34.10	32.91	32.70	36.65	36.65	36.27	36.50	33.27	33.51	42.22	40.59	31.28	31.46
ChatGLM3-6B	30.40	30.40	30.70	30.70	26.90	26.90	37.20	37.20	32.60	32.60	35.40	35.40	28.30	28.30	40.90	40.90
Baichuan2-13B-Chat	14.10	15.30	24.10	25.80	32.30	33.10	25.60	27.70	35.64	35.91	30.59	30.52	34.65	35.6	30.21	32.05
LLaMA-2-7B	19.14	21.62	25.70	27.11	21.38	24.85	32.38	34.83	23.57	23.47	27.65	29.26	30.30	30.03	30.98	31.93
Note: The best accuracy of each language for each LLM is in bold font. The best accuracy of all LLMs for each setting is underlined.

Table 12. LLMs’ overall performance on database operations test set

Model	English Test Set								Chinese Test Set
	Zero-shot				3-shot				Zero-shot				3-shot
	Naive	SC	CoT	CoT+SC	Naive	SC	CoT	CoT+SC	Naive	SC	CoT	CoT+SC	Naive	SC	CoT	CoT+SC
GPT-4	/	/	59.02	64.56	/	/	58.35	62.58	/	/	59.38	65.17	/	/	44.06	48.09
InternLM2-Chat-20B	/	/	59.21	59.21	/	/	/	/	/	/	/	/	/	/	/	/
ERNIE-Bot-4.0	43.80	43.80	47.14	47.14	46.00	46.00	54.0	54.0	48.56	48.56	50.64	50.64	48.00	48.00	54.0	54.0
Qwen-72B-Chat	47.28	47.48	48.09	48.09	49.70	49.70	43.46	43.66	48.29	48.49	49.50	49.70	49.70	49.70	45.27	44.87
GPT-3.5-turbo	38.63	38.83	40.04	42.05	36.62	37.63	42.66	43.86	36.42	35.81	39.24	43.26	39.84	39.44	27.16	27.77
Qwen-14B-Chat	24.95	28.37	33.00	36.62	27.97	28.37	27.97	24.14	27.57	27.57	32.39	36.02	40.04	35.41	30.38	33.4
DevOps-Model-14B-Chat	25.15	26.96	35.41	38.83	33.2	34.81	27.36	27.36	24.75	22.74	28.37	27.77	36.62	37.02	27.57	26.36
LLaMA-2-70B	19.72	19.72	27.97	27.97	26.56	26.56	32.6	32.6	15.29	15.29	34.81	34.81	26.76	26.76	33.8	33.8
Qwen-7B-Chat	18.91	19.11	22.13	23.94	26.76	25.55	34.81	34.81	18.51	17.71	27.36	28.37	29.78	29.58	33.60	33.60
LLaMA-2-13B	16.10	20.32	23.94	29.58	20.12	22.33	24.35	33.80	23.94	24.35	29.58	31.99	24.55	26.76	21.13	20.72
LLaMA-2-7B	22.13	23.74	23.74	26.56	19.32	20.52	28.77	33.60	20.72	20.72	27.16	27.97	21.53	18.51	18.31	17.91
Mistral-7B	17.10	17.10	26.76	26.76	31.19	31.19	27.97	27.97	0.20	0.20	26.76	26.76	10.26	10.26	32.19	32.19
InternLM2-Chat-7B	27.16	27.16	28.17	28.17	29.98	29.98	30.18	30.18	28.57	28.57	31.79	31.79	30.78	30.78	31.19	31.19
ChatGLM3-6B	20.93	20.93	25.15	25.15	24.75	24.75	29.18	29.18	21.33	21.33	28.97	28.97	21.73	21.73	29.58	29.58
Baichuan2-13B-Chat	17.10	19.11	18.71	22.94	25.96	26.56	20.93	24.55	25.75	25.55	20.12	21.33	27.77	26.76	22.74	24.75
Note: The best accuracy of each language for each LLM is in bold font. The best accuracy of all LLMs for each setting is underlined.

A.5. Metrics used in Question-Answering Evaluation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating machine translation and summarization. ROUGE-N is the overlap of n-grams between the prediction and reference. ROUGE-L assesses sentence structure similarity and identifies the longest sequential n-grams. ROUGE can be understood as the recall of the ground-truth answer. The score of ROUGE is normalized from 0 to 100. The higher the score is, the better it is.

BLEU (Bilingual Evaluation Understudy) can be understood as the precision of the generated answer. We utilize the scarebleu python package to calculate BLEU in OpsEval. The score of BLEU is normalized from 0 to 100. The higher the score is, the better it is.

GPT4-Score is a score generated by GPT4 with a deliberately crafted prompt. Scoring by LLMs is used increasingly (Bai et al., 2023) (Chang et al., 2023), especially after the parameters of LLMs get larger. We compose the scoring prompt of the question, the ground-truth keypoint, the ground-truth detailed answer, and the answer of LLM to be scored. The score is between 1 and 10, and the higher is better. The prompt for GPT-4 Scoring is shown in Figure 2(b).

Appendix B Additional details of experiments

B.1. Detailed Information of LLMs Evaluated

GPT-4 (OpenAI, 2023a) is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. It is recognized as the strongest lanuage model currently. ChatGPT (OpenAI, 2022) is an earlier AI-powered language model developed by OpenAI which is built upon GPT-3.5. We use the GPT-3.5-turbo version in our experiments. LLama 2 (et.al., 2023) is a second-generation open-source LLM from Meta which is very popular due to its open-source feature. It has the ability to process multiple languages including Chinese. We evaluate three weights (70B, 13B and 7B as shown in 3) of LLama 2.

Although LLama 2 is able to process Chinese input, it has a small Chinese vocabulary so that its abitilty of understanding and generating Chinese text is limited. As a result, we evaluate some Chinese-oriented LLMs which are published by institutions in China. ERNIE-Bot 4.0 (202, 2024a) is the latest self-developed language model released by Baidu. As claimed by Baidu, ERNIE-Bot 4.0 rivals OpenAI’s GPT-4. Qwen (202, 2023) (abbr. Tongyi Qianwen) is a series of LLMs developed by Alibaba Cloud. And Qwen-Chat is a series of large-model-based AI assistant trained with alignment techniques based on the pretrained Qwen. We evaluate three weights (72B, 14B and 7B as shown in 3) of Qwen-Chat. Baichuan2-13B-Chat (Baichuan, 2023) is aligned chat model based on Baichuan2-13B-Base (Baichuan, 2023) which is an open-source LLM published by Baichuan Intelligence. GLM (Du et al., 2022), developed by Tsinghua Knowledge Engineering Group, is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks. Based on GLM, Zhipu AI released GLM4 (the newest version of GLM model) (202, 2024b) and GLM3 (the third version of GLM model). For GLM3, we use GLM3-turbo (202, 2024b) version and ChatGLM3-6B (202, 2024b) in our experiments. InternLM2-Chat-20B and InternLM2-Chat-7B (Team, 2023), recently developed by Shanghai AI Laboratory, are multi-lingual models based on billions of parameters through multi-stage progressive training on over trillions of tokens. Furthermore, we evaluate DevOps-Model-14B-Chat (devopspal2023), an open source Chinese DevOps oriented models, mainly dedicated to exerting practical value in the field of DevOps.

In general, since some models (among them GPT-4, GPT-3.5-turbo, ERNIE-Bot-4.0, GLM4, GLM3-turbo) are not locally available, we evaluate them via API calls. For the remaining models, we perform local inference during evaluation.

B.2. An Example of CoT Prompt

For zero-shot evaluation in the CoT setting, we get the answer of LLMs in two rounds. Firstly, by adding a ’Let’s think step by step.’ after the question, LLMs will output its reasoning result. Secondly, we compose the final prompt of the question and the reasoning result in whole as the input of LLMs to get the final answer. An example is shown in Figure 10. For few-shot evaluation in the CoT setting, We make an analysis of each option of the question as a reasoning process, and craft three Q-A examples with CoT reasoning process in answers. An example is shown in Figure 11.

B.3. Overview Performance on Different Test Sets

In Table 11 and Table 12, we present overview performance of different LLMs on the 2 test sets in OpsEval, including 5G Communication Technology Operations and Database Operations.

B.4. Performance on Different Quantization Models

Figure 12 shows the accuracy of LLaMA-2-70B of different quantization parameters on Chinsese objective questions. We do both few-shot and zero-shot evaluation with the naive setting.

B.5. Performance on Different Languages

In Figure 13, we compare the few-shot performance of various LLMs under the CoT+SC setting for both English and Chinese questions. Notably, some of the LLMs that have undergone specific training or fine-tuning with Chinese language corpus, such as Chinese-Alpaca-2-13B, Qwen-7B-Chat, and ChatGLM2-6B, still perform better in answering English questions than Chinese ones.

Despite the observed fact that performance tends to be lower for Chinese questions compared to the original English questions, we can still glean valuable insights into the language capabilities of the LLMs. Notably:

(1)

ChatGLM2-6B experiences the smallest decline in performance when transitioning to Chinese questions. This improvement can be attributed to its substantial exposure to Chinese language data during training rather than simple fine-tuning on top of an existing base model.
(2)

LLaMA-2-13B exhibits the most significant drop in performance when switching to Chinese questions. This indicates that the shift in language impacts LLMs’ general understanding ability and capacity to extract domain-specific knowledge.

We also observe an interesting phenomenon with the Baichuan-13B-Chat in the 3-shot evaluation with the CoT+SC setting, where its performance in Chinese questions significantly outperforms in English. We examine the LLM’s outputs and analyze a sample question to shed light on this phenomenon in Appendix B.6.2.

B.6. Case Study

B.6.1. Case study: Why advanced settings sometimes lack behind

In certain cases, more advanced evaluation methods surprisingly lead to poorer results. We analyze to understand the potential reasons behind this phenomenon:

•

Some models may respond poorly to the guidance provided by the CoT prompts when required to think step by step, leading to subpar outputs. Figure 14 is one of the examples where CoT failed: the model tested cannot comprehend the idea of thinking step by step. Thus, instead of analyzing each option, it repeated the question and came to its answer directly. Even though the model correctly answered “FTP server” when asked in English, it failed to give the expected option A. This failed case inspires the need for few-shot prompting when applying the CoT method.
•

Few-shot prompts may lead some models to believe that the task involves generating questions rather than answering them, resulting in performance issues. Figure 15 provides an example to the problem mentioned above.

B.6.2. Case study: How Baichuan outperforms in Chinese

Figure 16 shows an example where Baichuan-13B-Chat failed in the English 3-shot CoT+SC setting, with correct English analysis from LLaMA-2-13B and correct Chinese analysis from Baichuan-13B-Chat itself for comparison. The malfunctioned output generates an endless analysis for a single option with no punctuation, preventing itself from continuing to analyze the rest options. This observation suggests that Baichuan-13B-Chat heavily relies on the input language (Chinese in this case) while possessing a foundational knowledge base related to Ops.

OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models