\pdfcolInitStack

tcb@breakable

[Uncaptioned image] CoIR: A Comprehensive Benchmark for Code Information Retrieval Models

Xiangyang Li  Kuicai Dong11footnotemark: 1  Yi Quan Lee11footnotemark: 1  Wei Xia  Yichun Yin  
Hao ZhangYong LiuYasheng WangRuiming Tang22footnotemark: 2
Huawei Noah’s Ark Lab
{lixiangyang34, kuicai.dong, lee.yi.quan, xiawei24, yinyichun,
zhang.hao3, liu.yong6, wangyasheng, tangruiming}@huawei.com
These authors contributed equally to this work.Corresponding authors.
Abstract

Despite the substantial success of Information Retrieval (IR) in various NLP tasks, most IR systems predominantly handle queries and corpora in natural language, neglecting the domain of code retrieval. Code retrieval is critically important yet remains under-explored, with existing methods and benchmarks inadequately representing the diversity of code in various domains and tasks. Addressing this gap, we present CoIR (Code Information Retrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. CoIR comprises ten meticulously curated code datasets, spanning eight distinctive retrieval tasks across seven diverse domains. We first discuss the construction of CoIR and its diverse dataset composition. Further, we evaluate nine widely used retrieval models using CoIR, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems. To facilitate easy adoption and integration within existing research workflows, CoIR has been developed as a user-friendly Python framework, readily installable via pip. It shares same data schema as other popular benchmarks like MTEB and BEIR, enabling seamless cross-benchmark evaluations. Through CoIR, we aim to invigorate research in the code retrieval domain, providing a versatile benchmarking tool that encourages further development and exploration of code retrieval systems111https://github.com/CoIR-team/coir.

1 Introduction

Information retrieval (IR) is a process that facilitates effective and efficient retrieval of relevant information from a large scale corpus. The advances of pretrained Transformers [39] like BERT [5] and RoBERTa [23] enable IR systems [14, 40, 3, 22] to leverage rich semantic embeddings to interpret and fulfill natural language search queries. Specifically, IR have successfully improve a wide range of Natural Language Processing (NLP) tasks such as Question Answering (QA) [19, 19, 32, 1], Retrieval-Augmented Generation (RAG) [26, 21, 33, 48], Information Extraction [28, 50, 34], Text Summarization [8, 25], and etc. Although proven to be effective in text-based retrieval, standard IR methods often fall short in specialized domains such as code retrieval [12].

Unlike standard text, code data is semi-structured and inherently logical, consisting of syntactic rules and semantic information that require specific parsing and understanding. The distinctive nature of code makes it crucial for specialized IR systems to adapt and interpret this format accurately. Recognizing the importance of code data, pioneering works such as CodeBERT [6], CodeGPT [24] and Unixcoder [9] have conducted pre-training specifically on code corpora to perform well on code-related tasks. Code information retrieval is a critical component in accelerating development processes and improving code quality. Efficient code retrieval helps developers quickly find not only relevant code snippets, but also related information like code explanations, bug analyses, code summaries, and similar code instances. Commercial products have recently integrated tools for code retrieval, such as VS Code [35] and GitHub Code Search. Moreover, code-RAG systems [47, 46, 4, 37] have effectively leveraged on code retrieval to minimize hallucinations (i.e., errors in generated code) by Large Language Models (LLMs), thereby ensuring more accurate and reliable outputs.

Benchamark Domain #PL
Retrieval
Tasks
Eval
Package
CodeSeachNet Github 6 Text-to-Code ×\times×
CoSQA Web Query 1 Text-to-Code ×\times×
XCodeEval Contest 17
Text-to-Code
Code-to-Code
×\times×
CoIR (Ours)
Github,Web Query,
Database, Contest,
Deep Learning,
StackOverFlow,
Code Instruction
14
Text-to-Code
Code-to-Code
Code-to-Text
Hybrid Code
\checkmark
Table 1: Comparison between CoIR and other code retrieval benchmarks. #PL is the number of main programming languages used.

Due to the importance of code retrieval, benchmarks (see Table 1) such as CodeSearchNet [12], CosQA [11], and XcodeEval [17], have been proposed to evaluate the code retrieval effectiveness. Despite these efforts, there remain three principal limitations that need to be addressed. Limitation 1: current benchmarks focus on limited number of code retrieval tasks. Commonly, these involve using a textual query to search for corresponding code snippets. However, the practical needs of code retrieval are far more diverse. In real-world scenarios, queries and retrieved corpus can involve not just text or code, but a combination of both. For instance, one might input a code snippet coupled with bug information, and seek detailed explanations, summaries, or even fixed code as output. Existing benchmarks do not adequately cater to such complex and varied query types, limiting the scope of their applicability and the robustness of the models tested. Limitation 2: there is a noticeable lack of diversity in data domains. For example, CodeSearchNet exclusively extracts code and code-comment pairs from GitHub , which represents a specific format of coding practices in open-source projects. Similarly, XcodeEval focuses only on coding related to contest challenges, which may be overly specialized. Such narrow focus implies that models trained and evaluated on these datasets could exhibit suboptimal performance in broader coding contexts. Limitation 3: there is no standard evaluation framework for code retrieval, which complicates the comparison and development of methods in this field. CodeSearchNet, CosQA, and XcodeEval employ various types of evaluation metrics tailored to their specific tasks and formats. This variety in evaluation approaches means that metrics used in one benchmark are often not directly applicable to others, leading to potential inconsistencies in measuring model performance across different benchmarks.

To address the aforementioned limitations, we propose a comprehensive benchmark for code retrieval named CoIR (Code Information Retrieval Benchamark). Table 1 shows the comparison between CoIR and existing benchmarks. Observe that CoIR offers much richer domain coverage and retrieval tasks, as well as a standardized evaluation package. Specifically, CoIR consists of 10 distinct datasets as shown in Figure 1: 8 existing and 2 newly-curated ones. CoIR encompasses 4 main retrieval tasks: (1) Text-to-Code Retrieval (retrieving code documents given textual query), (2) Code-to-Code Retrieval (retrieving code documents given code query), (3) Code-to-Text Retrieval (retrieving textual documents given code query), and (4) Hybrid Code Retrieval (retrieving hybrid of code and textual document given hybrid query). The 4 main tasks can be further breakdown into 8 retrieval sub-tasks: code contest retrieval, web query code retrieval, text-to-sql retrieval, code summary retrieval, code context retrieval, similar code retrieval, single-turn code QA retrieval, and multi-turn code QA retrieval. Additionally, CoIR offers datasets of various sizes, ranging from 1K to 1M corpus. The average number of tokens of query and corpus ranges from 37 to 4.4K and 113 to 1.5K, respectively.

With the aid of CoIR, we evaluate 9 popular retrieval systems in the context of code retrieval. We discovered that even state-of-the-art retrievers perform suboptimally on CoIR, indicating the substantial challenges presented in code retrieval. Furthermore, to minimize the evaluation efforts of code retrieval, we provide a straightforward and user-friendly python framework. It can be simply installed via pip-install. Furthermore, we follow BEIR [38] and MTEB [29] to design the format of all datasets in CoIR. Hence, researchers can seamlessly utilize the framework of MTEB and BEIR benchmarks to evaluate any models on CoIR.

Refer to caption
Figure 1: Overview of CoIR benchmark.

2 Related Work

Existing Benchmarks. Benchmarks like BEIR [38] and MTEB [29] are pivotal in driving progress in retrieval and embedding models. BEIR covers a diverse range of retrieval tasks and MTEB unifies various embedding evaluation tasks. However, they primarily focus on text and lack dedicated evaluation for code retrieval. Regarding code retrieval, notable benchmarks include CodeSearchNet [12], CosQA [11], and XcodeEval [17]. CodeSearchNet constructs pairs of functions and their corresponding comments sourced from GitHub, focusing primarily on function retrieval. CosQA constructs a text-to-code retrieval dataset using web queries and code snippets, albeit with a limited corpus size of 21k pairs and only covers the python programming language. XcodeEval, based on CodeForces222https://codeforces.com/, offers both text-to-code and code-to-code retrieval tasks. However, it lacks domain diversity and a comprehensive evaluation framework.

These benchmarks suffer significant limitations, including narrow scope of retrieval tasks, lack of diversity in both domains and languages, and absence of unified evaluation framework (in Section 1). In contrast, CoIR provides a comprehensive suite of evaluation tasks designed to assess various aspects of code information retrieval across a broader range of domains and tasks. Furthermore, it introduces a unified evaluation software to facilitate seamless and standardized benchmarking.

Retrieval Models. Dense retrievers, which encode rich textual information into high-dimension semantic representations, are the dominant solution in IR. Dense Passage Retrieval (DPR)  [16] is one such pioneering work that fine-tunes text encoders, such as BERT [5], to allow dense vector representations to be better suited for retrieval by comparing the dot-product similarity between the query embedding against every documents’ embedding in QA tasks. Contriever [14] is a significant advancement in the field of dense retrieval. Unlike traditional methods, Contriever employs contrastive learning techniques to improve the discriminative power of the embeddings. E5 [40] focuses on reducing the dependencies on large annotated datasets through efficient self-supervised learning strategies. Its architecture is designed to balance representation quality and computational efficiency, making it a viable option for various retrieval applications. Moreover, GTE [22] integrates graph-based techniques to enhance dense embedding quality. BGE [44] introduced the RetroMAE [43] pre-training paradigm to further enhance retrieval efficiency. BGE-M3 [2] uses a combination of retrieval methods mentioned in DPR, ColBERT [18], COIL [7] to achieve improved retrieval performance. Recently, with the increasing popularity of LLMs, several approaches [30, 36, 41] have emerged that leverage these models for encoding textual data, as seen in E5-Mistral [41] and OpenAI-Ada-002 333https://platform.openai.com/docs/guides/embeddings.

While there are a plethora of works dedicated to enhancing dense retrievals, the foundation, and evaluation of these works revolve around retrieval for open-domain question-answering datasets such as MS-Marco  [31], Natural Questions [20]. Even though there has been some advancement in producing encoders for code-related datasets, for instance, CodeBERT [6] and Voyage-Code-002 444https://blog.voyageai.com/2024/01/23/voyage-code-2-elevate-your-code-retrieval/ are being pre-trained on code datasets, and UniXCoder [9], which exploits unique syntactic features of programming languages to encode code data, the number of neural retrieval works dedicated on improving code retrieval techniques are far fewer compared to the variety we see in general open-domain QA retrieval. By providing a suite of benchmarks for various code retrieval tasks we hope to drive innovation and advance state-of-the-art methodologies in code retrieval.

3 The CoIR Benchmark

Table 2: Statistics of datasets in CoIR benchmark. # is the quantity of query/corpus instances. L()subscript𝐿L_{(\cdot)}italic_L start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT refers to the average numbers of words per query/corpus. Datasets marked by are created by us.
Main Task Sub Task Domain Dataset Language
#Query
(train/dev/test)
#Corpus LQuerysubscript𝐿QueryL_{\text{Query}}italic_L start_POSTSUBSCRIPT Query end_POSTSUBSCRIPT LCorpussubscript𝐿CorpusL_{\text{Corpus}}italic_L start_POSTSUBSCRIPT Corpus end_POSTSUBSCRIPT
Text-to-Code Retrieval
Code Contest
Retrieval
Code
Contest
APPS py 5k/-/3.8K 9K 1.4K 575
Web Query to
Code Retrieval
Web
query
CosQA py 19k/-/500 21K 37 276
Text to SQL
Retrieval
Database
Synthetic
Text2SQL
sql 100k/-/6K 106K 83 127
Code-to-Text
Retrieval
Code Summary
Retrieval
Github CodeSearchNet
go, java, js
php, py, ruby
905k/41k/53K 1M 594 156
Code-to-Code Retrieval
Code Context
Retrieval
Github
CodeSearchNet
-CCR
go, java, js
php, py, ruby
905k/41k/53K 1M 154 113
Similar Code Retrieval
Deep
Learning
CodeTrans
Ocean-DL
py 564/72/180 816 1.6K 1.5K
Contest
CodeTrans
Ocean-Contest
c++, py 561/226/446 1K 770 1.5K
Hybrid Code Retrieval Single-turn Code QA
Stack
Overflow
StackOverflow
QA
miscellaneous 13k/3k/2K 20K 1.4K 1.2K
Code
Instruction
CodeFeedBack
-ST
html, c, css, sql
js, sql, py, shell
ruby, rust, swift
125k/-/31K 156K 722 1.5K
Multi-turn
Code QA
Code
Instruction
CodeFeeback
-MT
miscellaneous 53k/-/13K 66K 4.4K 1.5K

3.1 Desiderata

CoIR is designed to offer a one-stop zero-shot evaluation benchmark for various code retrieval tasks. To construct a comprehensive and representative evaluation benchmark, an effective selection strategy is critical for collecting the tasks and datasets with desired properties. To achieve more efficient evaluations, a one-click executable end-to-end evaluation pipeline is also crucial. Specifically, CoIR focuses on the following characteristics: (1) Diversity: the goal of CoIR is to evaluate the capability of different code retrieval models from different perspectives. To comprehensively cover diverse retrieval needs, the CoIR benchmark encompasses a total of 𝟒4\bm{4}bold_4 primary code retrieval task , 𝟖8\bm{8}bold_8 code retrieval sub-tasks, 𝟏𝟎10\bm{10}bold_10 datasets, 𝟏𝟒14\bm{14}bold_14 main programming languages, as shown in Figure 1. (2) Usability: evaluating the capabilities of target models across multiple retrieval tasks and datasets is time-consuming and labor-intensive. This process requires writing a different set of evaluation code each time to adapt to different tasks and data formats, as well as manually collecting results for further statistical analysis. In contrast, CoIR offers an efficient end-to-end executable pipeline that enables one-click evaluation, encompassing both open-source retrieval models and proprietary API models, as well as various metrics such as nDCG, precision, recall, MAP, etc. We also provide comprehensive guidelines and relevant evaluation instances to clarify the usage instruction. All evaluation results will be stored in JSON format for subsequent utilization. The dataset statistics are detailed in Table 2 and detailed information on the preparation of each dataset can be found in B.

3.2 Main Task I: Text-to-Code Retrieval

Code Contest Retrieval. Code contest retrieval is the task of retrieving relevant code solutions in response to coding problems described in natural language. This task presents significant challenges due to the inherent complexity and variability of natural language descriptions and their corresponding code implementations, as well as the discrepancies between human and machine languages. In this task, we utilize the APPS dataset [10], a comprehensive collection of coding problems sourced from various competitive programming platforms, including Codewars, AtCoder, Kattis, and Codeforces.

Web Query Code Retrieval. Web query code retrieval aims to retrieve relevant code snippets or functions in response to concise web queries. Unlike code contest retrieval, the queries in this context are concise, typically consisting of only a few words. For this task, we utilize the CosQA [11] dataset, which consists of 20.6k labeled pairs of textual queries and corresponding Python functions.

Text-to-SQL Retrieval. Text-to-SQL is a critical code generation task due to the significance of SQL and its dialects for database information retrieval. In this task, the model requires to retrieval of appropriate SQL queries in response to natural language questions. We employ the Synthetic Text-to-SQL dataset [27], which is currently the largest and most diverse synthetic Text-to-SQL dataset available, containing around 106k examples.

3.3 Main Task II: Code-to-Text Retrieval

Code Summary Retrieval. Code summary retrieval evaluates a model’s ability to use code to retrieve code summaries or annotations. For this task, we employ the CodeSearchNet dataset [12], which consists of numerous code functions accompanied by code comments. This dataset spans six programming languages and includes over one million documents, providing a rich source of information for evaluating model performance.

3.4 Main Task III: Code-to-Code Retrieval

Code Context Retrieval. Code context retrieval is the task of retrieving the most relevant code segment that completes a given initial segment of code. This task is critical for code completion purposes. Here we modify the original CodeSearchNet dataset [12] to better suit our needs. Specifically, for each code snippet or function in CodeSearchNet, we randomly divide the code into two segments: (1) the initial segment serves as our query, and (2) the remaining segment forms the target corpus to be retrieved for this query. The length of each query is uniformly and randomly selected to comprise between 40%percent4040\%40 % and 70%percent7070\%70 % of the total number of characters in the original code or function.

Similar Code Retrieval. In the task of similar code retrieval, the primary objective is to assess a model’s ability to retrieve similar code snippets. Specifically, given a code snippet in one programming language or deep learning framework, the model needs to retrieve semantically equivalent code in a different language or framework. We utilize the CodeTransOcean dataset [45] for this purpose, creating two sub-datasets for similar code retrieval. These sub-datasets are named “CodeTransOcean-DL” and “CodeTransOcean-Contest”, with the suffix indicating their respective domains. The “CodeTransOcean-DL” subset contains code written in different deep learning frameworks within the same programming language, such as TensorFlow and Paddlepaddle. Whereas the “CodeTransOcean-Contest” subset includes code written in different programming languages, such as Python and C++, for the same coding contest or competitive programming problem.

3.5 Main Task IV: Hybrid Code Retrieval

Single-turn Code Question Answer Retrieval. In single-turn code question-answering (QA), a retrieval model is required to find the corresponding answer for a given natural language question. Both the question and the answer typically consist of a mix of text and code snippets. For this task, we use two code QA datasets: StackOverflow QA 555https://www.kaggle.com/datasets/stackoverflow/stacksample/data and CodeFeedQA [49]. The StackOverflow QA is derived from the original StackOverflow dataset by pairing questions with their highest upvoted answers, resulting in 19,9311993119,93119 , 931 pairs. Additionally, we sampled 1,20212021,2021 , 202 query instances to validate the retrieval model’s performance. The CodeFeedQA is a synthesized code instruction dataset generated by LLMs. From this dataset, we sampled 20%percent2020\%20 % of the queries to assess the model’s performance.

Multi-turn Code Question Answer. In multi-turn code question-answer retrieval, the retrieval model must effectively utilize the context from multiple dialogue turns to accurately retrieve the answer for the subsequent turn. This task is challenging due to the extensive dialogue context, which can exceed 4,00040004,0004 , 000 tokens, whereas most current retrieval models are limited to a context length of 512512512512 tokens. Specifically, we employ the CodeFeedback multi-turn question-answer dataset, with dialogues generated by LLMs. Our test set comprises a total of 13,2271322713,22713 , 227 queries and 66,3836638366,38366 , 383 corpus.

Refer to caption
Refer to caption
Figure 2: Jaccard similarity and Programming Language Distributions of CoIR.

3.6 Dataset and Diversity Analysis

In addition to the multitude of tasks, the datasets in CoIR also encompass a broad array of programming languages, such as Python, Java, and SQL, each featuring unique attributes. The distribution of these programming languages is long-tailed, as depicted in the bar graph in Figure 2. Despite this, the datasets maintain diversity, originating from varied sources including code contest websites, GitHub repositories, StackOverflow responses, etc. To quantify the diversity of CoIR, we calculated the weighted Jaccard similarity scores on unigram word overlap across all dataset pairs, displayed in a heatmap in Figure 2 (see Appendix C.2 for calculation details). The heatmap reveals generally low Jaccard similarity scores among dataset pairs, with notable exceptions being CodeFeedback Single-Turn (CodeFeedback-ST) and CodeFeedback Multi-Turn (CodeFeedback-MT), which are derived from the same domain. This signifies the challenge of CoIR as a benchmark. For optimal performance, a method must not only excel in major programming languages but also exhibit robust generalization ability across various domains.

3.7 CoIR Evaluation Software and Framework

To advance the assessment of code retrieval capabilities, we introduce a streamlined, user-friendly Python framework for the CoIR benchmark evaluation. This framework is installable via pip using the command pip install coir-eval and features a straightforward script that evaluates model performance across multiple datasets, outputting results in JSON format. Meanwhile, CoIR is compatible with several popular open-source frameworks, including HuggingFace and Sentence-Transformers, as well as API-based models such as OpenAI-Ada-002 and Voyage-Code-002. In line with BEIR and MTEB, all datasets have been standardized into a uniform format, facilitating the use of MTEB and BEIR frameworks for evaluating CoIR.

4 Experiment Setup

In this section, we evaluate and analyze the performance of the current state-of-the-art retrieval models on the eight subtasks of the CoIR benchmark. More details can be found in Appendix D.

Benchmarked Models. To comprehensively evaluate the capabilities of various state-of-the-art retrieval models for code retrieval tasks, we select 9 different retrieval models. For open-source dense models, we included E5-Base [40], GTE-Base [22], BGE-Base [44], Contriever [14], E5-Mistral [41], BGE-M3 [3], and UniXcoder [9]. For proprietary models, we included OpenAI-Ada-002 666https://platform.openai.com/docs/guides/embeddings and Voyage-Code-002 777https://blog.voyageai.com/2024/01/23/voyage-code-2-elevate-your-code-retrieval/. Notably, both UniXcoder and Voyage-Code-002 have been pre-trained on code data, making them robust baselines for comparison.

Implementation Details. For all open-source dense retrieval models, we set the input length of both queries and the corpus to 512. For proprietary models such as Voyage-Code-002, due to the constraint of tokens per minute (TPM), we set the length of the queries to 256 and the corpus to 512. Following previous work [41], we utilize (i) last token pooling for E5-Mistral, and (ii) mean pooling for other models, to obtain semantic vector representations. Cosine similarity is employed to calculate the similarity between the query and corpus for ranking.

Evaluation Metric. In code retrieval tasks, selecting an appropriate evaluation metric is essential for both assessing ranking sensitivity and relevance accuracy. Building on prior research [42], Normalized Discounted Cumulative Gain (NDCG) emerges as a superior metric. It effectively addresses the limitations of rank-insensitive measures such as precision and recall, as well as binary metrics like MRR and MAP that fail to account for varying relevance levels. Specifically, NDCG considers not only the order of retrieved items but also their relevance intensity. Following BEIR [38], we use NDCG@10 to evaluate the performance of retrieval models. For comprehensive evaluation, our scripts also provide metrics such as MAP, Recall, and Precision.

5 Results and Analysis

Table 3: NDCG@10 score of various retrievers on CoIR. The best score is marked in boldface.
Task (\rightarrow) Text-to-Code Code-to-Text Code-to-Code Hybrid Code Avg
Model (param.) \downarrow Apps CosQA Synthetic Code Code CodeTrans StackOver CodeFeedBack
Text2sql SearchNet SN-CCR -Contest -DL Flow QA -ST -MT
Contriever (110M) 5.14 14.21 45.46 34.72 35.74 44.16 24.21 66.05 55.11 39.23 36.40
E5-base (110M) 11.52 32.59 52.31 67.99 56.87 62.50 21.87 86.86 74.52 41.99 50.90
BGE-Base (110M) 4.05 32.76 45.59 69.60 45.56 38.50 21.71 73.55 64.99 31.42 42.77
GTE-Base (110M) 3.24 30.24 46.19 43.35 35.50 33.81 28.80 62.71 55.19 28.48 36.75
UniXcoder (123M) 1.36 25.14 50.45 60.20 58.36 41.82 31.03 44.67 36.02 24.21 37.33
BGE-M3 (567M) 7.37 22.73 48.76 43.23 47.55 47.86 31.16 61.04 49.94 33.46 39.31
E5-Mistral (7B) 21.33 31.27 65.98 54.25 65.27 82.55 33.24 91.54 72.71 33.65 55.18
OpenAI-Ada-002 8.70 28.88 58.32 74.21 69.13 53.34 26.04 72.40 47.12 17.74 45.59
Voyage-Code-002 26.52 29.79 69.26 81.79 73.45 72.77 27.28 87.68 65.35 28.74 56.26

5.1 Main Results

Table 3 showcases the performance of existing embedding models on the CoIR benchmark. From the table, several interesting conclusions can be drawn:

No Single Model Dominates Across All Tasks. Despite Voyage-Code-002 achieving the highest mean score of 52.86 on the benchmark, it does not universally surpass other models in every task. Models like E5-base and E5-Mistral exhibit superior performance in specific retrieval sub-tasks. This highlights the CoIR benchmark’s complexity and suggests that different models excel in different facets, catering to varied task requirements.

Demonstration of Benchmark Diversity and Challenge.. Models achieving top performance in mainstream text retrieval benchmarks, such as E5-Mistral, do not consistently excel across CoIR sub-tasks. For example, E5-Mistral scores highly in CodeTrans-Contest and StackOverflow QA but shows only median performance in APPS and CosQA for Text-to-Code tasks, and falls behind E5-base in CodeFeedBack-MT. This variation underscores the CoIR benchmark’s challenge and the need for adaptive, robust code retrieval capabilities.

Effectiveness of Code-Specific Optimization. Models tailored specifically for code retrieval, like Voyage-Code-002, achieve impressive results on CoIR. This indicates that specialized optimization can significantly boost performance. Voyage-Code-002 excels in Text-to-Code and Code-to-Text tasks, underscoring the advantages of specific training and optimization for code retrieval.

5.2 Analysis of Embedding and Retrieval Efficiency

In practical industrial settings, efficiency is also a critical factor for evaluating retrieval model. Factors such as embedding latency for query/corpus, retrieval latency from all embeddings, and the physical memory footprint of the index are critical. In this section, we analyze the efficiency of the current retrieval model using a portion of the CoIR dataset, designated as CodeFeedBack-ST, which comprises 156k corpus and 31k queries. For embedding latency, we record the time taken by each model to process a batch of data. This time was then divided by the number of samples in the batch to derive the average embedding time per sample. For retrieval latency, we measure the averaged retrieval time of all queries, where each query is tasked with retrieving 1,000 documents. The latency results are presented in Table 4.

CodeFeedBack-ST
156k corpus & 31k queries
Embed
Latency
Retrieval
Latency
Index
Rank Model Dim. GPU GPU Size
9 Contriever 768 7.8ms 38.1µs 0.3G
3 E5-Base 768 7.4ms 38.1µs 0.3G
5 BGE-Base 768 7.6ms 38.1µs 0.3G
8 GTE-Base 768 7.8ms 38.1µs 0.3G
7 UniXcoder 768 7.7ms 38.1µs 0.3G
6 BGE-M3 1024 31.4ms 42.9µs 0.6G
2 E5-Mistral 4096 1840ms 115.5µs 2.3G
4 OpenAI-Ada-002 1536 - 56.8µs 0.9G
1 Voyage-Code-002 1536 - 56.8µs 0.9G
Table 4: The average embedding/retrieval latency per instance using CodeFeedBack-ST. We retain models that have rankings in both BEIR and CoIR.

Trade-off between Accuracy and Latency. High-performing models like E5-Mistral with excellent accuracy, suffer from noticeable embedding latency. Specifically, E5-Mistral’s average embedding latency is 1840ms per sample, significantly higher than other evaluated models. This discrepancy highlights the common trade-off in retrieval systems: higher accuracy often comes at the expense of increased latency.

Index Size Considerations. Index sizes vary significantly across different models. Simpler models such as Contriever, E5-Base, BGE-Base, GTE-Base, and UniXcoder have relatively small index sizes, around 0.3GB. In contrast, more complex models like E5-Mistral, OpenAI-Ada-002, and Voyage-Code-2 have larger index sizes ranging from 0.6GB up to 2.3GB. This suggests a trade-off between accuracy and memory demands. Advanced models with better performance is at the cost of more memory, which could be a limitation in resource-constrained environments.

5.3 Impact of Input Length for Code Retrieval

This section examines how input length affects the performance of code retrieval tasks. We analyze results using four datasets: CodeFeedBack-MT, CodeTransOcean-DL, APPS, and StackOverflow QA. Each dataset has an average query and corpus length that exceeds 1,000 words. We utilize two models, GTE and BGE-M3, both of which have been optimized for long documents and support context length of 8k tokens. We experiment on two settings, with input length capped at 512 and 4,096 respectively. Note that tokens beyond the cap will be truncated. We report the results in Table 5.

Model Code Code APPS Stack
(input #tokens) FB-MT TO-DL OF-QA
GTE (512) 28.48 28.80 3.24 62.71
GTE (4k) 51.32 27.33 5.08 78.63
BGE-M3 (512) 33.46 31.16 7.37 61.04
BGE-M3 (4k) 27.49 32.75 6.80 56.53
Table 5: Effects of different input length on BGE-M3 and GTE.

Impact of Input Length on Model Performance. For GTE model, extending the input length from 512 to 4,096 shows notable improvements in retrieval performance across most datasets. Specifically, retrieval scores in CodeFeedBack-MT and StackOverflow QA increase significantly from 38.20 to 51.32, and from 64.36 to 78.63, respectively. In contrast, the BGE-M3 model shows inconsistent results. CodeTransOcean-DL shows a slight improvement in scores from 31.16 to 32.75, whereas scores for CodeFeedBack-MT fall from 33.46 to 27.49. One possible reason can be that: although BGE-M3 has been optimized for long documents, the significant differences between code data and text data result in a performance degradation as the document length increases.

5.4 Comparative Analysis of CoIR and BEIR Rankings

Model Rank
CoIR BEIR
Contriever 7 7
GTE-Base 6 2
BGE-M3 5 6
BGE-Base 4 3
OpenAI-Ada-002 3 5
E5-Base 2 4
E5-Mistral 1 1
Table 6: Retriever rankings in CoIR vs rankings in BEIR.

This section delves into the comparative performance of seven retrieval models that have been evaluated under both the CoIR and BEIR after adjusting the rankings to account for consistency across two benchmarks. We outline the relative rankings of several models. To ensure a direct and fair comparison across the same set of models, voyage-code-002 and UniXcoder were excluded due to their pre-training on code data. We focus on outlining the differences and the similarities in their rankings post-adjustment.

Comparative Insights. The E5-Mistral model consistently ranks first in both the CoIR and BEIR benchmarks, demonstrating its versatility and efficiency in handling both text and code retrieval tasks. There is notable ranking variability among models like GTE-Base and E5-Base. GTE-Base ranks 2ndnd{}^{\text{nd}}start_FLOATSUPERSCRIPT nd end_FLOATSUPERSCRIPT in BEIR but falls to 6thth{}^{\text{th}}start_FLOATSUPERSCRIPT th end_FLOATSUPERSCRIPT in CoIR. This indicates that models excelling in text retrieval may not necessarily perform well in code retrieval. Conversely, E5-Base is ranked higher in CoIR than in BEIR, reflecting its better adaptation to code retrieval challenges.

These comparative insights underscore the importance of specialized benchmarks like CoIR for code retrieval tasks. While certain models show consistent performance, the significant discrepancies in others reveal that code retrieval introduces unique challenges not entirely addressed by text-focused benchmarks. Therefore, CoIR plays a crucial role in the assessment and enhancement of retrieval models, ensuring they are capable of code searching and retrieval tasks.

6 Conclusion

In this paper, we introduce CoIR: a comprehensive benchmark for code retrieval. CoIR encompasses 4 distinct code retrieval tasks, 8 fine-grained retrieval tasks, supports 14 main different programming languages, and incorporates 10 different datasets comprising over 2 million corpus. Furthermore, we evaluate the performance of 9 different retrieval models on CoIR. Our findings reveal that even the state-of-the-art retrieval models exhibit unsatisfactory performance on CoIR, underscoring its challenging nature in code retrieval. Moreover, CoIR can be easily installed using pip, enhancing its integration into current research processes. It adopts the same data schema as widely-used benchmarks such as MTEB and BEIR, which allows for straightforward cross-benchmark comparisons. With CoIR, we aim to stimulate advances in code retrieval, encouraging researchers to develop more robust retrieval models to benefit the community.

References

  • Abbasiantaeb and Momtazi [2021] Zahra Abbasiantaeb and Saeedeh Momtazi. Text-based question answering from information retrieval and deep neural network perspectives: A survey. WIREs Data Mining Knowl. Discov., 11(6), 2021. doi: 10.1002/WIDM.1412. URL https://doi.org/10.1002/widm.1412.
  • Bai et al. [2024] Yang Bai, Anthony M. Colas, Christan Grant, and Daisy Zhe Wang. M3: A multi-task mixed-objective learning framework for open-domain multi-hop dense sentence retrieval. In Nicoletta Calzolari, Min-Yen Kan, Véronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, pages 10846–10857. ELRA and ICCL, 2024. URL https://aclanthology.org/2024.lrec-main.947.
  • Chen et al. [2024] Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. CoRR, abs/2402.03216, 2024. doi: 10.48550/ARXIV.2402.03216. URL https://doi.org/10.48550/arXiv.2402.03216.
  • Choi et al. [2023] YunSeok Choi, CheolWon Na, Hyojun Kim, and Jee-Hyong Lee. READSUM: retrieval-augmented adaptive transformer for source code summarization. IEEE Access, 11:51155–51165, 2023. doi: 10.1109/ACCESS.2023.3271992. URL https://doi.org/10.1109/ACCESS.2023.3271992.
  • Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  • Feng et al. [2020] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. CodeBERT: A pre-trained model for programming and natural languages. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.139. URL https://aclanthology.org/2020.findings-emnlp.139.
  • Gao et al. [2021] Luyu Gao, Zhuyun Dai, and Jamie Callan. COIL: Revisit exact lexical match in information retrieval with contextualized inverted list. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3030–3042, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.241. URL https://aclanthology.org/2021.naacl-main.241.
  • Glavas and Snajder [2014] Goran Glavas and Jan Snajder. Event graphs for information retrieval and multi-document summarization. Expert Syst. Appl., 41(15):6904–6916, 2014. doi: 10.1016/J.ESWA.2014.04.004. URL https://doi.org/10.1016/j.eswa.2014.04.004.
  • Guo et al. [2022] Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. UniXcoder: Unified cross-modal pre-training for code representation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7212–7225, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.499. URL https://aclanthology.org/2022.acl-long.499.
  • Hendrycks et al. [2021] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/c24cd76e1ce41366a4bbe8a49b02a028-Paper-round2.pdf.
  • Huang et al. [2021] Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. CoSQA: 20,000+ web queries for code search and question answering. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5690–5700, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.442. URL https://aclanthology.org/2021.acl-long.442.
  • Husain et al. [2019] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. CoRR, abs/1909.09436, 2019. URL http://arxiv.longhoe.net/abs/1909.09436.
  • Ioffe [2010] Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In 2010 IEEE International Conference on Data Mining, pages 246–255, 2010. URL https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36928.pdf.
  • Izacard et al. [2022] Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res., 2022, 2022. URL https://openreview.net/forum?id=jKN1pXi7b0.
  • Johnson et al. [2021] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. IEEE Trans. Big Data, 7(3):535–547, 2021. doi: 10.1109/TBDATA.2019.2921572. URL https://doi.org/10.1109/TBDATA.2019.2921572.
  • Karpukhin et al. [2020] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550.
  • Khan et al. [2023] Mohammad Abdullah Matin Khan, M. Saiful Bari, Xuan Long Do, Weishi Wang, Md. Rizwan Parvez, and Shafiq R. Joty. xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. CoRR, abs/2303.03004, 2023. doi: 10.48550/ARXIV.2303.03004. URL https://doi.org/10.48550/arXiv.2303.03004.
  • Khattab and Zaharia [2020] Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, page 39–48, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450380164. doi: 10.1145/3397271.3401075. URL https://doi.org/10.1145/3397271.3401075.
  • Kolomiyets and Moens [2011] Oleksandr Kolomiyets and Marie-Francine Moens. A survey on question answering technology from an information retrieval perspective. Inf. Sci., 181(24):5412–5434, 2011. doi: 10.1016/J.INS.2011.07.047. URL https://doi.org/10.1016/j.ins.2011.07.047.
  • Kwiatkowski et al. [2019] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl_a_00276. URL https://aclanthology.org/Q19-1026.
  • Lewis et al. [2020] Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.
  • Li et al. [2023] Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning. CoRR, abs/2308.03281, 2023. doi: 10.48550/ARXIV.2308.03281. URL https://doi.org/10.48550/arXiv.2308.03281.
  • Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.longhoe.net/abs/1907.11692.
  • Lu et al. [2021] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. Codexglue: A machine learning benchmark dataset for code understanding and generation. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c16a5320fa475530d9583c34fd356ef5-Abstract-round1.html.
  • Mahalakshmi and Fatima [2022] P. Mahalakshmi and N. Sabiyath Fatima. Summarization of text and image captioning in information retrieval using deep learning techniques. IEEE Access, 10:18289–18297, 2022. doi: 10.1109/ACCESS.2022.3150414. URL https://doi.org/10.1109/ACCESS.2022.3150414.
  • Mao et al. [2021] Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. Generation-augmented retrieval for open-domain question answering. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4089–4100, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.316. URL https://aclanthology.org/2021.acl-long.316.
  • Meyer et al. [2024] Yev Meyer, Marjan Emadi, Dhruv Nathawani, Lipika Ramaswamy, Kendrick Boyd, Maarten Van Segbroeck, Matthew Grossman, Piotr Mlocek, and Drew Newberry. Synthetic-Text-To-SQL: A synthetic dataset for training language models to generate sql queries from natural language prompts, April 2024. URL https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
  • Milward and Thomas [2000] David Milward and James Thomas. From information retrieval to information extraction. In ACL-2000 Workshop on Recent Advances in Natural Language Processing and Information Retrieval, pages 85–97, Hong Kong, China, October 2000. Association for Computational Linguistics. doi: 10.3115/1117755.1117767. URL https://aclanthology.org/W00-1109.
  • Muennighoff et al. [2023] Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.148. URL https://aclanthology.org/2023.eacl-main.148.
  • Muennighoff et al. [2024] Niklas Muennighoff, Hong** Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. CoRR, abs/2402.09906, 2024. doi: 10.48550/ARXIV.2402.09906. URL https://doi.org/10.48550/arXiv.2402.09906.
  • Nguyen et al. [2016] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehension dataset. In Tarek Richard Besold, Antoine Bordes, Artur S. d’Avila Garcez, and Greg Wayne, editors, Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, volume 1773 of CEUR Workshop Proceedings. CEUR-WS.org, 2016. URL https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf.
  • Otegi et al. [2022] Arantxa Otegi, Iñaki San Vicente, Xabier Saralegi, Anselmo Peñas, Borja Lozano, and Eneko Agirre. Information retrieval and question answering: A case study on COVID-19 scientific literature. Knowl. Based Syst., 240:108072, 2022. doi: 10.1016/J.KNOSYS.2021.108072. URL https://doi.org/10.1016/j.knosys.2021.108072.
  • Salemi and Zamani [2024] Alireza Salemi and Hamed Zamani. Evaluating retrieval quality in retrieval-augmented generation. CoRR, abs/2404.13781, 2024. doi: 10.48550/ARXIV.2404.13781. URL https://doi.org/10.48550/arXiv.2404.13781.
  • Sarhan [2023] Ingy Sarhan. Open Information Extraction for Knowledge Representation: Triple Extraction and Information Retrieval From Unstructured Text. PhD thesis, Utrecht University, Netherlands, 2023. URL https://dspace.library.uu.nl/handle/1874/428396.
  • Sole [2019] Alessandro Del Sole. Introducing visual studio code. Visual Studio Code, 2019. URL https://api.semanticscholar.org/CorpusID:69922364.
  • Springer et al. [2024] Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, and Aditi Raghunathan. Repetition improves language model embeddings. CoRR, abs/2402.15449, 2024. doi: 10.48550/ARXIV.2402.15449. URL https://doi.org/10.48550/arXiv.2402.15449.
  • Su et al. [2024] Hong** Su, Shuyang Jiang, Yuhang Lai, Haoyuan Wu, Boao Shi, Che Liu, Qian Liu, and Tao Yu. ARKS: active retrieval in knowledge soup for code generation. CoRR, abs/2402.12317, 2024. doi: 10.48550/ARXIV.2402.12317. URL https://doi.org/10.48550/arXiv.2402.12317.
  • Thakur et al. [2021] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/65b9eea6e1cc6bb9f0cd2a47751a186f-Abstract-round2.html.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  • Wang et al. [2022] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. CoRR, abs/2212.03533, 2022. doi: 10.48550/ARXIV.2212.03533. URL https://doi.org/10.48550/arXiv.2212.03533.
  • Wang et al. [2024] Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models. CoRR, abs/2401.00368, 2024. doi: 10.48550/ARXIV.2401.00368. URL https://doi.org/10.48550/arXiv.2401.00368.
  • Wang et al. [2013] Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. A theoretical analysis of ndcg type ranking measures. In Shai Shalev-Shwartz and Ingo Steinwart, editors, Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 25–54, Princeton, NJ, USA, 12–14 Jun 2013. PMLR. URL https://proceedings.mlr.press/v30/Wang13.html.
  • Xiao et al. [2022] Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao. RetroMAE: Pre-training retrieval-oriented language models via masked auto-encoder. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 538–548, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.35. URL https://aclanthology.org/2022.emnlp-main.35.
  • Xiao et al. [2023] Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighof. C-pack: Packaged resources to advance general chinese embedding. CoRR, abs/2309.07597, 2023. doi: 10.48550/ARXIV.2309.07597. URL https://doi.org/10.48550/arXiv.2309.07597.
  • Yan et al. [2023] Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. CodeTransOcean: A comprehensive multilingual benchmark for code translation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5067–5089, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.337. URL https://aclanthology.org/2023.findings-emnlp.337.
  • Zhang et al. [2023a] Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, ** Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through iterative retrieval and generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.151. URL https://aclanthology.org/2023.emnlp-main.151.
  • Zhang et al. [2023b] Xiangyu Zhang, Yu Zhou, Guang Yang, and Taolue Chen. Syntax-aware retrieval augmented code generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1291–1302, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.90. URL https://aclanthology.org/2023.findings-emnlp.90.
  • Zhao et al. [2024] Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey. CoRR, abs/2402.19473, 2024. doi: 10.48550/ARXIV.2402.19473. URL https://doi.org/10.48550/arXiv.2402.19473.
  • Zheng et al. [2024] Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. Opencodeinterpreter: Integrating code generation with execution and refinement. CoRR, abs/2402.14658, 2024. doi: 10.48550/ARXIV.2402.14658. URL https://doi.org/10.48550/arXiv.2402.14658.
  • Ziembinski [2015] Radoslaw Z. Ziembinski. Unsupervised extraction of graph-stream structure for purpose of knowledge retrieval and information fusion. In Maria Ganzha, Leszek A. Maciaszek, and Marcin Paprzycki, editors, Position Papers of the 2015 Federated Conference on Computer Science and Information Systems, FedCSIS 2015, Lódz, Poland, September 13-16, 2015, volume 6 of Annals of Computer Science and Information Systems, pages 53–60, 2015. doi: 10.15439/2015F288. URL https://doi.org/10.15439/2015F288.

Appendix A Limitations

Even though we cover a wide range of tasks and domains in CoIR, no benchmark is perfect and each set of benchmark has its own limitations. It is crucial to make the limitations explicit in order to better interpret (1) retrieval results on these benchmark datasets and (2) to curate a better benchmark in the future that complements existing benchmarks in the field.

  1. 1.

    Multilingual Tasks: Although we aim for a diverse retrieval evaluation benchmark, due to the limited availability of multilingual retrieval datasets for code information retrieval , all datasets covered in the CoIR benchmark are currently English. Future work could include multi- and cross-lingual tasks and models.

  2. 2.

    Multi-faceted Search: Due to the existing paradigm that heavily emphasizes on semantic-based information retrieval, our benchmark mainly focuses on queries that aim to benchmark retrieval performance solely based on textual information. However, real-world information needs is often complex and could possibly rely on various corpus meta-data, especially so in code datasets where meta-data could play a huge role. For instance, the versioning of programming language or software libraries could make a huge difference to whether a functioning code snippet is retrieved. In essence incorporating queries that are multi-faceted can more accurately reflect real-world informational needs.

  3. 3.

    n𝑛nitalic_n-ary Match: With each query corresponding to exactly one ground-truth corpus, we dismiss the real-world informational needs where (1) a single query is could be relevant to multiple copora and (2) the informational needs of a single query can only be satisfied by multiple corpora simultaneously. Dedicating a particular section of CoIR for n𝑛nitalic_n-nary / list-wise labels for each query would allow us to address how model perform on informational needs that are diverse and further enhance the diversity of tasks of the benchmark.

Appendix B Dataset Details

Examples of queries and corpora present in each dataset can be viewed in tables 7 and 8.

Dataset Query Relevant-Corpus Granularity
Text Query to Code Corpus Apps You are playing a very popular game called Cubecraft. Initially, you have one stick and want to craft k torches. <Text omitted for brevity> For each test case, print the answer: the minimum number of trades you need to craft at least k torches. The answer always exists under the given constraints. <Example Test Case Omitted> for haaghfj in range(int(input())): x,y,k = list(map(int,input().split())) print(k + (y * k + k - 1 +x-2) // (x - 1)) Function
CosQA python adjacency matrix from edge list def get_adjacent_matrix(self): edges = self.edges num_edges = len(edges) + 1 adj = np.zeros([num_edges, num_edges]) for k in range(num_edges - 1): adj[edges[k].L, edges[k].R] = 1 adj[edges[k].R, edges[k].L] = 1 return adj Function
Synthetic Text2Sql Which buildings in the UK have both a green roof and solar panels installed? SELECT b.name FROM Building b JOIN GreenRoof gr ON b.id = gr.building_id JOIN SolarPanel sp ON b.id = sp.building_id WHERE b.country = UK’; Entire Code
Code Query to Code Corpus CodeSe- archNet -CCR def get_cumulative_spend(key): """ Get the sum of spending for this category up to and including the given month. """ query = (‘ROUND(SUM(total_ex_vat), 2) AS total FROM {table} WHERE date <= "{year}-{month:02}-01" AND lot="{lot}" AND customer_sector="{sector}" AND supplier_type="{sme_large}"’. format( table=_RAW_SALES_TABLE, year=key.year, month=key.month, lot=key.lot, sector=key.sector, sme_large=key.sme_large)) logging.debug(query) result = scraperwiki.sqlite.select(query) logging.debug(result) value = result[0][‘total’] return float(result[0][‘total’]) if value is not None else 0.0 Code Snippet
CodeTrans Ocean-DL import tensorflow as tf from d2l import tensorflow as d2l net = tf.keras.models.Sequential([tf. keras. layers. Flatten(), tf.keras. layers. Dense(256, activation=’relu’), tf.keras.layers.Dense(10)]) batch_s, lr, num_epochs = 256, 0.1, 10 loss = tf.keras.losses. SparseCategoricalCrossentropy() trainer = tf.keras.optimizers.SGD(lr) train_iter, test_iter = d2l.load_data_fashion_mnist(batch_s) d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer from d2l import paddle as d2l import paddle from paddle import nn net = nn.Sequential(nn.Flatten(), nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10)) for layer in net: if type(layer) == nn.Linear: weight_attr = paddle.framework.ParamAttr( initializer=paddle.nn.initializer. Normal(mean=0.0, std=0.01)) layer.weight_attr = weight_attr batch_size, lr, num_epochs = 256, 0.1, 10 ... Code truncated for brevity Entire Code
CodeTrans Ocean- Contest def setup(): println(distance("kitten", "sitting")) def distance(a, b): costs = [] for j in range(len(b) + 1): costs.append(j) for i in range(1, len(a) + 1): costs[0], nw = i, i - 1 for j in range(1, len(b) + 1): cj=min(1+min(costs[j],costs[j-1]), nw if a[i-1]==b[j-1] else nw+1) nw, costs[j] = costs[j], cj return costs[len(b)] #include <algorithm> #include ... template <typename StringType> size_t levenshtein_distance(const StringType& s1, const StringType& s2) { const size_t m = s1.size(); const size_t n = s2.size(); if (m == 0) return n; if (n == 0) return m; std::vector<size_t> costs(n + 1); std::iota(costs.begin(),costs.end(),0); size_t i =... ... Code truncated for brevity} Entire Code
Table 7: Examples of queries and relevant corpora for 6 datasets related to Text-to-Code and Code-to-Code retrieval (i.e. Apps, CosQA, Synthetic Text2sql, CodeSearchNet-CCR, CodeTransOcean-DL, and CodeTransOcean-Contest) in CoIR. For brevity, we omit or truncate some portion of the query and corpus.
Dataset Query Relevant-Corpus Granularity
Code Query to Code Corpus CodeSe- archNet def precedes(self, other): """Returns a boolean indicating whether this element precedes other element""" try: ancestor = next(commonancestors( AbstractElement, self, other)) except StopIteration: raise Exception("Elements share no common ancestor") # now we just do a depth first search # and see who comes first def callback(e): .... Code truncated for brevity Returns a boolean indicating whether this element precedes the other element Document String
Hybrid Code Query to Hybrid Code Corpus StackOver Flow QA With Rust Regular Expressions, how can I use named capture groups preceding a string? I’m using the sd tool which uses rust regular expressions, and I am trying to use it with a named capture group, however… <Text ommitted for brevity> Here is a contrived example to illustrate it: echo abc | sd -p ‘(?P<cg>b)’ $cgB # outputs: ac # desired output: abBc echo abc | sd -p ‘(?P<cg>b)’ $cg B # outputs as expected: ab Bc # however, places a space there I’ve tried $<cg>B, $cg(B), $cg0B, all don’t give abBc. I’ve also checked the rust regex docs however the x flag, and other techniques seem only applicable to the search pattern, not the replace pattern. We don’t need the sd tool to reproduce this behavior. Here it is in pure Rust: let re = regex::Regex::new(r"(?P<n>b)") .unwrap(); let before = "abc"; assert_eq!(re.replace_all(before, "$nB"), "ac"); assert_eq!(re.replace_all(before, "${n}B"), "abBc"); The brace replacement syntax isn’t described in the front documentation but on that of the replace method: The longest possible name is used. e.g., $1a looks up the capture group named 1a and not the capture group at index 1. To exert more precise control over the name, use braces, e.g., ${1}a. <Text omitted for brevity> Code Snippet mixed with text
CodeFeed Back-ST user: How would you extend the following code to add the features “undo" and “redo" while maintaining a constant time complexity for adding items? def add_item(item, list): list.append(item) def undo(list): if len(list) > 0: list.pop() def redo(undone_list, list): if len(undone_list) > 0: item = undone_list.pop() list.append(item) assistant: To add the features "undo" and "redo" while maintaining a constant time complexity for adding items, you can use … Here’s an extended version of the code: def add_item(item, item_list, undone_list, redone_list): item_list.append(item) redone_list.clear() # Clear the # redone_list as new item is added def undo(item_list, undone_list, redone_list): if len(item_list) > 0: .... Code omitted for brevity In this version, we add three new parameters to the functions: ‘undone_list’… <Text omitted for brevity> Code Snippet or Functions mixed with text
CodeFeed Back-MT user: Convert a given 2D point to the polar representation. The point has coordinates (x,y), where x and y are integers between -1000 and 1000, inclusive. Additional Requirement: …<Text omitted for brevity> assistant: To convert a 2D point to polar representation, we need to calculate the polar coordinates (r, theta) using the given Cartesian coordinates (x, y). We can use the arctangent function to calculate the value of theta. <Text omitted for brevity> Now, let’s write the code to convert the given 2D point to polar representation using the atan2 function in Python: import math def convert_to_polar(x, y): r = math.sqrt(x*x + y*y) # calculating the distance from the origin theta = math.atan2(y, x) # calculating the angle in radians using atan2 function return (r, theta) ... Code truncated for brevity user: <Text omitted for brevity>, assistant: <Text omitted for brevity>, user: Please generate the code to meet the original requirements without using arithmetic operations. Clarifying comments and documentation will also be appreciated. assistant: I apologize for any misunderstanding, but converting from Cartesian coordinates to polar coordinates inherently requires mathematical operations – notably, addition, multiplication, and trigonometric operations. These are defined by the mathematical formulas for converting between the two coordinate systems: r = sqrt(x^2 + y^2), theta = atan2(y, x) These formulas can’t be implemented without using arithmetic operations, due to their inherent mathematical nature. As per your requirement of not using any arithmetic operations to solve this problem, it’s essential to understand that it would be impossible to convert to polar coordinates, as the problem mainly depends on these arithmetic operations. Again, I emphasize that this kind of computation relies heavily on arithmetic operations, and it is a significant constraint that would make finding a software or coding solution impossible. However, if there’s any related issue or a different problem you need assistance with, I would be more than happy to assist!" Code Snippet or Functions mixed with text
Table 8: Examples of queries and relevant corpora for 4 datasets related to Code-to-Text and Hybrid Code retrieval (i.e. CodeSearchNet, StackOverflow QA, CodeFeedback-ST and CodeFeedback-MT) in CoIR. For brevity, we omit or truncate some portion of the query and corpus

B.1 Text-to-Code Retrieval Datasets

APPS [10]. The original APPS dataset is a code generation dataset derived from programming problems shared on open-access sites frequented by programmers, including Codewars, AtCoder, Kattis, and Codeforces, where each example consists of a question description and its corresponding code solution. To adapt the original APPS dataset for retrieval, we use the original problem descriptions as the query to retrieve from a corpus of all code solutions. We retain the original dataset’s train-test split and remove examples that do not have a corresponding code solution, resulting in a total of 5,000 samples for the training set and 3,765 samples for the test set.

CoSQA [11]. The CosQA dataset comprises 20,604 human-annotated labels for pairs of natural language web queries and corresponding code snippets. We retain the original train/dev/test dataset splits of 19,604/500/500 and utilize the natural language web queries as queries to retrieve from a corpus of all code snippets as intended in CoSQA.

Synthetic Text2Sql [27]. The Synthetic Text2Sql dataset is a comprehensive collection of high-quality synthetic Text-to-SQL samples, meticulously designed and generated using Gretel Navigator 888https://docs.gretel.ai/. Each example consists of the following: (1) a problem description which can be resolved using SQL (2) information on the schema of relevant tables used (3) the corresponding SQL code solution (4) meta-data revolving problem described, for instance the type of domain or industry the problem falls under (e.g. healthcare, aerospace etc.) and the nature of the task at hand (e.g. reporting, analytics, dashboarding etc.). Here we use the natural language question description as the query to retrieve from a corpus of corresponding SQL code solutions. We follow the train-test split provided by Gretel where the 105,851 queries are divided into 100,000 queries for train and the remaining 5,851 for test.

B.2 Code-to-Code Retrieval Datasets

CodeSearchNet-CCR [12]. CodeSearchNet-Code Context Retrieval (CCR) is modified from the original CodeSearchNet dataset with 1 million (docstring, code) pairs sourced from open-source repositories hosted on GitHub. For each example present in the dataset, we randomly divide each code function into two code segments where all the initial segment serves as our query and all latter segments forms the corpus to be retrieved using the corresponding initial segment. The length of each query is uniformly and randomly selected to comprise between 40% and 70% of the total number of characters in the original code or function. We retain the original train/dev/test split of 905k/41k/53k and also retain the partition by their respective language from CodeSearchNet.

CodeTransOcean-DL [45]. CodeTransOcean-DeepLearning (DL) is derived from the DLTrans, a dataset featured in the CodeTransOcean benchmark which focuses on code translation. The original dataset consists of pairs of semantically equivalent deep learning code written using different deep learning libraries namely Tensorflow, Pytorch, PaddlePaddle and MXNet. For each pair of data, we use one of the codes in one of the framework as query to retrieve its counterpart written in a different framework. We follow the original train/dev/test split of 564/72/180 in CodeTransOcean-DL.

CodeTransOcean-Contest [45]. Similar to CodeTransOcean-DL, the CodeTransOcean-Contest dataset is derived from the MultilingualTrans dataset in the CodeTransOcean benchmark. The dataset features a collection of code for various problems, for instance binary tree traversal, sorting algorithms, written in various languages featured in Rosetta Code 999https://rosettacode.org/wiki/Rosetta_Code, a programming chrestomathy website. In CodeTransOcean-Contest, we focus on retrieving semantically equivalent C++, Python code pairs as these two languages differs greatly in terms of syntax and language features. We filter the examples in MultilingualTrans that does not contain any C++ nor Python code pairs and we use the Python code in each pair as queries to retrieve its counterpart from all C++ code in the filtered dataset. We retain the dataset split as in MultilingualTrans resulting in a 561/226/221 examples for train/dev/test respectively.

B.3 Code-to-Text Retrieval Datasets

CodeSearchNet [12]. The CodeSearchNet is a dataset consisting of 1 million (docstring, code) pairs sourced from open-source repositories hosted on GitHub. It contains code and documentation for several programming languages. Instead of retrieving relevant code from the original description, we reverse the roles of the docstring and code by retrieving the relevant docstring using the code as a query. We follow the original CodeSearchNet train/dev/test split of 905k/41k/53k and also retain the partition by their respective language.

B.4 Hybrid Code Retrieval Datasets

StackOverflow QA101010https://www.kaggle.com/datasets/stackoverflow/stacksample/data. We modify the original StackOverflow dataset from Kaggle 111111https://www.kaggle.com/, which contains questions posted by users, the corresponding highest voted answer to the user’s questions and tags pertaining to the user’s questions. Both the question and the answer typically consist of a mix of text and code snippets. Out of a total of 1,048,576 questions, we randomly sample 19,931 questions and their corresponding answers from the StackOverflow QA dataset. We use the questions raised by the users as a query to retrieve the corresponding answer from other correct answers posted by users. We randomly split the data into train/dev/test sets with a 13,951/3,986/1,994 split respectively.

CodeFeedBack-ST [49]. We utilize data from CodeFeedback-Filtered-Instruction to generate retrieval dataset for CodeFeedBack-ST. CodeFeedback-Filtered-Instruction is a collection of instruction-tuning datasets where a language model takes cue from user’s instruction in order to generate code to fulfil the user’s instructions. The dataset consists of pairs of users instruction and the response to be expected from an assistant where both the user’s instructions and the assistant’s response could contain a mixture of text and code snippets. We use the user’s instructions as query to retrieve the corresponding relevant expected assistant’s reply. Of the 156,526 examples present, we split the train and test dataset into 125,221 and 31,307 train and test examples respectively.

CodeFeedBack-MT [49]. We utilize examples from Code-Feedback dataset to generate retrieval dataset for CodeFeedBack-Multi Turn (MT). Code-Feedback is a collection of instruction-tuning datasets involving simulated multi-turn dialogues between 2 LLMs; one plays the role of a user and the other plays the role of assistant. To ensure that that the generated replies of the assistant are of high quality, the LLM playing the role of assistant is aided by code compilers alongside multiple generation attempts to arrive at a desirable response. Each example comprises a sequence of exchanges alternating between the user and assistant, starting with the user and ending with the assistant. To construct CodeFeedBack-MT, we split each sequence into 2 portions, the first portion consists of the initial dialogue history, prior to the last reply by the assistant, and second part is made up of the final reply by the assistant. We use the initial portion of the dialogue as query to retrieve the corresponding reply by the assistant. We split the dataset into 53,106 and 13,277 for train and test respectively.

Appendix C Metric

C.1 NDCG Metric

Normalized Discounted Cumulative Gain (NDCG) is a popular metric used to evaluate the quality of rankings, particularly in information retrieval and recommender systems. It measures the usefulness, or gain, of an relevant item based on its position in the result list, discounted logarithmically by the position.

The Discounted Cumulative Gain (DCG) is the sum of the gains of relevant items, discounted logarithmically by their positions in the ranking. The normalized version, NDCG, compares the DCG of the ranked list to the DCG of the ideal ranking. This normalization ensures that the score is within the range [0, 1].

DCG Calculation:

The DCG at position p𝑝pitalic_p is calculated as:

DCGp=i=1p2reli1log2(i+1)subscriptDCG𝑝superscriptsubscript𝑖1𝑝superscript2𝑟𝑒subscript𝑙𝑖1subscript2𝑖1\text{DCG}_{p}=\sum_{i=1}^{p}\frac{2^{rel_{i}}-1}{\log_{2}(i+1)}DCG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT italic_r italic_e italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 1 end_ARG start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_i + 1 ) end_ARG

where reli𝑟𝑒subscript𝑙𝑖rel_{i}italic_r italic_e italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the relevance score of the item at position i𝑖iitalic_i.

IDCG (Ideal DCG) Calculation:

The Ideal DCG (IDCG) is the DCG of the ideal ranking. This is calculated by sorting all items by their relevance scores in descending order and then computing the DCG using the same formula.

IDCGp=i=1p2reli1log2(i+1)subscriptIDCG𝑝superscriptsubscript𝑖1𝑝superscript2𝑟𝑒subscriptsuperscript𝑙𝑖1subscript2𝑖1\text{IDCG}_{p}=\sum_{i=1}^{p}\frac{2^{rel^{\ast}_{i}}-1}{\log_{2}(i+1)}IDCG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT italic_r italic_e italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 1 end_ARG start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_i + 1 ) end_ARG

where reli𝑟𝑒subscriptsuperscript𝑙𝑖rel^{\ast}_{i}italic_r italic_e italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the relevance score of the item at position i𝑖iitalic_i in the ideal ranking.

NDCG Calculation:

The NDCG at position p𝑝pitalic_p is the ratio of the DCG at position p𝑝pitalic_p to the IDCG at position p𝑝pitalic_p:

NDCGp=DCGpIDCGpsubscriptNDCG𝑝subscriptDCG𝑝subscriptIDCG𝑝\text{NDCG}_{p}=\frac{\text{DCG}_{p}}{\text{IDCG}_{p}}NDCG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = divide start_ARG DCG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG IDCG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG

This normalization ensures that NDCG values are bounded between 0 and 1, making comparisons across different queries or datasets meaningful.

By evaluating the NDCG, we can get a sense of how well our ranking system is performing relative to the best possible ranking. This is particularly useful when the relevance varies across items and lower-ranked items are less likely to be seen and thus should contribute less to the overall evaluation metric.

C.2 Weighted Jaccard Similarity

The weighted Jaccard similarity J(S,T)𝐽𝑆𝑇J(S,T)italic_J ( italic_S , italic_T ) [13] is calculated as the unique word overlap for all words present in both the datasets. We define the normalized frequency for an unique word k𝑘kitalic_k in a dataset to be the frequency of word k𝑘kitalic_k divided over the sum of frequencies of all words in the dataset.

Let Sksubscript𝑆𝑘{S_{k}}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be the normalized frequency of word k𝑘kitalic_k in the source dataset S𝑆Sitalic_S and Tksubscript𝑇𝑘{T_{k}}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for the target dataset T𝑇Titalic_T respectively. The weighted Jaccard similarity between S𝑆Sitalic_S and T𝑇Titalic_T is defined as:

J(S,T)=kmin(Sk,Tk)kmax(Sk,Tk)𝐽𝑆𝑇subscript𝑘subscript𝑆𝑘subscript𝑇𝑘subscript𝑘subscript𝑆𝑘subscript𝑇𝑘J(S,T)=\frac{\sum_{k}\min(S_{k},T_{k})}{\sum_{k}\max(S_{k},T_{k})}italic_J ( italic_S , italic_T ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_min ( italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_max ( italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG

where the sum is over all unique words k𝑘kitalic_k present in datasets S𝑆Sitalic_S and T𝑇Titalic_T.

Appendix D Implementation Details

D.1 Implementation Environments

The experimental setup for performance and efficiency evaluation was conducted utilizing a Tesla V100 GPU with 32GB of memory and Intel Titan CPU and CUDA 11.2. The versions of the software utilized in this study are PyTorch 2.0.1 and Transformers 4.38.1. We utilized the Faiss IndexFlat[15] to store vector embedding index.

D.2 Retriever Resources

Model Public Model Checkpoints (Link)
Contriever https://huggingface.co/facebook/contriever-msmarco
E5-base https://huggingface.co/intfloat/e5-base-v2
BGE-Base https://huggingface.co/BAAI/bge-base-en-v1.5
GTE-Base https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5
UniXcoder https://huggingface.co/microsoft/unixcoder-base
BGE-M3 https://huggingface.co/BAAI/bge-m3
E5-Mistral https://huggingface.co/intfloat/e5-mistral-7b-instruct
OpenAI-Ada-002 https://openai.com/
Voyage-Code-002 https://www.voyageai.com/
Table 9: Publicly available model links used for evaluation in CoIR.

The huggingface models or API used of retrieval model used in CoIR is listed in Table 9.

D.3 Dataset Resources

Table 10: Corpus Name and Link used for datasets in CoIR.

The urls of dataset resources used in CoIR is listed in Table 10.

D.4 Dataset Licenses

The licenses of all datasets used in CoIR are listed as follows:

  1. \bullet

    APPS: Provided under “MIT License” for non-commercial research purposes.

  2. \bullet

    CodeTransOcean: Provided under Apache License 2.0 license.

  3. \bullet

    CodeSearchNet: Provided under “MIT License” for non-commercial research purposes.

  4. \bullet

    CoSQA: Provided under “MIT License” for non-commercial research purposes.

  5. \bullet

    Synthetic Text2sql: Provided under Apache License 2.0 license.

  6. \bullet

    Code-Feedback: Provided under Apache License 2.0 license.

  7. \bullet

    CodeFeedback-Filtered-Instruction: Provided under Apache License 2.0 license.