11institutetext: DCST, Tsinghua University, China 22institutetext: Quan Cheng Laboratory, China 33institutetext: Columbia University, USA 44institutetext: MegaTech.AI, China
44email: [email protected]

Towards an In-Depth Comprehension of Case Relevance for Better Legal Retrieval

Haitao Li 1122 0009-0006-8766-8610    You Chen 1122 0009-0008-9873-4315    Zhekai Ge 33 0009-0005-5619-8591    Qingyao Ai 1122 0000-0002-5030-709X    Yiqun Liu 1122 0000-0002-0140-4512    Quan Zhou 4422 0009-0003-8097-4621    Shuai Huo 4422 0009-0007-1276-0268
Abstract

Legal retrieval techniques play an important role in preserving the fairness and equality of the judicial system. As an annually well-known international competition, COLIEE aims to advance the development of state-of-the-art retrieval models for legal texts. This paper elaborates on the methodology employed by the TQM team in COLIEE2024. Specifically, we explored various lexical matching and semantic retrieval models, with a focus on enhancing the understanding of case relevance. Additionally, we endeavor to integrate various features using the learning-to-rank technique. Furthermore, fine heuristic pre-processing and post-processing methods have been proposed to mitigate irrelevant information. Consequently, our methodology achieved remarkable performance in COLIEE2024, securing first place in Task 1 and third place in Task 3. We anticipate that our proposed approach can contribute valuable insights to the advancement of legal retrieval technology.

Keywords:
Legal case retrieval Dense retrieval Pre-training.

1 Introduction

Efficient legal retrieval is essential in the judicial process. It supports lawyers in argumentation, guides judges in decision-making, and aids scholars in analyzing legal trends. With the evolution of the legal field into the digital age, the ability to efficiently navigate vast legal databases with advanced search techniques is essential for the maintenance of justice ensuring the judicial fairness [28, 2, 36, 1, 21, 20, 14].

The Competition on Legal Information Extraction/Entailment (COLIEE) has emerged as a significant platform for advancing the state-of-the-art in legal information processing and retrieval. The competition consists of several tasks focusing on two categories: legal retrieval and legal entailment.

This year, our team TQM primarily focused on participating in the legal retrieval tasks, i.e. Task 1 and Task 3. Task 1 involves retrieving relevant documents to support a given query case within the case law system. Task 3 involves retrieving civil law related to Japanese Legal Bar exam questions under the statutory law system. Through a thorough comprehension of case relevance, the TQM team achieved commendable results in COLIEE2024.

In legal practice, case relevance is complex and differs from that of conventional web search [23, 27, 19]. In the context of legal retrieval, relevance transcends mere lexical matches or semantic similarities. The relevance of legal cases usually involves an in-depth analysis of the facts of the case, legal principles, and prior jurisprudence [19, 29, 15]. This requires the retrieval system to understand not only the words and concepts in the text, but also to gain insight into their interactions within a particular legal framework. Traditional methods often prove inadequate in capturing the nuanced aspects that determine case relevance, including the construction of legal arguments, key legal facts, and the particular nature of applicable laws.

Therefore, during COLIEE2024, our team, TQM, not only investigated the effectiveness of established methods in legal retrieval but also explored new strategies to improve the model’s understanding of case relevance. Specifically, within the traditional lexical matching approach, we employed BM25_ngram to underscore the significance of law-specific terms in determining relevance. Additionally, in the semantic similarity approach, we utilized the translation process between different structures of legal cases to deepen the understanding of key facts. Subsequently, we employed learning-to-rank techniques to integrate different features. In addition, we design delicate heuristic pre-processing and post-processing methods to mitigate the impact of irrelevant information. In conclusion, the official results reveal our team’s remarkable achievement, attaining first place in Task 1 and third place in Task 3. This shows the effectiveness of our design approach.

The paper is structured as follows: Section 2 offers an overview of foundational concepts in legal case retrieval and dense retrieval. Section 3 elaborates on the COLIEE2024 legal case retrieval task, encompassing its description, datasets, and evaluation metrics. Section 4 delves into the technical aspects of the study. Following this, Section 5 presents the results of our experiments. The paper concludes with Section 6, summarizing key findings and outlining directions for future research.

2 Related Work

2.1 Legal Retrieval

In the area of legal retrieval, the integration of deep learning techniques has become foundational, giving rise to a plethora of methodologies such as CNN-based models [30], BiDAF [26], and SMASH-RNN [10], among others. Generative transformers have emerged as the preferred architecture in this domain, notably powering innovations like LEGAL-BERT [3] and Lawformer [32]. Besides, Jiang et al. [11] demonstrated improvements in cross-lingual retrieval, by using Multilingual BERT to handle the linguistic space in legal documentation. Recent contributions further enriched this field. By focusing on context-aware citation recommendations [9] and graph-based legal reasoning [38], we can significantly enhance relevance and semantic richness of case retrieval methods. Also, Li et al. proposed SAILER [14], which utilizes the structure of legal documents for pre-training and achieves the best results on some legal benchmarks. These developments highlights the potential of transformative strides in AI and machine learning to legal information retrieval.

2.2 Dense Retrieval

A radical departure from traditional retrieval has emerged through dense retrieval, which leverages dual encoders to map the queries and documents into dense embeddings and capture intricate contextual nuances [33, 7]. This method has been progressively improved through a series of innovative works: Zhan et al.[17] introduced dynamic negative sampling to refine the matching process and Chen et al.[5] unveiled ARES that incorporates retrieval axioms during pre-training, which substantially improved performance. Similarly, Karpukhin et al. [12] introduced DPR (Dense Passage Retrieval) which surpassed traditional IR methods by a large margin in large-scale open-domain question-answering tasks, and Xiong et al.[34]introduced ANCE (Approximate Nearest Neighbor Negative Contrastive Learning), which dynamically updated the negative samples and further optimized the retrieval process. These studies, demonstrate the potential of dense retrieval to revolutionize IR technologies, providing more accurate results across various applications.

3 Task Overview

3.1 Task1.The Case Law Retrieval Task

3.1.1 Task Description

The Competition on Legal Information Extraction/Entailment (COLIEE), an annual international contest, is committed to advancing state-of-the-art methodologies in legal text processing. In COLIEE2024, four tasks are presented, with our exclusive focus directed towards the legal retrieval task.

Task 1, referred to as the Case Law Retrieval task, involves the identification of supporting cases that substantiate the decisions of query cases within an extensive corpus. Formally, for a given query case denoted as q𝑞qitalic_q and a set of candidate cases represented by S𝑆Sitalic_S, the objective is to identify all supporting cases, designated as Sq={S1,S2,,Sn}superscriptsubscript𝑆𝑞subscript𝑆1subscript𝑆2subscript𝑆𝑛S_{q}^{*}=\{S_{1},S_{2},...,S_{n}\}italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } from the extensive candidate pool. Participants are allowed to submit any number of supporting cases for each individual query in this task. Hence, it is also crucial to identify the conditions fulfilled by the relevant cases.

The data corpus utilized for Task 1 comprises a collection of case law documents from the Federal Court of Canada, provided by Compass Law. Detailed statistics of this dataset are presented in Table 1. Through our analysis, we find that there is a significant difference in the average number of relevant documents per query between the COLIEE2023 training and test sets. Therefore, we similarly consider possible bias for effective post-processing in COLIEE2024. We employ the test set of COLIEE2023 as the validation set and and apply the best parameters in COLIEE2023 to COLIEE2024.

Table 1: Dataset statistics of COLIEE Task 1.
COLIEE2021 COLIEE2022 COLIEE2023 COLIEE2024
Train Test Train Test Train Test Train Test
# of queries 650 250 898 300 959 319 1278 400
# of candidate case per query 4415 4415 3531 1263 4400 1335 5616 1734
avg # of relevant candidates/paragraphs 5.17 3.60 4.68 4.21 4.68 2.69 4.16 -

3.1.2 Metrics

For COLIEE 2024 Task 1, the evaluation metrics will include precision, recall, and the F1-measure:

 Precision =#TP#TP+#FP Precision #𝑇𝑃#𝑇𝑃#𝐹𝑃\text{ Precision }=\frac{\#TP}{\#TP+\#FP}Precision = divide start_ARG # italic_T italic_P end_ARG start_ARG # italic_T italic_P + # italic_F italic_P end_ARG (1)
 Recall =#TP#TP+#FN Recall #𝑇𝑃#𝑇𝑃#𝐹𝑁\text{ Recall }=\frac{\#TP}{\#TP+\#FN}Recall = divide start_ARG # italic_T italic_P end_ARG start_ARG # italic_T italic_P + # italic_F italic_N end_ARG (2)
F measure =2× Precision × Recall  Precision + Recall 𝐹 measure 2 Precision  Recall  Precision  Recall F-\text{ measure }=\frac{2\times\text{ Precision }\times\text{ Recall }}{\text% { Precision }+\text{ Recall }}italic_F - measure = divide start_ARG 2 × Precision × Recall end_ARG start_ARG Precision + Recall end_ARG (3)

where #TP#𝑇𝑃\#TP# italic_T italic_P represents the total number of accurately retrieved candidate cases across all queries, #FP#𝐹𝑃\#FP# italic_F italic_P denotes the number of incorrectly retrieved candidate cases for all queries, and #FN#𝐹𝑁\#FN# italic_F italic_N signifies the count of overlooked noticed candidate paragraphs in all queries. Notably, the evaluation process employed a micro-average approach, where the evaluation measure is computed based on the collective results of all queries. This differs from a macro-average approach, which calculates the evaluation measure for each query individually before averaging these values.

Refer to caption
Figure 1: Example of task3 in COLIEE2024. The label is Yes, which means that the text is relevant to the articles.

3.2 Task3.The Statute Law Retrieval Task

3.2.1 Task Description

This task focuses on retrieving civil law articles relevant to a given "Yes/No" question. For a legal bar exam question denoted as Q𝑄Qitalic_Q Q and a set of Japanese Civil Code Articles represented as S=S1,,Sn𝑆subscript𝑆1subscript𝑆𝑛S={S_{1},...,S_{n}}italic_S = italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the objective is to compile a subset E𝐸Eitalic_E from S𝑆Sitalic_S that aids in answering Q𝑄Qitalic_Q. The questions for this task are sourced from Japanese Legal Bar Exams and are translated into English, along with the entire corpus of Japanese Civil Law articles.

The dataset of this task consists of 1097 pairs, a legal corpus (Civil Code) with 768 articles, and 109 test queries. Participants need to find the relevant articles for the test query. The examples of this dataset are shown in Figure 1. More accurately, this task is more like a ranking task, since the candidate set has only 768 legal entries. We selected questions with IDs beginning with R04 with 101 questions to form a validation set. This subset was utilized to conduct evaluations of various models and settings.

3.2.2 Metrics

For COLIEE 2024 Task 3, the evaluation criteria include macro-average precision, recall, and F2-measure, diverging from the micro-average measures traditionally used in Task 1.

 Precision =Average of|Correctly articles for each query||Retrieved articles for each query|\text{ Precision }=\text{Average of}\frac{|\text{Correctly articles for each % query|}}{|\text{Retrieved articles for each query}|}Precision = Average of divide start_ARG | Correctly articles for each query| end_ARG start_ARG | Retrieved articles for each query | end_ARG (4)
 Recall =Average of|Correctly retrieved articles for each query||Correct articles for each query|\text{ Recall }=\text{Average of}\frac{|\text{Correctly retrieved articles for% each query|}}{|\text{Correct articles for each query}|}Recall = Average of divide start_ARG | Correctly retrieved articles for each query| end_ARG start_ARG | Correct articles for each query | end_ARG (5)
F measure =5× Precision × Recall 4× Precision + Recall 𝐹 measure 5 Precision  Recall 4 Precision  Recall F-\text{ measure }=\frac{5\times\text{ Precision }\times\text{ Recall }}{4% \times\text{ Precision }+\text{ Recall }}italic_F - measure = divide start_ARG 5 × Precision × Recall end_ARG start_ARG 4 × Precision + Recall end_ARG (6)

4 Method

In this section, we present our approach and motivation for the legal case retrieval task in COLIEE2024.

4.1 Task1.The Case Law Retrieval Task

In this section, we present our solution in detail for Task 1 of COLIEE2024. Overall, we followed the framework of last year’s first place team THUIR [21]. We first pre-process the data to eliminate noisy information. After that, we implemented the classical lexical matching method and the state-of-the-art semantic retrieval model. The difference is that we improve both approaches from the perspective of case relevance. Following this, we use learning to rank to fuse features from different perspectives for better modeling of case relevance. Finally, we propose heuristic post-processing strategies by observing common properties of relevant cases.

4.1.1 Pre-processing

Following li et al [21], we perform the fine data pre-processing before training. To be specific, our initial step involved the removal of text before the “[1]” character in each case document, which typically includes procedural details such as time and court. Subsequently, we eliminated all placeholders, notably “FRAGMENT_SUPPRESSED”, to avoid interference in similarity computations. Additionally, in cases where legal documents contained French text, we utilized the Langdetect tool to identify and remove French passages. Documents predominantly in French were translated into English to retain their essential information. In the process of summary extraction, we selectively extracted sections under “summary” subheadings, which generally encapsulate key case elements, and integrated these at the beginning of the processed text. Through preprocessing, Through this pre-processing, we aimed to reduce as much noisy information in the case documents as possible, which does not contribute to the relevance judgment.

4.1.2 Lexical Matching Models

In previous competitions, many participants have discovered that traditional lexical matching models can produce competitive results. This phenomenon can be attributed to two primary factors. Firstly, bag-of-words models do not impose limitations on the text length, rendering them well-suited for handling legal case documents with lengthy texts. Secondly, the legal domain encompasses numerous specialized terms, where relevance is often discernible through word matching. Therefore, in this section, we experimented with the following methods:

  • -

    BM25 [25] a probabilistic relevance model grounded in the bag-of-words concept, calculates relevance between a query q𝑞qitalic_q and a document d𝑑ditalic_d. The formulation of BM25 is presented as follows:

    BM25(d,q)=i=1MIDF(ti)TF(ti,d)(k1+1)TF(ti,d)+k1(1b+blen(d)avgdl)𝐵𝑀25𝑑𝑞superscriptsubscript𝑖1𝑀𝐼𝐷𝐹subscript𝑡𝑖𝑇𝐹subscript𝑡𝑖𝑑subscript𝑘11𝑇𝐹subscript𝑡𝑖𝑑subscript𝑘11𝑏𝑏𝑙𝑒𝑛𝑑𝑎𝑣𝑔𝑑𝑙BM25(d,q)=\sum_{i=1}^{M}\dfrac{IDF(t_{i})\cdot TF(t_{i},d)\cdot(k_{1}+1)}{TF(t% _{i},d)+k_{1}\cdot\left(1-b+b\cdot\dfrac{len(d)}{avgdl}\right)}italic_B italic_M 25 ( italic_d , italic_q ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG italic_I italic_D italic_F ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_T italic_F ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d ) ⋅ ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 ) end_ARG start_ARG italic_T italic_F ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d ) + italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ ( 1 - italic_b + italic_b ⋅ divide start_ARG italic_l italic_e italic_n ( italic_d ) end_ARG start_ARG italic_a italic_v italic_g italic_d italic_l end_ARG ) end_ARG (7)

    where k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, b𝑏bitalic_b are free hyperparameters. TF𝑇𝐹TFitalic_T italic_F denotes term frequency and IDF𝐼𝐷𝐹IDFitalic_I italic_D italic_F signifies inverse document frequency. The term avgdl𝑎𝑣𝑔𝑑𝑙avgdlitalic_a italic_v italic_g italic_d italic_l is the represents the average document length across the dataset.

  • -

    QLD [37] is an efficient probabilistic statistical model, assesses relevance scores by evaluating the likelihood of query generation. The computation of the QLD score is outlined as follows:

    logp(q|d)=i:c(qi;d)>0logps(qi|d)αdp(qi|𝒞)+nlogαd+ilogp(qi|𝒞)𝑝conditional𝑞𝑑subscript:𝑖𝑐subscript𝑞𝑖𝑑0subscript𝑝𝑠conditionalsubscript𝑞𝑖𝑑subscript𝛼𝑑𝑝conditionalsubscript𝑞𝑖𝒞𝑛subscript𝛼𝑑subscript𝑖𝑝conditionalsubscript𝑞𝑖𝒞\log p(q|d)=\sum_{i:c(q_{i};d)>0}\log\dfrac{p_{s}(q_{i}|d)}{\alpha_{d}p(q_{i}|% \mathcal{C})}+n\log\alpha_{d}+\sum_{i}\log p(q_{i}|\mathcal{C})roman_log italic_p ( italic_q | italic_d ) = ∑ start_POSTSUBSCRIPT italic_i : italic_c ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_d ) > 0 end_POSTSUBSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_d ) end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_p ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_C ) end_ARG + italic_n roman_log italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_C ) (8)

    For more information, please refer to Zhai et al.’s work[37].

  • -

    BM25_ngram is a modified version of BM25 in order to better determine relevance through lexical matching. Given the abundance of uncommon specialized terms in legal case documents, which hold unique meanings in specific contexts, specific combinations of terms can offer fresh insights into relevance identification. Therefore, we implemented Bm25_ngram by adapting the ngram_range parameter of the TfidfVectorizer. The ngram_range parameter specifies the lower and upper boundaries for the range of n-values corresponding to different n-grams to be extracted.

Refer to caption
Figure 2: Pre-training designs of DELTA.

4.1.3 Semantic Retrieval Models

Semantic retrieval models can effectively avoid the problem of lexical mismatch and have been widely used in legal retrieval. However, pre-trained language models often perform unsatisfactory due to the limited input length and the difficulty of effectively understanding legal structures. Recently, a series of work has achieved state-of-the-art results by designing specific pre-training objectives for legal case retrieval. In this section, we implement SAILER and optimize it for better identification of legal case relevance.

  • SAILER [14] is a structure-aware pre-trained model. It fully utilizes the structure of legal documents to construct information bottlenecks and achieves state-of-the-art results on legal case retrieval tasks. We continued to fine-tune SAILER with the training sets of COLIEE2023 and COLIEE2022.

  • DELTA [16] is an improved version of SAILER, which enhances the understanding of key facts in the legal cases and improves the discriminatory ability. To be specific, DELTA introduces a deep decoder which implements the translation of Fact section to Reasoning section. Afterwards, the word alignment mechanism is employed to determine key facts. Following this, the representation of the case in the vector space is pulled closer to the key facts and pushed away from the non-key facts. The framework of DELTA is shown as Figure 2.

Table 2: Features employed in our learning-to-rank approach for COLIEE2024 Task1. The placeholder contains “FRAGMENT_SUPPRESSED", “REFERENCE_SUPPRESSED", “CITATION_SUPPRESSED".
Feature ID Feature Name Description
1 query_length Length of the query
2 candidate_length Length of the candidate paragraph
3 query_ref_num Number of placeholders in the query case
4 doc_ref_num Number of placeholders in the candidate case
5 BM25 Query-candidate scores with BM25 (k_1 = 3.0 , b = 1.0)
6 BM25_rank Rank of documents in the search list of the query by BM25 score
7 QLD Query-candidate scores with QLD
8 QLD_rank Rank of documents in the search list of the query by QLD score
9 BM25_ngram Query-candidate scores with BM25_ngram
10 BM25_ngram_rank Rank of documents in the search list of the query by BM25_ngram score
11 SAILER Inner product of query and candidate vectors generated by SAILER
12 SAILER_rank Rank of documents in the search list of the query by SAILER score
13 DELTA Inner product of query and candidate vectors generated by DELTA
14 DELTA _rank Rank of documents in the search list of the query by DELTA score

4.1.4 Learning to Rank

Following previous work [35, 18, 4, 31, 8], lWe utilize Lightgbm to integrate all feature scores. Table 2 shows the details of all the features. A total of 14 features were used to integrate the final score. For optimizing ranking, we employ the Normalized Discounted Cumulative Gain (NDCG) as our objective. The model demonstrating the highest performance on the validation set is selected for subsequent testing.

4.1.5 Post-processing

Finally, we post-processed the ranking scores from the relevance perspective to remove irrelevant documents. Apart from Filtering by trial date, Filtering query cases and Dynamic cut-off proposed in previous li et al. work [21], we add Filtering duplicate cases as a post-processing strategy. The specific details are as follows:

  • Filtering by trial date. Considering that a query case typically cites cases preceding its trial date, it is logical to filter the candidate set based on this criterion. By extracting all dates mentioned within each case, we determine the latest date as the trial date, thereby minimizing erroneous exclusions. In instances where dates cannot be extracted from query cases, we retain all cases in the candidate set.

  • Filtering query cases. We find that query cases hardly become noticed case for other queries. Therefore we remove all query cases from the search results.

  • Filtering duplicate cases. We find that all the noticed cases are not repeating in the COLIEE2021, COLIEE2022 and COLIEE2023 query cases respectively, indicating that deleting duplicate cases might be effective. Kim et al. [13] also used removing repeating cases in the previous retrieval task, utilizing maximum duplicate cases as the hyper-parameter. By noticing that removing duplicate cases may delete all the candidate cases for some query cases, we define t𝑡titalic_t as the maximum numbers of duplicate cases and then supplement s𝑠sitalic_s cases with higher score for those query cases without candidate case. Grid search in the validation set is utilized to find optimal t𝑡titalic_t and s𝑠sitalic_s.

  • Dynamic cut-off To accommodate the variability in the number of supporting cases associated with different query cases, we implement a dynamic-cutoff mechanism for each query case. This involves defining three hyperparameters: hhitalic_h, l𝑙litalic_l, and p𝑝pitalic_p, respectively. Here, hhitalic_h represents the maximum, and l𝑙litalic_l the minimum number of supporting cases to be retrieved per query case. Additionally, if the highest score achieved by supporting cases for a specific query case is denoted as S𝑆Sitalic_S, then only those supporting cases scoring above p×S𝑝𝑆p\times Sitalic_p × italic_S are selected. A grid search technique is employed to ascertain the optimal values for these hyperparameters hhitalic_h,l𝑙litalic_l and p𝑝pitalic_p.

4.2 Task3.The Statute Law Retrieval Task

In this section, we follow the framework of Task 1 to implement Task 3. Specifically, we design heuristic pre-processing and post-processing strategies and implement advanced retrievers and rankers. Finally, we use learning to rank to integrate all scores.

4.2.1 Pre-processing

In Task 3, we primarily pre-process the retrieval pool, i.e., the legal articles. Specifically, we started by removing the lead-in information from the Civil Code. For example: “Part I General Provisions”, “Chapter I Common Provisions”. We consider that this information does not contribute to the relevance judgment. Subsequently, we deleted all explanatory descriptions in brackets, such as (Standards for Construction). We consider that these are too general and do not facilitate the differentiation of legal articles. Finally, we obtain a map** of article IDs and specific content to form the retrieval set.

4.2.2 Retriever

We implemented the following retriever to get the most relevant legal articles from the full set:

  • BM25 [25] is a robust lexical matching method. In Task 3, we set k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 0.99 and b𝑏bitalic_b to 0.75.

  • QLD [37] is another effective probabilistic lexical model. The detailed description can be found in section 4.1.

4.2.3 Reranker

After getting the retrieved top200𝑡𝑜𝑝200top200italic_t italic_o italic_p 200 relevant legal articles, we use reranker to further rank them. The detailed model is as follows:

  • BERT [6] is the classic pre-trained language model, which employs a multi-layer bidirectional Transformer encoder architecture, BERT leverages both the Masked Language Model (MLM) and Next Sentence Prediction (NSP) as its pre-training tasks.

  • RoBERTa [22] represents an advancement over BERT, utilizing a more extensive dataset for pre-training. Unlike BERT, RoBERTa is exclusively pre-trained using the Masked Language Model (MLM) task.

  • [3] has been pre-trained on an extensive English legal database and has demonstrated state-of-the-art performance across a variety of legal tasks.

  • monoT5 [24] adopts an encoder-decoder architecture. It operates by generating a “true” or “false” token, reflecting the relevance between queries and candidates. The model then considers the probability of generating “true” as the ultimate relevance score.

For BERT, RoBERTa, and LEGALBERT, we train them with the cross-encoder architecture. Specifically, the query and legal articles are spliced together and fed into the encoder, and the vector of [CLS]delimited-[]𝐶𝐿𝑆[CLS][ italic_C italic_L italic_S ] token is passed through the MLP layer to get the final score. The loss function for training is as follows:

L(q,d+,d1,,dn)=logexp(s(q,d+))exp(s(q,d+))+j=1nexp(s(q,dj))𝐿𝑞superscript𝑑subscriptsuperscript𝑑1subscriptsuperscript𝑑𝑛𝑒𝑥𝑝𝑠𝑞superscript𝑑𝑒𝑥𝑝𝑠𝑞superscript𝑑superscriptsubscript𝑗1𝑛𝑒𝑥𝑝𝑠𝑞subscriptsuperscript𝑑𝑗L(q,d^{+},d^{-}_{1},...,d^{-}_{n})=-\log{\frac{exp(s(q,d^{+}))}{exp(s(q,d^{+})% )+\sum_{j=1}^{n}exp(s(q,d^{-}_{j}))}}italic_L ( italic_q , italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = - roman_log divide start_ARG italic_e italic_x italic_p ( italic_s ( italic_q , italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ) end_ARG start_ARG italic_e italic_x italic_p ( italic_s ( italic_q , italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ) + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e italic_x italic_p ( italic_s ( italic_q , italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_ARG (9)

where d+superscript𝑑d^{+}italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and dsuperscript𝑑d^{-}italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT are relevant and negative articles. We employ irrelevant articles from the top200𝑡𝑜𝑝200top200italic_t italic_o italic_p 200 articles retrieved by BM25 as negative examples. For monoT5, we trained three versions of monoT5_base, monoT5_large, and monoT5_3B.

4.2.4 Learning to Rank

Similar to Task 1, we integrate all the features using Lightgbm. The features utilized in Task 3 are displayed in Table 3. A total of 9 features were employed to integrate the final score. We adopt Precision@1𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@1Precision@1italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n @ 1 as the optimization objective and select the best model based on performance on the validation set for testing purposes.

Table 3: Features that we used for learning to rank in COLIEE2024 Task 3.
Feature ID Feature Name Description
1 query_length Length of the query
2 article_length Length of the candidate article
3 BM25 Query-article scores with BM25
4 QLD Query-article scores with QLD
5 BERT Query-article scores with BERT
6 RoBERTa Query-article scores with RoBERTa
7 LEGALBERT Query-article scores with LEGAL-BERT-base
8 monoT5_large Query-article scores with monoT5_large
9 monoT5_3B Query-article scores with monoT5_3B

4.2.5 Post-processing

Finally we performed the heuristic post-processing on the ranking scores. Upon analysis, it was observed that the majority of queries are associated with no more than two relevant legal articles. Therefore, we define the maximum score for one query to be S𝑆Sitalic_S. Only articles that exceed the S×p𝑆𝑝S\times pitalic_S × italic_p score are considered relevant. The hyperparameter p𝑝pitalic_p is finely tuned to maintain consistency in the proportion of queries with two relevant laws across both the training and validation sets.

Table 4: Performance and optimal hyperparameter on COLIEE2024 validation set.
Model F1 score Precision Recall p h l t s
TQM_run1 0.3824 0.3708 0.3046 0.7 5 4 1 2
TQM_run2 0.4294 0.4064 0.4552 0.3 7 4 1 2
TQM_run3 0.4592 0.4530 0.4656 0.46 7 1 1 2
Table 5: Results on the official test of COLIEE2024 Task 1. Best results are marked bold.
Team Submission F1 Precision Recall
TQM task1_test_answer_2024_run1 0.4432 0.5057 0.3944
TQM task1_test_answer_2024_run3 0.4342 0.5082 0.3790
UMNLP task1_umnlp_run1 0.4134 0.4000 0.4277
UMNLP task1_umnlp_run2 0.4097 0.3755 0.4507
UMNLP task1_umnlp_runs_combined 0.4046 0.3597 0.4622
YR task1_yr_run1 0.3605 0.3210 0.4110
TQM task1_test_answer_2024_run2 0.3548 0.4196 0.3073
YR task1_yr_run2 0.3483 0.3245 0.3758
YR task1_yr_run3 0.3417 0.3184 0.3688
JNLP 64b7b-07f39 0.3246 0.3110 0.3393
JNLP 07f39 0.3222 0.3347 0.3105
JNLP 64b7b-48fe5 0.3103 0.3017 0.3195
WJY submit_1 0.3032 0.2700 0.3457
BM24 task1_test_result 0.1878 0.1495 0.2522
CAPTAIN captain_mstr 0.1688 0.1793 0.1594
CAPTAIN captain_ft5 0.1574 0.1586 0.1562
NOWJ nowjtask1run2 0.1313 0.0895 0.2465
NOWJ nowjtask1run3 0.1306 0.0957 0.2055
NOWJ nowjtask1run1 0.1224 0.0813 0.2478
WJY submit_3 0.1179 0.0870 0.1831
WJY submit_2 0.1174 0.0824 0.2042
MIG test1_ans 0.0508 0.0516 0.0499
UBCS run3 0.0276 0.0140 0.7196
UBCS run2 0.0275 0.0140 0.7177
UBCS run1 0.0272 0.0139 0.7100
CAPTAIN captain_bm25 0.0019 0.0019 0.0019

5 EXPERIMENT RESULT

In this section, we present the results of our experiments and the corresponding analysis.

5.1 Task1.The Case Law Retrieval Task

5.1.1 Submissions

For COLIEE2024 Task 1, we submitted 3 runs with the following details

  • task1_test_answer_2024_run1: We implemented the lexical matching model QLD and searched for the best parameters t,s,h,l,p𝑡𝑠𝑙𝑝t,s,h,l,pitalic_t , italic_s , italic_h , italic_l , italic_p on the validation set based on the QLD scores in the post-processing stage and applied them to the test set.

  • task1_test_answer_2024_run2: The improved lexical matching model BM25_ngram was implemented, and an optimal set of parameters t,s,h,l,p𝑡𝑠𝑙𝑝t,s,h,l,pitalic_t , italic_s , italic_h , italic_l , italic_p was identified through a search on the validation set, guided by the BM25_ngram scores during the post-processing stage. These parameters were subsequently applied to the test set.

  • task1_test_answer_2024_run3: The lightgbm integrates all the features to get the final score, after which the best post-processing parameters are obtained based on this score and applied to the test set.

Table 6: The performance of various model on COLIEE2024 task3 validation set. Best results are marked bold.
Model F2 Precision Recall
BM25 0.5267 0.6039 0.5181
QLD 0.3888 0.4257 0.3844
BERT 0.6698 0.7524 0.6600
RoBERTa 0.6637 0.7524 0.6534
LEGALBERT 0.6929 0.7920 0.6815
monoT5_base 0.6951 0.7821 0.6848
monoT5_large 0.7072 0.8019 0.6963
monoT5_3B 0.7171 0.8118 0.7062
Table 7: Results on the official test of COLIEE2024 Task 3.Best results are marked bold. * indicates runs that use LLMs with undiscolsed training data. · indicates runs that use LLMs with discolsed training data. # is runs without LLM.
Submission_id F2 Precision Recall MAP R_5 R_10 R_30
JNLP.constr-join* 0.7408 0.6502 0.7982 0.8010 0.8769 0.9154 0.9462
CAPTAIN.bjpAllMonoT5· 0.7335 0.6713 0.7752 0.8149 0.8615 0.9308 0.9538
TQM-run1# 0.7171 0.7202 0.7339 0.7899 0.8308 0.9000 0.9615
CAPTAIN.bjpAllMonoP· 0.7171 0.6743 0.7477 0.7731 0.8538 0.9308 0.9538
CAPTAIN.bjpAll# 0.7135 0.6227 0.7844 0.8149 0.8615 0.9308 0.9538
JNLP.Mistral* 0.7123 0.6682 0.7477 0.7434 0.8308 0.9154 0.9538
NOWJ-25mulreftask-ensemble# 0.7081 0.6334 0.7661 0.7562 0.8231 0.8769 0.9077
AMHR02· 0.6876 0.5972 0.7569 0.7405 0.7846 0.8308 0.8462
AMHR03· 0.6825 0.6456 0.7202 0.7405 0.7846 0.8308 0.8462
AMHR01· 0.6749 0.5734 0.7569 0.7405 0.7846 0.8308 0.8462
NOWJ-25multask-ensemble# 0.6654 0.5934 0.7431 0.7180 0.7231 0.8077 0.8692
NOWJ-25mulref-ensemble# 0.6649 0.5916 0.7202 0.7315 0.8154 0.8462 0.8923
TQM-run2# 0.6621 0.5734 0.7110 0.7082 0.7769 0.8077 0.8077
JNLP.RankLLaMA* 0.6555 0.6606 0.6651 0.7400 0.8385 0.9154 0.9538
UA-mp_net# 0.6409 0.4908 0.7385 0.7127 0.8000 0.8538 0.9000
UA-anglE# 0.6399 0.4679 0.7477 0.6935 0.7538 0.8077 0.8769
TQM-run3# 0.6330 0.5963 0.6606 0.7492 0.8154 0.8692 0.9308
BM24-1* 0.4945 0.2590 0.7294 - - - -
MIG2# 0.1665 0.1604 0.1881 0.2125 0.2615 0.2923 0.3769
MIG1# 0.1637 0.1187 0.2064 0.2049 0.2385 0.2923 0.3846
MIG3# 0.1629 0.1631 0.1789 0.2049 0.2385 0.2923 0.3846
PSI01 0.0785 0.0826 0.0780 0.2312 0.3692 0.4769 0.6308

5.1.2 Results

Table 4 shows the effectiveness and optimal parameters of submission runs on the validation set. Table 5 shows the final official evaluation results. From the experimental results, we can draw the following conclusions:

  • From the results of the validation set, the lexical matching model Bm25_ngram achieved competitive results. Learning to rank effectively combines the perspectives of lexical matching model and semantic retrieval model to achieve the best results.

  • However, the official test results showed different performance. BM25_ngram had the worst results and QLD achieved the best performance. We speculate this is due to the bias in the distribution of terms on the test sets of COLIEE2023 and COLIEE2024. Since the distribution of the BM25_ngram scores is different on the two datasets, it results slightly lower performance of learning to rank than the single model.

  • Overall, our approach achieves championship in the legal case retrieval task and shows sufficient robustness, which is crucial in legal scenarios where large-scale annotation data is lacking.

5.2 Task3.The Statute Law Retrieval Task

5.2.1 Submissions

In Task 3, we submit 3 runs as follows:

  • TQM_run1: We fine-tuned monoT5_3B using the training data and performed post-processing.

  • TQM_run2: Lightgbm was employed to integrate all features and use Precision@1 as the optimization objective.

  • TQM_run3: Lightgbm was employed to integrate all features and use Precision@2 as the optimization objective.

5.2.2 Results

Table 6 shows the performance of various models on the validation set. Table 7 shows the official evaluation results. We derive the fol- lowing observations from the experiment results.

  • From the Table 6 , it can be observed that Ranker performs better than the Retriever. The best single model result was achieved by mono_T5.

  • However, the performance drops significantly after learning to rank on the test set. We think this is due to overfitting caused by too little training data. How to effectively integrate each feature deserves further research.

  • Overall, our submission had the best performance among all the runs without LLMs, and ranked third among all the submissions. This suggests that LLMs can be effective in enhancing the understanding of the law thus improving the performance.

6 Conclusion

This paper presents TQM Team’s approaches to the legal case retrieval task in the COLIEE 2024 competition. We try to enhance the understanding of the model for case relevance from multiple perspectives and achieve some progress. We obtained the best performance in Task 1 among all submissions, and the third place in Task 3. In the future we will continue to explore infusing legal knowledge into the model to better understand case relevance.

References

  • [1] Althammer, S., Askari, A., Verberne, S., Hanbury, A.: Dossier@ coliee 2021: leveraging dense retrieval and summarization-based re-ranking for case law retrieval. arXiv preprint arXiv:2108.03937 (2021)
  • [2] Bench-Capon, T., Araszkiewicz, M., Ashley, K., Atkinson, K., Bex, F., Borges, F., Bourcier, D., Bourgine, P., Conrad, J.G., Francesconi, E., et al.: A history of ai and law in 50 papers: 25 years of the international conference on ai and law. Artificial Intelligence and Law 20(3), 215–319 (2012)
  • [3] Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., Androutsopoulos, I.: Legal-bert: The muppets straight out of law school. arXiv preprint arXiv:2010.02559 (2020)
  • [4] Chen, J., Li, H., Su, W., Ai, Q., Liu, Y.: Thuir at wsdm cup 2023 task 1: Unbiased learning to rank (2023)
  • [5] Chen, J., Liu, Y., Fang, Y., Mao, J., Fang, H., Yang, S., Xie, X., Zhang, M., Ma, S.: Axiomatically regularized pre-training for ad hoc search. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1524–1534 (2022)
  • [6] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • [7] Dong, Q., Liu, Y., Ai, Q., Li, H., Wang, S., Liu, Y., Yin, D., Ma, S.: I3 retriever: Incorporating implicit interaction in pre-trained language models for passage retrieval. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. pp. 441–451 (2023)
  • [8] Han, X., Tu, Y., Li, H., Ai, Q., Liu, Y.: Thuir_ss at the ntcir-17 session search (ss) task. (No Title) p. none (2023)
  • [9] Huang, Z., Low, C., Teng, M., Zhang, H., Ho, D.E., Krass, M.S., Grabmair, M.: Context-aware legal citation recommendation using deep learning. In: Proceedings of the eighteenth international conference on artificial intelligence and law. pp. 79–88 (2021)
  • [10] Jiang, J.Y., Zhang, M., Li, C., Bendersky, M., Golbandi, N., Najork, M.: Semantic text matching for long-form documents. In: The world wide web conference. pp. 795–806 (2019)
  • [11] Jiang, Z., El-Jaroudi, A., Hartmann, W., Karakos, D., Zhao, L.: Cross-lingual information retrieval with bert. arXiv preprint arXiv:2004.13005 (2020)
  • [12] Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020)
  • [13] Kim, M.Y., Rabelo, J., Babiker, H.K.B., Rahman, M.A., Goebel, R.: Legal information retrieval and entailment using transformer-based approaches. The Review of Socionetwork Strategies pp. 1–21 (2024)
  • [14] Li, H., Ai, Q., Chen, J., Dong, Q., Wu, Y., Liu, Y., Chen, C., Tian, Q.: Sailer: Structure-aware pre-trained language model for legal case retrieval (2023)
  • [15] Li, H., Ai, Q., Chen, J., Dong, Q., Wu, Z., Liu, Y., Chen, C., Tian, Q.: Blade: Enhancing black-box large language models with small domain-specific models. arXiv preprint arXiv:2403.18365 (2024)
  • [16] Li, H., Ai, Q., Han, X., Chen, J., Dong, Q., Liu, Y., Chen, C., Tian, Q.: Delta: Pre-train a discriminative encoder for legal case retrieval via structural word alignment. arXiv preprint arXiv:2403.18435 (2024)
  • [17] Li, H., Ai, Q., Zhan, J., Mao, J., Liu, Y., Liu, Z., Cao, Z.: Constructing tree-based index for efficient and effective dense retrieval (2023)
  • [18] Li, H., Chen, J., Su, W., Ai, Q., Liu, Y.: Towards better web search performance: Pre-training, fine-tuning and learning to rank. arXiv preprint arXiv:2303.04710 (2023)
  • [19] Li, H., Shao, Y., Wu, Y., Ai, Q., Ma, Y., Liu, Y.: Lecardv2: A large-scale chinese legal case retrieval dataset (2023)
  • [20] Li, H., Su, W., Wang, C., Wu, Y., Ai, Q., Liu, Y.: Thuir@coliee 2023: Incorporating structural knowledge into pre-trained language models for legal case retrieval (2023)
  • [21] Li, H., Wang, C., Su, W., Wu, Y., Ai, Q., Liu, Y.: Thuir@coliee 2023: More parameters and legal knowledge for legal case entailment (2023)
  • [22] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
  • [23] Ma, Y., Shao, Y., Wu, Y., Liu, Y., Zhang, R., Zhang, M., Ma, S.: Lecard: a legal case retrieval dataset for chinese law system. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 2342–2348 (2021)
  • [24] Nogueira, R., Jiang, Z., Lin, J.: Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713 (2020)
  • [25] Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval 3(4), 333–389 (2009)
  • [26] Seo, M., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603 (2016)
  • [27] Shao, Y., Li, H., Wu, Y., Liu, Y., Ai, Q., Mao, J., Ma, Y., Ma, S.: An intent taxonomy of legal case retrieval. ACM Trans. Inf. Syst. 42(2) (dec 2023). https://doi.org/10.1145/3626093, https://doi.org/10.1145/3626093
  • [28] Shao, Y., Mao, J., Liu, Y., Ma, W., Satoh, K., Zhang, M., Ma, S.: Bert-pli: Modeling paragraph-level interactions for legal case retrieval. In: IJCAI. pp. 3501–3507 (2020)
  • [29] Shao, Y., Wu, Y., Liu, Y., Mao, J., Ma, S.: Understanding relevance judgments in legal case retrieval. ACM Transactions on Information Systems 41(3), 1–32 (2023)
  • [30] Tran, V., Nguyen, M.L., Satoh, K.: Building legal case retrieval systems with lexical matching and summarization using a pre-trained phrase scoring model. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law. pp. 275–282 (2019)
  • [31] Tu, Y., Li, H., Chu, Z., Ai, Q., Liu, Y.: Thuir at the ntcir-17 fairweb-1 task: An initial exploration of the relationship between relevance and fairness. Proceedings of NTCIR-17. https://doi. org/10.20736/0002001317 (2023)
  • [32] Xiao, C., Hu, X., Liu, Z., Tu, C., Sun, M.: Lawformer: A pre-trained language model for chinese legal long documents. AI Open 2, 79–84 (2021)
  • [33] Xie, X., Dong, Q., Wang, B., Lv, F., Yao, T., Gan, W., Wu, Z., Li, X., Li, H., Liu, Y., et al.: T2ranking: A large-scale chinese benchmark for passage ranking. arXiv preprint arXiv:2304.03679 (2023)
  • [34] Xiong, L., Xiong, C., Li, Y., Tang, K.F., Liu, J., Bennett, P., Ahmed, J., Overwijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020)
  • [35] Yang, S., Li, H., Chu, Z., Zhan, J., Liu, Y., Zhang, M., Ma, S.: Thuir at the ntcir-16 www-4 task. Proceedings of NTCIR-16. to appear (2022)
  • [36] Yu, W., Sun, Z., Xu, J., Dong, Z., Chen, X., Xu, H., Wen, J.R.: Explainable legal case matching via inverse optimal transport-based rationale extraction. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 657–668 (2022)
  • [37] Zhai, C.: Statistical language models for information retrieval. Synthesis lectures on human language technologies 1(1), 1–141 (2008)
  • [38] Zhang, K., Chen, C., Wang, Y., Tian, Q., Bai, L.: Cfgl-lcr: A counterfactual graph learning framework for legal case retrieval. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 3332–3341 (2023)