44email: [email protected]
Towards an In-Depth Comprehension of Case Relevance for Better Legal Retrieval
Abstract
Legal retrieval techniques play an important role in preserving the fairness and equality of the judicial system. As an annually well-known international competition, COLIEE aims to advance the development of state-of-the-art retrieval models for legal texts. This paper elaborates on the methodology employed by the TQM team in COLIEE2024. Specifically, we explored various lexical matching and semantic retrieval models, with a focus on enhancing the understanding of case relevance. Additionally, we endeavor to integrate various features using the learning-to-rank technique. Furthermore, fine heuristic pre-processing and post-processing methods have been proposed to mitigate irrelevant information. Consequently, our methodology achieved remarkable performance in COLIEE2024, securing first place in Task 1 and third place in Task 3. We anticipate that our proposed approach can contribute valuable insights to the advancement of legal retrieval technology.
Keywords:
Legal case retrieval Dense retrieval Pre-training.1 Introduction
Efficient legal retrieval is essential in the judicial process. It supports lawyers in argumentation, guides judges in decision-making, and aids scholars in analyzing legal trends. With the evolution of the legal field into the digital age, the ability to efficiently navigate vast legal databases with advanced search techniques is essential for the maintenance of justice ensuring the judicial fairness [28, 2, 36, 1, 21, 20, 14].
The Competition on Legal Information Extraction/Entailment (COLIEE) has emerged as a significant platform for advancing the state-of-the-art in legal information processing and retrieval. The competition consists of several tasks focusing on two categories: legal retrieval and legal entailment.
This year, our team TQM primarily focused on participating in the legal retrieval tasks, i.e. Task 1 and Task 3. Task 1 involves retrieving relevant documents to support a given query case within the case law system. Task 3 involves retrieving civil law related to Japanese Legal Bar exam questions under the statutory law system. Through a thorough comprehension of case relevance, the TQM team achieved commendable results in COLIEE2024.
In legal practice, case relevance is complex and differs from that of conventional web search [23, 27, 19]. In the context of legal retrieval, relevance transcends mere lexical matches or semantic similarities. The relevance of legal cases usually involves an in-depth analysis of the facts of the case, legal principles, and prior jurisprudence [19, 29, 15]. This requires the retrieval system to understand not only the words and concepts in the text, but also to gain insight into their interactions within a particular legal framework. Traditional methods often prove inadequate in capturing the nuanced aspects that determine case relevance, including the construction of legal arguments, key legal facts, and the particular nature of applicable laws.
Therefore, during COLIEE2024, our team, TQM, not only investigated the effectiveness of established methods in legal retrieval but also explored new strategies to improve the model’s understanding of case relevance. Specifically, within the traditional lexical matching approach, we employed BM25_ngram to underscore the significance of law-specific terms in determining relevance. Additionally, in the semantic similarity approach, we utilized the translation process between different structures of legal cases to deepen the understanding of key facts. Subsequently, we employed learning-to-rank techniques to integrate different features. In addition, we design delicate heuristic pre-processing and post-processing methods to mitigate the impact of irrelevant information. In conclusion, the official results reveal our team’s remarkable achievement, attaining first place in Task 1 and third place in Task 3. This shows the effectiveness of our design approach.
The paper is structured as follows: Section 2 offers an overview of foundational concepts in legal case retrieval and dense retrieval. Section 3 elaborates on the COLIEE2024 legal case retrieval task, encompassing its description, datasets, and evaluation metrics. Section 4 delves into the technical aspects of the study. Following this, Section 5 presents the results of our experiments. The paper concludes with Section 6, summarizing key findings and outlining directions for future research.
2 Related Work
2.1 Legal Retrieval
In the area of legal retrieval, the integration of deep learning techniques has become foundational, giving rise to a plethora of methodologies such as CNN-based models [30], BiDAF [26], and SMASH-RNN [10], among others. Generative transformers have emerged as the preferred architecture in this domain, notably powering innovations like LEGAL-BERT [3] and Lawformer [32]. Besides, Jiang et al. [11] demonstrated improvements in cross-lingual retrieval, by using Multilingual BERT to handle the linguistic space in legal documentation. Recent contributions further enriched this field. By focusing on context-aware citation recommendations [9] and graph-based legal reasoning [38], we can significantly enhance relevance and semantic richness of case retrieval methods. Also, Li et al. proposed SAILER [14], which utilizes the structure of legal documents for pre-training and achieves the best results on some legal benchmarks. These developments highlights the potential of transformative strides in AI and machine learning to legal information retrieval.
2.2 Dense Retrieval
A radical departure from traditional retrieval has emerged through dense retrieval, which leverages dual encoders to map the queries and documents into dense embeddings and capture intricate contextual nuances [33, 7]. This method has been progressively improved through a series of innovative works: Zhan et al.[17] introduced dynamic negative sampling to refine the matching process and Chen et al.[5] unveiled ARES that incorporates retrieval axioms during pre-training, which substantially improved performance. Similarly, Karpukhin et al. [12] introduced DPR (Dense Passage Retrieval) which surpassed traditional IR methods by a large margin in large-scale open-domain question-answering tasks, and Xiong et al.[34]introduced ANCE (Approximate Nearest Neighbor Negative Contrastive Learning), which dynamically updated the negative samples and further optimized the retrieval process. These studies, demonstrate the potential of dense retrieval to revolutionize IR technologies, providing more accurate results across various applications.
3 Task Overview
3.1 Task1.The Case Law Retrieval Task
3.1.1 Task Description
The Competition on Legal Information Extraction/Entailment (COLIEE), an annual international contest, is committed to advancing state-of-the-art methodologies in legal text processing. In COLIEE2024, four tasks are presented, with our exclusive focus directed towards the legal retrieval task.
Task 1, referred to as the Case Law Retrieval task, involves the identification of supporting cases that substantiate the decisions of query cases within an extensive corpus. Formally, for a given query case denoted as and a set of candidate cases represented by , the objective is to identify all supporting cases, designated as from the extensive candidate pool. Participants are allowed to submit any number of supporting cases for each individual query in this task. Hence, it is also crucial to identify the conditions fulfilled by the relevant cases.
The data corpus utilized for Task 1 comprises a collection of case law documents from the Federal Court of Canada, provided by Compass Law. Detailed statistics of this dataset are presented in Table 1. Through our analysis, we find that there is a significant difference in the average number of relevant documents per query between the COLIEE2023 training and test sets. Therefore, we similarly consider possible bias for effective post-processing in COLIEE2024. We employ the test set of COLIEE2023 as the validation set and and apply the best parameters in COLIEE2023 to COLIEE2024.
COLIEE2021 | COLIEE2022 | COLIEE2023 | COLIEE2024 | |||||
Train | Test | Train | Test | Train | Test | Train | Test | |
# of queries | 650 | 250 | 898 | 300 | 959 | 319 | 1278 | 400 |
# of candidate case per query | 4415 | 4415 | 3531 | 1263 | 4400 | 1335 | 5616 | 1734 |
avg # of relevant candidates/paragraphs | 5.17 | 3.60 | 4.68 | 4.21 | 4.68 | 2.69 | 4.16 | - |
3.1.2 Metrics
For COLIEE 2024 Task 1, the evaluation metrics will include precision, recall, and the F1-measure:
(1) |
(2) |
(3) |
where represents the total number of accurately retrieved candidate cases across all queries, denotes the number of incorrectly retrieved candidate cases for all queries, and signifies the count of overlooked noticed candidate paragraphs in all queries. Notably, the evaluation process employed a micro-average approach, where the evaluation measure is computed based on the collective results of all queries. This differs from a macro-average approach, which calculates the evaluation measure for each query individually before averaging these values.
3.2 Task3.The Statute Law Retrieval Task
3.2.1 Task Description
This task focuses on retrieving civil law articles relevant to a given "Yes/No" question. For a legal bar exam question denoted as Q and a set of Japanese Civil Code Articles represented as , the objective is to compile a subset from that aids in answering . The questions for this task are sourced from Japanese Legal Bar Exams and are translated into English, along with the entire corpus of Japanese Civil Law articles.
The dataset of this task consists of 1097 pairs, a legal corpus (Civil Code) with 768 articles, and 109 test queries. Participants need to find the relevant articles for the test query. The examples of this dataset are shown in Figure 1. More accurately, this task is more like a ranking task, since the candidate set has only 768 legal entries. We selected questions with IDs beginning with R04 with 101 questions to form a validation set. This subset was utilized to conduct evaluations of various models and settings.
3.2.2 Metrics
For COLIEE 2024 Task 3, the evaluation criteria include macro-average precision, recall, and F2-measure, diverging from the micro-average measures traditionally used in Task 1.
(4) |
(5) |
(6) |
4 Method
In this section, we present our approach and motivation for the legal case retrieval task in COLIEE2024.
4.1 Task1.The Case Law Retrieval Task
In this section, we present our solution in detail for Task 1 of COLIEE2024. Overall, we followed the framework of last year’s first place team THUIR [21]. We first pre-process the data to eliminate noisy information. After that, we implemented the classical lexical matching method and the state-of-the-art semantic retrieval model. The difference is that we improve both approaches from the perspective of case relevance. Following this, we use learning to rank to fuse features from different perspectives for better modeling of case relevance. Finally, we propose heuristic post-processing strategies by observing common properties of relevant cases.
4.1.1 Pre-processing
Following li et al [21], we perform the fine data pre-processing before training. To be specific, our initial step involved the removal of text before the “[1]” character in each case document, which typically includes procedural details such as time and court. Subsequently, we eliminated all placeholders, notably “FRAGMENT_SUPPRESSED”, to avoid interference in similarity computations. Additionally, in cases where legal documents contained French text, we utilized the Langdetect tool to identify and remove French passages. Documents predominantly in French were translated into English to retain their essential information. In the process of summary extraction, we selectively extracted sections under “summary” subheadings, which generally encapsulate key case elements, and integrated these at the beginning of the processed text. Through preprocessing, Through this pre-processing, we aimed to reduce as much noisy information in the case documents as possible, which does not contribute to the relevance judgment.
4.1.2 Lexical Matching Models
In previous competitions, many participants have discovered that traditional lexical matching models can produce competitive results. This phenomenon can be attributed to two primary factors. Firstly, bag-of-words models do not impose limitations on the text length, rendering them well-suited for handling legal case documents with lengthy texts. Secondly, the legal domain encompasses numerous specialized terms, where relevance is often discernible through word matching. Therefore, in this section, we experimented with the following methods:
-
-
BM25 [25] a probabilistic relevance model grounded in the bag-of-words concept, calculates relevance between a query and a document . The formulation of BM25 is presented as follows:
(7) where , are free hyperparameters. denotes term frequency and signifies inverse document frequency. The term is the represents the average document length across the dataset.
-
-
QLD [37] is an efficient probabilistic statistical model, assesses relevance scores by evaluating the likelihood of query generation. The computation of the QLD score is outlined as follows:
(8) For more information, please refer to Zhai et al.’s work[37].
-
-
BM25_ngram is a modified version of BM25 in order to better determine relevance through lexical matching. Given the abundance of uncommon specialized terms in legal case documents, which hold unique meanings in specific contexts, specific combinations of terms can offer fresh insights into relevance identification. Therefore, we implemented Bm25_ngram by adapting the ngram_range parameter of the TfidfVectorizer. The ngram_range parameter specifies the lower and upper boundaries for the range of n-values corresponding to different n-grams to be extracted.
4.1.3 Semantic Retrieval Models
Semantic retrieval models can effectively avoid the problem of lexical mismatch and have been widely used in legal retrieval. However, pre-trained language models often perform unsatisfactory due to the limited input length and the difficulty of effectively understanding legal structures. Recently, a series of work has achieved state-of-the-art results by designing specific pre-training objectives for legal case retrieval. In this section, we implement SAILER and optimize it for better identification of legal case relevance.
-
•
SAILER [14] is a structure-aware pre-trained model. It fully utilizes the structure of legal documents to construct information bottlenecks and achieves state-of-the-art results on legal case retrieval tasks. We continued to fine-tune SAILER with the training sets of COLIEE2023 and COLIEE2022.
-
•
DELTA [16] is an improved version of SAILER, which enhances the understanding of key facts in the legal cases and improves the discriminatory ability. To be specific, DELTA introduces a deep decoder which implements the translation of Fact section to Reasoning section. Afterwards, the word alignment mechanism is employed to determine key facts. Following this, the representation of the case in the vector space is pulled closer to the key facts and pushed away from the non-key facts. The framework of DELTA is shown as Figure 2.
Feature ID | Feature Name | Description |
---|---|---|
1 | query_length | Length of the query |
2 | candidate_length | Length of the candidate paragraph |
3 | query_ref_num | Number of placeholders in the query case |
4 | doc_ref_num | Number of placeholders in the candidate case |
5 | BM25 | Query-candidate scores with BM25 (k_1 = 3.0 , b = 1.0) |
6 | BM25_rank | Rank of documents in the search list of the query by BM25 score |
7 | QLD | Query-candidate scores with QLD |
8 | QLD_rank | Rank of documents in the search list of the query by QLD score |
9 | BM25_ngram | Query-candidate scores with BM25_ngram |
10 | BM25_ngram_rank | Rank of documents in the search list of the query by BM25_ngram score |
11 | SAILER | Inner product of query and candidate vectors generated by SAILER |
12 | SAILER_rank | Rank of documents in the search list of the query by SAILER score |
13 | DELTA | Inner product of query and candidate vectors generated by DELTA |
14 | DELTA _rank | Rank of documents in the search list of the query by DELTA score |
4.1.4 Learning to Rank
Following previous work [35, 18, 4, 31, 8], lWe utilize Lightgbm to integrate all feature scores. Table 2 shows the details of all the features. A total of 14 features were used to integrate the final score. For optimizing ranking, we employ the Normalized Discounted Cumulative Gain (NDCG) as our objective. The model demonstrating the highest performance on the validation set is selected for subsequent testing.
4.1.5 Post-processing
Finally, we post-processed the ranking scores from the relevance perspective to remove irrelevant documents. Apart from Filtering by trial date, Filtering query cases and Dynamic cut-off proposed in previous li et al. work [21], we add Filtering duplicate cases as a post-processing strategy. The specific details are as follows:
-
•
Filtering by trial date. Considering that a query case typically cites cases preceding its trial date, it is logical to filter the candidate set based on this criterion. By extracting all dates mentioned within each case, we determine the latest date as the trial date, thereby minimizing erroneous exclusions. In instances where dates cannot be extracted from query cases, we retain all cases in the candidate set.
-
•
Filtering query cases. We find that query cases hardly become noticed case for other queries. Therefore we remove all query cases from the search results.
-
•
Filtering duplicate cases. We find that all the noticed cases are not repeating in the COLIEE2021, COLIEE2022 and COLIEE2023 query cases respectively, indicating that deleting duplicate cases might be effective. Kim et al. [13] also used removing repeating cases in the previous retrieval task, utilizing maximum duplicate cases as the hyper-parameter. By noticing that removing duplicate cases may delete all the candidate cases for some query cases, we define as the maximum numbers of duplicate cases and then supplement cases with higher score for those query cases without candidate case. Grid search in the validation set is utilized to find optimal and .
-
•
Dynamic cut-off To accommodate the variability in the number of supporting cases associated with different query cases, we implement a dynamic-cutoff mechanism for each query case. This involves defining three hyperparameters: , , and , respectively. Here, represents the maximum, and the minimum number of supporting cases to be retrieved per query case. Additionally, if the highest score achieved by supporting cases for a specific query case is denoted as , then only those supporting cases scoring above are selected. A grid search technique is employed to ascertain the optimal values for these hyperparameters , and .
4.2 Task3.The Statute Law Retrieval Task
In this section, we follow the framework of Task 1 to implement Task 3. Specifically, we design heuristic pre-processing and post-processing strategies and implement advanced retrievers and rankers. Finally, we use learning to rank to integrate all scores.
4.2.1 Pre-processing
In Task 3, we primarily pre-process the retrieval pool, i.e., the legal articles. Specifically, we started by removing the lead-in information from the Civil Code. For example: “Part I General Provisions”, “Chapter I Common Provisions”. We consider that this information does not contribute to the relevance judgment. Subsequently, we deleted all explanatory descriptions in brackets, such as (Standards for Construction). We consider that these are too general and do not facilitate the differentiation of legal articles. Finally, we obtain a map** of article IDs and specific content to form the retrieval set.
4.2.2 Retriever
We implemented the following retriever to get the most relevant legal articles from the full set:
4.2.3 Reranker
After getting the retrieved relevant legal articles, we use reranker to further rank them. The detailed model is as follows:
-
•
BERT [6] is the classic pre-trained language model, which employs a multi-layer bidirectional Transformer encoder architecture, BERT leverages both the Masked Language Model (MLM) and Next Sentence Prediction (NSP) as its pre-training tasks.
-
•
RoBERTa [22] represents an advancement over BERT, utilizing a more extensive dataset for pre-training. Unlike BERT, RoBERTa is exclusively pre-trained using the Masked Language Model (MLM) task.
-
•
[3] has been pre-trained on an extensive English legal database and has demonstrated state-of-the-art performance across a variety of legal tasks.
-
•
monoT5 [24] adopts an encoder-decoder architecture. It operates by generating a “true” or “false” token, reflecting the relevance between queries and candidates. The model then considers the probability of generating “true” as the ultimate relevance score.
For BERT, RoBERTa, and LEGALBERT, we train them with the cross-encoder architecture. Specifically, the query and legal articles are spliced together and fed into the encoder, and the vector of token is passed through the MLP layer to get the final score. The loss function for training is as follows:
(9) |
where and are relevant and negative articles. We employ irrelevant articles from the articles retrieved by BM25 as negative examples. For monoT5, we trained three versions of monoT5_base, monoT5_large, and monoT5_3B.
4.2.4 Learning to Rank
Similar to Task 1, we integrate all the features using Lightgbm. The features utilized in Task 3 are displayed in Table 3. A total of 9 features were employed to integrate the final score. We adopt as the optimization objective and select the best model based on performance on the validation set for testing purposes.
Feature ID | Feature Name | Description |
---|---|---|
1 | query_length | Length of the query |
2 | article_length | Length of the candidate article |
3 | BM25 | Query-article scores with BM25 |
4 | QLD | Query-article scores with QLD |
5 | BERT | Query-article scores with BERT |
6 | RoBERTa | Query-article scores with RoBERTa |
7 | LEGALBERT | Query-article scores with LEGAL-BERT-base |
8 | monoT5_large | Query-article scores with monoT5_large |
9 | monoT5_3B | Query-article scores with monoT5_3B |
4.2.5 Post-processing
Finally we performed the heuristic post-processing on the ranking scores. Upon analysis, it was observed that the majority of queries are associated with no more than two relevant legal articles. Therefore, we define the maximum score for one query to be . Only articles that exceed the score are considered relevant. The hyperparameter is finely tuned to maintain consistency in the proportion of queries with two relevant laws across both the training and validation sets.
Model | F1 score | Precision | Recall | p | h | l | t | s |
---|---|---|---|---|---|---|---|---|
TQM_run1 | 0.3824 | 0.3708 | 0.3046 | 0.7 | 5 | 4 | 1 | 2 |
TQM_run2 | 0.4294 | 0.4064 | 0.4552 | 0.3 | 7 | 4 | 1 | 2 |
TQM_run3 | 0.4592 | 0.4530 | 0.4656 | 0.46 | 7 | 1 | 1 | 2 |
Team | Submission | F1 | Precision | Recall |
---|---|---|---|---|
TQM | task1_test_answer_2024_run1 | 0.4432 | 0.5057 | 0.3944 |
TQM | task1_test_answer_2024_run3 | 0.4342 | 0.5082 | 0.3790 |
UMNLP | task1_umnlp_run1 | 0.4134 | 0.4000 | 0.4277 |
UMNLP | task1_umnlp_run2 | 0.4097 | 0.3755 | 0.4507 |
UMNLP | task1_umnlp_runs_combined | 0.4046 | 0.3597 | 0.4622 |
YR | task1_yr_run1 | 0.3605 | 0.3210 | 0.4110 |
TQM | task1_test_answer_2024_run2 | 0.3548 | 0.4196 | 0.3073 |
YR | task1_yr_run2 | 0.3483 | 0.3245 | 0.3758 |
YR | task1_yr_run3 | 0.3417 | 0.3184 | 0.3688 |
JNLP | 64b7b-07f39 | 0.3246 | 0.3110 | 0.3393 |
JNLP | 07f39 | 0.3222 | 0.3347 | 0.3105 |
JNLP | 64b7b-48fe5 | 0.3103 | 0.3017 | 0.3195 |
WJY | submit_1 | 0.3032 | 0.2700 | 0.3457 |
BM24 | task1_test_result | 0.1878 | 0.1495 | 0.2522 |
CAPTAIN | captain_mstr | 0.1688 | 0.1793 | 0.1594 |
CAPTAIN | captain_ft5 | 0.1574 | 0.1586 | 0.1562 |
NOWJ | nowjtask1run2 | 0.1313 | 0.0895 | 0.2465 |
NOWJ | nowjtask1run3 | 0.1306 | 0.0957 | 0.2055 |
NOWJ | nowjtask1run1 | 0.1224 | 0.0813 | 0.2478 |
WJY | submit_3 | 0.1179 | 0.0870 | 0.1831 |
WJY | submit_2 | 0.1174 | 0.0824 | 0.2042 |
MIG | test1_ans | 0.0508 | 0.0516 | 0.0499 |
UBCS | run3 | 0.0276 | 0.0140 | 0.7196 |
UBCS | run2 | 0.0275 | 0.0140 | 0.7177 |
UBCS | run1 | 0.0272 | 0.0139 | 0.7100 |
CAPTAIN | captain_bm25 | 0.0019 | 0.0019 | 0.0019 |
5 EXPERIMENT RESULT
In this section, we present the results of our experiments and the corresponding analysis.
5.1 Task1.The Case Law Retrieval Task
5.1.1 Submissions
For COLIEE2024 Task 1, we submitted 3 runs with the following details
-
•
task1_test_answer_2024_run1: We implemented the lexical matching model QLD and searched for the best parameters on the validation set based on the QLD scores in the post-processing stage and applied them to the test set.
-
•
task1_test_answer_2024_run2: The improved lexical matching model BM25_ngram was implemented, and an optimal set of parameters was identified through a search on the validation set, guided by the BM25_ngram scores during the post-processing stage. These parameters were subsequently applied to the test set.
-
•
task1_test_answer_2024_run3: The lightgbm integrates all the features to get the final score, after which the best post-processing parameters are obtained based on this score and applied to the test set.
Model | F2 | Precision | Recall |
---|---|---|---|
BM25 | 0.5267 | 0.6039 | 0.5181 |
QLD | 0.3888 | 0.4257 | 0.3844 |
BERT | 0.6698 | 0.7524 | 0.6600 |
RoBERTa | 0.6637 | 0.7524 | 0.6534 |
LEGALBERT | 0.6929 | 0.7920 | 0.6815 |
monoT5_base | 0.6951 | 0.7821 | 0.6848 |
monoT5_large | 0.7072 | 0.8019 | 0.6963 |
monoT5_3B | 0.7171 | 0.8118 | 0.7062 |
Submission_id | F2 | Precision | Recall | MAP | R_5 | R_10 | R_30 |
---|---|---|---|---|---|---|---|
JNLP.constr-join* | 0.7408 | 0.6502 | 0.7982 | 0.8010 | 0.8769 | 0.9154 | 0.9462 |
CAPTAIN.bjpAllMonoT5· | 0.7335 | 0.6713 | 0.7752 | 0.8149 | 0.8615 | 0.9308 | 0.9538 |
TQM-run1# | 0.7171 | 0.7202 | 0.7339 | 0.7899 | 0.8308 | 0.9000 | 0.9615 |
CAPTAIN.bjpAllMonoP· | 0.7171 | 0.6743 | 0.7477 | 0.7731 | 0.8538 | 0.9308 | 0.9538 |
CAPTAIN.bjpAll# | 0.7135 | 0.6227 | 0.7844 | 0.8149 | 0.8615 | 0.9308 | 0.9538 |
JNLP.Mistral* | 0.7123 | 0.6682 | 0.7477 | 0.7434 | 0.8308 | 0.9154 | 0.9538 |
NOWJ-25mulreftask-ensemble# | 0.7081 | 0.6334 | 0.7661 | 0.7562 | 0.8231 | 0.8769 | 0.9077 |
AMHR02· | 0.6876 | 0.5972 | 0.7569 | 0.7405 | 0.7846 | 0.8308 | 0.8462 |
AMHR03· | 0.6825 | 0.6456 | 0.7202 | 0.7405 | 0.7846 | 0.8308 | 0.8462 |
AMHR01· | 0.6749 | 0.5734 | 0.7569 | 0.7405 | 0.7846 | 0.8308 | 0.8462 |
NOWJ-25multask-ensemble# | 0.6654 | 0.5934 | 0.7431 | 0.7180 | 0.7231 | 0.8077 | 0.8692 |
NOWJ-25mulref-ensemble# | 0.6649 | 0.5916 | 0.7202 | 0.7315 | 0.8154 | 0.8462 | 0.8923 |
TQM-run2# | 0.6621 | 0.5734 | 0.7110 | 0.7082 | 0.7769 | 0.8077 | 0.8077 |
JNLP.RankLLaMA* | 0.6555 | 0.6606 | 0.6651 | 0.7400 | 0.8385 | 0.9154 | 0.9538 |
UA-mp_net# | 0.6409 | 0.4908 | 0.7385 | 0.7127 | 0.8000 | 0.8538 | 0.9000 |
UA-anglE# | 0.6399 | 0.4679 | 0.7477 | 0.6935 | 0.7538 | 0.8077 | 0.8769 |
TQM-run3# | 0.6330 | 0.5963 | 0.6606 | 0.7492 | 0.8154 | 0.8692 | 0.9308 |
BM24-1* | 0.4945 | 0.2590 | 0.7294 | - | - | - | - |
MIG2# | 0.1665 | 0.1604 | 0.1881 | 0.2125 | 0.2615 | 0.2923 | 0.3769 |
MIG1# | 0.1637 | 0.1187 | 0.2064 | 0.2049 | 0.2385 | 0.2923 | 0.3846 |
MIG3# | 0.1629 | 0.1631 | 0.1789 | 0.2049 | 0.2385 | 0.2923 | 0.3846 |
PSI01 | 0.0785 | 0.0826 | 0.0780 | 0.2312 | 0.3692 | 0.4769 | 0.6308 |
5.1.2 Results
Table 4 shows the effectiveness and optimal parameters of submission runs on the validation set. Table 5 shows the final official evaluation results. From the experimental results, we can draw the following conclusions:
-
•
From the results of the validation set, the lexical matching model Bm25_ngram achieved competitive results. Learning to rank effectively combines the perspectives of lexical matching model and semantic retrieval model to achieve the best results.
-
•
However, the official test results showed different performance. BM25_ngram had the worst results and QLD achieved the best performance. We speculate this is due to the bias in the distribution of terms on the test sets of COLIEE2023 and COLIEE2024. Since the distribution of the BM25_ngram scores is different on the two datasets, it results slightly lower performance of learning to rank than the single model.
-
•
Overall, our approach achieves championship in the legal case retrieval task and shows sufficient robustness, which is crucial in legal scenarios where large-scale annotation data is lacking.
5.2 Task3.The Statute Law Retrieval Task
5.2.1 Submissions
In Task 3, we submit 3 runs as follows:
-
•
TQM_run1: We fine-tuned monoT5_3B using the training data and performed post-processing.
-
•
TQM_run2: Lightgbm was employed to integrate all features and use Precision@1 as the optimization objective.
-
•
TQM_run3: Lightgbm was employed to integrate all features and use Precision@2 as the optimization objective.
5.2.2 Results
Table 6 shows the performance of various models on the validation set. Table 7 shows the official evaluation results. We derive the fol- lowing observations from the experiment results.
-
•
From the Table 6 , it can be observed that Ranker performs better than the Retriever. The best single model result was achieved by mono_T5.
-
•
However, the performance drops significantly after learning to rank on the test set. We think this is due to overfitting caused by too little training data. How to effectively integrate each feature deserves further research.
-
•
Overall, our submission had the best performance among all the runs without LLMs, and ranked third among all the submissions. This suggests that LLMs can be effective in enhancing the understanding of the law thus improving the performance.
6 Conclusion
This paper presents TQM Team’s approaches to the legal case retrieval task in the COLIEE 2024 competition. We try to enhance the understanding of the model for case relevance from multiple perspectives and achieve some progress. We obtained the best performance in Task 1 among all submissions, and the third place in Task 3. In the future we will continue to explore infusing legal knowledge into the model to better understand case relevance.
References
- [1] Althammer, S., Askari, A., Verberne, S., Hanbury, A.: Dossier@ coliee 2021: leveraging dense retrieval and summarization-based re-ranking for case law retrieval. arXiv preprint arXiv:2108.03937 (2021)
- [2] Bench-Capon, T., Araszkiewicz, M., Ashley, K., Atkinson, K., Bex, F., Borges, F., Bourcier, D., Bourgine, P., Conrad, J.G., Francesconi, E., et al.: A history of ai and law in 50 papers: 25 years of the international conference on ai and law. Artificial Intelligence and Law 20(3), 215–319 (2012)
- [3] Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., Androutsopoulos, I.: Legal-bert: The muppets straight out of law school. arXiv preprint arXiv:2010.02559 (2020)
- [4] Chen, J., Li, H., Su, W., Ai, Q., Liu, Y.: Thuir at wsdm cup 2023 task 1: Unbiased learning to rank (2023)
- [5] Chen, J., Liu, Y., Fang, Y., Mao, J., Fang, H., Yang, S., Xie, X., Zhang, M., Ma, S.: Axiomatically regularized pre-training for ad hoc search. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1524–1534 (2022)
- [6] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
- [7] Dong, Q., Liu, Y., Ai, Q., Li, H., Wang, S., Liu, Y., Yin, D., Ma, S.: I3 retriever: Incorporating implicit interaction in pre-trained language models for passage retrieval. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. pp. 441–451 (2023)
- [8] Han, X., Tu, Y., Li, H., Ai, Q., Liu, Y.: Thuir_ss at the ntcir-17 session search (ss) task. (No Title) p. none (2023)
- [9] Huang, Z., Low, C., Teng, M., Zhang, H., Ho, D.E., Krass, M.S., Grabmair, M.: Context-aware legal citation recommendation using deep learning. In: Proceedings of the eighteenth international conference on artificial intelligence and law. pp. 79–88 (2021)
- [10] Jiang, J.Y., Zhang, M., Li, C., Bendersky, M., Golbandi, N., Najork, M.: Semantic text matching for long-form documents. In: The world wide web conference. pp. 795–806 (2019)
- [11] Jiang, Z., El-Jaroudi, A., Hartmann, W., Karakos, D., Zhao, L.: Cross-lingual information retrieval with bert. arXiv preprint arXiv:2004.13005 (2020)
- [12] Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020)
- [13] Kim, M.Y., Rabelo, J., Babiker, H.K.B., Rahman, M.A., Goebel, R.: Legal information retrieval and entailment using transformer-based approaches. The Review of Socionetwork Strategies pp. 1–21 (2024)
- [14] Li, H., Ai, Q., Chen, J., Dong, Q., Wu, Y., Liu, Y., Chen, C., Tian, Q.: Sailer: Structure-aware pre-trained language model for legal case retrieval (2023)
- [15] Li, H., Ai, Q., Chen, J., Dong, Q., Wu, Z., Liu, Y., Chen, C., Tian, Q.: Blade: Enhancing black-box large language models with small domain-specific models. arXiv preprint arXiv:2403.18365 (2024)
- [16] Li, H., Ai, Q., Han, X., Chen, J., Dong, Q., Liu, Y., Chen, C., Tian, Q.: Delta: Pre-train a discriminative encoder for legal case retrieval via structural word alignment. arXiv preprint arXiv:2403.18435 (2024)
- [17] Li, H., Ai, Q., Zhan, J., Mao, J., Liu, Y., Liu, Z., Cao, Z.: Constructing tree-based index for efficient and effective dense retrieval (2023)
- [18] Li, H., Chen, J., Su, W., Ai, Q., Liu, Y.: Towards better web search performance: Pre-training, fine-tuning and learning to rank. arXiv preprint arXiv:2303.04710 (2023)
- [19] Li, H., Shao, Y., Wu, Y., Ai, Q., Ma, Y., Liu, Y.: Lecardv2: A large-scale chinese legal case retrieval dataset (2023)
- [20] Li, H., Su, W., Wang, C., Wu, Y., Ai, Q., Liu, Y.: Thuir@coliee 2023: Incorporating structural knowledge into pre-trained language models for legal case retrieval (2023)
- [21] Li, H., Wang, C., Su, W., Wu, Y., Ai, Q., Liu, Y.: Thuir@coliee 2023: More parameters and legal knowledge for legal case entailment (2023)
- [22] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
- [23] Ma, Y., Shao, Y., Wu, Y., Liu, Y., Zhang, R., Zhang, M., Ma, S.: Lecard: a legal case retrieval dataset for chinese law system. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 2342–2348 (2021)
- [24] Nogueira, R., Jiang, Z., Lin, J.: Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713 (2020)
- [25] Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval 3(4), 333–389 (2009)
- [26] Seo, M., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603 (2016)
- [27] Shao, Y., Li, H., Wu, Y., Liu, Y., Ai, Q., Mao, J., Ma, Y., Ma, S.: An intent taxonomy of legal case retrieval. ACM Trans. Inf. Syst. 42(2) (dec 2023). https://doi.org/10.1145/3626093, https://doi.org/10.1145/3626093
- [28] Shao, Y., Mao, J., Liu, Y., Ma, W., Satoh, K., Zhang, M., Ma, S.: Bert-pli: Modeling paragraph-level interactions for legal case retrieval. In: IJCAI. pp. 3501–3507 (2020)
- [29] Shao, Y., Wu, Y., Liu, Y., Mao, J., Ma, S.: Understanding relevance judgments in legal case retrieval. ACM Transactions on Information Systems 41(3), 1–32 (2023)
- [30] Tran, V., Nguyen, M.L., Satoh, K.: Building legal case retrieval systems with lexical matching and summarization using a pre-trained phrase scoring model. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law. pp. 275–282 (2019)
- [31] Tu, Y., Li, H., Chu, Z., Ai, Q., Liu, Y.: Thuir at the ntcir-17 fairweb-1 task: An initial exploration of the relationship between relevance and fairness. Proceedings of NTCIR-17. https://doi. org/10.20736/0002001317 (2023)
- [32] Xiao, C., Hu, X., Liu, Z., Tu, C., Sun, M.: Lawformer: A pre-trained language model for chinese legal long documents. AI Open 2, 79–84 (2021)
- [33] Xie, X., Dong, Q., Wang, B., Lv, F., Yao, T., Gan, W., Wu, Z., Li, X., Li, H., Liu, Y., et al.: T2ranking: A large-scale chinese benchmark for passage ranking. arXiv preprint arXiv:2304.03679 (2023)
- [34] Xiong, L., Xiong, C., Li, Y., Tang, K.F., Liu, J., Bennett, P., Ahmed, J., Overwijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020)
- [35] Yang, S., Li, H., Chu, Z., Zhan, J., Liu, Y., Zhang, M., Ma, S.: Thuir at the ntcir-16 www-4 task. Proceedings of NTCIR-16. to appear (2022)
- [36] Yu, W., Sun, Z., Xu, J., Dong, Z., Chen, X., Xu, H., Wen, J.R.: Explainable legal case matching via inverse optimal transport-based rationale extraction. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 657–668 (2022)
- [37] Zhai, C.: Statistical language models for information retrieval. Synthesis lectures on human language technologies 1(1), 1–141 (2008)
- [38] Zhang, K., Chen, C., Wang, Y., Tian, Q., Bai, L.: Cfgl-lcr: A counterfactual graph learning framework for legal case retrieval. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 3332–3341 (2023)