A Hierarchical Neural Framework for Classification and its Explanation in Large Unstructured Legal Documents

Nishchal Prasad IRIT, ToulouseFrance [email protected] Mohand Boughanem IRIT, ToulouseFrance [email protected]  and  Taoufiq Dkaki IRIT, ToulouseFrance [email protected]
Abstract.

Automatic legal judgment prediction and its explanation suffer from the problem of long case documents exceeding tens of thousands of words, in general, and having a non-uniform structure. Predicting judgments from such documents and extracting their explanation becomes a challenging task, more so on documents with no structural annotation. We define this problem as ”scarce annotated legal documents” and explore their lack of structural information and their long lengths with a deep-learning-based classification framework which we call MESc; ”Multi-stage Encoder-based Supervised with-clustering”; for judgment prediction. We explore the adaptability of LLMs with multi-billion parameters (GPT-Neo, and GPT-J) to legal texts and their intra-domain(legal) transfer learning capacity. Alongside this, we compare their performance and adaptability with MESc and the impact of combining embeddings from their last layers. For such hierarchical models, we also propose an explanation extraction algorithm named ORSE; Occlusion sensitivity-based Relevant Sentence Extractor; based on the input-occlusion sensitivity of the model, to explain the predictions with the most relevant sentences from the document. We explore these methods and test their effectiveness with extensive experiments and ablation studies on legal documents from India, the European Union, and the United States with the ILDC dataset and a subset of the LexGLUE dataset. MESc achieves a minimum total performance gain of approximately 2 points over previous state-of-the-art proposed methods, while ORSE applied on MESc achieves a total average gain of 50% over the baseline explainability scores.

Extractive explanation, Legal judgment prediction, Scarce annotated documents, Long document classification, Multi-stage classification framework

1. Introduction

A legal case proceeding cycle generally involves pre-filing and investigation, pleadings, discovery, pre-trial motions, trial, post-trial motions and appeals, and enforcement111https://www.law.cornell.edu/wex/civil_procedure. Out of the many steps in this cycle involving case filing, appeals, etc., the lawyer or the judge needs to analyze each case giving a certain time to come to a conclusion. This can involve analyzing vast amounts of data and legal precedents, which can be a time-consuming process given the complexity and length of the case. The number of legal cases in a country is also proportionally related to its population. This leads to a backlog of cases, especially in countries with huge populations, ultimately setting back the progress of its legal system 222https://www.globaltimes.cn/page/202204/1260044.shtml(Katju, 2019).

Automating such legal case procedures can help speed up and strengthen the decision-making process, saving time and benefiting both the legal authorities and the people involved. (The scope of this work is within the ethical considerations discussed in the section 7)

One of the fundamental problems that deal with this larger component is the prediction of the outcome based just on the case’s raw texts (which include the facts, arguments, appeals, etc. except the final decision), which corresponds to the typical real-life scenario. While alongside, the reasoning as to why such a prediction (decision/judgment) was made is essential to understand, rely upon, and use that prediction in a more informative way.

There have been many machine learning techniques applied to legal texts in the past to predict the judgments as a text classification problem ((Feng et al., 2022), (Cui et al., 2022)). While it may seem like a general text classification task, legal texts differ from general texts and are rather more complex, broadly in two ways, i.e. structure and syntax and, lexicon and grammar ((Zhong et al., 2020b), (Chalkidis et al., 2020), (Nallapati and Manning, 2008)). The structure of legal case documents (preamble, facts, appeals, facts, etc.) is not uniform in most settings and their complex syntax and lexicon make it more difficult to annotate, requiring only legal professionals. This adds to another challenge of long lengths of legal case documents, reaching more than 10000 tokens (Table 1). The lack of structure information and the long lengths of structure-specific legal documents can be defined as a more specific text classification problem, which in our work we call \sayscarce-annotated documents.

What are scarce-annotated legal documents? Legal documents have a structure that comprises a preamble, facts, justifications, case arguments, appeals, and old decisions. Except for the preamble (the heading), the remaining parts (with varying lengths) can occur in no specific order. This also varies with individual documents, making them non-uniform in nature. In such a case, it becomes necessary to provide labels (annotate) defining this structure for a model, to understand the relevant parts of the document and help make a better prediction. These relevant parts also get hidden as the length of the document increases (10’s of thousands of tokens), hence making it noisy.We define this scarcity of structure information in large legal case documents as scarce-annotated. One thing to be noted here is that the documents have their class (final decision/judgment) label(s) but they don’t give any other information regarding its structure.

We explore this problem of classification of large scarce-annotated legal documents with their explanation.

For classification, to approximate the structure we cluster the features generated (after processing individual parts of these documents) from a language model (transformer-based), and then use them alongside to make the predictions. With this, we try to see if these approximated labels help in judgment prediction with experiments on four different datasets (ILDC(Malik et al., 2021) and LexGLUE’s (Chalkidis et al., 2022b) ECtHR(A), ECtHR(B) and SCOTUS) which can be found in the later sections of the paper. The work on classification can be summarised below:

  • We define the problem of judgment prediction from large scarce-annotated legal documents and propose a multi-stage neural classification framework named \sayMulti-stage Encoder-based Supervised with-clustering (MESc). This works by extracting embeddings from the last four layers of a fine-tuned encoder of a large language model (LLM) and using an unsupervised clustering mechanism on them to approximate structure labels. Alongside the embeddings, these labels are processed through another set of transformer encoder layers for final classification.

  • With ablation on MESc we show the importance of combining features from the last layers of transformer-based large language models (LLMs) (BERT (Devlin et al., 2019), GPT-Neo (Black et al., 2021), GPT-J (Wang and Komatsuzaki, 2021)) for such documents, with the impact on their prediction upon using the approximated structure labels.

  • We also study the adaptability of multi-billion parametered LLMs to the hierarchical framework and scarce-annotated legal documents while studying their intra-domain(legal) transfer learning capacity (both with fine-tuning and in MESc).

  • MESc achieves a minimum gain of \approx 2 points in classification on ILDC and the LexGLUE’s subset.

To explain the predictions, we developed a novel explanation extraction algorithm to rank and extract relevant sentences that impacted the prediction. While these sentences cannot exactly serve as the explanation as they are just marked sentences, but in a situation where no annotations are available to train an abstractive explanative algorithm these sentences can serve as a representative, for an explanation, to guide an expert on what led to a certain prediction. The results also showcase the challenge associated with such documents with a scope for future developments.

The work on explanation is summarized below:

  • We propose an extractive explanation algorithm, Occlusion sensitivity based Relevant Sentence Extractor (ORSE), based on the input sensitivity of a model which ranks relevant sentences from a document, to serve as an explanation for its predicted class.

  • We test ORSE for explanation extraction on ILDCexpert (Malik et al., 2021), with GPT-J(Wang and Komatsuzaki, 2021) and InLegalBERT(Paul et al., 2022) obtaining new benchmarks.

2. Related works

Several strategies have been investigated in the past with machine-learning techniques to predict the result of legal cases in specific categories (criminal, civil, etc.) with rich annotations (Xiao et al. (Xiao et al., 2018), Xu et al. (Xu et al., 2020), Zhong et al. (Zhong et al., 2018), Chen et al. (Chen et al., 2019)). These studies on well-structured and annotated legal documents show the effect and importance of having good structural information. While creating such a dataset is both time and resource (highly skilled) demanding, researchers have worked on legal documents in a more general and raw setting.

Chalkidis et al. (Chalkidis et al., 2019) presented a dataset of European Court of Human Rights case proceedings in English, with each case assigned a score indicating its importance. They described a Legal Judgment Prediction (LJP) task for their dataset, which seeks to predict the outcome of a court case using the case facts and law violations. They also curated another version of this dataset (Chalkidis et al., 2021) to give a rational explanation for the decision prediction made on them. In the US case setting, Kaufman et al. (Kaufman et al., 2019) used AdaBoost decision tree to predict the U.S. Supreme Court rulings. Tuggener et al. (Tuggener et al., 2020) curated LEDGAR, a multilabel dataset of legal provisions in US contracts. Malik et al. (Malik et al., 2021) curated the Indian Legal Document Corpus (ILDC) of unannotated and unstructured documents, and used it to build baseline models for their Case Judgment Prediction and Explanation (CJPE) task upon which Prasad et al. (Prasad et al., 2022) showed the possibility of intra-domain(legal) transfer learning using LEGAL-BERT on Indian legal texts (ILDC).

Pretrained language models based on transformers (Devlin et al.(Devlin et al., 2019), Vaswani et al.(Vaswani et al., 2017)) have shown widespread success in all fields of natural language processing (NLP) but only for short texts spanning a few hundred tokens. There have been several approaches to handle longer sequences with transformer encoders (Beltagy et al.(Beltagy et al., 2020), Kitaev et al.(Kitaev et al., 2020), Zaheer et al.(Zaheer et al., 2020), Ainslie et al. (Ainslie et al., 2020)) demanding expensive domain-specific pretraining for adaptation to legal texts, and are not guaranteed to scale compared to their vanilla counterparts (Tay et al. (Tay et al., 2022)). Since we try to get the structure information of the document, we choose to process the document in short sequences rather than as a whole. These short sequences will help us to learn and approximate their structure labels. So, we take a different approach to handle large documents with smaller pre-trained transformer encoder models (such as BERT (Devlin et al., 2019)) based on the hierarchical idea of \saydivide, learn and combine (Chalkidis et al.(Chalkidis et al., 2022a), Zhang et al. (Zhang et al., 2019), Yang et al. (Yang et al., 2016)), where the document is split (into parts then sentences and words, etc.) and features of each component are learned and combined together hierarchically from bottom-up to get the whole document’s representation.

The domain-specific pre-training of transformer encoders has also accelerated the development of NLP in legal systems with better performance as compared to the general pre-trained variants (Chalkidis et al.(Chalkidis et al., 2020)’s LEGAL-BERT trained on court cases of the US, UK, and EU, Zheng et al. (Zheng et al., 2021)’s BERT trained on US court cases dataset CaseHOLD, Shounak et al. (Paul et al., 2022)’s InLegalBERT and InCaseLawBERT trained on the Indian legal cases). Recently, with the emergence of multi-billion parametered LLMs such as GPT-3 (Brown et al., 2020), LLaMA (Touvron et al., 2023), LaMDA (Thoppilan et al., 2022), and their superior performance in natural language understanding, researchers have tried to adapt (with few-shot learning) their smaller variants to legal texts (Trautmann et al. (Trautmann et al., 2022), Yu et al. (Yu et al., 2022)). We check their adaptability with the hierarchical idea of \saydivide, learn and combine and their intra-domain(legal) transfer-learning with full-fine tuning compared to the intra-domain pre-training (as done in LEGAL-BERT, InLegal-BERT). To do so we use three such variants of GPT (GPT-Neo (1.3 and 2.7)(Black et al., 2021), GPT-J(Wang and Komatsuzaki, 2021)) pre-trained on Pile(Gao et al., 2021), which has a subset (FreeLaw) of court opinions of US legal cases.

To rely upon the judgment prediction of a legal case an explanation leading to that judgment is of paramount importance. In scenarios where there is a lack of explanations annotation of legal texts, an extractive explanation method is a good fit to create interpretations of the predicted judgments. To provide interpretations for their judgment prediction task, (Zhong et al., 2020a) employed Deep Reinforcement learning and created a ”question-answering” based model called QAjudge. Jiang et al. (Jiang et al., 2018) try to extract readable snippets of texts (called rationale) from legal texts using reinforcement learning for their judgment classification problem. To improve the interpretability of charge prediction systems Ye et al. (Ye et al., 2018) propose a label-conditioned Seq2Seq model, which, for a given predicted charge chooses relevant reasoning in the legal document. Since we try to develop an extractive explanation algorithm using no training data and relying solely on a trained model we turn toward the idea of input sensitivity of a model (Zeiler et al. (Zeiler and Fergus, 2014), Petsiuk et al. (Petsiuk et al., 2018)) which has been used in the interpretation of computer vision models, where the pixels are scored (according to a scoring parameter) against their absence in the input, and finally they are chosen according to the desirability of the scores (higher or lower).

3. Method

Refer to caption
Figure 1. MESc classification Framework

3.1. Classification Framework (MESc)

To handle large documents MESc architecture shares the general hierarchical idea of divide, learn and combine ((Chalkidis et al., 2022a), (Zhang et al., 2019), (Yang et al., 2016)) but it differs from the previous works in the following main aspects:

a) It employs custom fine-tuning and uses the last four layers of the fine-tuned transformer encoder for extracting representations for parts(chunks) of the document.

b) It approximates the document structure by applying unsupervised learning (clustering) on these representations’ embeddings and uses this information alongside, for classification.

c) Different configurations of transformer encoder layers are used and experimented with to attend over the combined embeddings to learn the intra-chunk representation to get a global document representation.

d) Divide the process into four stages, custom fine-tuning, extracting embeddings, processing the embeddings (supervised + unsupervised learning), and classification. An overview of MESc can be seen in Figure 1. The stages as detailed below:

An input document D𝐷Ditalic_D is tokenized into a sequence of tokens, D={t1,D,t2,D,,tLD,D}𝐷subscript𝑡1𝐷subscript𝑡2𝐷subscript𝑡subscript𝐿𝐷𝐷D=\{t_{1,D},t_{2,D},\cdots,t_{L_{D},D}\}italic_D = { italic_t start_POSTSUBSCRIPT 1 , italic_D end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 , italic_D end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_D end_POSTSUBSCRIPT } via a tokenizer specific to a chosen pre-trained language model (BERT, GPT etc.), where t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N and \mathbb{N}blackboard_N is the vocabulary of the tokenizer. This token sequence is split into a set of blocks {C1,D,C2,D,,CND,D}subscript𝐶1𝐷subscript𝐶2𝐷subscript𝐶subscript𝑁𝐷𝐷\{C_{1,D},C_{2,D},\cdots,C_{N_{D},D}\}{ italic_C start_POSTSUBSCRIPT 1 , italic_D end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 , italic_D end_POSTSUBSCRIPT , ⋯ , italic_C start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_D end_POSTSUBSCRIPT } with overlaps(o𝑜oitalic_o) with the previous block, which we call as chunks. Where each chunk block, Ci,D={t(i+co),D,,t(i+2co,D}C_{i,D}=\{t_{(i+c-o),D},\cdots,t_{(i+2c-o,D}\}italic_C start_POSTSUBSCRIPT italic_i , italic_D end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT ( italic_i + italic_c - italic_o ) , italic_D end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT ( italic_i + 2 italic_c - italic_o , italic_D end_POSTSUBSCRIPT } with c𝑐citalic_c being the maximum number of tokens in the chunks, which is a predefined parameter for MESc (e.g. 512). ND=LDcosubscript𝑁𝐷subscript𝐿𝐷𝑐𝑜N_{D}=\lceil\frac{L_{D}}{c-o}\rceilitalic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = ⌈ divide start_ARG italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_ARG start_ARG italic_c - italic_o end_ARG ⌉ is the total number of chunks for a document having L𝐿Litalic_L tokens in total, with o<<cmuch-less-than𝑜𝑐o<<citalic_o < < italic_c. NDsubscript𝑁𝐷N_{D}italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT varies with the length of the document.

Stage 1 - Custom fine-tuning:

To each chunk of a document, we associate the document label lDsubscript𝑙𝐷l_{D}italic_l start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and combine them together to form a token matrix:

(1) IDND×c×1[{C1,D,lD},{C2,D,lD},,{CND,D,lD}]subscript𝐼𝐷superscriptsubscript𝑁𝐷𝑐1subscript𝐶1𝐷subscript𝑙𝐷subscript𝐶2𝐷subscript𝑙𝐷subscript𝐶subscript𝑁𝐷𝐷subscript𝑙𝐷I_{D}\in\mathbb{R}^{N_{D}\times c\times 1}\leftarrow[\{C_{1,D},l_{D}\},\{C_{2,% D},l_{D}\},\cdots,\{C_{N_{D},D},l_{D}\}]italic_I start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT × italic_c × 1 end_POSTSUPERSCRIPT ← [ { italic_C start_POSTSUBSCRIPT 1 , italic_D end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } , { italic_C start_POSTSUBSCRIPT 2 , italic_D end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } , ⋯ , { italic_C start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_D end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } ]

This is used as input for the document for fine-tuning the pre-trained encoder, where NDsubscript𝑁𝐷N_{D}italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is the batch size for one pass through the encoder.

This allows the encoder to adapt to the domain-specific legal texts, which helps get richer features for the next stage.

Stage 2 - Extracting chunk embeddings:

For a document, we pass its chunks Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT through the fine-tuned encoder and extract its representation embeddings (Ei,Dsubscript𝐸𝑖𝐷E_{i,D}italic_E start_POSTSUBSCRIPT italic_i , italic_D end_POSTSUBSCRIPT) from the last l𝑙litalic_l layers. Ei,Dl×dsubscript𝐸𝑖𝐷superscript𝑙𝑑E_{i,D}\in\mathbb{R}^{l\times d}italic_E start_POSTSUBSCRIPT italic_i , italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l × italic_d end_POSTSUPERSCRIPT, where d𝑑ditalic_d is the dimension of the feature-length (we use l=4𝑙4l=4italic_l = 4). The representation embeddings can be either the first token (as in BERT) or the last token for causal language models (as in GPT). We accumulate all Ei,Dsubscript𝐸𝑖𝐷E_{i,D}italic_E start_POSTSUBSCRIPT italic_i , italic_D end_POSTSUBSCRIPT of a document to form an embedding matrix:

(2) EDND×l×d[E1,D,E2,D,,END,D]subscript𝐸𝐷superscriptsubscript𝑁𝐷𝑙𝑑subscript𝐸1𝐷subscript𝐸2𝐷subscript𝐸subscript𝑁𝐷𝐷E_{D}\in\mathbb{R}^{N_{D}\times l\times d}\leftarrow\bigl{[}E_{1,D},E_{2,D},% \cdots,E_{N_{D},D}\bigr{]}italic_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT × italic_l × italic_d end_POSTSUPERSCRIPT ← [ italic_E start_POSTSUBSCRIPT 1 , italic_D end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 , italic_D end_POSTSUBSCRIPT , ⋯ , italic_E start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_D end_POSTSUBSCRIPT ]

The Ei,Dsubscript𝐸𝑖𝐷E_{i,D}italic_E start_POSTSUBSCRIPT italic_i , italic_D end_POSTSUBSCRIPT acts as a representation of the chunk in this context, and combining them yields an approximate representation of the entire document. Doing this for all the documents gives us generated training data.

Stage 3 - Processing the extracted representations:

Since the features extracted from the last layers of a fine-tuned encoder have different embedding spaces, they can contribute to being either positive or redundant. So for this stage, we choose to combine together the last p<l𝑝𝑙p<litalic_p < italic_l layers in EDsubscript𝐸𝐷E_{D}italic_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT for further training. We experiment with different p𝑝pitalic_p before fixing one value as shown This gives ED(p)ND×p×d,p{1,2,3,4}formulae-sequencesuperscriptsubscript𝐸𝐷𝑝superscriptsubscript𝑁𝐷𝑝𝑑𝑝1234E_{D}^{(p)}\in\mathbb{R}^{N_{D}\times p\times d},p\in\{1,2,3,4\}italic_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT × italic_p × italic_d end_POSTSUPERSCRIPT , italic_p ∈ { 1 , 2 , 3 , 4 }. (We used 1,2 and 4 in our experiments to compare their effects.) We concatenate together the representations from these p𝑝pitalic_p layers to get,

(3) Ei,D(p)pd×1[Ei,D(l)|Ei,D(l1)||Ei,D(lp)]superscriptsubscript𝐸𝑖𝐷𝑝superscript𝑝𝑑1delimited-[]conditionalsuperscriptsubscript𝐸𝑖𝐷𝑙superscriptsubscript𝐸𝑖𝐷𝑙1superscriptsubscript𝐸𝑖𝐷𝑙𝑝E_{i,D}^{(p)}\in\mathbb{R}^{pd\times 1}\leftarrow\bigl{[}E_{i,D}^{(l)}|E_{i,D}% ^{(l-1)}|\cdots|E_{i,D}^{(l-p)}\bigr{]}italic_E start_POSTSUBSCRIPT italic_i , italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p italic_d × 1 end_POSTSUPERSCRIPT ← [ italic_E start_POSTSUBSCRIPT italic_i , italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT | italic_E start_POSTSUBSCRIPT italic_i , italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT | ⋯ | italic_E start_POSTSUBSCRIPT italic_i , italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - italic_p ) end_POSTSUPERSCRIPT ]

This gives,

(4) E^D(p)ND×pd{E1,D(p)|E2,D(p)||EN,D(p)}superscriptsubscript^𝐸𝐷𝑝superscriptsubscript𝑁𝐷𝑝𝑑conditional-setsuperscriptsubscript𝐸1𝐷𝑝superscriptsubscript𝐸2𝐷𝑝superscriptsubscript𝐸𝑁𝐷𝑝\widehat{E}_{D}^{(p)}\in\mathbb{R}^{N_{D}\times pd}\leftarrow\bigl{\{}E_{1,D}^% {(p)}|E_{2,D}^{(p)}|\cdots|E_{N,D}^{(p)}\bigr{\}}\ over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT × italic_p italic_d end_POSTSUPERSCRIPT ← { italic_E start_POSTSUBSCRIPT 1 , italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT | italic_E start_POSTSUBSCRIPT 2 , italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT | ⋯ | italic_E start_POSTSUBSCRIPT italic_N , italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT }

We also experimented with the element-wise addition of representations in ED(p)superscriptsubscript𝐸𝐷𝑝E_{D}^{(p)}italic_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT and found their performance to be lower by 1 point in most of the experiments of section 5, hence we exclude it in MESc.

  1. (1)

    Approximating the structure labels (SDsubscript𝑆𝐷S_{D}italic_S start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT) (Unsupervised learning): To get the information on the document’s structure i.e. its parts (facts, arguments, concerned laws, etc.), we use a clustering mechanism (HDBSCAN (McInnes et al., 2017)). We cluster the p𝑝pitalic_p chosen extracted chunk embeddings, E^D(p)superscriptsubscript^𝐸𝐷𝑝\widehat{E}_{D}^{(p)}over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT to map similar parts of different documents together where the labels of one part of a document are learned by its similarity with another part of another document. The idea is that the embeddings of similar parts from different documents will group together forming a pool of cluster labels that can help identify its part in the document. One such dummy example can be seen in figure 2, where the Ei,D(p)superscriptsubscript𝐸𝑖𝐷𝑝E_{i,D}^{(p)}italic_E start_POSTSUBSCRIPT italic_i , italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT of documents 1 and 2 learn their cluster (label) pool for, arguments of one type, a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = {E1,1(p),E1,2(p)E2,1(p)subscriptsuperscript𝐸𝑝11subscriptsuperscript𝐸𝑝12subscriptsuperscript𝐸𝑝21E^{(p)}_{1,1},E^{(p)}_{1,2}E^{(p)}_{2,1}italic_E start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , italic_E start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT italic_E start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT}, facts of one type, f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = {E1,3(p),E1,5(p),E2,2(p),E2,3(p),E2,5(p)subscriptsuperscript𝐸𝑝13subscriptsuperscript𝐸𝑝15subscriptsuperscript𝐸𝑝22subscriptsuperscript𝐸𝑝23subscriptsuperscript𝐸𝑝25E^{(p)}_{1,3},E^{(p)}_{1,5},E^{(p)}_{2,2},E^{(p)}_{2,3},E^{(p)}_{2,5}italic_E start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 3 end_POSTSUBSCRIPT , italic_E start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 5 end_POSTSUBSCRIPT , italic_E start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT , italic_E start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , 3 end_POSTSUBSCRIPT , italic_E start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , 5 end_POSTSUBSCRIPT}, facts of another type, f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ={E1,4(p),E1,6(p),E2,4(p),E2,6(p)subscriptsuperscript𝐸𝑝14subscriptsuperscript𝐸𝑝16subscriptsuperscript𝐸𝑝24subscriptsuperscript𝐸𝑝26E^{(p)}_{1,4},E^{(p)}_{1,6},E^{(p)}_{2,4},E^{(p)}_{2,6}italic_E start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 4 end_POSTSUBSCRIPT , italic_E start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 6 end_POSTSUBSCRIPT , italic_E start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , 4 end_POSTSUBSCRIPT , italic_E start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , 6 end_POSTSUBSCRIPT}. So for document 1 the approximated structure then becomes S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = {a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT} and for document 2 it is S2subscript𝑆2S_{2}italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = {a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT}. It is to be noted that this distinction if it’s a fact or an argument etc. is done here for the purpose of representation. In the actual setting, it is unknown and these structure labels don’t carry any specific name or meaning except for the model to give an understanding of its structure.

    Refer to caption
    Figure 2. An example of clustering of chunk representations of two documents to generate structure labels.

    Since the performance of the HDBSCAN clustering mechanism decreases significantly with an increase in data dimension, we use a dimensionality reduction algorithm (pUMAP (McInnes et al., 2018)), before clustering. For all the chunks of a document, their approximated structure labels are combined with the output of stage 3 (2), before processing through the final classification stage (4).

  2. (2)

    Global document representation (Supervised learning): For intra-chunk attention, we use transformer encoder layers (Vaswani et al. (Vaswani et al., 2017)), for one chunk to attend to another through its multi-head attention with a feed-forward neural network (FFN) layer. With respect to a chunk’s position in the document, we add its positional embeddings ((Devlin et al., 2019)) in ED(p)superscriptsubscript𝐸𝐷𝑝{E}_{D}^{(p)}italic_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT and process it through e𝑒eitalic_e transformer layers T{h,df}(e)subscriptsuperscript𝑇𝑒subscript𝑑𝑓T^{(e)}_{\{h,d_{f}\}}italic_T start_POSTSUPERSCRIPT ( italic_e ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT { italic_h , italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT } end_POSTSUBSCRIPT, with hhitalic_h attention heads and df=pdsubscript𝑑𝑓𝑝𝑑d_{f}=pditalic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_p italic_d as the dimension of the FFN. e𝑒eitalic_e and hhitalic_h are both hyperparameters whose choice depends upon the input feature lengths (Section 5 evaluates different values of these parameters). But e3𝑒3e\geq 3italic_e ≥ 3 sometimes overfits the model in our experiments, hence we fix e=2𝑒2e=2italic_e = 2 for MESc. The output is max-pooled and passed through a feed-forward neural network FFNT𝐹𝐹subscript𝑁𝑇FFN_{T}italic_F italic_F italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT of 128128128128 nodes to get:

    (5) G(E^D(p))=FFNT(maxpool(T{h,df}(e)(E^D(p))))128𝐺superscriptsubscript^𝐸𝐷𝑝𝐹𝐹subscript𝑁𝑇𝑚𝑎𝑥𝑝𝑜𝑜𝑙subscriptsuperscript𝑇𝑒subscript𝑑𝑓superscriptsubscript^𝐸𝐷𝑝superscript128G\left(\widehat{E}_{D}^{(p)}\right)=FFN_{T}\left(maxpool\left(T^{(e)}_{\{h,d_{% f}\}}\left(\widehat{E}_{D}^{(p)}\right)\right)\right)\in\mathbb{R}^{128}italic_G ( over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ) = italic_F italic_F italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_m italic_a italic_x italic_p italic_o italic_o italic_l ( italic_T start_POSTSUPERSCRIPT ( italic_e ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT { italic_h , italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ( over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ) ) ) ∈ blackboard_R start_POSTSUPERSCRIPT 128 end_POSTSUPERSCRIPT
Stage 4 - Classification:

The structure labels along with the output of the feed-forward network of stage 3(b) are concatenated together. Which is processed through an internal feed-forward network FFNi𝐹𝐹subscript𝑁𝑖FFN_{i}italic_F italic_F italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (32 nodes, with softmax activation) and an external feed-forward network FFNe𝐹𝐹subscript𝑁𝑒FFN_{e}italic_F italic_F italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT (u𝑢uitalic_u label(class) nodes with task-specific activation function sigmoid(softmax)) to get the output O(D)𝑂𝐷O(D)italic_O ( italic_D ) for a document D𝐷Ditalic_D, as shown in equation 6.

(6) O(D)=FFNe(FFNi(([G(E^D(p))|SD])))u𝑂𝐷𝐹𝐹subscript𝑁𝑒𝐹𝐹subscript𝑁𝑖delimited-[]conditional𝐺superscriptsubscript^𝐸𝐷𝑝subscript𝑆𝐷superscript𝑢O\left(D\right)=FFN_{e}\left(FFN_{i}\left(\left(\left[G\left(\widehat{E}_{D}^{% (p)}\right)|S_{D}\right]\right)\right)\right)\in\mathbb{R}^{u}italic_O ( italic_D ) = italic_F italic_F italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_F italic_F italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( [ italic_G ( over^ start_ARG italic_E end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ) | italic_S start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ] ) ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT

O𝑂Oitalic_O and G𝐺Gitalic_G are learnt together for final classification while SDsubscript𝑆𝐷S_{D}italic_S start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is learnt independently.

3.2. Extractive Explanation Algorithm - ”Occlusion-based Relevant Sentence Extraction (ORSE)”

To understand the relevant parts of the document contributing to the said decision prediction, we extract sentences that have a high impact on the decision prediction. We develop an extractive explanation algorithm for hierarchical classification models (similar to MESc, (Pappagari et al., 2019)) based on the input sensitivity of the models in the hierarchy. Since this is an extractive process there is no need for a \saypre-summarised (detailed annotation) explanation, which is required to train an explanation model. A hierarchical model is composed of many individual models divided into levels, where each level is responsible to learn different components of the input, which are combined together in moving up the hierarchy. Taking an example of a document, the input in these models can be processed in a hierarchical fashion from the bottom up by learning from the words, then combining them into sentences, and into paragraphs/parts which can be further accumulated to give a full input representation.

In this work, we target ORSE to explain the reason why a decision prediction is made by our hierarchical predictive model (MESc) where the algorithm processes the document in its two steps of hierarchy.

1) Find the highly sensitive (impactful) chunks (parts of the document).

2) Find the highly sensitive sentences from these chunks.

We define ORSE (Algorithm 1) kee** in mind the long lengths of the large documents. While it can also be adapted to shorter documents (a few hundred to thousand tokens), for which the extraction can be done by just the 1st step.

For a classification model M𝑀Mitalic_M and input I={ij|0jn}𝐼conditional-setsubscript𝑖𝑗0𝑗𝑛I={\{i_{j}|0\leq j\leq n}\}italic_I = { italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | 0 ≤ italic_j ≤ italic_n }, where n𝑛nitalic_n is the input length, consider M(I)=OI𝑀𝐼subscript𝑂𝐼M(I)=O_{I}italic_M ( italic_I ) = italic_O start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT to be the prediction without any occlusion where P𝑃Pitalic_P is the final predicted class label(s), and M({I|ij})=OI(j)𝑀conditional-set𝐼subscript𝑖𝑗superscriptsubscript𝑂𝐼𝑗M(\{I|i_{j}\})=O_{I}^{(j)}italic_M ( { italic_I | italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } ) = italic_O start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT to be the prediction after the occlusion of ijsubscript𝑖𝑗i_{j}italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in I𝐼Iitalic_I. The occlusion is done by masking individual parts (e.g. 00 masking) before feeding into a classification model. If P𝑃Pitalic_P is from the final model in the hierarchy we take it as the absolute class label with which we rank the inputs for all models in the hierarchy. We define an occlusion-sensitivity impact function L𝐿Litalic_L, depending on the classification problem type as,

(7) L(OI(j),P)={CCEloss(OI(j),P),for multi-classBCEloss(OI(j),P),for binary, multi-label𝐿superscriptsubscript𝑂𝐼𝑗𝑃cases𝐶𝐶subscript𝐸𝑙𝑜𝑠𝑠superscriptsubscript𝑂𝐼𝑗𝑃for multi-class𝐵𝐶subscript𝐸𝑙𝑜𝑠𝑠superscriptsubscript𝑂𝐼𝑗𝑃for binary, multi-labelL(O_{I}^{(j)},P)=\begin{cases}CCE_{loss}(O_{I}^{(j)},P),&\text{for multi-class% }\\ BCE_{loss}(O_{I}^{(j)},P),&\text{for binary, multi-label}\end{cases}italic_L ( italic_O start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , italic_P ) = { start_ROW start_CELL italic_C italic_C italic_E start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ( italic_O start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , italic_P ) , end_CELL start_CELL for multi-class end_CELL end_ROW start_ROW start_CELL italic_B italic_C italic_E start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT ( italic_O start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , italic_P ) , end_CELL start_CELL for binary, multi-label end_CELL end_ROW

where, CCEloss=𝐶𝐶subscript𝐸𝑙𝑜𝑠𝑠absentCCE_{loss}=italic_C italic_C italic_E start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT = Categorical-Cross-Entropy loss, and BCEloss=𝐵𝐶subscript𝐸𝑙𝑜𝑠𝑠absentBCE_{loss}=italic_B italic_C italic_E start_POSTSUBSCRIPT italic_l italic_o italic_s italic_s end_POSTSUBSCRIPT = Binary-Cross-Entropy loss. Other loss functions can also be used depending on the task. The intuition behind this impact measure is to see how important is the occluded part of the input for a prediction with the change in its loss function. A higher loss means more impact. L𝐿Litalic_L is to be chosen such that it is always >0absent0>0> 0.

To rank these losses in terms of their impact, we measure the deviance of an input’s occluded component’s impact value from the impact value of the whole input (without any occlusion). With L𝐿Litalic_L we compute the \sayweighted occlusion sensitivity score S𝑆Sitalic_S,

(8) S(s,Lj,LI)=s×(LI(j)LI+δ)𝑆𝑠subscript𝐿𝑗subscript𝐿𝐼𝑠superscriptsubscript𝐿𝐼𝑗subscript𝐿𝐼𝛿S(s,L_{j},L_{I})=s\times\left(L_{I}^{(j)}-L_{I}+\delta\right)italic_S ( italic_s , italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) = italic_s × ( italic_L start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT - italic_L start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_δ )

where s𝑠sitalic_s is the score weight, LI(j)superscriptsubscript𝐿𝐼𝑗L_{I}^{(j)}italic_L start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT and LIsubscript𝐿𝐼L_{I}italic_L start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT are impact scores on occluding ijsubscript𝑖𝑗i_{j}italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in I𝐼Iitalic_I and impact of I𝐼Iitalic_I with the P𝑃Pitalic_P (absolute class label). We add constant δ𝛿\deltaitalic_δ so as to make S𝑆Sitalic_S positive by shifting the axis, which is required to keep the score above 00.

Algorithm 1 Occlusion sensitivity-based Relevant Sentence Extractor (ORSE)
1:From 3.1, Select a classification model M𝑀Mitalic_M, and its backbone fine-tuned encoder T𝑇Titalic_T. k=%k=\%italic_k = % of sentences to choose.
2:for all documents do
3:    Divide the document into chunks of length c𝑐citalic_c.
4:    E𝐸absentE\leftarrowitalic_E ← Extract all chunk embeddings from T𝑇Titalic_T.
5:    OEM(E)subscript𝑂𝐸𝑀𝐸O_{E}\leftarrow M(E)italic_O start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ← italic_M ( italic_E ), probability output.
6:    P𝑃absentP\leftarrowitalic_P ← absolute predicted label from OEsubscript𝑂𝐸O_{E}italic_O start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT
7:    LE0subscript𝐿𝐸0L_{E}\leftarrow 0italic_L start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ← 0, impact with itself L(P,P)𝐿𝑃𝑃L(P,P)italic_L ( italic_P , italic_P ) (Eq. 7)
8:    for chunk cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in E𝐸Eitalic_E do
9:        Mask cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
10:        OE(i)M({E|ci})superscriptsubscript𝑂𝐸𝑖𝑀conditional-set𝐸subscript𝑐𝑖O_{E}^{(i)}\leftarrow M(\{E|c_{i}\})italic_O start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ← italic_M ( { italic_E | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ), probability output after masking cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.
11:        LciL(OE(i),P)subscript𝐿subscript𝑐𝑖𝐿superscriptsubscript𝑂𝐸𝑖𝑃L_{c_{i}}\leftarrow L(O_{E}^{(i)},P)italic_L start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_L ( italic_O start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_P ) (Eq. 7)
12:        SciS(1,Lci,LE)subscript𝑆subscript𝑐𝑖𝑆1subscript𝐿subscript𝑐𝑖subscript𝐿𝐸S_{c_{i}}\leftarrow S(1,L_{c_{i}},L_{E})italic_S start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_S ( 1 , italic_L start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) (Eq. 8)
13:    end for
14:    SEsubscript𝑆𝐸absentS_{E}\leftarrowitalic_S start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ← concatenate all (ci,Sci)subscript𝑐𝑖subscript𝑆subscript𝑐𝑖(c_{i},S_{c_{i}})( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ).
15:    SEsubscript𝑆𝐸absentS_{E}\leftarrowitalic_S start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ← Sort SEsubscript𝑆𝐸S_{E}italic_S start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT in descending order of Scisubscript𝑆subscript𝑐𝑖S_{c_{i}}italic_S start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT.
16:    for (Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, s𝑠sitalic_s) in SEsubscript𝑆𝐸S_{E}italic_S start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT do
17:        OCiT(Ci)subscript𝑂subscript𝐶𝑖𝑇subscript𝐶𝑖O_{C_{i}}\leftarrow T(C_{i})italic_O start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_T ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), probability output from T𝑇Titalic_T
18:        LCiL(OCi,P)subscript𝐿subscript𝐶𝑖𝐿subscript𝑂subscript𝐶𝑖𝑃L_{C_{i}}\leftarrow L(O_{C_{i}},P)italic_L start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_L ( italic_O start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_P ) (Eq. 7)
19:        Split Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into sentences, {sj|1j\{s_{j}|1\leq j\leq{ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | 1 ≤ italic_j ≤ total sentences}}\}}.
20:        for sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT do
21:           Mask sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
22:           OsjT({Ci|sj})subscript𝑂subscript𝑠𝑗𝑇conditional-setsubscript𝐶𝑖subscript𝑠𝑗O_{s_{j}}\leftarrow T(\{C_{i}|s_{j}\})italic_O start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_T ( { italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } ), probability after masking sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
23:           LsjL(Osj,P)subscript𝐿subscript𝑠𝑗𝐿subscript𝑂subscript𝑠𝑗𝑃L_{s_{j}}\leftarrow L(O_{s_{j}},P)italic_L start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_L ( italic_O start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_P ) (Eq. 7)
24:           SsjS(s,Lsj,LCi)subscript𝑆subscript𝑠𝑗𝑆𝑠subscript𝐿subscript𝑠𝑗subscript𝐿subscript𝐶𝑖S_{s_{j}}\leftarrow S(s,L_{s_{j}},L_{C_{i}})italic_S start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_S ( italic_s , italic_L start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (Eq. 8)
25:           Ascoresubscript𝐴𝑠𝑐𝑜𝑟𝑒A_{score}italic_A start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT \leftarrow concatenate all (i,sj,Ssj)𝑖subscript𝑠𝑗subscript𝑆subscript𝑠𝑗(i,s_{j},S_{s_{j}})( italic_i , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ).
26:        end for
27:    end for
28:    Sort Ascoresubscript𝐴𝑠𝑐𝑜𝑟𝑒A_{score}italic_A start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT in descending order of Ssjsubscript𝑆subscript𝑠𝑗S_{s_{j}}italic_S start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT.
29:    Ascore[k]subscript𝐴𝑠𝑐𝑜𝑟𝑒delimited-[]𝑘absentA_{score}[k]\leftarrowitalic_A start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT [ italic_k ] ← keep the top k%percent𝑘k\%italic_k % sentences.
30:    Ascore[k]subscript𝐴𝑠𝑐𝑜𝑟𝑒delimited-[]𝑘absentA_{score}[k]\leftarrowitalic_A start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT [ italic_k ] ← rearrange in the order of (i,sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT).
31:end for

We give a description of ORSE adapted to MESc in the Algorithm 1, and detail the steps involved. We start from the top-level model M𝑀Mitalic_M (stage 4) to find the highly sensitive chunks (steps 2-14), for a document. We calculate the probability output from M𝑀Mitalic_M. Since M𝑀Mitalic_M is at the top level of the hierarchy we take its prediction as the absolute predicted label P𝑃Pitalic_P, and take the self-impact score as 00 (Step 2-6). We mask/occlude the chunks and calculate their impact score (Eq. 7) and then their weighted occluded sensitivity scores (Eq. 8) with respect to the whole document i.e. self-impact score (steps 8-11). Since this is the top level we use 1111 as the weight. We sort the accumulated scores in order of their sensitivity score (i.e. higher value is given more importance).

To rank the sentences (steps 15-28) we iteratively start from the highest-scored chunk and take its probability output from the fine-tuned transformer T𝑇Titalic_T (from stage 1) to calculate its impact-score w.r.t P𝑃Pitalic_P (step 17). We then split this chunk (cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) into sentences and iteratively mask/occlude a sentence sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT inside the chunk to calculate its weighted occluded sensitivity score (Ssjsubscript𝑆subscript𝑠𝑗S_{s_{j}}italic_S start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT) (steps 19-24). To weigh the overall importance of each sentence of this chunk as compared to the sentences belonging to other chunks, we weigh the impact shift of sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the sensitivity score of Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the previous level of hierarchy. We store the sentences along with their chunk number and sensitivity score in Ascoresubscript𝐴𝑠𝑐𝑜𝑟𝑒A_{score}italic_A start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT. We sort Ascoresubscript𝐴𝑠𝑐𝑜𝑟𝑒A_{score}italic_A start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT, ranking in the order of Ssjsubscript𝑆subscript𝑠𝑗S_{s_{j}}italic_S start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Since this is the last level of the hierarchy we stop and take the top k%percent𝑘k\%italic_k % sentences. To arrange the sentences with their sequential occurrence in the document we arrange Ascore[k]subscript𝐴𝑠𝑐𝑜𝑟𝑒delimited-[]𝑘A_{score}[k]italic_A start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT [ italic_k ] according to the chunk number and the sentence in the chunk. These sentences serve as the explanation for a document’s prediction. The time complexity is model dependent, and is O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) here, due to the quadratic complexity of the fine-tuned transformer (T𝑇Titalic_T) used, where asymptotically n𝑛nitalic_n is the average length of all the documents for a batch.

4. Experimental setup

For our backbone transformer encoder, we used the domain-specific pre-trained model LEGAL-BERT(Chalkidis et al., 2020), InLegalBERT(Paul et al., 2022) and chose GPT-Neo(Black et al., 2021), GPT-J(Wang and Komatsuzaki, 2021) for experimenting with larger LLMs with multi-billion parameters. The tokenizers used to tokenize the documents are from the same encoders. No chunks were excluded in any stage of MESc.

Stage 1: For BERT-based encoders, the chunk size was set to 510absent510\leq 510≤ 510 (90 token overlaps) with global tokens ([CLS],[SEP]) and padding to make the input chunk size 512512512512. For GPT-based encoders, we experimented with the same overlap but two different chunk sizes 512512512512 and 2048204820482048, the latter being the maximum possible input length. We abbreviate the encoders fine-tuned on 512 input length as (α𝛼\alphaitalic_α) and for ones fine-tuned with 2048 input length as (γ𝛾\gammaitalic_γ).

For testing the fine-tuned models, we considered the last tokens for each document. For all GPT-Neo(α𝛼\alphaitalic_α) and GPT-J(α𝛼\alphaitalic_α) we evaluated on input length of 512 tokens to compare their performance with LEGAL-BERT(α𝛼\alphaitalic_α) and InLegalBERT(α𝛼\alphaitalic_α).

These encoders were fine-tuned for 4 epochs with learning rates in {2e6\{2e^{-6}{ 2 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT,3e5}3e^{-5}\}3 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT }, and we chose the best-performing one for stage 2. For the full-fine-tuning of the GPT-J and all GPT-Neo, we used 6 Nvidia A100 (80GB GPU) with ZeRO-3 optimization strategy implemented in Deepspeed333https://www.deepspeed.ai/ with huggingface’s Accelerate library444https://huggingface.co/docs/accelerate/index.

Stage 2: We used the [CLS] token for embeddings extraction in BERT-based models and for the causal language models(GPT-Neo and GPT-J) we used the last token’s embedding as the global representation for the respective chunk.

Stage 3 & 4: Adam optimizer (learning rate = 3.5e63.5superscript𝑒63.5e^{-6}3.5 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT) was used after stage 2 of the MESc. For the binary and multi-label classification problems, we used \saybinary cross-entropy loss and used \saycategorical cross-entropy loss respectively. We experimented with N={1,2,3}𝑁123N=\{1,2,3\}italic_N = { 1 , 2 , 3 } transformer encoder layers in stage 3 with h=88h=8italic_h = 8 and trained for 5 epochs, and chose N=2𝑁2N=2italic_N = 2 after analyzing their performance (section 5). For clustering to approximate our structure labels we use HDBSCAN (minimum cluster size = 15151515) and pUMAP555https://umap-learn.readthedocs.io/en/latest/parametric_umap.html(64646464 output dimension). ,

The best-performing MESc configuration for ILDC (Table 4), with 512512512512 input chunks and k={0.2,0.3,0.4}𝑘0.20.30.4k=\{0.2,0.3,0.4\}italic_k = { 0.2 , 0.3 , 0.4 } was used for ORSE (Table 5).

4.1. Dataset

We choose the legal datasets having large documents with a nonuniform structure throughout, without any structural annotations or information. Suiting our problem of large scarce-annotation documents we found one such dataset in the Indian legal court setting, named IDLC (Malik et al., 2021), and the same requirement in a subset of the LexGLUE dataset (Chalkidis et al., 2022b). The ILDC dataset includes highly unstructured 39898 English-language case transcripts from the Supreme Court of India (SCI), where the final decisions have been removed from the document from the end. Upon analyzing the documents from their sources and the dataset we found that they are highly unstructured and noisy. The initial decision between ”rejected” and ”accepted” made by the SCI judge(s) is used to identify each document that also serves as their decision label. To assess how well the judgment prediction algorithms explain themselves, a piece of the corpus (ILDCExpert, a separate test set of 56 documents) is labeled with gold standard explanations by five distinct legal experts which are pieces of text selected from the document that is most relevant to the judgment. We use this to evaluate our extractive explanation algorithm ORSE.

The LexGLUE dataset (Chalkidis et al., 2022b) comprises a set of seven datasets from the European Union and US court case setting, for uniformly assessing model performance across a range of legal NLP tasks, from which we choose ECtHR (Task A), ECtHR (Task B), and SCOTUS as they are classification tasks involving our problem of long unstructured legal documents. ECtHR (A and B) are court cases from European Convention on Human Rights (ECHR) for articles that were violated or allegedly violated. The dataset contains factual paragraphs from the description of the cases. SCOTUS consists of court cases from the highest federal court in the United States of America, with metadata from SCDB 666http://scdb.wustl.edu/. The details of the number of labels and the average and maximum document length (in tokens) with task description can be found in the table 1. The tokenization in table 1 is done using the tokenizer of GPT-J.

The ILDC, ECtHR (Task A), ECtHR (Task B), and SCOTUS serve as a good fit for our problem and test our approach.For performance comparison on LexGLUE, we used the SOTA benchmark of Chalkidis et al. (Chalkidis et al., 2022b), and for ILDC we used the benchmark from its paper (Malik et al., 2021) and of Shounak et al. (Paul et al., 2022)’s experiments on their models specifically pre-trained on the Indian legal cases. Since LexGLUE lacks an explanation set as like in ILDCexpert we couldn’t use ORSE to test its effectiveness on LexGLUE.

Table 1. Dataset statistics

Name No. of Documents Average length & Maximum length (tokens) No. of labels Problem Type Train Validation Test Train Validation Test ILDC 37387 994 1517 4120 501275 5104 58048 5238 55703 2 Binary ECtHR(A) 9000 1000 1000 2011 46500 2210 18352 2401 20835 10 Multi-Label ECtHR(B) 9000 1000 1000 2011 46500 2210 18352 2401 20835 10 Multi-Label SCOTUS 5000 1400 1400 8291 126377 12639 56310 12597 124955 13 Multi-Class

5. Results and discussion

Table 2. Custom-finetuning results on the chosen pre-trained transformer encoder language models (in Section 4), e = epoch

α𝛼\alphaitalic_α: fine-tuned and evaluated with 512 input length, β𝛽\betaitalic_β: evaluating α𝛼\alphaitalic_α on its maximum input length, γ𝛾\gammaitalic_γ: fine-tuned and evaluated with maximum input length. Dataset LEGAL-BERT (μ𝜇\muitalic_μ-F1/m-F1) GPT-Neo 1.3B (μ𝜇\muitalic_μ-F1/m-F1) GPT-Neo 2.7B (μ𝜇\muitalic_μ-F1/m-F1) GPT-J 6B (μ𝜇\muitalic_μ-F1/m-F1) Validation Test Validation Test Validation Test Validation Test ECtHR (A) (α𝛼\alphaitalic_α) 0.6408/0.5095 (e = 4) (α𝛼\alphaitalic_α) 0.6285/0.4866 (e = 4) (α𝛼\alphaitalic_α) 0.6705/0.5940 (β𝛽\betaitalic_β) 0.6708/0.5958 (e = 2) (α𝛼\alphaitalic_α) 0.6619/0.5659 (β𝛽\betaitalic_β) 0.6620/0.5716 (e = 2) (α𝛼\alphaitalic_α) 0.6815/0.5833 (β𝛽\betaitalic_β) 0.6739/0.5947 (e = 2) (α𝛼\alphaitalic_α) 0.6849/0.5445 (β𝛽\betaitalic_β) 0.6811/0.5649 (e = 2) (α𝛼\alphaitalic_α) 0.7260/0.6715 (β𝛽\betaitalic_β) 0.7567/0.6945 (γ𝛾\gammaitalic_γ) 0.7855/0.7550 (e = 3) (α𝛼\alphaitalic_α) 0.7142/0.5927 (β𝛽\betaitalic_β) 0.7330/0.6245 (γ𝛾\gammaitalic_γ) 0.7451/0.6467 (e = 3) ECtHR (B) (α𝛼\alphaitalic_α) 0.6961/0.6255 (e = 3) (α𝛼\alphaitalic_α) 0.7089/0.6405 (e = 3) (α𝛼\alphaitalic_α) 0.7459/0.6938 (β𝛽\betaitalic_β) 0.7542/0.6946 (e = 2) (α𝛼\alphaitalic_α) 0.7542/0.7091 (β𝛽\betaitalic_β) 0.7574/0.7009 (e = 2) (α𝛼\alphaitalic_α) 0.7524/0.7147 (β𝛽\betaitalic_β) 0.7619/0.7252 (e = 2) (α𝛼\alphaitalic_α) 0.7448/0.6826 (β𝛽\betaitalic_β) 0.7513/0.7072 (e = 2) (α𝛼\alphaitalic_α) 0.7769/0.7244 (β𝛽\betaitalic_β) 0.8069/0.7611 (γ𝛾\gammaitalic_γ) 0.8308/0.8039 (e = 3) (α𝛼\alphaitalic_α) 0.7715/0.7326 (β𝛽\betaitalic_β) 0.8049/0.7631 (γ𝛾\gammaitalic_γ) 0.8316/0.7927 (e = 3) SCOTUS (α𝛼\alphaitalic_α) 0.7296/0.5924 (e = 6) (α𝛼\alphaitalic_α) 0.6876/0.5357 (e = 6) (α𝛼\alphaitalic_α) 0.7300/0.6582 (β𝛽\betaitalic_β) 0.7614/0.6772 (γ𝛾\gammaitalic_γ) 0.7731/0.6830 (e = 2) (α𝛼\alphaitalic_α) 0.7114/0.6035 (β𝛽\betaitalic_β) 0.7371/0.6310 (γ𝛾\gammaitalic_γ) 0.7502/0.6438 (e = 2) (α𝛼\alphaitalic_α) 0.7314/0.6571 (β𝛽\betaitalic_β) 0.7686/0.6851 (γ𝛾\gammaitalic_γ) 0.7828/0.6931 (e = 1) (α𝛼\alphaitalic_α) 0.7057/0.6025 (β𝛽\betaitalic_β) 0.7364/0.6564 (γ𝛾\gammaitalic_γ) 0.7636/0.6619 (e = 1) (α𝛼\alphaitalic_α) 0.7592/0.6875 (β𝛽\betaitalic_β) 0.7950/0.7295 (γ𝛾\gammaitalic_γ) 0.8178/0.7513 (e = 3) (α𝛼\alphaitalic_α) 0.7200/0.6276 (β𝛽\betaitalic_β) 0.7571/0.6625 (γ𝛾\gammaitalic_γ) 0.7850/0.7196 (e = 3) InLegalBERT (accuracy(%)/m-F1) accuracy(%)/m-F1 ILDC (α𝛼\alphaitalic_α) 76.15/76.8 (e = 4) (α𝛼\alphaitalic_α) 76.00/76.10 (e = 4) (α𝛼\alphaitalic_α) 74.25/0.7421 (β𝛽\betaitalic_β) 76.66/0.7662 (e=1) (α𝛼\alphaitalic_α) 72.91/0.7291 (β𝛽\betaitalic_β) 77.26/0.7725 (e=1) (α𝛼\alphaitalic_α) 76.96/0.7675 (β𝛽\betaitalic_β) 81.59/0.8144 (e=1) (α𝛼\alphaitalic_α) 74.29/0.7424 (β𝛽\betaitalic_β) 81.21/0.8118 (e=1) (α𝛼\alphaitalic_α) 75.15/0.7511 (β𝛽\betaitalic_β) 79.78/0.7972 (γ𝛾\gammaitalic_γ83.60/0.8347 (e=1) (α𝛼\alphaitalic_α) 73.96/0.7396 (β𝛽\betaitalic_β) 81.93/0.8192 (γ𝛾\gammaitalic_γ) 83.72/0.8366 (e=1)

Table 3. Results on LexGLUE for different configurations of MESc (* is the encoder model used for embedding extraction)
ECtHR (A) ECtHR (B) SCOTUS
p𝑝pitalic_p layers, e𝑒eitalic_e x Encoder Structure Labels Validation Test Validation Test Validation Test
μ𝜇\muitalic_μ-F1/m-F1
LexGLUE benchmark (Chalkidis et al., 2022b) 0.725/0.682 0.712/0.647 0.797/0.768 0.804/0.747 0.776/0.633 0.766/0.665
LEGAL-BERT* (α𝛼\alphaitalic_α)
p𝑝pitalic_p=1, 1 x No 0.7005/0.6118 0.6825/0.5806 0.7470/0.6791 0.7418/0.6890 0.7719/0.6926 0.7136/0.5916
p𝑝pitalic_p=1, 2 x No 0.6984/0.6056 0.6923/0.5935 0.7507/0.6835 0.7386/0.6742 0.7729/0.6895 0.7152/0.5817
p𝑝pitalic_p=4, 1 x No 0.7718/0.6994 0.7546/0.6226 0.8084/0.7709 0.8102/0.7573 0.7928/0.6866 0.7396/0.5865
Yes 0.7652/0.6870 0.7582/0.6378 0.8087/0.7727 0.8122/0.7725 0.7959/0.7025 0.7525/0.6194
p𝑝pitalic_p=4, 2 x No 0.7662/0.6479 0.7543/0.6337 0.8078/0.7574 0.8118/0.7564 0.7899/0.6849 0.7431/0.6054
Yes 0.7682/0.6883 0.7618/0.6508 0.8089/0.7748 0.8157/0.7670 0.7952/0.6872 0.7550/0.6208
p𝑝pitalic_p=4, 3 x No 0.7884/0.6905 0.7523/0.6311 0.8138/0.7863 0.8132/0.7699 0.7928/0.6866 0.7399/0.5635
Yes 0.7714/0.6815 0.7510/0.6309 0.8075/0.7564 0.8100/0.7621 0.7729/0.6536 0.7392/0.5783
Gpt-Neo 1.3B* (α𝛼\alphaitalic_α)
p𝑝pitalic_p=2, 2 x No 0.7231/0.6862 0.7115/0.6359 0.7960/0.7307 0.8030/0.7702 0.7862/0.7185 0.7536/0.6479
Yes 0.7407/0.7033 0.7273/0.6448 0.7939/0.7479 0.8040/0.7808 0.7797/0.7284 0.7646/0.6592
p𝑝pitalic_p=4, 2 x No 0.7358/0.7059 0.7146/0.6277 0.7947/0.7483 0.8086/0.7664 0.7722/0.7026 0.7429/0.6352
Yes 0.7248/0.7076 0.7068/0.6410 0.7968/0.7537 0.8060/0.7757 0.7651/0.7036 0.7418/0.6377
GPT-Neo 2.7B* (α𝛼\alphaitalic_α)
p𝑝pitalic_p=2, 2 x No 0.7380/0.6750 0.7457/0.6224 0.7896/0.7689 0.7949/0.7620 0.7845/0.7140 0.7676/0.6570
Yes 0.7634/0.7105 0.7567/0.6644 0.7986/0.7693 0.8072/0.7696 0.7988/0.7348 0.7627/0.6630
p𝑝pitalic_p=4, 2 x No 0.7510/0.6641 0.7524/0.6355 0.7897/0.7599 0.7940/0.7503 0.7818/0.7273 0.7577/0.6554
Yes 0.7600/0.6857 0.7587/0.6561 0.7825/0.7686 0.7935/0.7635 0.7903/0.7245 0.7641/0.6775
GPT-J 6B* (α𝛼\alphaitalic_α)
p𝑝pitalic_p=2, 2 x No 0.7516/0.7138 0.7222/0.6263 0.7997/0.7588 0.7931/0.7692 0.7888/0.7215 0.7505/0.6658
Yes 0.7529/0.7255 0.7163/0.6406 0.8048/0.7722 0.7977/0.7760 0.7791/0.7343 0.7598/0.6715
p𝑝pitalic_p=4, 2 x No 0.7351/0.6743 0.7156/0.6118 0.7707/0.7414 0.7800/0.7605 0.7967/0.7211 0.7490/0.6333
Yes 0.7544/0.7377 0.7219/0.6437 0.7891/0.7534 0.7795/0.7625 0.7872/0.7268 0.7485/0.6593
GPT-J 6B* (γ𝛾\gammaitalic_γ)
p𝑝pitalic_p=2, 2 x No 0.7565/0.7182 0.7384/0.6434 0.8055/0.7954 0.8094/0.7675 0.8141/0.7508 0.7688/0.6773
Yes 0.7619/0.7383 0.7470/0.6571 0.8201/0.8076 0.8169/0.7801 0.8164/0.7755 0.7814/0.6853
p𝑝pitalic_p=4, 2 x No 0.7512/0.7007 0.7296/0.6333 0.8142/0.7937 0.8113/0.7763 0.8170/0.7533 0.7728/0.6786
Yes 0.75860.7174 0.7484/0.6548 0.8192/0.7977 0.8134/0.7802 0.8248/0.7555 0.7867/0.6966
Table 4. Results on ILDC(Malik et al., 2021) for the best configurations of MESc (* is the encoder model used for embedding extraction)
ILDC
Validation Test
Accuracy(%)/m-F1
Shounak et al. (Paul et al., 2022) benchmark - -/83.09
ILDC (Malik et al., 2021) benchmark - 77.78/77.79
p𝑝pitalic_p layers, e𝑒eitalic_e x Encoder
Structure
labels
InLegalBERT* (α𝛼\alphaitalic_α)
p𝑝pitalic_p=1, 1 x No 84.10/84.21 83.72/83.73
Yes 84.51/84.53 83.65/83.65
p𝑝pitalic_p=1, 2 x No 83.90/84.00 83.45/83.47
Yes 85.11/85.15 83.78/83.78
p𝑝pitalic_p=4, 1 x No 84.30/84.32 83.41/83.41
Yes 85.23/85.25 84.15/84.15
p𝑝pitalic_p=4, 2 x No 84.30/84.32 83.72/83.68
Yes 85.15/85.17 84.11/84.13
GPT-Neo 2.7B* (α𝛼\alphaitalic_α)
p𝑝pitalic_p=2, 2 x No 84.13/84.12 82.97/82.79
Yes 84.71/84.67 83.65/83.64
p𝑝pitalic_p=4, 2 x No 84.10/84.09 83.01/83.00
Yes 84.30/84.29 83.22/83.21
GPT-J 6B* (α𝛼\alphaitalic_α)
p𝑝pitalic_p=2, 2 x No 83.43/83.42 82.84/82.78
Yes 84.32/84.31 83.21/83.19
p𝑝pitalic_p=4, 2 x No 83.45/83.46 82.73/82.73
Yes 84.22/84.21 83.37/83.36

5.1. Results on classification framework

μ𝜇\muitalic_μ-F1111 (micro) and m𝑚mitalic_m-F1111 (macro) are used to measure the performance for the LexGLUE dataset. And accuracy(%) and macro-F1111 for the ILDC dataset. We emphasize more on the μ𝜇\muitalic_μ-F1111 for the LexGULE dataset taking into the class imbalance whilst we also take m𝑚mitalic_m-F1111 into consideration to compare performance with previous benchmarks of LexGLUE (Chalkidis et al., 2022b). We list out the detailed experimental results for best configurations of MESc in table 3 and 4, and the fine-tuned performance of the LLMs used in table 2.

5.1.1. Intra-domain(legal) transfer learning:

As can be seen from table 2, for LexGLUE’s subset, all the GPTs used here are able to adapt better with a minimum of \approx 3 points gain on μ𝜇\muitalic_μ-F1 and a minimum of \approx 6 points on m-F1 score. On the other hand in the ILDC dataset, for the 512-fine-tuned variants with 512 input lengths for evaluation, their performance dropped or remained similar to the InLegalBERT, while upon increasing the evaluation input length to 2048 we can see an increase of more than 1 point in the performance. For GPT-J when fine-tuned with 2048 input length, the performance increase, compared to its 512 variant, is at least a minimum of \approx 2 points for all the datasets. We can see that an increase in the input length for fine-tuning helps to capture more feature information for such documents. Also going from GPT-Neo-1.3B’s 1.3 billion parameters to its 2.7 billion to 6 billion GPT-J the performance increases by a margin of 2 points at minimum, where we can see the parameter count playing an important role in adaptation and understanding these documents. Even though GPT-Neo and GPT-J are pre-trained on US legal cases (Pile (Gao et al., 2021)) they are able to adapt better to the European and Indian legal documents, with with a minimum gain of \approx 7 points (γ𝛾\gammaitalic_γ) and the ECtHR(A & B) and the ILDC dataset over their domain-specific pre-trained counterparts LEGAL-BERT and InLegalBERT respectively.

5.1.2. Performance with MESc:

Looking at table 3 and table 4 we interpret the results in two directions.

Encoders fine-tuned on 512 input length (α)𝛼(\alpha)( italic_α ):

For LEGAL-BERT and InLegalBERT in all datasets, MESc achieves a significant increase in performance by at least 4 points in all metrics than their fine-tuned LLM counterparts with just the last layer. Combining the last four layers in 1 ×\times× encoder yields a performance boost of 4 points or more in ECtHR datasets while there is not much improvement in ILDC and SCOTUS. With the approximated structure labels, there is a slight performance increase in the test set of ILDC with \approx 1 point increase in the validation set. The same goes for SCOTUS with \approx 1 point increase in its validation and test set. With the same configuration and 2 ×\times× encoder, we can see a much bigger performance with the structure labels achieving new baseline performance in ECtHR (A) and ECtHR (B), and ILDC datasets. For SCOTUS, this improvement from the baseline is only on the validation set. This is because of the high skew of class labels in the test dataset (for ex. label 5 has only 5 samples). With these results, we fixed certain parameters in MESc for further experiments with the extracted embeddings from GPT-Neo and GPT-J. For them, we ran experiments with 2 ×\times× encoders and the last layer and gained lesser performance than its 2 (or 4) layers with 2 ×\times× encoders, which we exclude in this paper. For LexGLUE, as can be seen in the table 3, concatenating the embeddings from the last two layers of GPT-Neo or GPT-J had a significant impact above their vanilla fine-tuned variants by a minimum margin of 3 points for GPT-Neo-1.3B, and 1 point for GPT-Neo-2.7B and GPT-J. This increases further by a minimum of 1 point when including the approximated structure labels, showing the importance of having structural information for such sparse-annotated documents. For ILDC in table 4, concatenating the last four layers didn’t have much improvement in the performance, while including the generated structure labels in it did increase the performance by 1 point in the validation set and slightly in the test set.

Encoders fine-tuned on 2048 input length (γ)𝛾(\gamma)( italic_γ ):

For the sparse-annotated documents in LexGLUE and ILDC, we did a comparative study of MESc(on GPT-J 6B* (γ𝛾\gammaitalic_γ))’s performance with its backbone fine-tuned encoder(GPT-J 6B (γ𝛾\gammaitalic_γ)) (Table 2, 3) to see the effect of increasing the number of parameters and the input length. GPT-J 6B (γ𝛾\gammaitalic_γ) fine-tuned on its maximum input length (2048) achieves better (or similar) performance than its MESc overhead trained on its extracted embeddings. For SCOTUS, MESc achieves better performance (2 points, m-F1) in the validation set but lower (2 points, m-F1) in the test set. Almost similar performance (m-F1) in ECtHR(B), 1 point higher (m-F1) in ECtHR(A)’s test set, and lesser in ILDC. To check if this is the case with GPT-Neo-1.b and GPT-Neo-2.7B we fine-tuned them with their maximum input length (2048) on SCOTUS (which through our experiments can be seen as more difficult to classify). We found for GPT-Neo (1.3B and 2.7B) fine-tuning on their maximum input length didn’t show the same results as with the GPT-J, where we can see that for both GPT-Neo-1.3B(γ𝛾\gammaitalic_γ) and GPT-Neo-2.7B(γ𝛾\gammaitalic_γ) even the MESc (on GPT-Neo-1.3B(α𝛼\alphaitalic_α)) and MESc (on GPT-Neo-2.7B(α𝛼\alphaitalic_α)) performs better (¿ 1 point m-F1) respectively. To analyze this, we plot the distribution of the number of documents with respect to their chunk counts (chunk length = 2048) in the datasets , one such example of ECtHR can be found in figure 3. As observed, most of the documents are able to fit between 1-2 chunks (median = 1), which means that with the longer input of 2048, most of the important information is not fragmented during the fine-tuning process of stage 1 and is learned together. Along with this, the higher number of parameters in GPT-J is able to adapt better to most of the documents. We observe that most (¿ 90%) of the documents can fit in very few chunks, deepening the models with extra layers (stages 3 & 4) does not have any added value.

With these results, we find that:

  1. (1)

    Concatenating the last two layers in GPT-Neo (1.3b,2.7B) or GPT-J provides the optimum number of feature variances. And for BERT-based models, the last 4 layers worked better. Globally concatenating the embeddings helped to get a better approximation of the structure labels and improves the performance.

  2. (2)

    MESc adapts well to LLMs (BERT-based models, GPT-Neo-1.3B, GPT-Neo-2.7B) with less than 6 billion parameters (GPT-J).

  3. (3)

    MESc works better than its counterpart LLM under the condition that the length of most of the documents in the dataset is much greater than the maximum input length of the LLM.

Refer to caption
Figure 3. Number of documents vs. the number of chunks for ECtHR.

5.2. Results on extractive explanation (ORSE)

Table 5. ORSE vs ILDCexpert (Malik et al., 2021)
Expert
1 2 3 4 5
Baseline-scores
ROUGE-1 0.444 0.517 0.401 0.391 0.501
ROUGE-2 0.303 0.295 0.296 0.297 0.294
ROUGE-L 0.439 0.407 0.423 0.444 0.407
BLEU 0.16 0.28 0.099 0.093 0.248
Jaccard Similarity 0.333 0.317 0.328 0.324 0.318
ORSE @ k=20% (MESc with InLegalBERT (α𝛼\alphaitalic_α))
ROUGE-1 0.4844 0.4815 0.4657 0.4678 0.5083
ROUGE-2 0.3339 0.3125 0.3299 0.331 0.3523
ROUGE-L 0.4679 0.4518 0.4537 0.4577 0.488
BLEU 0.1682 0.3253 0.0969 0.08 0.2973
Jaccard Similarity 0.346 0.3381 0.3306 0.318 0.3637
ORSE @ k=30% (MESc with InLegalBERT (α𝛼\alphaitalic_α))
ROUGE-1 0.5441 0.5108 0.5351 0.5445 0.55
ROUGE-2 0.3939 0.3452 0.4018 0.4078 0.3976
ROUGE-L 0.5266 0.4815 0.5225 0.5338 0.5302
BLEU 0.2888 0.3979 0.2104 0.1901 0.4049
Jaccard Similarity 0.4051 0.3657 0.3992 0.3896 0.4044
ORSE @ k=40% (MESc with InLegalBERT (α𝛼\alphaitalic_α))
ROUGE-1 0.5809 0.5201 0.5857 0.6016 0.5665
ROUGE-2 0.4364 0.3574 0.4597 0.4738 0.4185
ROUGE-L 0.5649 0.4942 0.5741 0.5914 0.5476
BLEU 0.3918 0.3915 0.3416 0.3221 0.4397
Jaccard Similarity 0.4445 0.3739 0.4535 0.4476 0.4216
ORSE @ k=30% (MESc with GPT-J 6B (α𝛼\alphaitalic_α))
ROUGE-1 0.5448 0.5152 0.5327 0.5484 0.5488
ROUGE-2 0.3996 0.3497 0.4009 0.4159 0.3931
ROUGE-L 0.5277 0.4858 0.5182 0.5387 0.5283
BLEU 0.2834 0.4078 0.2057 0.1861 0.3982
Jaccard Similarity 0.4076 0.372 0.3984 0.3948 0.4057
ORSE @ k=40% (MESc with GPT-J 6B (α𝛼\alphaitalic_α))
ROUGE-1 0.5822 0.5297 0.5864 0.6077 0.567
ROUGE-2 0.4414 0.3685 0.4611 0.4816 0.4164
ROUGE-L 0.5659 0.5033 0.573 0.5984 0.5464
BLEU 0.3854 0.4082 0.3288 0.3113 0.4305
Jaccard Similarity 0.4479 0.3854 0.4563 0.4566 0.4246

We used two best-performing configurations of MESc in ILDC (InLegalBERT (α𝛼\alphaitalic_α) and GPT-J 6B (α𝛼\alphaitalic_α)) to extract the sentences with ORSE and varying k as 20%, 30%, and 40%. The performance of ORSE can be seen in table 5. The sentences extracted from our extractive explanation algorithm are compared with the gold explanations given by five different annotators (1,2,3,4,5) in ILDCExpert. The sentence similarity between the two is measured with the help of metrics ROUGE-1, ROUGE-2, ROUGE-L (Lin, 2004), Jaccard similarity, and BLEU(Papineni et al., 2002) score, the results of which are shown in Table 5. We compare scores from our algorithm with the baseline score on the ILDCExpert (Malik et al., 2021). With InLegalBERT (α𝛼\alphaitalic_α), and k = top 20% of ranked sentences, ORSE performs almost similar to the baseline in ROUGE-1 while overall slightly better in other metrics. With k = 30%, ORSE surpasses the baselines with a total gain of 19% on ROUGE-1, 31% on ROUGE-2, 22.38% on ROUGE-L, 69.5% on BLEU, and 32.16% on Jaccard Similarity. For k = 40% the gain is much higher with a total gain of 26.65% on ROUGE-1, 44.49% on ROUGE-2, 30.76% on ROUGE-L, 114% on BLEU, and 32.16% on Jaccard Similarity. The explanations extracted from GPT-J 6B (α𝛼\alphaitalic_α) variant are slightly better than InLegalBERT (α𝛼\alphaitalic_α) for both k = 30% and 40% respectively. Overall ORSE performs better than the baseline with the best scores having a total average gain of 50% over the baseline on all the metrics.

6. Conclusion

We explore the problem of classification of large and unstructured legal documents and develop a multi-stage hierarchical classification framework (MESc). We find the effect of including the structure information with our approximated structure labels in such documents and also explore the impact of combining the embeddings from the last layers of a fine-tuned transformer encoder model in MESc. Along with BERT-based LLMs, we also explored the adaptability of larger LLMs (GPT-Neo and GPT-J) with multi-billion parameters, to MESc. We check MESc’s limits (section 5.1.2) with these LLMs to suggest the optimal condition for its performance. GPT-Neo and GPT-J adapted well to legal cases from India and Europe even though they were pre-trained only on the US legal case documents showing the intra-domain(legal) transfer learning capacity of these multi-billion parameter language models. Our experiments achieve a new benchmark in the classification of the ILDC and the LexGLUE subset (ECtHR (A), ECtHR (B), and SCOTUS). For the explanation in such hierarchical models, we developed an extractive explanation algorithm (ORSE) based on the sensitivity of a model to its inputs at each level of the hierarchy. ORSE ranks the sentences according to their impact on the prediction/classification and achieves an average performance gain of 50% in ILDCExpert over the previous benchmark. We aim to further develop the explanation algorithm to adapt to a general neural framework in our future work. Alongside we also aim to leverage this work in-domain, on the French and European legal cases by exploring further the problem of length and non-uniform structure in these legal case documents.

7. Ethical Considerations

Our work aligns with the ethical consideration of the datasets (ILDC (Malik et al., 2021) and LexGLUE (Chalkidis et al., 2022b))) used here for the experimentation and evaluation of our approach. We add certain points to this. The framework developed here is in no way to create a ”robotic” judge or replace one in real life. Rather we try to create such frameworks to analyze how deep learning and natural language processing techniques can be applied to legal documents to extract and provide legal professionals with patterns and insights that may not be implicitly visible. The methods developed here are in no way foolproof to predict and generate an explanatory response, and should not be used for the same in real-life settings (courts) or used to guide people unfamiliar with legal proceedings. The results from our framework should not be used by a non-professional to make high-stakes decisions in one’s life concerning legal cases.

Acknowledgements.
This work is supported by the LAWBOT project (ANR-20-CE38-0013) and HPC/AI resources from GENCI-IDRIS grant number 2022-AD011013937.

References

  • (1)
  • Ainslie et al. (2020) Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. 2020. ETC: Encoding Long and Structured Inputs in Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 268–284. https://doi.org/10.18653/v1/2020.emnlp-main.19
  • Beltagy et al. (2020) Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. https://doi.org/10.48550/ARXIV.2004.05150
  • Black et al. (2021) Sid Black, Gao Leo, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  • Chalkidis et al. (2019) Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4317–4323. https://doi.org/10.18653/v1/P19-1424
  • Chalkidis et al. (2022a) Ilias Chalkidis, Xiang Dai, Manos Fergadiotis, Prodromos Malakasiotis, and Desmond Elliott. 2022a. An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification. https://arxiv.longhoe.net/abs/2210.05529
  • Chalkidis et al. (2020) Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 2898–2904. https://doi.org/10.18653/v1/2020.findings-emnlp.261
  • Chalkidis et al. (2021) Ilias Chalkidis, Manos Fergadiotis, Dimitrios Tsarapatsanis, Nikolaos Aletras, Ion Androutsopoulos, and Prodromos Malakasiotis. 2021. Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 226–241. https://doi.org/10.18653/v1/2021.naacl-main.22
  • Chalkidis et al. (2022b) Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Katz, and Nikolaos Aletras. 2022b. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 4310–4330. https://doi.org/10.18653/v1/2022.acl-long.297
  • Chen et al. (2019) Huajie Chen, Deng Cai, Wei Dai, Zehui Dai, and Yadong Ding. 2019. Charge-Based Prison Term Prediction with Deep Gating Network. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 6362–6367. https://doi.org/10.18653/v1/D19-1667
  • Cui et al. (2022) Junyun Cui, Xiaoyu Shen, Fei** Nie, Z. Wang, **glong Wang, and Yulong Chen. 2022. A Survey on Legal Judgment Prediction: Datasets, Metrics, Models and Challenges. ArXiv abs/2204.04859 (2022).
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
  • Feng et al. (2022) Yi Feng, Chuanyi Li, and Vincent Ng. 2022. Legal Judgment Prediction: A Survey of the State of the Art. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence Organization, 5461–5469. https://doi.org/10.24963/ijcai.2022/765 Survey Track.
  • Gao et al. (2021) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2021. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR abs/2101.00027 (2021). arXiv:2101.00027 https://arxiv.longhoe.net/abs/2101.00027
  • Jiang et al. (2018) Xin Jiang, Hai Ye, Zhunchen Luo, WenHan Chao, and Wenjia Ma. 2018. Interpretable Rationale Augmented Charge Prediction System. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Santa Fe, New Mexico, 146–151. https://aclanthology.org/C18-2032
  • Katju (2019) Justice Markandey Katju. 2019. Backlog of cases crippling judiciary. (2019). https://www.tribuneindia.com/news/archive/comment/backlog-of-cases-crippling-judiciary-776503
  • Kaufman et al. (2019) Aaron Russell Kaufman, Peter Kraft, and Maya Sen. 2019. Improving Supreme Court Forecasting Using Boosted Decision Trees. Political Analysis 27, 3 (2019), 381–387. https://doi.org/10.1017/pan.2018.59
  • Kitaev et al. (2020) Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rkgNKkHtvB
  • Lin (2004) Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
  • Malik et al. (2021) Vijit Malik, Rishabh Sanjay, Shubham Kumar Nigam, Kripa Ghosh, Shouvik Kumar Guha, Arnab Bhattacharya, and Ashutosh Modi. 2021. ILDC for CJPE: Indian Legal Documents Corpus for Court JudgmentPrediction and Explanation. CoRR abs/2105.13562 (2021). arXiv:2105.13562 https://arxiv.longhoe.net/abs/2105.13562
  • McInnes et al. (2017) Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering. Journal of Open Source Software 2, 11 (2017), 205. https://doi.org/10.21105/joss.00205
  • McInnes et al. (2018) Leland McInnes, John Healy, and James Melville. 2018. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. https://doi.org/10.48550/ARXIV.1802.03426
  • Nallapati and Manning (2008) Ramesh Nallapati and Christopher D. Manning. 2008. Legal Docket Classification: Where Machine Learning Stumbles. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Honolulu, Hawaii, 438–446. https://aclanthology.org/D08-1046
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-**g Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
  • Pappagari et al. (2019) Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. 2019. Hierarchical Transformers for Long Document Classification. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 838–844. https://doi.org/10.1109/ASRU46091.2019.9003958
  • Paul et al. (2022) Shounak Paul, Arpan Mandal, Pawan Goyal, and Saptarshi Ghosh. 2022. Pre-training Transformers on Indian Legal Text. https://doi.org/10.48550/ARXIV.2209.06049
  • Petsiuk et al. (2018) Vitali Petsiuk, Abir Das, and Kate Saenko. 2018. RISE: Randomized Input Sampling for Explanation of Black-box Models. In British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018. BMVA Press, 151. http://bmvc2018.org/contents/papers/1064.pdf
  • Prasad et al. (2022) Nishchal Prasad, Mohand Boughanem, and Taoufiq Dkaki. 2022. Effect of Hierarchical Domain-specific Language Models and Attention in the Classification of Decisions for Legal Cases. In Proceedings of the 2nd Joint Conference of the Information Retrieval Communities in Europe (CIRCLE 2022), Samatan, Gers, France, July 4-7, 2022 (CEUR Workshop Proceedings, Vol. 3178). CEUR-WS.org. http://ceur-ws.org/Vol-3178/CIRCLE_2022_paper_21.pdf
  • Tay et al. (2022) Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, **feng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, and Donald Metzler. 2022. Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? https://doi.org/10.48550/ARXIV.2207.10551
  • Thoppilan et al. (2022) Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia **, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yan** Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. 2022. LaMDA: Language Models for Dialog Applications. arXiv:2201.08239 [cs.CL]
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
  • Trautmann et al. (2022) Dietrich Trautmann, Alina Petrova, and Frank Schilder. 2022. Legal Prompt Engineering for Multilingual Legal Judgement Prediction. CoRR abs/2212.02199 (2022). https://doi.org/10.48550/arXiv.2212.02199 arXiv:2212.02199
  • Tuggener et al. (2020) Don Tuggener, Pius von Däniken, Thomas Peetz, and Mark Cieliebak. 2020. LEDGAR: A Large-Scale Multi-label Corpus for Text Classification of Legal Provisions in Contracts. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 1235–1241. https://aclanthology.org/2020.lrec-1.155
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.longhoe.net/abs/1706.03762
  • Wang and Komatsuzaki (2021) Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  • Xiao et al. (2018) Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, and Jianfeng Xu. 2018. CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction. CoRR abs/1807.02478 (2018). arXiv:1807.02478 http://arxiv.longhoe.net/abs/1807.02478
  • Xu et al. (2020) Nuo Xu, **hui Wang, Long Chen, Li Pan, Xiaoyan Wang, and Junzhou Zhao. 2020. Distinguish Confusing Law Articles for Legal Judgment Prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 3086–3095. https://doi.org/10.18653/v1/2020.acl-main.280
  • Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 1480–1489. https://doi.org/10.18653/v1/N16-1174
  • Ye et al. (2018) Hai Ye, Xin Jiang, Zhunchen Luo, and Wenhan Chao. 2018. Interpretable Charge Predictions for Criminal Cases: Learning to Generate Court Views from Fact Descriptions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1854–1864. https://doi.org/10.18653/v1/N18-1168
  • Yu et al. (2022) Fangyi Yu, Lee Quartey, and Frank Schilder. 2022. Legal Prompting: Teaching a Language Model to Think Like a Lawyer. arXiv:2212.01326 [cs.CL]
  • Zaheer et al. (2020) Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 17283–17297. https://proceedings.neurips.cc/paper_files/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf
  • Zeiler and Fergus (2014) Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding Convolutional Networks. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 818–833.
  • Zhang et al. (2019) Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 5059–5069. https://doi.org/10.18653/v1/P19-1499
  • Zheng et al. (2021) Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law (São Paulo, Brazil) (ICAIL ’21). Association for Computing Machinery, New York, NY, USA, 159–168. https://doi.org/10.1145/3462757.3466088
  • Zhong et al. (2018) Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Chaojun Xiao, Zhiyuan Liu, and Maosong Sun. 2018. Legal Judgment Prediction via Topological Learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 3540–3549. https://doi.org/10.18653/v1/D18-1390
  • Zhong et al. (2020a) Haoxi Zhong, Yuzhong Wang, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020a. Iteratively Questioning and Answering for Interpretable Legal Judgment Prediction. Proceedings of the AAAI Conference on Artificial Intelligence 34, 01 (Apr. 2020), 1250–1257. https://doi.org/10.1609/aaai.v34i01.5479
  • Zhong et al. (2020b) Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020b. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5218–5230. https://doi.org/10.18653/v1/2020.acl-main.466