A Hierarchical Neural Framework for Classification and its Explanation in Large Unstructured Legal Documents

Nishchal Prasad IRIT, ToulouseFrance [email protected] , Mohand Boughanem IRIT, ToulouseFrance [email protected] and Taoufiq Dkaki IRIT, ToulouseFrance [email protected]

Abstract.

Automatic legal judgment prediction and its explanation suffer from the problem of long case documents exceeding tens of thousands of words, in general, and having a non-uniform structure. Predicting judgments from such documents and extracting their explanation becomes a challenging task, more so on documents with no structural annotation. We define this problem as ”scarce annotated legal documents” and explore their lack of structural information and their long lengths with a deep-learning-based classification framework which we call MESc; ”Multi-stage Encoder-based Supervised with-clustering”; for judgment prediction. We explore the adaptability of LLMs with multi-billion parameters (GPT-Neo, and GPT-J) to legal texts and their intra-domain(legal) transfer learning capacity. Alongside this, we compare their performance and adaptability with MESc and the impact of combining embeddings from their last layers. For such hierarchical models, we also propose an explanation extraction algorithm named ORSE; Occlusion sensitivity-based Relevant Sentence Extractor; based on the input-occlusion sensitivity of the model, to explain the predictions with the most relevant sentences from the document. We explore these methods and test their effectiveness with extensive experiments and ablation studies on legal documents from India, the European Union, and the United States with the ILDC dataset and a subset of the LexGLUE dataset. MESc achieves a minimum total performance gain of approximately 2 points over previous state-of-the-art proposed methods, while ORSE applied on MESc achieves a total average gain of 50% over the baseline explainability scores.

Extractive explanation, Legal judgment prediction, Scarce annotated documents, Long document classification, Multi-stage classification framework

1. Introduction

A legal case proceeding cycle generally involves pre-filing and investigation, pleadings, discovery, pre-trial motions, trial, post-trial motions and appeals, and enforcement¹¹1https://www.law.cornell.edu/wex/civil_procedure. Out of the many steps in this cycle involving case filing, appeals, etc., the lawyer or the judge needs to analyze each case giving a certain time to come to a conclusion. This can involve analyzing vast amounts of data and legal precedents, which can be a time-consuming process given the complexity and length of the case. The number of legal cases in a country is also proportionally related to its population. This leads to a backlog of cases, especially in countries with huge populations, ultimately setting back the progress of its legal system ²²2https://www.globaltimes.cn/page/202204/1260044.shtml(Katju, 2019).

Automating such legal case procedures can help speed up and strengthen the decision-making process, saving time and benefiting both the legal authorities and the people involved. (The scope of this work is within the ethical considerations discussed in the section 7)

One of the fundamental problems that deal with this larger component is the prediction of the outcome based just on the case’s raw texts (which include the facts, arguments, appeals, etc. except the final decision), which corresponds to the typical real-life scenario. While alongside, the reasoning as to why such a prediction (decision/judgment) was made is essential to understand, rely upon, and use that prediction in a more informative way.

There have been many machine learning techniques applied to legal texts in the past to predict the judgments as a text classification problem ((Feng et al., 2022), (Cui et al., 2022)). While it may seem like a general text classification task, legal texts differ from general texts and are rather more complex, broadly in two ways, i.e. structure and syntax and, lexicon and grammar ((Zhong et al., 2020b), (Chalkidis et al., 2020), (Nallapati and Manning, 2008)). The structure of legal case documents (preamble, facts, appeals, facts, etc.) is not uniform in most settings and their complex syntax and lexicon make it more difficult to annotate, requiring only legal professionals. This adds to another challenge of long lengths of legal case documents, reaching more than 10000 tokens (Table 1). The lack of structure information and the long lengths of structure-specific legal documents can be defined as a more specific text classification problem, which in our work we call \sayscarce-annotated documents.

What are scarce-annotated legal documents? Legal documents have a structure that comprises a preamble, facts, justifications, case arguments, appeals, and old decisions. Except for the preamble (the heading), the remaining parts (with varying lengths) can occur in no specific order. This also varies with individual documents, making them non-uniform in nature. In such a case, it becomes necessary to provide labels (annotate) defining this structure for a model, to understand the relevant parts of the document and help make a better prediction. These relevant parts also get hidden as the length of the document increases (10’s of thousands of tokens), hence making it noisy.We define this scarcity of structure information in large legal case documents as scarce-annotated. One thing to be noted here is that the documents have their class (final decision/judgment) label(s) but they don’t give any other information regarding its structure.

We explore this problem of classification of large scarce-annotated legal documents with their explanation.

For classification, to approximate the structure we cluster the features generated (after processing individual parts of these documents) from a language model (transformer-based), and then use them alongside to make the predictions. With this, we try to see if these approximated labels help in judgment prediction with experiments on four different datasets (ILDC(Malik et al., 2021) and LexGLUE’s (Chalkidis et al., 2022b) ECtHR(A), ECtHR(B) and SCOTUS) which can be found in the later sections of the paper. The work on classification can be summarised below:

•

We define the problem of judgment prediction from large scarce-annotated legal documents and propose a multi-stage neural classification framework named \sayMulti-stage Encoder-based Supervised with-clustering (MESc). This works by extracting embeddings from the last four layers of a fine-tuned encoder of a large language model (LLM) and using an unsupervised clustering mechanism on them to approximate structure labels. Alongside the embeddings, these labels are processed through another set of transformer encoder layers for final classification.
•

With ablation on MESc we show the importance of combining features from the last layers of transformer-based large language models (LLMs) (BERT (Devlin et al., 2019), GPT-Neo (Black et al., 2021), GPT-J (Wang and Komatsuzaki, 2021)) for such documents, with the impact on their prediction upon using the approximated structure labels.
•

We also study the adaptability of multi-billion parametered LLMs to the hierarchical framework and scarce-annotated legal documents while studying their intra-domain(legal) transfer learning capacity (both with fine-tuning and in MESc).
•

MESc achieves a minimum gain of $\approx$ 2 points in classification on ILDC and the LexGLUE’s subset.

To explain the predictions, we developed a novel explanation extraction algorithm to rank and extract relevant sentences that impacted the prediction. While these sentences cannot exactly serve as the explanation as they are just marked sentences, but in a situation where no annotations are available to train an abstractive explanative algorithm these sentences can serve as a representative, for an explanation, to guide an expert on what led to a certain prediction. The results also showcase the challenge associated with such documents with a scope for future developments.

The work on explanation is summarized below:

•

We propose an extractive explanation algorithm, Occlusion sensitivity based Relevant Sentence Extractor (ORSE), based on the input sensitivity of a model which ranks relevant sentences from a document, to serve as an explanation for its predicted class.
•

We test ORSE for explanation extraction on ILDC_expert (Malik et al., 2021), with GPT-J(Wang and Komatsuzaki, 2021) and InLegalBERT(Paul et al., 2022) obtaining new benchmarks.

2. Related works

Several strategies have been investigated in the past with machine-learning techniques to predict the result of legal cases in specific categories (criminal, civil, etc.) with rich annotations (Xiao et al. (Xiao et al., 2018), Xu et al. (Xu et al., 2020), Zhong et al. (Zhong et al., 2018), Chen et al. (Chen et al., 2019)). These studies on well-structured and annotated legal documents show the effect and importance of having good structural information. While creating such a dataset is both time and resource (highly skilled) demanding, researchers have worked on legal documents in a more general and raw setting.

Chalkidis et al. (Chalkidis et al., 2019) presented a dataset of European Court of Human Rights case proceedings in English, with each case assigned a score indicating its importance. They described a Legal Judgment Prediction (LJP) task for their dataset, which seeks to predict the outcome of a court case using the case facts and law violations. They also curated another version of this dataset (Chalkidis et al., 2021) to give a rational explanation for the decision prediction made on them. In the US case setting, Kaufman et al. (Kaufman et al., 2019) used AdaBoost decision tree to predict the U.S. Supreme Court rulings. Tuggener et al. (Tuggener et al., 2020) curated LEDGAR, a multilabel dataset of legal provisions in US contracts. Malik et al. (Malik et al., 2021) curated the Indian Legal Document Corpus (ILDC) of unannotated and unstructured documents, and used it to build baseline models for their Case Judgment Prediction and Explanation (CJPE) task upon which Prasad et al. (Prasad et al., 2022) showed the possibility of intra-domain(legal) transfer learning using LEGAL-BERT on Indian legal texts (ILDC).

Pretrained language models based on transformers (Devlin et al.(Devlin et al., 2019), Vaswani et al.(Vaswani et al., 2017)) have shown widespread success in all fields of natural language processing (NLP) but only for short texts spanning a few hundred tokens. There have been several approaches to handle longer sequences with transformer encoders (Beltagy et al.(Beltagy et al., 2020), Kitaev et al.(Kitaev et al., 2020), Zaheer et al.(Zaheer et al., 2020), Ainslie et al. (Ainslie et al., 2020)) demanding expensive domain-specific pretraining for adaptation to legal texts, and are not guaranteed to scale compared to their vanilla counterparts (Tay et al. (Tay et al., 2022)). Since we try to get the structure information of the document, we choose to process the document in short sequences rather than as a whole. These short sequences will help us to learn and approximate their structure labels. So, we take a different approach to handle large documents with smaller pre-trained transformer encoder models (such as BERT (Devlin et al., 2019)) based on the hierarchical idea of \saydivide, learn and combine (Chalkidis et al.(Chalkidis et al., 2022a), Zhang et al. (Zhang et al., 2019), Yang et al. (Yang et al., 2016)), where the document is split (into parts then sentences and words, etc.) and features of each component are learned and combined together hierarchically from bottom-up to get the whole document’s representation.

The domain-specific pre-training of transformer encoders has also accelerated the development of NLP in legal systems with better performance as compared to the general pre-trained variants (Chalkidis et al.(Chalkidis et al., 2020)’s LEGAL-BERT trained on court cases of the US, UK, and EU, Zheng et al. (Zheng et al., 2021)’s BERT trained on US court cases dataset CaseHOLD, Shounak et al. (Paul et al., 2022)’s InLegalBERT and InCaseLawBERT trained on the Indian legal cases). Recently, with the emergence of multi-billion parametered LLMs such as GPT-3 (Brown et al., 2020), LLaMA (Touvron et al., 2023), LaMDA (Thoppilan et al., 2022), and their superior performance in natural language understanding, researchers have tried to adapt (with few-shot learning) their smaller variants to legal texts (Trautmann et al. (Trautmann et al., 2022), Yu et al. (Yu et al., 2022)). We check their adaptability with the hierarchical idea of \saydivide, learn and combine and their intra-domain(legal) transfer-learning with full-fine tuning compared to the intra-domain pre-training (as done in LEGAL-BERT, InLegal-BERT). To do so we use three such variants of GPT (GPT-Neo (1.3 and 2.7)(Black et al., 2021), GPT-J(Wang and Komatsuzaki, 2021)) pre-trained on Pile(Gao et al., 2021), which has a subset (FreeLaw) of court opinions of US legal cases.

To rely upon the judgment prediction of a legal case an explanation leading to that judgment is of paramount importance. In scenarios where there is a lack of explanations annotation of legal texts, an extractive explanation method is a good fit to create interpretations of the predicted judgments. To provide interpretations for their judgment prediction task, (Zhong et al., 2020a) employed Deep Reinforcement learning and created a ”question-answering” based model called QAjudge. Jiang et al. (Jiang et al., 2018) try to extract readable snippets of texts (called rationale) from legal texts using reinforcement learning for their judgment classification problem. To improve the interpretability of charge prediction systems Ye et al. (Ye et al., 2018) propose a label-conditioned Seq2Seq model, which, for a given predicted charge chooses relevant reasoning in the legal document. Since we try to develop an extractive explanation algorithm using no training data and relying solely on a trained model we turn toward the idea of input sensitivity of a model (Zeiler et al. (Zeiler and Fergus, 2014), Petsiuk et al. (Petsiuk et al., 2018)) which has been used in the interpretation of computer vision models, where the pixels are scored (according to a scoring parameter) against their absence in the input, and finally they are chosen according to the desirability of the scores (higher or lower).

3. Method

Refer to caption — Figure 1. MESc classification Framework

3.1. Classification Framework (MESc)

To handle large documents MESc architecture shares the general hierarchical idea of divide, learn and combine ((Chalkidis et al., 2022a), (Zhang et al., 2019), (Yang et al., 2016)) but it differs from the previous works in the following main aspects:

a) It employs custom fine-tuning and uses the last four layers of the fine-tuned transformer encoder for extracting representations for parts(chunks) of the document.

b) It approximates the document structure by applying unsupervised learning (clustering) on these representations’ embeddings and uses this information alongside, for classification.

c) Different configurations of transformer encoder layers are used and experimented with to attend over the combined embeddings to learn the intra-chunk representation to get a global document representation.

d) Divide the process into four stages, custom fine-tuning, extracting embeddings, processing the embeddings (supervised + unsupervised learning), and classification. An overview of MESc can be seen in Figure 1. The stages as detailed below:

An input document $D$ is tokenized into a sequence of tokens, $D=\{t_{1,D},t_{2,D},\cdots,t_{L_{D},D}\}$ via a tokenizer specific to a chosen pre-trained language model (BERT, GPT etc.), where $t\in\mathbb{N}$ and $\mathbb{N}$ is the vocabulary of the tokenizer. This token sequence is split into a set of blocks $\{C_{1,D},C_{2,D},\cdots,C_{N_{D},D}\}$ with overlaps( $o$ ) with the previous block, which we call as chunks. Where each chunk block, $C_{i,D}=\{t_{(i+c-o),D},\cdots,t_{(i+2c-o,D}\}$ with $c$ being the maximum number of tokens in the chunks, which is a predefined parameter for MESc (e.g. 512). $N_{D}=\lceil\frac{L_{D}}{c-o}\rceil$ is the total number of chunks for a document having $L$ tokens in total, with $o<<c$ . $N_{D}$ varies with the length of the document.

Stage 1 - Custom fine-tuning:

To each chunk of a document, we associate the document label $l_{D}$ and combine them together to form a token matrix:

(1)

I_{D}\in\mathbb{R}^{N_{D}\times c\times 1}\leftarrow[\{C_{1,D},l_{D}\},\{C_{2,% D},l_{D}\},\cdots,\{C_{N_{D},D},l_{D}\}]

This is used as input for the document for fine-tuning the pre-trained encoder, where $N_{D}$ is the batch size for one pass through the encoder.

This allows the encoder to adapt to the domain-specific legal texts, which helps get richer features for the next stage.

Stage 2 - Extracting chunk embeddings:

For a document, we pass its chunks $C_{i}$ through the fine-tuned encoder and extract its representation embeddings ( $E_{i,D}$ ) from the last $l$ layers. $E_{i,D}\in\mathbb{R}^{l\times d}$ , where $d$ is the dimension of the feature-length (we use $l=4$ ). The representation embeddings can be either the first token (as in BERT) or the last token for causal language models (as in GPT). We accumulate all $E_{i,D}$ of a document to form an embedding matrix:

(2)

E_{D}\in\mathbb{R}^{N_{D}\times l\times d}\leftarrow\bigl{[}E_{1,D},E_{2,D},% \cdots,E_{N_{D},D}\bigr{]}

The $E_{i,D}$ acts as a representation of the chunk in this context, and combining them yields an approximate representation of the entire document. Doing this for all the documents gives us generated training data.

Stage 3 - Processing the extracted representations:

Since the features extracted from the last layers of a fine-tuned encoder have different embedding spaces, they can contribute to being either positive or redundant. So for this stage, we choose to combine together the last $p<l$ layers in $E_{D}$ for further training. We experiment with different $p$ before fixing one value as shown This gives $E_{D}^{(p)}\in\mathbb{R}^{N_{D}\times p\times d},p\in\{1,2,3,4\}$ . (We used 1,2 and 4 in our experiments to compare their effects.) We concatenate together the representations from these $p$ layers to get,

(3)

E_{i,D}^{(p)}\in\mathbb{R}^{pd\times 1}\leftarrow\bigl{[}E_{i,D}^{(l)}|E_{i,D}% ^{(l-1)}|\cdots|E_{i,D}^{(l-p)}\bigr{]}

This gives,

(4)

\widehat{E}_{D}^{(p)}\in\mathbb{R}^{N_{D}\times pd}\leftarrow\bigl{\{}E_{1,D}^% {(p)}|E_{2,D}^{(p)}|\cdots|E_{N,D}^{(p)}\bigr{\}}\

We also experimented with the element-wise addition of representations in $E_{D}^{(p)}$ and found their performance to be lower by 1 point in most of the experiments of section 5, hence we exclude it in MESc.

(1)

Approximating the structure labels ( $S_{D}$ ) (Unsupervised learning): To get the information on the document’s structure i.e. its parts (facts, arguments, concerned laws, etc.), we use a clustering mechanism (HDBSCAN (McInnes et al., 2017)). We cluster the $p$ chosen extracted chunk embeddings, $\widehat{E}_{D}^{(p)}$ to map similar parts of different documents together where the labels of one part of a document are learned by its similarity with another part of another document. The idea is that the embeddings of similar parts from different documents will group together forming a pool of cluster labels that can help identify its part in the document. One such dummy example can be seen in figure 2, where the $E_{i,D}^{(p)}$ of documents 1 and 2 learn their cluster (label) pool for, arguments of one type, $a_{1}$ = { $E^{(p)}_{1,1},E^{(p)}_{1,2}E^{(p)}_{2,1}$ }, facts of one type, $f_{1}$ = { $E^{(p)}_{1,3},E^{(p)}_{1,5},E^{(p)}_{2,2},E^{(p)}_{2,3},E^{(p)}_{2,5}$ }, facts of another type, $f_{2}$ ={ $E^{(p)}_{1,4},E^{(p)}_{1,6},E^{(p)}_{2,4},E^{(p)}_{2,6}$ }. So for document 1 the approximated structure then becomes $S_{1}$ = { $a_{1}$ , $a_{1}$ , $f_{1}$ , $f_{2}$ , $f_{1}$ , $f_{2}$ } and for document 2 it is $S_{2}$ = { $a_{1}$ , $f_{1}$ , $f_{1}$ , $f_{2}$ , $f_{1}$ , $f_{2}$ }. It is to be noted that this distinction if it’s a fact or an argument etc. is done here for the purpose of representation. In the actual setting, it is unknown and these structure labels don’t carry any specific name or meaning except for the model to give an understanding of its structure.

Figure 2. An example of clustering of chunk representations of two documents to generate structure labels.

Since the performance of the HDBSCAN clustering mechanism decreases significantly with an increase in data dimension, we use a dimensionality reduction algorithm (pUMAP (McInnes et al., 2018)), before clustering. For all the chunks of a document, their approximated structure labels are combined with the output of stage 3 (2), before processing through the final classification stage (4).

(2)

Global document representation (Supervised learning): For intra-chunk attention, we use transformer encoder layers (Vaswani et al. (Vaswani et al., 2017)), for one chunk to attend to another through its multi-head attention with a feed-forward neural network (FFN) layer. With respect to a chunk’s position in the document, we add its positional embeddings ((Devlin et al., 2019)) in ${E}_{D}^{(p)}$ and process it through $e$ transformer layers $T^{(e)}_{\{h,d_{f}\}}$ , with $h$ attention heads and $d_{f}=pd$ as the dimension of the FFN. $e$ and $h$ are both hyperparameters whose choice depends upon the input feature lengths (Section 5 evaluates different values of these parameters). But $e\geq 3$ sometimes overfits the model in our experiments, hence we fix $e=2$ for MESc. The output is max-pooled and passed through a feed-forward neural network $FFN_{T}$ of $128$ nodes to get:

(5)

G\left(\widehat{E}_{D}^{(p)}\right)=FFN_{T}\left(maxpool\left(T^{(e)}_{\{h,d_{% f}\}}\left(\widehat{E}_{D}^{(p)}\right)\right)\right)\in\mathbb{R}^{128}

Stage 4 - Classification:

The structure labels along with the output of the feed-forward network of stage 3(b) are concatenated together. Which is processed through an internal feed-forward network $FFN_{i}$ (32 nodes, with softmax activation) and an external feed-forward network $FFN_{e}$ ( $u$ label(class) nodes with task-specific activation function sigmoid(softmax)) to get the output $O(D)$ for a document $D$ , as shown in equation 6.

(6)

O\left(D\right)=FFN_{e}\left(FFN_{i}\left(\left(\left[G\left(\widehat{E}_{D}^{% (p)}\right)|S_{D}\right]\right)\right)\right)\in\mathbb{R}^{u}

$O$ and $G$ are learnt together for final classification while $S_{D}$ is learnt independently.

3.2. Extractive Explanation Algorithm - ”Occlusion-based Relevant Sentence Extraction (ORSE)”

To understand the relevant parts of the document contributing to the said decision prediction, we extract sentences that have a high impact on the decision prediction. We develop an extractive explanation algorithm for hierarchical classification models (similar to MESc, (Pappagari et al., 2019)) based on the input sensitivity of the models in the hierarchy. Since this is an extractive process there is no need for a \saypre-summarised (detailed annotation) explanation, which is required to train an explanation model. A hierarchical model is composed of many individual models divided into levels, where each level is responsible to learn different components of the input, which are combined together in moving up the hierarchy. Taking an example of a document, the input in these models can be processed in a hierarchical fashion from the bottom up by learning from the words, then combining them into sentences, and into paragraphs/parts which can be further accumulated to give a full input representation.

In this work, we target ORSE to explain the reason why a decision prediction is made by our hierarchical predictive model (MESc) where the algorithm processes the document in its two steps of hierarchy.

1) Find the highly sensitive (impactful) chunks (parts of the document).

2) Find the highly sensitive sentences from these chunks.

We define ORSE (Algorithm 1) kee** in mind the long lengths of the large documents. While it can also be adapted to shorter documents (a few hundred to thousand tokens), for which the extraction can be done by just the 1st step.

For a classification model $M$ and input $I={\{i_{j}|0\leq j\leq n}\}$ , where $n$ is the input length, consider $M(I)=O_{I}$ to be the prediction without any occlusion where $P$ is the final predicted class label(s), and $M(\{I|i_{j}\})=O_{I}^{(j)}$ to be the prediction after the occlusion of $i_{j}$ in $I$ . The occlusion is done by masking individual parts (e.g. $0$ masking) before feeding into a classification model. If $P$ is from the final model in the hierarchy we take it as the absolute class label with which we rank the inputs for all models in the hierarchy. We define an occlusion-sensitivity impact function $L$ , depending on the classification problem type as,

(7)

L(O_{I}^{(j)},P)=\begin{cases}CCE_{loss}(O_{I}^{(j)},P),&\text{for multi-class% }\\ BCE_{loss}(O_{I}^{(j)},P),&\text{for binary, multi-label}\end{cases}

where, $CCE_{loss}=$ Categorical-Cross-Entropy loss, and $BCE_{loss}=$ Binary-Cross-Entropy loss. Other loss functions can also be used depending on the task. The intuition behind this impact measure is to see how important is the occluded part of the input for a prediction with the change in its loss function. A higher loss means more impact. $L$ is to be chosen such that it is always $>0$ .

To rank these losses in terms of their impact, we measure the deviance of an input’s occluded component’s impact value from the impact value of the whole input (without any occlusion). With $L$ we compute the \sayweighted occlusion sensitivity score $S$ ,

(8)

S(s,L_{j},L_{I})=s\times\left(L_{I}^{(j)}-L_{I}+\delta\right)

where $s$ is the score weight, $L_{I}^{(j)}$ and $L_{I}$ are impact scores on occluding $i_{j}$ in $I$ and impact of $I$ with the $P$ (absolute class label). We add constant $\delta$ so as to make $S$ positive by shifting the axis, which is required to keep the score above $0$ .

Algorithm 1 Occlusion sensitivity-based Relevant Sentence Extractor (ORSE)

1:From 3.1, Select a classification model

M

, and its backbone fine-tuned encoder

T

k=\%

of sentences to choose.

2:for all documents do

3: Divide the document into chunks of length

c

E\leftarrow

Extract all chunk embeddings from

T

O_{E}\leftarrow M(E)

, probability output.

P\leftarrow

absolute predicted label from

O_{E}

L_{E}\leftarrow 0

, impact with itself

L(P,P)

(Eq. 7)

8: for chunk

c_{i}

E

9: Mask

c_{i}

10:

O_{E}^{(i)}\leftarrow M(\{E|c_{i}\})

, probability output after masking

c_{i}

11:

L_{c_{i}}\leftarrow L(O_{E}^{(i)},P)

(Eq. 7)

12:

S_{c_{i}}\leftarrow S(1,L_{c_{i}},L_{E})

(Eq. 8)

13: end for

14:

S_{E}\leftarrow

concatenate all

(c_{i},S_{c_{i}})

15:

S_{E}\leftarrow

Sort

S_{E}

in descending order of

S_{c_{i}}

16: for (

C_{i}

s

) in

S_{E}

17:

O_{C_{i}}\leftarrow T(C_{i})

, probability output from

T

18:

L_{C_{i}}\leftarrow L(O_{C_{i}},P)

(Eq. 7)

19: Split

C_{i}

into sentences,

\{s_{j}|1\leq j\leq

total sentences

\}

20: for

s_{j}

C_{i}

21: Mask

s_{j}

22:

O_{s_{j}}\leftarrow T(\{C_{i}|s_{j}\})

, probability after masking

s_{j}

23:

L_{s_{j}}\leftarrow L(O_{s_{j}},P)

(Eq. 7)

24:

S_{s_{j}}\leftarrow S(s,L_{s_{j}},L_{C_{i}})

(Eq. 8)

25:

A_{score}

\leftarrow

concatenate all

(i,s_{j},S_{s_{j}})

26: end for

27: end for

28: Sort

A_{score}

in descending order of

S_{s_{j}}

29:

A_{score}[k]\leftarrow

keep the top

k\%

sentences.

30:

A_{score}[k]\leftarrow

rearrange in the order of (i,

s_{j}

31:end for

We give a description of ORSE adapted to MESc in the Algorithm 1, and detail the steps involved. We start from the top-level model $M$ (stage 4) to find the highly sensitive chunks (steps 2-14), for a document. We calculate the probability output from $M$ . Since $M$ is at the top level of the hierarchy we take its prediction as the absolute predicted label $P$ , and take the self-impact score as $0$ (Step 2-6). We mask/occlude the chunks and calculate their impact score (Eq. 7) and then their weighted occluded sensitivity scores (Eq. 8) with respect to the whole document i.e. self-impact score (steps 8-11). Since this is the top level we use $1$ as the weight. We sort the accumulated scores in order of their sensitivity score (i.e. higher value is given more importance).

To rank the sentences (steps 15-28) we iteratively start from the highest-scored chunk and take its probability output from the fine-tuned transformer $T$ (from stage 1) to calculate its impact-score w.r.t $P$ (step 17). We then split this chunk ( $c_{i}$ ) into sentences and iteratively mask/occlude a sentence $s_{j}$ inside the chunk to calculate its weighted occluded sensitivity score ( $S_{s_{j}}$ ) (steps 19-24). To weigh the overall importance of each sentence of this chunk as compared to the sentences belonging to other chunks, we weigh the impact shift of $s_{i}$ with the sensitivity score of $C_{i}$ from the previous level of hierarchy. We store the sentences along with their chunk number and sensitivity score in $A_{score}$ . We sort $A_{score}$ , ranking in the order of $S_{s_{j}}$ . Since this is the last level of the hierarchy we stop and take the top $k\%$ sentences. To arrange the sentences with their sequential occurrence in the document we arrange $A_{score}[k]$ according to the chunk number and the sentence in the chunk. These sentences serve as the explanation for a document’s prediction. The time complexity is model dependent, and is $O(n^{2})$ here, due to the quadratic complexity of the fine-tuned transformer ( $T$ ) used, where asymptotically $n$ is the average length of all the documents for a batch.

4. Experimental setup

For our backbone transformer encoder, we used the domain-specific pre-trained model LEGAL-BERT(Chalkidis et al., 2020), InLegalBERT(Paul et al., 2022) and chose GPT-Neo(Black et al., 2021), GPT-J(Wang and Komatsuzaki, 2021) for experimenting with larger LLMs with multi-billion parameters. The tokenizers used to tokenize the documents are from the same encoders. No chunks were excluded in any stage of MESc.

Stage 1: For BERT-based encoders, the chunk size was set to $\leq 510$ (90 token overlaps) with global tokens ([CLS],[SEP]) and padding to make the input chunk size $512$ . For GPT-based encoders, we experimented with the same overlap but two different chunk sizes $512$ and $2048$ , the latter being the maximum possible input length. We abbreviate the encoders fine-tuned on 512 input length as ( $\alpha$ ) and for ones fine-tuned with 2048 input length as ( $\gamma$ ).

For testing the fine-tuned models, we considered the last tokens for each document. For all GPT-Neo( $\alpha$ ) and GPT-J( $\alpha$ ) we evaluated on input length of 512 tokens to compare their performance with LEGAL-BERT( $\alpha$ ) and InLegalBERT( $\alpha$ ).

These encoders were fine-tuned for 4 epochs with learning rates in $\{2e^{-6}$ , $3e^{-5}\}$ , and we chose the best-performing one for stage 2. For the full-fine-tuning of the GPT-J and all GPT-Neo, we used 6 Nvidia A100 (80GB GPU) with ZeRO-3 optimization strategy implemented in Deepspeed³³3https://www.deepspeed.ai/ with huggingface’s Accelerate library⁴⁴4https://huggingface.co/docs/accelerate/index.

Stage 2: We used the [CLS] token for embeddings extraction in BERT-based models and for the causal language models(GPT-Neo and GPT-J) we used the last token’s embedding as the global representation for the respective chunk.

Stage 3 & 4: Adam optimizer (learning rate = $3.5e^{-6}$ ) was used after stage 2 of the MESc. For the binary and multi-label classification problems, we used \saybinary cross-entropy loss and used \saycategorical cross-entropy loss respectively. We experimented with $N=\{1,2,3\}$ transformer encoder layers in stage 3 with $h=8$ and trained for 5 epochs, and chose $N=2$ after analyzing their performance (section 5). For clustering to approximate our structure labels we use HDBSCAN (minimum cluster size = $15$ ) and pUMAP⁵⁵5https://umap-learn.readthedocs.io/en/latest/parametric_umap.html( $64$ output dimension). ,

The best-performing MESc configuration for ILDC (Table 4), with $512$ input chunks and $k=\{0.2,0.3,0.4\}$ was used for ORSE (Table 5).

4.1. Dataset

We choose the legal datasets having large documents with a nonuniform structure throughout, without any structural annotations or information. Suiting our problem of large scarce-annotation documents we found one such dataset in the Indian legal court setting, named IDLC (Malik et al., 2021), and the same requirement in a subset of the LexGLUE dataset (Chalkidis et al., 2022b). The ILDC dataset includes highly unstructured 39898 English-language case transcripts from the Supreme Court of India (SCI), where the final decisions have been removed from the document from the end. Upon analyzing the documents from their sources and the dataset we found that they are highly unstructured and noisy. The initial decision between ”rejected” and ”accepted” made by the SCI judge(s) is used to identify each document that also serves as their decision label. To assess how well the judgment prediction algorithms explain themselves, a piece of the corpus (ILDC_Expert, a separate test set of 56 documents) is labeled with gold standard explanations by five distinct legal experts which are pieces of text selected from the document that is most relevant to the judgment. We use this to evaluate our extractive explanation algorithm ORSE.

The LexGLUE dataset (Chalkidis et al., 2022b) comprises a set of seven datasets from the European Union and US court case setting, for uniformly assessing model performance across a range of legal NLP tasks, from which we choose ECtHR (Task A), ECtHR (Task B), and SCOTUS as they are classification tasks involving our problem of long unstructured legal documents. ECtHR (A and B) are court cases from European Convention on Human Rights (ECHR) for articles that were violated or allegedly violated. The dataset contains factual paragraphs from the description of the cases. SCOTUS consists of court cases from the highest federal court in the United States of America, with metadata from SCDB ⁶⁶6http://scdb.wustl.edu/. The details of the number of labels and the average and maximum document length (in tokens) with task description can be found in the table 1. The tokenization in table 1 is done using the tokenizer of GPT-J.

The ILDC, ECtHR (Task A), ECtHR (Task B), and SCOTUS serve as a good fit for our problem and test our approach.For performance comparison on LexGLUE, we used the SOTA benchmark of Chalkidis et al. (Chalkidis et al., 2022b), and for ILDC we used the benchmark from its paper (Malik et al., 2021) and of Shounak et al. (Paul et al., 2022)’s experiments on their models specifically pre-trained on the Indian legal cases. Since LexGLUE lacks an explanation set as like in ILDC_expert we couldn’t use ORSE to test its effectiveness on LexGLUE.

Table 1. Dataset statistics

Name No. of Documents Average length & Maximum length (tokens) No. of labels Problem Type Train Validation Test Train Validation Test ILDC 37387 994 1517 4120 501275 5104 58048 5238 55703 2 Binary ECtHR(A) 9000 1000 1000 2011 46500 2210 18352 2401 20835 10 Multi-Label ECtHR(B) 9000 1000 1000 2011 46500 2210 18352 2401 20835 10 Multi-Label SCOTUS 5000 1400 1400 8291 126377 12639 56310 12597 124955 13 Multi-Class

5. Results and discussion

Table 2. Custom-finetuning results on the chosen pre-trained transformer encoder language models (in Section 4), e = epoch

$\alpha$ : fine-tuned and evaluated with 512 input length, $\beta$ : evaluating $\alpha$ on its maximum input length, $\gamma$ : fine-tuned and evaluated with maximum input length. Dataset LEGAL-BERT ( $\mu$ -F1/m-F1) GPT-Neo 1.3B ( $\mu$ -F1/m-F1) GPT-Neo 2.7B ( $\mu$ -F1/m-F1) GPT-J 6B ( $\mu$ -F1/m-F1) Validation Test Validation Test Validation Test Validation Test ECtHR (A) ( $\alpha$ ) 0.6408/0.5095 (e = 4) ( $\alpha$ ) 0.6285/0.4866 (e = 4) ( $\alpha$ ) 0.6705/0.5940 ( $\beta$ ) 0.6708/0.5958 (e = 2) ( $\alpha$ ) 0.6619/0.5659 ( $\beta$ ) 0.6620/0.5716 (e = 2) ( $\alpha$ ) 0.6815/0.5833 ( $\beta$ ) 0.6739/0.5947 (e = 2) ( $\alpha$ ) 0.6849/0.5445 ( $\beta$ ) 0.6811/0.5649 (e = 2) ( $\alpha$ ) 0.7260/0.6715 ( $\beta$ ) 0.7567/0.6945 ( $\gamma$ ) 0.7855/0.7550 (e = 3) ( $\alpha$ ) 0.7142/0.5927 ( $\beta$ ) 0.7330/0.6245 ( $\gamma$ ) 0.7451/0.6467 (e = 3) ECtHR (B) ( $\alpha$ ) 0.6961/0.6255 (e = 3) ( $\alpha$ ) 0.7089/0.6405 (e = 3) ( $\alpha$ ) 0.7459/0.6938 ( $\beta$ ) 0.7542/0.6946 (e = 2) ( $\alpha$ ) 0.7542/0.7091 ( $\beta$ ) 0.7574/0.7009 (e = 2) ( $\alpha$ ) 0.7524/0.7147 ( $\beta$ ) 0.7619/0.7252 (e = 2) ( $\alpha$ ) 0.7448/0.6826 ( $\beta$ ) 0.7513/0.7072 (e = 2) ( $\alpha$ ) 0.7769/0.7244 ( $\beta$ ) 0.8069/0.7611 ( $\gamma$ ) 0.8308/0.8039 (e = 3) ( $\alpha$ ) 0.7715/0.7326 ( $\beta$ ) 0.8049/0.7631 ( $\gamma$ ) 0.8316/0.7927 (e = 3) SCOTUS ( $\alpha$ ) 0.7296/0.5924 (e = 6) ( $\alpha$ ) 0.6876/0.5357 (e = 6) ( $\alpha$ ) 0.7300/0.6582 ( $\beta$ ) 0.7614/0.6772 ( $\gamma$ ) 0.7731/0.6830 (e = 2) ( $\alpha$ ) 0.7114/0.6035 ( $\beta$ ) 0.7371/0.6310 ( $\gamma$ ) 0.7502/0.6438 (e = 2) ( $\alpha$ ) 0.7314/0.6571 ( $\beta$ ) 0.7686/0.6851 ( $\gamma$ ) 0.7828/0.6931 (e = 1) ( $\alpha$ ) 0.7057/0.6025 ( $\beta$ ) 0.7364/0.6564 ( $\gamma$ ) 0.7636/0.6619 (e = 1) ( $\alpha$ ) 0.7592/0.6875 ( $\beta$ ) 0.7950/0.7295 ( $\gamma$ ) 0.8178/0.7513 (e = 3) ( $\alpha$ ) 0.7200/0.6276 ( $\beta$ ) 0.7571/0.6625 ( $\gamma$ ) 0.7850/0.7196 (e = 3) InLegalBERT (accuracy(%)/m-F1) accuracy(%)/m-F1 ILDC ( $\alpha$ ) 76.15/76.8 (e = 4) ( $\alpha$ ) 76.00/76.10 (e = 4) ( $\alpha$ ) 74.25/0.7421 ( $\beta$ ) 76.66/0.7662 (e=1) ( $\alpha$ ) 72.91/0.7291 ( $\beta$ ) 77.26/0.7725 (e=1) ( $\alpha$ ) 76.96/0.7675 ( $\beta$ ) 81.59/0.8144 (e=1) ( $\alpha$ ) 74.29/0.7424 ( $\beta$ ) 81.21/0.8118 (e=1) ( $\alpha$ ) 75.15/0.7511 ( $\beta$ ) 79.78/0.7972 ( $\gamma$ ) 83.60/0.8347 (e=1) ( $\alpha$ ) 73.96/0.7396 ( $\beta$ ) 81.93/0.8192 ( $\gamma$ ) 83.72/0.8366 (e=1)

Table 3. Results on LexGLUE for different configurations of MESc (* is the encoder model used for embedding extraction)

		ECtHR (A)		ECtHR (B)		SCOTUS
$p$ layers, $e$ x Encoder	Structure Labels	Validation	Test	Validation	Test	Validation	Test
$p$ layers, $e$ x Encoder	Structure Labels	$\mu$ -F1/m-F1
LexGLUE benchmark (Chalkidis et al., 2022b)		0.725/0.682	0.712/0.647	0.797/0.768	0.804/0.747	0.776/0.633	0.766/0.665
LEGAL-BERT* ( $\alpha$ )
$p$ =1, 1 x	No	0.7005/0.6118	0.6825/0.5806	0.7470/0.6791	0.7418/0.6890	0.7719/0.6926	0.7136/0.5916
$p$ =1, 2 x	No	0.6984/0.6056	0.6923/0.5935	0.7507/0.6835	0.7386/0.6742	0.7729/0.6895	0.7152/0.5817
$p$ =4, 1 x	No	0.7718/0.6994	0.7546/0.6226	0.8084/0.7709	0.8102/0.7573	0.7928/0.6866	0.7396/0.5865
$p$ =4, 1 x	Yes	0.7652/0.6870	0.7582/0.6378	0.8087/0.7727	0.8122/0.7725	0.7959/0.7025	0.7525/0.6194
$p$ =4, 2 x	No	0.7662/0.6479	0.7543/0.6337	0.8078/0.7574	0.8118/0.7564	0.7899/0.6849	0.7431/0.6054
$p$ =4, 2 x	Yes	0.7682/0.6883	0.7618/0.6508	0.8089/0.7748	0.8157/0.7670	0.7952/0.6872	0.7550/0.6208
$p$ =4, 3 x	No	0.7884/0.6905	0.7523/0.6311	0.8138/0.7863	0.8132/0.7699	0.7928/0.6866	0.7399/0.5635
$p$ =4, 3 x	Yes	0.7714/0.6815	0.7510/0.6309	0.8075/0.7564	0.8100/0.7621	0.7729/0.6536	0.7392/0.5783
Gpt-Neo 1.3B* ( $\alpha$ )
$p$ =2, 2 x	No	0.7231/0.6862	0.7115/0.6359	0.7960/0.7307	0.8030/0.7702	0.7862/0.7185	0.7536/0.6479
$p$ =2, 2 x	Yes	0.7407/0.7033	0.7273/0.6448	0.7939/0.7479	0.8040/0.7808	0.7797/0.7284	0.7646/0.6592
$p$ =4, 2 x	No	0.7358/0.7059	0.7146/0.6277	0.7947/0.7483	0.8086/0.7664	0.7722/0.7026	0.7429/0.6352
$p$ =4, 2 x	Yes	0.7248/0.7076	0.7068/0.6410	0.7968/0.7537	0.8060/0.7757	0.7651/0.7036	0.7418/0.6377
GPT-Neo 2.7B* ( $\alpha$ )
$p$ =2, 2 x	No	0.7380/0.6750	0.7457/0.6224	0.7896/0.7689	0.7949/0.7620	0.7845/0.7140	0.7676/0.6570
$p$ =2, 2 x	Yes	0.7634/0.7105	0.7567/0.6644	0.7986/0.7693	0.8072/0.7696	0.7988/0.7348	0.7627/0.6630
$p$ =4, 2 x	No	0.7510/0.6641	0.7524/0.6355	0.7897/0.7599	0.7940/0.7503	0.7818/0.7273	0.7577/0.6554
$p$ =4, 2 x	Yes	0.7600/0.6857	0.7587/0.6561	0.7825/0.7686	0.7935/0.7635	0.7903/0.7245	0.7641/0.6775
GPT-J 6B* ( $\alpha$ )
$p$ =2, 2 x	No	0.7516/0.7138	0.7222/0.6263	0.7997/0.7588	0.7931/0.7692	0.7888/0.7215	0.7505/0.6658
$p$ =2, 2 x	Yes	0.7529/0.7255	0.7163/0.6406	0.8048/0.7722	0.7977/0.7760	0.7791/0.7343	0.7598/0.6715
$p$ =4, 2 x	No	0.7351/0.6743	0.7156/0.6118	0.7707/0.7414	0.7800/0.7605	0.7967/0.7211	0.7490/0.6333
$p$ =4, 2 x	Yes	0.7544/0.7377	0.7219/0.6437	0.7891/0.7534	0.7795/0.7625	0.7872/0.7268	0.7485/0.6593
GPT-J 6B* ( $\gamma$ )
$p$ =2, 2 x	No	0.7565/0.7182	0.7384/0.6434	0.8055/0.7954	0.8094/0.7675	0.8141/0.7508	0.7688/0.6773
$p$ =2, 2 x	Yes	0.7619/0.7383	0.7470/0.6571	0.8201/0.8076	0.8169/0.7801	0.8164/0.7755	0.7814/0.6853
$p$ =4, 2 x	No	0.7512/0.7007	0.7296/0.6333	0.8142/0.7937	0.8113/0.7763	0.8170/0.7533	0.7728/0.6786
$p$ =4, 2 x	Yes	0.75860.7174	0.7484/0.6548	0.8192/0.7977	0.8134/0.7802	0.8248/0.7555	0.7867/0.6966

Table 4. Results on ILDC(Malik et al., 2021) for the best configurations of MESc (* is the encoder model used for embedding extraction)

ILDC

Validation

Test

Accuracy(%)/m-F1

Shounak et al. (Paul et al., 2022) benchmark

-/83.09

ILDC (Malik et al., 2021) benchmark

77.78/77.79

p

layers,

e

x Encoder

Structure

labels

InLegalBERT* (

\alpha

)

p

=1, 1 x

84.10/84.21

83.72/83.73

Yes

84.51/84.53

83.65/83.65

p

=1, 2 x

83.90/84.00

83.45/83.47

Yes

85.11/85.15

83.78/83.78

p

=4, 1 x

84.30/84.32

83.41/83.41

Yes

85.23/85.25

84.15/84.15

p

=4, 2 x

84.30/84.32

83.72/83.68

Yes

85.15/85.17

84.11/84.13

GPT-Neo 2.7B* (

\alpha

)

p

=2, 2 x

84.13/84.12

82.97/82.79

Yes

84.71/84.67

83.65/83.64

p

=4, 2 x

84.10/84.09

83.01/83.00

Yes

84.30/84.29

83.22/83.21

GPT-J 6B* (

\alpha

)

p

=2, 2 x

83.43/83.42

82.84/82.78

Yes

84.32/84.31

83.21/83.19

p

=4, 2 x

83.45/83.46

82.73/82.73

Yes

84.22/84.21

83.37/83.36

5.1. Results on classification framework

$\mu$ -F $1$ (micro) and $m$ -F $1$ (macro) are used to measure the performance for the LexGLUE dataset. And accuracy(%) and macro-F $1$ for the ILDC dataset. We emphasize more on the $\mu$ -F $1$ for the LexGULE dataset taking into the class imbalance whilst we also take $m$ -F $1$ into consideration to compare performance with previous benchmarks of LexGLUE (Chalkidis et al., 2022b). We list out the detailed experimental results for best configurations of MESc in table 3 and 4, and the fine-tuned performance of the LLMs used in table 2.

5.1.1. Intra-domain(legal) transfer learning:

As can be seen from table 2, for LexGLUE’s subset, all the GPTs used here are able to adapt better with a minimum of $\approx$ 3 points gain on $\mu$ -F1 and a minimum of $\approx$ 6 points on m-F1 score. On the other hand in the ILDC dataset, for the 512-fine-tuned variants with 512 input lengths for evaluation, their performance dropped or remained similar to the InLegalBERT, while upon increasing the evaluation input length to 2048 we can see an increase of more than 1 point in the performance. For GPT-J when fine-tuned with 2048 input length, the performance increase, compared to its 512 variant, is at least a minimum of $\approx$ 2 points for all the datasets. We can see that an increase in the input length for fine-tuning helps to capture more feature information for such documents. Also going from GPT-Neo-1.3B’s 1.3 billion parameters to its 2.7 billion to 6 billion GPT-J the performance increases by a margin of 2 points at minimum, where we can see the parameter count playing an important role in adaptation and understanding these documents. Even though GPT-Neo and GPT-J are pre-trained on US legal cases (Pile (Gao et al., 2021)) they are able to adapt better to the European and Indian legal documents, with with a minimum gain of $\approx$ 7 points ( $\gamma$ ) and the ECtHR(A & B) and the ILDC dataset over their domain-specific pre-trained counterparts LEGAL-BERT and InLegalBERT respectively.

5.1.2. Performance with MESc:

Looking at table 3 and table 4 we interpret the results in two directions.

Encoders fine-tuned on 512 input length $(\alpha)$ :

For LEGAL-BERT and InLegalBERT in all datasets, MESc achieves a significant increase in performance by at least 4 points in all metrics than their fine-tuned LLM counterparts with just the last layer. Combining the last four layers in 1 $\times$ encoder yields a performance boost of 4 points or more in ECtHR datasets while there is not much improvement in ILDC and SCOTUS. With the approximated structure labels, there is a slight performance increase in the test set of ILDC with $\approx$ 1 point increase in the validation set. The same goes for SCOTUS with $\approx$ 1 point increase in its validation and test set. With the same configuration and 2 $\times$ encoder, we can see a much bigger performance with the structure labels achieving new baseline performance in ECtHR (A) and ECtHR (B), and ILDC datasets. For SCOTUS, this improvement from the baseline is only on the validation set. This is because of the high skew of class labels in the test dataset (for ex. label 5 has only 5 samples). With these results, we fixed certain parameters in MESc for further experiments with the extracted embeddings from GPT-Neo and GPT-J. For them, we ran experiments with 2 $\times$ encoders and the last layer and gained lesser performance than its 2 (or 4) layers with 2 $\times$ encoders, which we exclude in this paper. For LexGLUE, as can be seen in the table 3, concatenating the embeddings from the last two layers of GPT-Neo or GPT-J had a significant impact above their vanilla fine-tuned variants by a minimum margin of 3 points for GPT-Neo-1.3B, and 1 point for GPT-Neo-2.7B and GPT-J. This increases further by a minimum of 1 point when including the approximated structure labels, showing the importance of having structural information for such sparse-annotated documents. For ILDC in table 4, concatenating the last four layers didn’t have much improvement in the performance, while including the generated structure labels in it did increase the performance by 1 point in the validation set and slightly in the test set.

Encoders fine-tuned on 2048 input length $(\gamma)$ :

For the sparse-annotated documents in LexGLUE and ILDC, we did a comparative study of MESc(on GPT-J 6B* ( $\gamma$ ))’s performance with its backbone fine-tuned encoder(GPT-J 6B ( $\gamma$ )) (Table 2, 3) to see the effect of increasing the number of parameters and the input length. GPT-J 6B ( $\gamma$ ) fine-tuned on its maximum input length (2048) achieves better (or similar) performance than its MESc overhead trained on its extracted embeddings. For SCOTUS, MESc achieves better performance (2 points, m-F1) in the validation set but lower (2 points, m-F1) in the test set. Almost similar performance (m-F1) in ECtHR(B), 1 point higher (m-F1) in ECtHR(A)’s test set, and lesser in ILDC. To check if this is the case with GPT-Neo-1.b and GPT-Neo-2.7B we fine-tuned them with their maximum input length (2048) on SCOTUS (which through our experiments can be seen as more difficult to classify). We found for GPT-Neo (1.3B and 2.7B) fine-tuning on their maximum input length didn’t show the same results as with the GPT-J, where we can see that for both GPT-Neo-1.3B( $\gamma$ ) and GPT-Neo-2.7B( $\gamma$ ) even the MESc (on GPT-Neo-1.3B( $\alpha$ )) and MESc (on GPT-Neo-2.7B( $\alpha$ )) performs better (¿ 1 point m-F1) respectively. To analyze this, we plot the distribution of the number of documents with respect to their chunk counts (chunk length = 2048) in the datasets , one such example of ECtHR can be found in figure 3. As observed, most of the documents are able to fit between 1-2 chunks (median = 1), which means that with the longer input of 2048, most of the important information is not fragmented during the fine-tuning process of stage 1 and is learned together. Along with this, the higher number of parameters in GPT-J is able to adapt better to most of the documents. We observe that most (¿ 90%) of the documents can fit in very few chunks, deepening the models with extra layers (stages 3 & 4) does not have any added value.

With these results, we find that:

(1)

Concatenating the last two layers in GPT-Neo (1.3b,2.7B) or GPT-J provides the optimum number of feature variances. And for BERT-based models, the last 4 layers worked better. Globally concatenating the embeddings helped to get a better approximation of the structure labels and improves the performance.
(2)

MESc adapts well to LLMs (BERT-based models, GPT-Neo-1.3B, GPT-Neo-2.7B) with less than 6 billion parameters (GPT-J).
(3)

MESc works better than its counterpart LLM under the condition that the length of most of the documents in the dataset is much greater than the maximum input length of the LLM.

5.2. Results on extractive explanation (ORSE)

Table 5. ORSE vs ILDC_expert (Malik et al., 2021)

	Expert
	1	2	3	4	5
	Baseline-scores
ROUGE-1	0.444	0.517	0.401	0.391	0.501
ROUGE-2	0.303	0.295	0.296	0.297	0.294
ROUGE-L	0.439	0.407	0.423	0.444	0.407
BLEU	0.16	0.28	0.099	0.093	0.248
Jaccard Similarity	0.333	0.317	0.328	0.324	0.318
	ORSE @ k=20% (MESc with InLegalBERT ( $\alpha$ ))
ROUGE-1	0.4844	0.4815	0.4657	0.4678	0.5083
ROUGE-2	0.3339	0.3125	0.3299	0.331	0.3523
ROUGE-L	0.4679	0.4518	0.4537	0.4577	0.488
BLEU	0.1682	0.3253	0.0969	0.08	0.2973
Jaccard Similarity	0.346	0.3381	0.3306	0.318	0.3637
	ORSE @ k=30% (MESc with InLegalBERT ( $\alpha$ ))
ROUGE-1	0.5441	0.5108	0.5351	0.5445	0.55
ROUGE-2	0.3939	0.3452	0.4018	0.4078	0.3976
ROUGE-L	0.5266	0.4815	0.5225	0.5338	0.5302
BLEU	0.2888	0.3979	0.2104	0.1901	0.4049
Jaccard Similarity	0.4051	0.3657	0.3992	0.3896	0.4044
	ORSE @ k=40% (MESc with InLegalBERT ( $\alpha$ ))
ROUGE-1	0.5809	0.5201	0.5857	0.6016	0.5665
ROUGE-2	0.4364	0.3574	0.4597	0.4738	0.4185
ROUGE-L	0.5649	0.4942	0.5741	0.5914	0.5476
BLEU	0.3918	0.3915	0.3416	0.3221	0.4397
Jaccard Similarity	0.4445	0.3739	0.4535	0.4476	0.4216
	ORSE @ k=30% (MESc with GPT-J 6B ( $\alpha$ ))
ROUGE-1	0.5448	0.5152	0.5327	0.5484	0.5488
ROUGE-2	0.3996	0.3497	0.4009	0.4159	0.3931
ROUGE-L	0.5277	0.4858	0.5182	0.5387	0.5283
BLEU	0.2834	0.4078	0.2057	0.1861	0.3982
Jaccard Similarity	0.4076	0.372	0.3984	0.3948	0.4057
	ORSE @ k=40% (MESc with GPT-J 6B ( $\alpha$ ))
ROUGE-1	0.5822	0.5297	0.5864	0.6077	0.567
ROUGE-2	0.4414	0.3685	0.4611	0.4816	0.4164
ROUGE-L	0.5659	0.5033	0.573	0.5984	0.5464
BLEU	0.3854	0.4082	0.3288	0.3113	0.4305
Jaccard Similarity	0.4479	0.3854	0.4563	0.4566	0.4246

We used two best-performing configurations of MESc in ILDC (InLegalBERT ( $\alpha$ ) and GPT-J 6B ( $\alpha$ )) to extract the sentences with ORSE and varying k as 20%, 30%, and 40%. The performance of ORSE can be seen in table 5. The sentences extracted from our extractive explanation algorithm are compared with the gold explanations given by five different annotators (1,2,3,4,5) in ILDC_Expert. The sentence similarity between the two is measured with the help of metrics ROUGE-1, ROUGE-2, ROUGE-L (Lin, 2004), Jaccard similarity, and BLEU(Papineni et al., 2002) score, the results of which are shown in Table 5. We compare scores from our algorithm with the baseline score on the ILDC_Expert (Malik et al., 2021). With InLegalBERT ( $\alpha$ ), and k = top 20% of ranked sentences, ORSE performs almost similar to the baseline in ROUGE-1 while overall slightly better in other metrics. With k = 30%, ORSE surpasses the baselines with a total gain of 19% on ROUGE-1, 31% on ROUGE-2, 22.38% on ROUGE-L, 69.5% on BLEU, and 32.16% on Jaccard Similarity. For k = 40% the gain is much higher with a total gain of 26.65% on ROUGE-1, 44.49% on ROUGE-2, 30.76% on ROUGE-L, 114% on BLEU, and 32.16% on Jaccard Similarity. The explanations extracted from GPT-J 6B ( $\alpha$ ) variant are slightly better than InLegalBERT ( $\alpha$ ) for both k = 30% and 40% respectively. Overall ORSE performs better than the baseline with the best scores having a total average gain of 50% over the baseline on all the metrics.

6. Conclusion

We explore the problem of classification of large and unstructured legal documents and develop a multi-stage hierarchical classification framework (MESc). We find the effect of including the structure information with our approximated structure labels in such documents and also explore the impact of combining the embeddings from the last layers of a fine-tuned transformer encoder model in MESc. Along with BERT-based LLMs, we also explored the adaptability of larger LLMs (GPT-Neo and GPT-J) with multi-billion parameters, to MESc. We check MESc’s limits (section 5.1.2) with these LLMs to suggest the optimal condition for its performance. GPT-Neo and GPT-J adapted well to legal cases from India and Europe even though they were pre-trained only on the US legal case documents showing the intra-domain(legal) transfer learning capacity of these multi-billion parameter language models. Our experiments achieve a new benchmark in the classification of the ILDC and the LexGLUE subset (ECtHR (A), ECtHR (B), and SCOTUS). For the explanation in such hierarchical models, we developed an extractive explanation algorithm (ORSE) based on the sensitivity of a model to its inputs at each level of the hierarchy. ORSE ranks the sentences according to their impact on the prediction/classification and achieves an average performance gain of 50% in ILDC_Expert over the previous benchmark. We aim to further develop the explanation algorithm to adapt to a general neural framework in our future work. Alongside we also aim to leverage this work in-domain, on the French and European legal cases by exploring further the problem of length and non-uniform structure in these legal case documents.

7. Ethical Considerations

Our work aligns with the ethical consideration of the datasets (ILDC (Malik et al., 2021) and LexGLUE (Chalkidis et al., 2022b))) used here for the experimentation and evaluation of our approach. We add certain points to this. The framework developed here is in no way to create a ”robotic” judge or replace one in real life. Rather we try to create such frameworks to analyze how deep learning and natural language processing techniques can be applied to legal documents to extract and provide legal professionals with patterns and insights that may not be implicitly visible. The methods developed here are in no way foolproof to predict and generate an explanatory response, and should not be used for the same in real-life settings (courts) or used to guide people unfamiliar with legal proceedings. The results from our framework should not be used by a non-professional to make high-stakes decisions in one’s life concerning legal cases.

Acknowledgements.

This work is supported by the LAWBOT project (ANR-20-CE38-0013) and HPC/AI resources from GENCI-IDRIS grant number 2022-AD011013937.

References

(1)
Ainslie et al. (2020) Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. 2020. ETC: Encoding Long and Structured Inputs in Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 268–284. https://doi.org/10.18653/v1/2020.emnlp-main.19
Beltagy et al. (2020) Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. https://doi.org/10.48550/ARXIV.2004.05150
Black et al. (2021) Sid Black, Gao Leo, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. https://doi.org/10.5281/zenodo.5297715
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Chalkidis et al. (2019) Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4317–4323. https://doi.org/10.18653/v1/P19-1424
Chalkidis et al. (2022a) Ilias Chalkidis, Xiang Dai, Manos Fergadiotis, Prodromos Malakasiotis, and Desmond Elliott. 2022a. An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification. https://arxiv.longhoe.net/abs/2210.05529
Chalkidis et al. (2020) Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 2898–2904. https://doi.org/10.18653/v1/2020.findings-emnlp.261
Chalkidis et al. (2021) Ilias Chalkidis, Manos Fergadiotis, Dimitrios Tsarapatsanis, Nikolaos Aletras, Ion Androutsopoulos, and Prodromos Malakasiotis. 2021. Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 226–241. https://doi.org/10.18653/v1/2021.naacl-main.22
Chalkidis et al. (2022b) Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Katz, and Nikolaos Aletras. 2022b. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 4310–4330. https://doi.org/10.18653/v1/2022.acl-long.297
Chen et al. (2019) Huajie Chen, Deng Cai, Wei Dai, Zehui Dai, and Yadong Ding. 2019. Charge-Based Prison Term Prediction with Deep Gating Network. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 6362–6367. https://doi.org/10.18653/v1/D19-1667
Cui et al. (2022) Junyun Cui, Xiaoyu Shen, Fei** Nie, Z. Wang, **glong Wang, and Yulong Chen. 2022. A Survey on Legal Judgment Prediction: Datasets, Metrics, Models and Challenges. ArXiv abs/2204.04859 (2022).
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
Feng et al. (2022) Yi Feng, Chuanyi Li, and Vincent Ng. 2022. Legal Judgment Prediction: A Survey of the State of the Art. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence Organization, 5461–5469. https://doi.org/10.24963/ijcai.2022/765 Survey Track.
Gao et al. (2021) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2021. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR abs/2101.00027 (2021). arXiv:2101.00027 https://arxiv.longhoe.net/abs/2101.00027
Jiang et al. (2018) Xin Jiang, Hai Ye, Zhunchen Luo, WenHan Chao, and Wenjia Ma. 2018. Interpretable Rationale Augmented Charge Prediction System. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Santa Fe, New Mexico, 146–151. https://aclanthology.org/C18-2032
Katju (2019) Justice Markandey Katju. 2019. Backlog of cases crippling judiciary. (2019). https://www.tribuneindia.com/news/archive/comment/backlog-of-cases-crippling-judiciary-776503
Kaufman et al. (2019) Aaron Russell Kaufman, Peter Kraft, and Maya Sen. 2019. Improving Supreme Court Forecasting Using Boosted Decision Trees. Political Analysis 27, 3 (2019), 381–387. https://doi.org/10.1017/pan.2018.59
Kitaev et al. (2020) Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rkgNKkHtvB
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
Malik et al. (2021) Vijit Malik, Rishabh Sanjay, Shubham Kumar Nigam, Kripa Ghosh, Shouvik Kumar Guha, Arnab Bhattacharya, and Ashutosh Modi. 2021. ILDC for CJPE: Indian Legal Documents Corpus for Court JudgmentPrediction and Explanation. CoRR abs/2105.13562 (2021). arXiv:2105.13562 https://arxiv.longhoe.net/abs/2105.13562
McInnes et al. (2017) Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering. Journal of Open Source Software 2, 11 (2017), 205. https://doi.org/10.21105/joss.00205
McInnes et al. (2018) Leland McInnes, John Healy, and James Melville. 2018. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. https://doi.org/10.48550/ARXIV.1802.03426
Nallapati and Manning (2008) Ramesh Nallapati and Christopher D. Manning. 2008. Legal Docket Classification: Where Machine Learning Stumbles. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Honolulu, Hawaii, 438–446. https://aclanthology.org/D08-1046
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-**g Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
Pappagari et al. (2019) Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. 2019. Hierarchical Transformers for Long Document Classification. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 838–844. https://doi.org/10.1109/ASRU46091.2019.9003958
Paul et al. (2022) Shounak Paul, Arpan Mandal, Pawan Goyal, and Saptarshi Ghosh. 2022. Pre-training Transformers on Indian Legal Text. https://doi.org/10.48550/ARXIV.2209.06049
Petsiuk et al. (2018) Vitali Petsiuk, Abir Das, and Kate Saenko. 2018. RISE: Randomized Input Sampling for Explanation of Black-box Models. In British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018. BMVA Press, 151. http://bmvc2018.org/contents/papers/1064.pdf
Prasad et al. (2022) Nishchal Prasad, Mohand Boughanem, and Taoufiq Dkaki. 2022. Effect of Hierarchical Domain-specific Language Models and Attention in the Classification of Decisions for Legal Cases. In Proceedings of the 2nd Joint Conference of the Information Retrieval Communities in Europe (CIRCLE 2022), Samatan, Gers, France, July 4-7, 2022 (CEUR Workshop Proceedings, Vol. 3178). CEUR-WS.org. http://ceur-ws.org/Vol-3178/CIRCLE_2022_paper_21.pdf
Tay et al. (2022) Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, **feng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, and Donald Metzler. 2022. Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? https://doi.org/10.48550/ARXIV.2207.10551
Thoppilan et al. (2022) Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia **, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yan** Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. 2022. LaMDA: Language Models for Dialog Applications. arXiv:2201.08239 [cs.CL]
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
Trautmann et al. (2022) Dietrich Trautmann, Alina Petrova, and Frank Schilder. 2022. Legal Prompt Engineering for Multilingual Legal Judgement Prediction. CoRR abs/2212.02199 (2022). https://doi.org/10.48550/arXiv.2212.02199 arXiv:2212.02199
Tuggener et al. (2020) Don Tuggener, Pius von Däniken, Thomas Peetz, and Mark Cieliebak. 2020. LEDGAR: A Large-Scale Multi-label Corpus for Text Classification of Legal Provisions in Contracts. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 1235–1241. https://aclanthology.org/2020.lrec-1.155
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.longhoe.net/abs/1706.03762
Wang and Komatsuzaki (2021) Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
Xiao et al. (2018) Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, and Jianfeng Xu. 2018. CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction. CoRR abs/1807.02478 (2018). arXiv:1807.02478 http://arxiv.longhoe.net/abs/1807.02478
Xu et al. (2020) Nuo Xu, **hui Wang, Long Chen, Li Pan, Xiaoyan Wang, and Junzhou Zhao. 2020. Distinguish Confusing Law Articles for Legal Judgment Prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 3086–3095. https://doi.org/10.18653/v1/2020.acl-main.280
Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 1480–1489. https://doi.org/10.18653/v1/N16-1174
Ye et al. (2018) Hai Ye, Xin Jiang, Zhunchen Luo, and Wenhan Chao. 2018. Interpretable Charge Predictions for Criminal Cases: Learning to Generate Court Views from Fact Descriptions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1854–1864. https://doi.org/10.18653/v1/N18-1168
Yu et al. (2022) Fangyi Yu, Lee Quartey, and Frank Schilder. 2022. Legal Prompting: Teaching a Language Model to Think Like a Lawyer. arXiv:2212.01326 [cs.CL]
Zaheer et al. (2020) Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 17283–17297. https://proceedings.neurips.cc/paper_files/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf
Zeiler and Fergus (2014) Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding Convolutional Networks. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 818–833.
Zhang et al. (2019) Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 5059–5069. https://doi.org/10.18653/v1/P19-1499
Zheng et al. (2021) Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law (São Paulo, Brazil) (ICAIL ’21). Association for Computing Machinery, New York, NY, USA, 159–168. https://doi.org/10.1145/3462757.3466088
Zhong et al. (2018) Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Chaojun Xiao, Zhiyuan Liu, and Maosong Sun. 2018. Legal Judgment Prediction via Topological Learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 3540–3549. https://doi.org/10.18653/v1/D18-1390
Zhong et al. (2020a) Haoxi Zhong, Yuzhong Wang, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020a. Iteratively Questioning and Answering for Interpretable Legal Judgment Prediction. Proceedings of the AAAI Conference on Artificial Intelligence 34, 01 (Apr. 2020), 1250–1257. https://doi.org/10.1609/aaai.v34i01.5479
Zhong et al. (2020b) Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020b. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5218–5230. https://doi.org/10.18653/v1/2020.acl-main.466

A Hierarchical Neural Framework for Classification and its Explanation in Large Unstructured Legal Documents

Abstract.

1. Introduction

2. Related works

3. Method

3.1. Classification Framework (MESc)

Stage 1 - Custom fine-tuning:

Stage 2 - Extracting chunk embeddings:

Stage 3 - Processing the extracted representations:

Stage 4 - Classification:

3.2. Extractive Explanation Algorithm - ”Occlusion-based Relevant Sentence Extraction (ORSE)”

4. Experimental setup

4.1. Dataset

5. Results and discussion

5.1. Results on classification framework

5.1.1. Intra-domain(legal) transfer learning:

5.1.2. Performance with MESc:

Encoders fine-tuned on 512 input length (α)𝛼(\alpha)( italic_α ):

Encoders fine-tuned on 2048 input length (γ)𝛾(\gamma)( italic_γ ):

5.2. Results on extractive explanation (ORSE)

6. Conclusion

7. Ethical Considerations

Acknowledgements.

References

Encoders fine-tuned on 512 input length $(\alpha)$ :

Encoders fine-tuned on 2048 input length $(\gamma)$ :