-
TARexp: A Python Framework for Technology-Assisted Review Experiments
Authors:
Eugene Yang,
David D. Lewis
Abstract:
Technology-assisted review (TAR) is an important industrial application of information retrieval (IR) and machine learning (ML). While a small TAR research community exists, the complexity of TAR software and workflows is a major barrier to entry. Drawing on past open source TAR efforts, as well as design patterns from the IR and ML open source software, we present an open source Python framework…
▽ More
Technology-assisted review (TAR) is an important industrial application of information retrieval (IR) and machine learning (ML). While a small TAR research community exists, the complexity of TAR software and workflows is a major barrier to entry. Drawing on past open source TAR efforts, as well as design patterns from the IR and ML open source software, we present an open source Python framework for conducting experiments on TAR algorithms. Key characteristics of this framework are declarative representations of workflows and experiment plans, the ability for components to play variable numbers of workflow roles, and state maintenance and restart capabilities. Users can draw on reference implementations of standard TAR algorithms while incorporating novel components to explore their research interests. The framework is available at https://github.com/eugene-yang/tarexp.
△ Less
Submitted 24 April, 2022; v1 submitted 23 February, 2022;
originally announced February 2022.
-
TAR on Social Media: A Framework for Online Content Moderation
Authors:
Eugene Yang,
David D. Lewis,
Ophir Frieder
Abstract:
Content moderation (removing or limiting the distribution of posts based on their contents) is one tool social networks use to fight problems such as harassment and disinformation. Manually screening all content is usually impractical given the scale of social media data, and the need for nuanced human interpretations makes fully automated approaches infeasible. We consider content moderation from…
▽ More
Content moderation (removing or limiting the distribution of posts based on their contents) is one tool social networks use to fight problems such as harassment and disinformation. Manually screening all content is usually impractical given the scale of social media data, and the need for nuanced human interpretations makes fully automated approaches infeasible. We consider content moderation from the perspective of technology-assisted review (TAR): a human-in-the-loop active learning approach developed for high recall retrieval problems in civil litigation and other fields. We show how TAR workflows, and a TAR cost model, can be adapted to the content moderation problem. We then demonstrate on two publicly available content moderation data sets that a TAR workflow can reduce moderation costs by 20% to 55% across a variety of conditions.
△ Less
Submitted 29 August, 2021;
originally announced August 2021.
-
Certifying One-Phase Technology-Assisted Reviews
Authors:
David D. Lewis,
Eugene Yang,
Ophir Frieder
Abstract:
Technology-assisted review (TAR) workflows based on iterative active learning are widely used in document review applications. Most stop** rules for one-phase TAR workflows lack valid statistical guarantees, which has discouraged their use in some legal contexts. Drawing on the theory of quantile estimation, we provide the first broadly applicable and statistically valid sample-based stop** ru…
▽ More
Technology-assisted review (TAR) workflows based on iterative active learning are widely used in document review applications. Most stop** rules for one-phase TAR workflows lack valid statistical guarantees, which has discouraged their use in some legal contexts. Drawing on the theory of quantile estimation, we provide the first broadly applicable and statistically valid sample-based stop** rules for one-phase TAR. We further show theoretically and empirically that overshooting a recall target, which has been treated as innocuous or desirable in past evaluations of stop** rules, is a major source of excess cost in one-phase TAR workflows. Counterintuitively, incurring a larger sampling cost to reduce excess recall leads to lower total cost in almost all scenarios.
△ Less
Submitted 29 August, 2021;
originally announced August 2021.
-
Heuristic Stop** Rules For Technology-Assisted Review
Authors:
Eugene Yang,
David D. Lewis,
Ophir Frieder
Abstract:
Technology-assisted review (TAR) refers to human-in-the-loop active learning workflows for finding relevant documents in large collections. These workflows often must meet a target for the proportion of relevant documents found (i.e. recall) while also holding down costs. A variety of heuristic stop** rules have been suggested for striking this tradeoff in particular settings, but none have been…
▽ More
Technology-assisted review (TAR) refers to human-in-the-loop active learning workflows for finding relevant documents in large collections. These workflows often must meet a target for the proportion of relevant documents found (i.e. recall) while also holding down costs. A variety of heuristic stop** rules have been suggested for striking this tradeoff in particular settings, but none have been tested against a range of recall targets and tasks. We propose two new heuristic stop** rules, Quant and QuantCI based on model-based estimation techniques from survey research. We compare them against a range of proposed heuristics and find they are accurate at hitting a range of recall targets while substantially reducing review costs.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
On Minimizing Cost in Legal Document Review Workflows
Authors:
Eugene Yang,
David D. Lewis,
Ophir Frieder
Abstract:
Technology-assisted review (TAR) refers to human-in-the-loop machine learning workflows for document review in legal discovery and other high recall review tasks. Attorneys and legal technologists have debated whether review should be a single iterative process (one-phase TAR workflows) or whether model training and review should be separate (two-phase TAR workflows), with implications for the cho…
▽ More
Technology-assisted review (TAR) refers to human-in-the-loop machine learning workflows for document review in legal discovery and other high recall review tasks. Attorneys and legal technologists have debated whether review should be a single iterative process (one-phase TAR workflows) or whether model training and review should be separate (two-phase TAR workflows), with implications for the choice of active learning algorithm. The relative cost of manual labeling for different purposes (training vs. review) and of different documents (positive vs. negative examples) is a key and neglected factor in this debate. Using a novel cost dynamics analysis, we show analytically and empirically that these relative costs strongly impact whether a one-phase or two-phase workflow minimizes cost. We also show how category prevalence, classification task difficulty, and collection size impact the optimal choice not only of workflow type, but of active learning method and stop** point.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
Goldilocks: Just-Right Tuning of BERT for Technology-Assisted Review
Authors:
Eugene Yang,
Sean MacAvaney,
David D. Lewis,
Ophir Frieder
Abstract:
Technology-assisted review (TAR) refers to iterative active learning workflows for document review in high recall retrieval (HRR) tasks. TAR research and most commercial TAR software have applied linear models such as logistic regression to lexical features. Transformer-based models with supervised tuning are known to improve effectiveness on many text classification tasks, suggesting their use in…
▽ More
Technology-assisted review (TAR) refers to iterative active learning workflows for document review in high recall retrieval (HRR) tasks. TAR research and most commercial TAR software have applied linear models such as logistic regression to lexical features. Transformer-based models with supervised tuning are known to improve effectiveness on many text classification tasks, suggesting their use in TAR. We indeed find that the pre-trained BERT model reduces review cost by 10% to 15% in TAR workflows simulated on the RCV1-v2 newswire collection. In contrast, we likewise determined that linear models outperform BERT for simulated legal discovery topics on the Jeb Bush e-mail collection. This suggests the match between transformer pre-training corpora and the task domain is of greater significance than generally appreciated. Additionally, we show that just-right language model fine-tuning on the task collection before starting active learning is critical. Too little or too much fine-tuning hinders performance, worse than that of linear models, even for a favorable corpus such as RCV1-v2.
△ Less
Submitted 19 January, 2022; v1 submitted 3 May, 2021;
originally announced May 2021.
-
A Sequential Algorithm for Training Text Classifiers
Authors:
David D. Lewis,
William A. Gale
Abstract:
The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task. This method, which we call uncertain…
▽ More
The ability to cheaply train text classifiers is critical to their use in information retrieval, content analysis, natural language processing, and other tasks involving data which is partly or fully textual. An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task. This method, which we call uncertainty sampling, reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.
△ Less
Submitted 24 July, 1994; v1 submitted 24 July, 1994;
originally announced July 1994.