-
Bio-SIEVE: Exploring Instruction Tuning Large Language Models for Systematic Review Automation
Authors:
Ambrose Robinson,
William Thorne,
Ben P. Wu,
Abdullah Pandor,
Munira Essat,
Mark Stevenson,
Xingyi Song
Abstract:
Medical systematic reviews can be very costly and resource intensive. We explore how Large Language Models (LLMs) can support and be trained to perform literature screening when provided with a detailed set of selection criteria. Specifically, we instruction tune LLaMA and Guanaco models to perform abstract screening for medical systematic reviews. Our best model, Bio-SIEVE, outperforms both ChatG…
▽ More
Medical systematic reviews can be very costly and resource intensive. We explore how Large Language Models (LLMs) can support and be trained to perform literature screening when provided with a detailed set of selection criteria. Specifically, we instruction tune LLaMA and Guanaco models to perform abstract screening for medical systematic reviews. Our best model, Bio-SIEVE, outperforms both ChatGPT and trained traditional approaches, and generalises better across medical domains. However, there remains the challenge of adapting the model to safety-first scenarios. We also explore the impact of multi-task training with Bio-SIEVE-Multi, including tasks such as PICO extraction and exclusion reasoning, but find that it is unable to match single-task Bio-SIEVE's performance. We see Bio-SIEVE as an important step towards specialising LLMs for the biomedical systematic review process and explore its future developmental opportunities. We release our models, code and a list of DOIs to reconstruct our dataset for reproducibility.
△ Less
Submitted 12 August, 2023;
originally announced August 2023.
-
Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science
Authors:
Yida Mu,
Ben P. Wu,
William Thorne,
Ambrose Robinson,
Nikolaos Aletras,
Carolina Scarton,
Kalina Bontcheva,
Xingyi Song
Abstract:
Instruction-tuned Large Language Models (LLMs) have exhibited impressive language understanding and the capacity to generate responses that follow specific prompts. However, due to the computational demands associated with training these models, their applications often adopt a zero-shot setting. In this paper, we evaluate the zero-shot performance of two publicly accessible LLMs, ChatGPT and Open…
▽ More
Instruction-tuned Large Language Models (LLMs) have exhibited impressive language understanding and the capacity to generate responses that follow specific prompts. However, due to the computational demands associated with training these models, their applications often adopt a zero-shot setting. In this paper, we evaluate the zero-shot performance of two publicly accessible LLMs, ChatGPT and OpenAssistant, in the context of six Computational Social Science classification tasks, while also investigating the effects of various prompting strategies. Our experiments investigate the impact of prompt complexity, including the effect of incorporating label definitions into the prompt; use of synonyms for label names; and the influence of integrating past memories during foundation model training. The findings indicate that in a zero-shot setting, current LLMs are unable to match the performance of smaller, fine-tuned baseline transformer models (such as BERT-large). Additionally, we find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10\%.
△ Less
Submitted 24 March, 2024; v1 submitted 23 May, 2023;
originally announced May 2023.
-
HadISD: a quality-controlled global synoptic report database for selected variables at long-term stations from 1973--2011
Authors:
Robert J. H. Dunn,
Kate M. Willett,
Peter W. Thorne,
Emma V. Woolley,
Imke Durre,
Aiguo Dai,
David E. Parker,
Russ E. Vose
Abstract:
[Abridged] This paper describes the creation of HadISD: an automatically quality-controlled synoptic resolution dataset of temperature, dewpoint temperature, sea-level pressure, wind speed, wind direction and cloud cover from global weather stations for 1973--2011. The full dataset consists of over 6000 stations, with 3427 long-term stations deemed to have sufficient sampling and quality for clima…
▽ More
[Abridged] This paper describes the creation of HadISD: an automatically quality-controlled synoptic resolution dataset of temperature, dewpoint temperature, sea-level pressure, wind speed, wind direction and cloud cover from global weather stations for 1973--2011. The full dataset consists of over 6000 stations, with 3427 long-term stations deemed to have sufficient sampling and quality for climate applications requiring sub-daily resolution. As with other surface datasets, coverage is heavily skewed towards Northern Hemisphere mid-latitudes.
The dataset is constructed from a large pre-existing ASCII flatfile data bank that represents over a decade of substantial effort at data retrieval, reformatting and provision. These raw data have had varying levels of quality control applied to them by individual data providers. The work proceeded in several steps: merging stations with multiple reporting identifiers; reformatting to netCDF; quality control; and then filtering to form a final dataset. Particular attention has been paid to maintaining true extreme values where possible within an automated, objective process. Detailed validation has been performed on a subset of global stations and also on UK data using known extreme events to help finalise the QC tests. Further validation was performed on a selection of extreme events world-wide (Hurricane Katrina in 2005, the cold snap in Alaska in 1989 and heat waves in SE Australia in 2009). Although the filtering has removed the poorest station records, no attempt has been made to homogenise the data thus far. Hence non-climatic, time-varying errors may still exist in many of the individual station records and care is needed in inferring long-term trends from these data.
A version-control system has been constructed for this dataset to allow for the clear documentation of any updates and corrections in the future.
△ Less
Submitted 26 October, 2012;
originally announced October 2012.