TrialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets

**tai Chen Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, USA Yaojun Hu College of Computer Science and Technology Zhejiang University, Hangzhou, China Yue Wang College of Computer Science and Technology Zhejiang University, Hangzhou, China Yingzhou Lu School of Medicine, Stanford University, Stanford, CA, USA Xu Cao Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, USA Miao Lin Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China Hongxia Xu Medical Big Data Center, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Guangzhou, China. Jian Wu The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China Cao Xiao GE HealthCare, Chicago, USA Jimeng Sun Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, USA Lucas Glass IQVIA, Boston, USA Kexin Huang Computer Science Department, Stanford University, Stanford, CA, USA Marinka Zitnik Informatics, Harvard Medical School, Harvard University, USA Tianfan Fu Department of Computational Science, Rensselaer Polytechnic Institute, NY, USA

Abstract

Clinical trials are pivotal for develo** new medical treatments, yet they typically pose some risks such as patient mortality, adverse events, and enrollment failure that waste immense efforts spanning over a decade. Applying artificial intelligence (AI) to forecast or simulate key events in clinical trials holds great potential for providing insights to guide trial designs. However, complex data collection and question definition requiring medical expertise and a deep understanding of trial designs have hindered the involvement of AI thus far. This paper tackles these challenges by presenting a comprehensive suite of meticulously curated AI-ready datasets covering multi-modal data (e.g., drug molecule, disease code, text, categorical/numerical features) and 8 crucial prediction challenges in clinical trial design, encompassing prediction of trial duration, patient dropout rate, serious adverse event, mortality rate, trial approval outcome, trial failure reason, drug dose finding, design of eligibility criteria. Furthermore, we provide basic validation methods for each task to ensure the datasets’ usability and reliability. We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design, ultimately advancing clinical trial research and accelerating medical solution development. The curated dataset, metrics, and basic models are publicly available at https://github.com/ML2Health/ML2ClinicalTrials/tree/main/AI4Trial.

1 Background & Summary

The clinical trial process is an essential step in develo** new treatments (e.g., drugs or medical devices). It involves evaluating their safety, appropriate dosage, and effectiveness in treating specific diseases in the human body. However, these exploratory trials, often spanning three or four phases, have a high failure rate [1, 2]. Compounding the issue, clinical trials are known for being time-consuming, labor-intensive, and costly. clinical development programs containing the set of phase 1-3 trials typically span 7-11 years, cost an average of 2 billion USD, and achieve approval rates of only around $15\%$ [3]. Clinical Trials are inherently risky (they have a trial in the name), and artificial intelligence (AI) is particularly suited to making accurate estimates to reduce risk.

Years of clinical trials have generated a vast amount of multi-modal data, encompassing aspects such as inclusion/exclusion criteria designs, adverse event statistics, and patient enrollment results. This extensive dataset offers a robust foundation for develo** advanced artificial intelligence algorithms. However, identifying key clinical trial challenges and effectively leveraging the complex variables within this data require a blend of deep medical knowledge and AI expertise. This complexity has hindered skilled AI experts from fully utilizing the data.

The ClinicalTrials.gov website^a^aahttps://clinicaltrials.gov/ provides comprehensive information on clinical trials, including study protocols, participant eligibility criteria, and study results, making it a valuable resource for AI engineers and medical professionals. This centralized repository covers more than 480,000 clinical trial records (as of Feb 2024) from all 50 US states and international trials from 221 countries. However, identifying key clinical trial challenges suitable for AI solutions and selecting appropriate variables for different challenges remain problematic for data scientists who lack relevant background knowledge.

To facilitate cross-disciplinary research and fully leverage the expertise of data scientists and AI experts [4, 5], this paper identifies 8 key critical clinical trial challenges and organizes 23 corresponding AI-ready datasets to support their involvement in these tasks. The data, representing clinical trials registered before February 16, 2024, were collected from ClinicalTrials.gov. We extracted elements and attributes from the XML records of each clinical trial and converted them into tabular data formats, which are better suited for processing by AI models, including deep learning models. Additionally, we transformed some features into more informative forms; for example, converting health condition information into ICD-10 codes. We also contextualize our input variables with valuable data from DrugBank [6] and TrialTrove^b^bbhttps://pharmaintelligence.informa.com/products-and-services/data-and-analysis/trialtrove to depict a comprehensive set of information for clinical trial AI.

When curating these datasets, we manually determined the prediction objectives for each task and selected variables according to the timing of applying AI in real-world practice. For instance, we ensured that trial result information was not included if the AI task is to be performed before trial completion. Features with a limited number of discrete options were organized into categorical features. Each task ultimately has a clearly defined prediction objective and a collection of input tabular variables. Unlike traditional tabular datasets, these datasets may contain multi-modal input features, such as free text (e.g., eligibility criteria) and graph data (e.g., drug molecular graphs).

Refer to caption — Figure 1: Overview of TrialBench. (left) TrialBench comprises 23 AI-ready clinical trial datasets for 8 well-defined tasks: clinical trial duration forecasting, patient dropout rate prediction, serious adverse event, all-cause mortality rate prediction, trial approval outcome prediction, trial failure reason identification, eligibility criteria design, and drug dose finding. For each task, we extracted appropriate multi-modal variables and prediction targets from ClinicalTrials.gov, implemented evaluation metrics, and constructed a multi-modal baseline model to assess dataset quality and to serve as the baseline model. We integrate drug SMILES strings, textual descriptions (e.g., eligibility criteria), Medical Subject Heading (MeSH) term, disease ICD-10 code, and other categorical or numerical features as up to five distinct modal features. The multi-modal model utilizes message-passing neural networks (MPNNs) [7], Bio-BERT [8], MeSH embedding layer [9], Graph-based Attention Model (GRAM) [10], and DANet basic blocks [11] to process each modality, respectively. (right) We present the trial failure reason identification task as an illustrative example for better comprehension.

The datasets in TrialBench lend themselves to the study of the following open questions in artificial intelligence:

1.

Trial Duration Forecasting: The extensive duration of clinical trials, averaging 7 to 11 years, poses a significant challenge in the development of new treatments [12]. Moreover, research indicates that most trials encounter unforeseen delays attributed to strategic challenges, commercial barriers, operational issues, and high toxicity levels [13, 2, 14]. Accurate prediction of trial duration helps pharmaceutical companies better estimate costs, understand how many sites to include to accelerate if necessary, and plan for market launches.
2.

Patient Dropout Rate Prediction: Previous research has pointed out that approximately 30% of participants eventually drop out of trials [15]. If patients drop out without completing the primary outcomes, the investment in enrollment will be wasted. More importantly, patient dropout can introduce bias in the study’s conclusions [16]. Thus, predicting patient dropout is crucial for the success of clinical trials, as it helps to minimize follow-up loss and reduce this potential bias. Previous researchers have identified several factors influencing dropout rates, such as age, sex, and education levels [17], providing a solid foundation for using artificial intelligence to predict patient dropout rates during the trial design stage.
3.

Serious Adverse Event Prediction: Serious adverse events (SAEs) [18] are critical in assessing the safety profile of new treatments. SAEs can lead to significant trial modifications, including early termination, impacting the development timeline and financial investment [19]. Accurate prediction of SAE can determine patient population size and how long you need to monitor patients, allowing for better risk management and resource allocation.
4.

Mortality Event Prediction: When serious adverse events reach a critical level, unsafe treatments or severe diseases may result in fatalities. Unexpected high mortality rates can raise ethical concerns and necessitate comprehensive safety reviews [20]. Predicting mortality events is essential in clinical trial design to ensure patient safety and maintain ethical standards. Mortality events not only can lead to the early termination of trials but also significantly impact the development timeline and financial resources. Develo** artificial intelligence algorithms to identify clinical trials at risk of mortality events can significantly enhance risk management and resource allocation.
5.

Trial Approval Prediction: Clinical trial suffers from low approval rate [21]. It would save a great amount of funding, time, and other resources if we could forecast the approval rate of the clinical trial before it starts and circumvent the trials with low approval expectations. That is to say, an accurate clinical trial approval model can help clinical decision-makers prioritize trials that are more likely to succeed and promote profit rate.
6.

Trial Failure Reason Identification: Failures in clinical trials are commonplace, yet they lead to wasted millions of dollars and years of time, thereby delaying the development of new treatments [22]. Therefore, when designing clinical trials, it is particularly important to anticipate and mitigate risks such as efficiency issues, safety concerns, and poor enrollment. By understanding the root risks of failures in advance, trialists can make informed decisions to enhance design.
7.

Eligibility Criteria Design: The design of inclusion and exclusion eligibility criteria is one of the most crucial aspects of clinical trial design, significantly impacting the success or failure of the trial. Both overly restrictive and excessively lenient criteria [23] can lead to low enrollment rates in clinical trials [24, 25] or result in insufficient conclusions. Currently, the design of these criteria primarily relies on the experience of physicians, which can sometimes lack objectivity. Utilizing existing natural language processing technologies, particularly large language models (LLMs), to automate the design of inclusion and exclusion eligibility criteria presents a promising advantage.
8.

Drug Dose Finding: In Phase II trials, controlling drug dosage is crucial for ensuring safety while maintaining medication efficiency [26]. However, drug dosage control largely relies on the experience of physicians, which can be subjective. Artificial intelligence algorithms offer a promising approach to predict and optimize drug dosages based on drug SMILES and target diseases. By accurately predicting the appropriate drug dosage, clinical trials can enhance patient safety, improve treatment efficacy, and potentially reduce the duration and costs associated with trial phases.

Fig. 1 illustrates the TrialBench platform, containing 8 well-defined clinical trial design tasks. The TrialBench platform provides 23 corresponding AI-ready datasets, implemented evaluation metrics, and baseline models. AI experts can easily access the datasets and targets to develop advanced models, evaluate models on specific metrics, and compare them against baseline models for reference.

2 Methods

2.1 AI-solvable Clinical Trial Task Definitions

In this paper, we identify nine AI-solvable clinical trial tasks. For each task, we elaborate on its background, explain how it would help clinical trial design and management, curate the dataset, evaluate the performance of well-known artificial intelligence methods, and report the empirical results. Table 1 summarizes and compares all the AI-solvable clinical trial tasks and corresponding datasets. We provide the following three aspects for each learning task: (1) Background. Background of the learning task. (2) Definition. A formal definition of the learning task (input feature and output). (3) Broad impact. The broader impact of advancing real clinical trials on the task.

Table 1: Summarization of AI-solvable clinical trial tasks. There are five modalities in total, including drug molecule structure (SMILES string), disease code (ICD-10), text (e.g., summary of clinical trial, eligibility criteria), categorical/numerical features (e.g., gender of patients, blood pressure), and MeSH (The Medical Subject Headings).

Problem	AI Task	Input Modality	Input	Output	# Data
trial duration forecasting	Regression	all 5 modalities	See Fig. 7	trial duration (e.g., 3.2 years)	141,940
patient dropout event forecasting	Classification/ Regression	all 5 modalities	See Fig. 8	patient dropout event (0/1) / rate [0,1)	62,058
serious adverse event forecasting	Classification	all 5 modalities	See Fig. 8	adverse event (0/1)	31,306
mortality event prediction	Classification	all 5 modalities	See Fig. 8	mortality rate (0/1)	31,306
trial approval forecasting	Classification	all 5 modalities	See Fig. 9	trial approval label (0/1)	43,202
trial failure reason identification	Classification	all 5 modalities	See Fig. 10	trial failure reasons (4 categories)	41,369
eligibility criteria design	Generation	MeSH, SMILES, ICD-10, Texts	See Fig. 11	eligibility criteria (natural language)	136,443
drug dose finding	Classification	SMILES, MeSH	SMILES & intervention MeSH	drug dosage (4 categories)	12,790

2.1.1 Trial Duration Prediction

Background. The duration of a clinical trial is defined as the number of years from the trial’s start date to its completion date, representing a continuous numerical value. The clinical trial duration is directly related to its cost because longer trials require more extended use of resources, including personnel, facilities, and materials, leading to increased expenses [27].

Definition. This task focuses on predicting trial duration (time span from the enrollment of the first participant to the conclusion of the study) based on multi-modal trial features such as eligibility criteria, target disease, etc.

Broad impact. Predicting the duration of clinical trials offers several significant benefits that enhance drug development efficiency and effectiveness. AI-driven predictions allow for better planning and resource allocation, leading to more accurate staffing, budgeting, and management of clinical sites. This enhances decision-making by enabling stakeholders to prioritize projects based on expected timelines and identify risks early, allowing for proactive measures to mitigate delays. Ultimately, accurate duration predictions assists pharmaceutical companies in more accurately estimating costs, determining the right number of sites for potential acceleration, and strategizing effective market launch plans in a single, comprehensive solution.

2.1.2 Patient Dropout Forecasting

Background. Clinical trials often suffer from high patient dropout rates, which can compromise the validity of the results and lead to increased costs and delays.

Definition. This task aims to predict the patient dropout rate (percentage) of the clinical trial based on multi-modal trial features such as eligibility criteria, target disease, and so on.

Broad impact. Predicting patient dropout rates in clinical trials holds significant promise for improving the efficiency and effectiveness of drug development processes. Predicting patient dropout rates can significantly enhance the efficiency of clinical trials. High dropout rates often necessitate the recruitment of additional participants to meet the required sample size, which can be both time-consuming and costly.

2.1.3 Serious Adverse Rate Prediction

Background. Adverse event prediction is crucial in clinical trials as it directly impacts the safety, efficacy, and overall success of the trial. The primary concern in any clinical trial is the safety of the participants [28].

Definition. The task targets forecasting the probability of serious adverse effects given multi-modal clinical trial features such as drug molecule, target disease, eligibility criteria, etc.

Broad impact. Predicting adverse events helps in identifying potential risks to patients before they occur, allowing for proactive measures to be taken. On the other hand, regulatory organizations such as the FDA and EMA have strict guidelines for monitoring and reporting adverse events in clinical trials [29]. Accurate prediction and early detection of adverse events can ensure compliance with these regulations.

2.1.4 Mortality Rate Prediction

Background. The mortality rate in a clinical trial refers to the proportion of participants who die during the study. The mortality rate is an important measure used to assess the safety and potential risks associated with a treatment or intervention being tested in the trial.

Definition. The task targets forecasting the probability of mortality rate given multi-modal clinical trial features such as drug molecule, target disease, eligibility criteria, etc.

Broad impact. Accurately predicting the mortality rate of a clinical trial significantly enhances patient safety by identifying potential risks early, allowing for timely interventions. This leads to more efficient trial designs, optimizing resource allocation and reducing costs. Furthermore, it accelerates the drug development process, bringing effective treatments to market faster, and increases compliance with regulatory standards, thereby building public trust and ethical standards in clinical research.

2.1.5 Trial Approval Prediction

Background. Clinical trial approval refers to whether a drug can pass a certain phase of clinical trial, which is the most important outcome of a clinical trial. It is a binary variable.

Definition. This task aims to predict the probability of trial approval given multi-modal trial features such as drug molecule, disease code, and eligibility criteria.

Broad impact. Predicting trial approval can significantly enhance the efficiency and success rates of drug development. By accurately forecasting which drugs are likely to pass clinical trial phases, companies can focus their resources on the most promising candidates, reducing wasted time and money on less viable options. This targeted approach can accelerate the development of effective treatments, bringing them to market faster and improving patient outcomes. Additionally, reliable approval predictions can streamline regulatory processes and increase investor confidence in the pharmaceutical industry.

2.1.6 Trial Failure Reason Identification

Background. Clinical trials usually fail due to a couple of reasons [30]: (1) business decision (e.g., lack of funding, company strategy shift, pipeline reorganization, drug strategy shift); it is challenging to predict business decision, so we do not involve these trial in our dataset; (2) Poor enrollment. Insufficient enrollment can compromise the statistical power of the study, making it difficult to detect a significant effect of the drug. Also, poor enrollment can lead to delays in the trial timeline and increased costs, as more resources are required to recruit additional participants. (3) Safety. Unexpected adverse reactions or side effects can occur, posing significant risks to participants’ health. This can lead to the trial being halted or terminated. (4) Efficacy (effectiveness). In the trial, we expect the tested drug to outperform the standard treatment in curing the target disease. Thus, efficacy (effectiveness) is typically required.

Definition. Given clinical trial features, the goal of this task is to leverage the AI model to classify it into one of these four categories, including (1) successful trials, (2) failure due to poor enrollment, (3) failure due to drug safety issue; (4) fail due to lack of efficacy. It is a multi-category classification problem.

Broad impact. Accurately predicting the reasons for clinical trial failures can greatly enhance the efficiency of drug development by preventing costly delays and optimizing resource allocation. This leads to faster delivery of effective treatments to patients, improving patient outcomes and public health. Additionally, better-designed trials with higher success rates can encourage greater confidence and participation in clinical research.

2.1.7 Eligibility Criteria Design

Background. To achieve statistically significant results, a clinical trial must meet its target sample size [31]. Insufficient patient numbers can lead to underpowered studies, which may fail to demonstrate the effectiveness of a treatment or may miss important safety information. Eligibility criteria are essential to patient recruitment [32]. They describe the patient recruitment requirements in unstructured natural language. Eligibility criteria comprise multiple inclusion and exclusion criteria, which specify what is desired and undesired when recruiting patients. Each individual criterion is usually a natural language sentence.

Definition. This task aims to design eligibility criteria given a series of clinical trial features such as target disease, phase, drug molecules, etc.

Broad impact. Using AI models to design eligibility criteria for clinical trials offers several significant advantages. AI can predict which patients are more likely to meet the eligibility criteria based on historical data and real-world evidence. This speeds up the recruitment process by identifying suitable candidates faster and reducing the time and cost associated with screening large numbers of unsuitable participants.

2.1.8 Drug Dose Finding

Background. One of the primary goals of clinical trials is to determine the drug dose. Determining the correct dosage of a drug is crucial to ensure its effectiveness in treating a particular condition. In the early stages of drug development, predicting the optimal dosage is essential for designing clinical trials [26, 33].

Definition. This task aims to predict drug dosage based on drug molecular structure and target disease.

Broad impact. By estimating the dose-response relationship and identifying the dosage range that balances efficacy and safety, researchers can design more informative and efficient clinical studies.

2.1.9 Generalization of These Tasks

In real-world clinical trials, the design, structure, cost of clinical trials, and the drug structures of interest evolve significantly over time [1, 34, 35]. Thus, these prediction tasks require a model to generalize to a set of unseen clinical trials that are structurally distant to the known clinical trial set. The time information is available for all the trials; we use time split (detailed in Section 2.6) so that we learn from earlier trials and test on later ones to assess the model fairly.

2.2 Data Acquisition

We create the dataset benchmark from multiple public data sources, including ClinicalTrials.gov, DrugBank, TrialTrove, ICD-10 coding system, as elaborated below.

•

ClinicalTrials.gov. ClinicalTrials.gov is a publicly accessible database maintained by the U.S. National Library of Medicine (NLM) at the National Institutes of Health (NIH). It provides detailed information about clinical trials conducted around the world, including those funded by public and private entities. Each clinical trial in ClinicalTrials.gov is provided as an XML file, which we parse to extract relevant variables. For each trial, we retrieve the NCT ID (unique identifiers for each clinical study), disease names, associated drugs, title, summary, trial phase, eligibility criteria, results of statistical analyses, and other details. Some of these features are not always available. For example, observational clinical trials do not involve treatment and drugs.
•

DrugBank. DrugBank [6] (https://www.drugbank.com/) is a comprehensive, freely accessible online database that provides detailed information about drugs and their biological targets. It integrates chemical, pharmacological, and pharmaceutical data with comprehensive drug target information, making it a valuable resource for researchers, healthcare professionals, and students in the fields of drug discovery, pharmacology, and medicinal chemistry.
•

TrialTrove. TrialTrove^c^cchttps://pharmaintelligence.informa.com/products-and-services/data-and-analysis/trialtrove is a comprehensive database and intelligence platform designed to provide detailed information and analysis on clinical trials across the pharmaceutical and biotechnology industries. TrialTrove serves as a critical resource for professionals involved in clinical development, competitive intelligence, and market analysis. We obtain the trial outcomes of some trials from the released/public subset of the TrialTrove database [36, 37].
•

ICD-10. ICD-10-CM (International Classification of Diseases, 10th Revision, Clinical Modification) is a medical coding system for classifying diagnoses and reasons for visits in U.S. healthcare settings. Diseases are extracted from https://clinicaltrials.gov/ and linked to ICD-10 codes and disease description using Clinical Table Search Service API^d^ddclinicaltables.nlm.nih.gov and then to CCS codes via hcup-us.ahrq.gov/toolssoftware/ccs10/ccs10.jsp.

2.3 Data Source Linking

Next, we describe how we process and link the parsed trial data to AI-ready input and output format:

•

Drug names are extracted from ClinicalTrials.gov and linked to its molecule structure (SMILES strings and the molecular graph structures) using the DrugBank Database.
•

Disease data are extracted from ClinicalTrials.gov and linked to ICD-10 (International Classification of Diseases, Tenth Revision) codes and disease description using clinicaltables.nlm.nih.gov and then to CCS codes via hcup-us.ahrq.gov/toolssoftware/ccs10/ccs10.jsp.
•

Trial outcomes are available at TrialTrove and linked through NCTID.

2.4 Dataset Curation and Feature Organization

We apply a series of selection filters to ensure the selected trials have high-quality. There are hundreds of multi-modal features in ClinicalTrials.gov for each trial organized in XML format, and the hierarchy of these features is shown in Figure 6. We only leverage the features that are available before trials start and remove the remaining features. Different tasks rely on different subsets of features. Based on clinical trial knowledge, we manually select the appropriate features for various tasks. In addition, we also remove features whose values are identical or all null across different trials. Following are the additional selection criteria for each task.

•

Trial duration forecasting: We only consider the trials whose start and completion dates are available. We only consider the trials with realistic completion dates and remove the cases with only anticipated completion dates provided. We found that trials with duration over 10 years are outliers [38, 39], so we removed them to facilitate regression analysis.
•

Patient dropout rate prediction: The results are available at ClinicalTrials.gov and the number of dropout and total enrolled patients are reported.
•

Serious adverse event prediction: The results are available at ClinicalTrials.gov and the serious adverse events are reported.
•

Mortality event prediction: The results are available at ClinicalTrials.gov and mortality event is reported.
•

Trial approval outcome prediction: The results and trial outcome information are available at either ClinicalTrials.gov or the released subset of TrialTrove [36, 37, 40, 41].
•

Trial failure reason identification: We incorporate those trials whose results and outcome information are available at ClinicalTrials.gov and can be categorized into four categories (three failure reasons or success) mentioned above.
•

Eligibility criteria design: To ensure the high quality of the selected eligibility criteria, we only incorporate completed trials, indicating successful patient recruitment and reasonable criteria design, and remove the others.
•

Drug dose finding: We incorporate trials whose drug dosage information is available on ClinicalTrials.gov. Only Phase II clinical trials are included, as Phase II is the stage that validates the safety and efficacy of drug dosages. Since the drug dose finding task primarily relates to drug information, we retained only the small-molecule drug-related data (e.g., MeSH) and sourced SMILES from DrugBank. We encourage AI experts to utilize external knowledge from sources such as PubMed and DrugBank for advanced AI model development.

Apart from flattening the XML nodes and attributes into tabular features, we also specially pre-process several features to be more deep learning approach-ready formats: We transform the information recorded in the XML node named “ipd_info_type” into multiple tabular features. The “ipd_info_type” feature specified the provided document types provided such as “Study Protocol”, ‘Statistical Analysis Plan (SAP)”, “Informed Consent Form (ICF)”, and “Clinical Study Report (CSR)”. In one clinical trial, several types of documents may be provided. Thus, we conveyed such information into multiple binary features, where each document type is represented in a binary categorical feature. The columns are named as “ipd_info_type-Analytic Code”, “ipd_info_type-Clinical Study Report (CSR)”, “ipd_info_type-Informed Consent Form (ICF)”, “ipd_info_type-Statistical Analysis Plan (SAP)”, and “ipd_info_type-Study Protocol”, respectively. If a document type appears in the data, the corresponding column value is 1; otherwise, it is 0. Similar strategies were applied on other nodes presenting discrete values, like “study_design_info/masking”, “arm_group/arm_group_type”, and “intervention/intervention_type”.

2.5 Data Annotation

Data annotation (a.k.a. labeling data) is a fundamental step when curating a dataset. Labels of all the datasets can be inferred from various data sources. For some tasks, such as drug dose finding, trial approval prediction, and trial failure reason identification, we use external tools such as GPT to obtain the label from the raw text.

•

Trial duration forecasting: The duration of a clinical trial refers to the number of years the trial lasts, i.e., the difference between the start and complete date. It is a continuous numerical value. For some trials, the start and completion date are available in ClinicalTrials.gov. We can use this information to calculate the trial duration.
•

Patient dropout rate prediction: Some clinical trials on ClinicalTrials.gov present the number of dropout patients and the number of enrolled patients. We compute the patient dropout rate by dividing the number of dropout patients by the number of enrolled patients. The resulting dropout rate is a percentage.
•

Serious adverse event prediction: ClinicalTrials.gov presents the results of some trials. Adverse events are reported for some of these trials.
•

Mortality event prediction: The results of clinical trials presented on ClinicalTrials.gov may include mortality events. We binarize the mortality event as the prediction target indicating whether a mortality event occurred, and remove all other trials that lack mortality event information.
•

Trial approval outcome prediction: The annotations come from two sources. First, the HINT paper [36, 36, 37, 41, 40, 42] builds a benchmark dataset for trial approval prediction, with approval labels sourced from TrialTrove. Additionally, ClinicalTrials.gov provides termination reasons for some trials, such as poor enrollment or lack of efficacy, included in the “why stopped” node in the XML files. We incorporate these trials, along with termination reasons indicating failed approval, into the dataset as negative samples.
•

Trial failure reason identification: For some of the terminated trials, ClinicalTrials.gov provides a “why stopped” tag that uses natural language to describe the failure reason. We use OpenAI ChatGPT API ^e^eehttps://openai.com/index/openai-api/ to automatically convert into three categories of failure reason, including (1) poor enrollment; (2) drug safety issue; (3) lack of efficacy (in treating the target disease). In using ChatGPT, the prompt and instruction are shown below, and we required ChatGPT to complete the “reasons” part:

We input “why stopped” contexts of 10 clinical trials into ChatGPT in each iteration. We also use the passed trials from the released subset of TrialTrove, following [37, 36].
•

Eligibility criteria design: For some trials, the eligibility criteria are organized in a textual format and are available on ClinicalTrials.gov. We considered the inclusion/exclusion eligibility criteria of trials marked as “completed” as the ground truth.
•

Drug dose finding: One aim of phase-II clinical trials is to determine the dosage of the drug. ClinicalTrials.gov presents the drug dosage information of some trials in natural language. We use OpenAI ChatGPT API^†^†footnotemark: to extract the label from natural language, the prompt is shown below.

We categorize these doses into four classes: (1): dose $<$ 1 mg/kg; (2) 1 mg/kg $<$ dose $<$ 10 mg/kg; (3) 10 mg/kg $<$ dose $<$ 100 mg/kg; (4) dose $>$ 100 mg/kg. For dosages expressed in units such as mg per person or mg/hour, we assume an individual weight of 60 kg and convert using 24 hours per day to keep the units consistent.

2.6 Data Split/Segmentation

Artificial intelligence models need to be evaluated on (future) unseen data. To simulate that setting, data split strategies are employed to partition the dataset into training, validation, and testing sets for unbiased evaluation of the artificial intelligence models. In this paper, we leverage temporal split, which refers to splitting the data samples based on their time stamps. The earlier data samples are used for training and validation, while the later data are used for testing. The reason is that the design of later clinical trials relies on earlier clinical trials. The training/test split ratio is 8:2.

Table 2: Statistics of all the curated AI-solvable clinical trial datasets.

Tasks	# trials (I/II/III/IV)	# drugs	# med device	# other inter	# diseases	Intervention study (%)
trial duration forecasting	143.8K (13.5K/13.4K/9.2K/7.1K)	40.8K	21.1K	83.6K	44.6K	77.3%
patient dropout event forecasting	62.1K (4.2K/15.8K/11.5K/6.9K)	29.7K	10.9K	20.7K	21.9K	94.5%
serious adverse event forecasting	31.3K (2.0K/8.1K/4.8K/2.9K)	15.9K	6.6K	12.4K	15.9K	96.0%
mortality event prediction	31.3K (2.0K/8.1K/4.8K/2.9K)	15.9K	6.6K	12.4K	15.9K	96.0%
trial approval forecasting	43.2K (4.5K/12.5K/9.2K/4.5K)	24.1K	3.3K	12.6K	19.5K	93.0%
trial failure reason identification	41.4K (4.3K/8.8K/4.2K/3.5K)	17.7K	6.6K	16.9K	21.9K	86.8%
eligibility criteria design	136.4K (19.4K/14.2K/10.8K/10.6K)	48.5K	16.2K	75.0K	36.6K	84.9%
drug dose finding	12.8K (0/12.8K/0/0)	11.0K	0.1K	1.2K	7.3K	100%

3 Data Records

The clinicalTrials.gov/ website (https://clinicaltrials.gov/) provides public data resources for clinical trials. It is supported by the U.S. National Library of Medicine and covers more than 420,000 clinical trial records over all 50 US states and also international trials from 221 countries. The number of recorded trials would grow rapidly with time, as shown in Figure 2.

Table 3: Comparison of different phases from several angles.

	Phase I	Phase II	Phase III
Spent time	1-2 years	1-2 years	2-3 years
Spent Money ($)	225 M	225 M	250 M
Result	5-10 candidates	2-5 candidates	1-2 candidates
Major objective	safety	safety and dosing	safety and efficacy
# of patients	20-80	100-300	300-3000
Recruited patient	healthy	with diseases	with diseases

There are hundreds of multi-modal features in ClinicalTrials.gov for each trial organized in XML format, and the hierarchy of these features is shown in Figure 6. Now, we review some essential features. Some trials have missing features, e.g., some incomplete trials do not have a completed date and outcome.

•

Trial questions. A clinical trial aims to answer the question: Is the treatment effective in treating the target diseases for patients? First, the treatment must be safe for the human body. Second, the new drug candidate should be better than the current standard treatment.
•

National Clinical Trial number (NCT ID) is the identifier of the clinical trial. It consists of 11 characters and begins with NCT, e.g., NCT02929095. NCT ID is assigned based on the temporal order of registration date and starts from NCT00000000.
•

Study type. Clinical trials can be categorized into interventional and observational. Interventional clinical trials involve drugs, medical devices, or surgery as treatment. In contrast, observational trials do not assign participants to a treatment or other intervention. Instead, the researchers observe participants or measure certain outcomes to determine clinical outcomes.
•

Phase. Phase I tests the toxicity and side effects of the drug; phase II determines the efficacy of the drug (i.e. if the drug works); phase III focuses on the effectiveness of the drug (i.e., whether the drug is better than the current standard practice). When the trial passes phase III, it can be submitted to the FDA for approval. In many cases, even after approval, we still need to further monitor the drugs’ effectiveness and safety. Sometimes a phase IV trial will be conducted to assess the drug’s effectiveness and safety. Table 3 demonstrate the differences between phases I, II, III, and IV.
•
Eligibility criteria describe the patient recruitment requirements in unstructured natural language. Eligibility criteria comprise multiple inclusion and exclusion criteria, which specify what is desired and undesired when recruiting patients. Each individual criterion is usually a natural language sentence. For example, in the clinical trial entitled “Efficacy and Safety Study of MP-513 in Combination With Thiazolidinedione in Patients With Type 2 Diabetes ”^f^ffhttps://clinicaltrials.gov/ct2/show/NCT01026194, which is a phase III trial, the inclusion criteria contain
- –
  
  Patients who are 20 - 75 years old.
- –
  
  Patients who are under dietary management and taking therapeutic exercise for diabetes over 12 weeks before administration of an investigational drug.
- –
  
  Patients whose HbA1c is between 6.5% and 10.0%.
- –
  
  Patients who took Thiazolidinedione for diabetes over 16 weeks before administration of the investigational drug.
- –
  
  Patients who were not administered diabetes therapeutic drugs prohibited for concomitant use within 12 weeks before administration of the investigational drug.
The exclusion criteria contain
- –
  
  Patients with type 1 diabetes, diabetes mellitus caused by pancreas impairment, or secondary diabetes (cushing disease, acromegaly, etc).
- –
  
  Patients who are accepting treatments of arrhythmias.
- –
  
  Patients with serious diabetic complications.
- –
  
  Patients who are excessive alcohol addicts.
- –
  
  Patients with a severe hepatic disorder or a severe renal disorder.
- –
  
  Patients who are pregnant, lactating, and probably pregnant patients, and patients who can not agree to contraception.
•

Disease (also known as condition, or indication) describes the diseases that the drug is intended to treat. It is in unstructured natural language. For example, NCT00428389 studies the safety of switching from Donepezil to Rivastigmine patch in patients with probable Alzheimer’s Disease, where Alzheimer’s disease is the disease that the trial wants to treat. Sometimes, a single trial may target multiple diseases or patients with co-morbidities.
•

Disease code. The disease is usually described by natural language, and it is hard to reveal the relationship between different diseases [43, 44]. To address this issue, we map disease names to disease codes and leverage the disease hierarchy for machine learning modeling. For example, several ICD-10 codes correspond to Alzheimer’s disease, including “G30.0” (Alzheimer’s disease with early onset), “G30.1” (Alzheimer’s disease with late onset), “G30.8” (Other Alzheimer’s disease), “G30.9” (Alzheimer’s disease, unspecified) [45, 46].
•

Title of the clinical trial is usually in unstructured natural language.
•

Summary of the clinical trial is also in terms of unstructured natural language, which consists of 2-5 sentences that describe the tested treatment, target disease to treat, and the main objective of the clinical trial.
•

Study type. There are mainly two study types: interventional and observational. Interventional trials assess an intervention/treatment, which can be drugs, medical devices, surgery, activity (exercise), procedure, etc. In contrast, observational trials do not involve an intervention or treatment; instead, in observational trials, patients take normal treatment, researchers observe/track patients’ health records and analyze the results. We restrict our attention to the subset of interventional trials using drug candidates as the interventions.
•

Drug (also known as intervention or treatment). In the trial document, the drug names are shown. We also know the category of the drug, i.e., whether it belongs to small-molecule drug or biologics. The treatment usually involves one or multiple drug molecules. We can also map the drug candidate to its molecule structure, such as its SMILES string (The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings).
•

Trial site. One trial is usually conducted in multiple trial sites so that scientists can recruit sufficient patients. Scientists also hope to reduce the bias of patient groups and enhance their diversity, so the geographic location of the trial sites is also considered.
•

Patient. The trial runner need to recruit eligible patient volunteers based on their electronic health records (EHR) in the trial sites to conduct the trial. The requirement of recruiting patients is provided in the eligibility criteria.
•

Electronic Health Record (EHR). An electronic health record (EHR) is the longitudinal digital record of patients and contains patients’ medical histories. The growing volume and availability of Electronic Health Record (EHR) data have sparked an interest in using machine learning methods for supporting drug development [47]. For example, machine learning approaches such as [48, 49] have been proposed to map patient EHR data to clinical trial eligibility criteria. EHR data comprises medical records of $N$ different patients. The medical record of each patient is longitudinal data.
•

Start date is the registration date of the clinical trial. NCTID is assigned based on the order of start date.
•

Completion date refers to the date when the clinical trial is complete. Incomplete clinical trials have the expected completion dates.

•

Sponsors of the clinical trial can be pharmaceutical companies or research institutes. For example, the trial entitled “PF-06863135 As Single Agent And In Combination With Immunomodulatory Agents In Relapse/Refractory Multiple Myeloma”^g^gghttps://clinicaltrials.gov/ct2/show/NCT03269136 is supported by Pfizer; the trial entitled “Five, Plus Nuts and Beans for Kidneys” ^h^hhhttps://clinicaltrials.gov/ct2/show/NCT03299816 is supported by Johns Hopkins University. Some trials may contain multiple sponsors. Table 4 lists the top 20 sponsors that conduct the most interventional clinical trials.

Table 4: The 20 sponsors with the most number of interventional clinical trials. We only count all the clinical trials that are publicly available at https://clinicaltrials.gov/ by February 2024. We find the top 20 sponsors cover both pharmaceutical companies and academic institutes.

Sponsor	# of trials
GlaxoSmithKline	3017
National Cancer Institute (NCI)	2826
Pfizer	2346
M.D. Anderson Cancer Center	2099
AstraZeneca	2095
Novartis Pharmaceuticals	2071
Mayo Clinic	1911
Cairo University	1835
National Institute of Allergy and Infectious Diseases (NIAID)	1758
Massachusetts General Hospital	1725
Merck Sharp & Dohme LLC	1691
Boehringer Ingelheim	1689
Assistance Publique - Hôpitaux de Paris	1686
Eli Lilly	1678
Hoffmann-La Roche	1575
University of California, San Francisco	1404
Stanford University	1323
Duke University	1301
Sanofi	1267
Memorial Sloan Kettering Cancer Center	1253

•

Outcome. Generally, the trial outcomes are usually complex, involving many statistics and analyses. In some tasks, such as clinical trial outcome prediction, the outcome can be abstracted into binary labels, e.g., whether the tested drug passed a particular phase.
•

Failure reason. Clinical trials suffer from high failure rates due to multiple reasons, including business decisions (e.g., lack of funding, company strategy shift), poor enrollment, drug safety issues (e.g., adverse effects), and lack of efficacy.

Table 5 demonstrates a real clinical trial example.

Table 5: A real example of a clinical trial record.

Feature	Descriptions
NCTID	NCT00610792
disease	Ovarian Cancer
phase	II
title	Phase 2 Study of Twice Weekly VELCADE and CAELYX in Patients With Ovarian Cancer Failing Platinum Containing Regimens
summary	This is a Phase 2, multicenter open-label, uncontrolled 2-step design. Patients will be arranged in two groups based on the response to their last platinum containing therapy.
	The two groups are, 1) Platinum-Resistant Patients: patients with the progressive disease while on platinum-containing therapy or stable disease after at least 4 cycles; patients relapsing following an objective response while still receiving treatment; patients relapsing after an objective response within 6 months from the discontinuation of the last chemotherapy and 2) Platinum-Sensitive Patients: patients who relapsed following an objective response
study type	interventional
drug	bortezomib and pegylated liposomal doxorubicin
start date	July 2006
completed date	September 2009
sponsor	Millennium Pharmaceuticals, Inc.
outcome	withdrawn

Figure 2 shows the time distribution of all the select trials. We observe that as the demand for new treatments increases, the number of initiating trials rises over time. Table 2 reports some essential statistics of the curated datasets, including the number of involved trials, drugs, diseases, and proportion of interventional trials.

3.1 Multi-Modal Features

Clinical trials involve diverse modalities of data, as shown in the following.

Categorical Feature

For example, there are mainly two study types: interventional and observational. The intervention type can be a small-molecule drug, biologics, or surgery, etc. Clinical trial sponsors can be pharmaceutical companies or research institutes, e.g., Johns Hopkins University, or Pfizer.

Numerical Feature

Numerical features, such as the minimum/maximum age of recruited patients and the number of real/expected recruited patients, are also common in clinical trials.

Text Feature

In clinical trials, there are many text data that contain rich information for artificial intelligence modeling. For example, eligibility criteria describe the patient recruitment requirements in unstructured natural language; each clinical trial contains a summary, which consists of 2-5 natural language sentences that describe the tested treatment, the target disease to treat, and the main objective of the clinical trial. To process such datasets, we treat the text data as sequences of tokens (e.g., words). How to extract useful information from unstructured text has been extensively studied with several well-known deep neural network architectures, such as recurrent neural network (RNN), convolutional neural network (CNN), and transformer architecture.

Drug Molecule

The most expressive and intuitive data representation of a drug molecule is the 2D molecular graph [50], where each node corresponds to an atom in the molecule while an edge corresponds to a chemical bond. The molecular graph mainly contains two essential components: node identities and node interconnectivity. The nodes’ identities include atom types, e.g., carbon, oxygen, nitrogen, etc. The nodes’ connectivity can be represented as an adjacency matrix, where the (i,j)-th element denotes the connectivity between $i$ -th and $j$ -th nodes.

MeSH Terms

The Medical Subject Headings (MeSH) comprehensively index, catalog, and search biomedical and health-related information. It consists of sets of terms in a hierarchical structure that enables more precise and efficient retrieval of information. Unlike ICD-10, which primarily classifies diseases and medical conditions, MeSH is also used to index and retrieve information on broader health-related topics such as anatomy, drugs, and diseases.

Disease Code

Diseases lie at the heart of clinical trial and drug discovery. There are several standardized disease coding systems that healthcare providers use for the electronic exchange of clinical health information, including the International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM), The International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM), and Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT) [51]. These coding systems contain disease concepts organized into hierarchies. We take the ICD-10-CM code as an example. ICD-10-CM is a seven-character, alphanumeric code. Each code begins with a letter, and two numbers follow that letter. The first three characters of ICD-10-CM are the “category”. The category describes the general type of injury or disease. A decimal point and the subcategory follow the category. For example, the code “G44” represents “Other headache syndromes”; the code “G44.31” represents “Acute post-traumatic headache”; the code “G44.311” represents “Acute post-traumatic headache, intractable”. G44.311 has two ancestors: G44 and G44.31, where an ancestor represents a higher-level category of the current code. The description of all the ICD-10-CM codes is available at https://www.icd10data.com/ICD10CM/Codes. We also illustrate the hierarchy in Figure 4.

4 Technical Validation

To show the processed datasets are AI-ready and of reasonable quality, we evaluate the performance of these datasets on mainstream artificial intelligence algorithms. We leverage a multi-modal deep neural network to represent the multi-modal features and concatenate all these representations to make the prediction. In this section, we first discuss the multi-modal deep learning method, then describe the experimental setup, and present the experimental results finally.

4.1 Multi-modal Deep Neural Networks

For all the classification and regression tasks, we apply state-of-the-art multi-modal deep neural networks to represent multi-modal features. Each representation is an embedding vector with continuous values. Then, we combine these representations and make the prediction. For the eligibility criteria design task, we use OpenAI ChatGPT APIⁱⁱihttps://openai.com/index/openai-api/ with the prompt to produce eligibility criteria.

Categorical and Numerical Features

Recently, numerous tabular data processing models [52, 53, 54] have been proposed for numerical and categorical feature processing. Among them, DANets [11] stand out due to its key component’s modularity and ability to achieve competitive performance without hyper-parameter tuning. The key component, the basic block module, supports flexible stacking, making DANets suitable as a submodule for processing numerical and categorical features. After preprocessing (e.g., normalization), three lightweight basic blocks are sequentially stacked to hierarchically select, extract, and merge features from input categorical and numerical features, ultimately yielding a 50-dimensional embedding.

Disease Code

Graph-based Attention Model (GRAM) is an attention-based neural network model that leverages the hierarchical information inherent to disease codes (medical ontologies) [10]. Specifically, each disease code is assigned a basic embedding, e.g., the disease code $d_{i}$ has basic embedding, denoted $\mathbf{e}_{i}\in\mathbb{R}^{d}$ . Then, to impute the hierarchical dependencies, the embedding of current disease $d_{i}$ (denoted $\mathbf{h}_{i}$ ) is represented as a weighted average of the basic embeddings ( $\mathbf{e}\in\mathbb{R}^{d}$ ) of itself and its ancestors, the weight is evaluated by the attention model. It is formally defined as

\mathbf{h}_{i}=\sum_{j\in\text{Ancestors}(i)\cup\{i\}}\alpha_{ij}\mathbf{e}_{j},

(1)

where $\alpha_{ji}\in(0,1)$ represents the attention weight and is defined as

\displaystyle\alpha_{ji}=\frac{\exp\big{(}\phi([\mathbf{e}_{j}^{\top},\mathbf{% e}_{i}^{\top}]^{\top})\big{)}}{\sum_{k\in\text{Ancestors}(i)\cup\{i\}}\exp\big% {(}\phi([\mathbf{e}_{k}^{\top},\mathbf{e}_{i}^{\top}]^{\top})\big{)}},\ \ \ \ % \ \ \sum_{j\in\text{Ancestors}(i)\cup\{i\}}\alpha_{ji}=1,

(2)

where the attention model $\phi(\cdot)$ is an MLP with a single hidden layer, the input is the concatenation of the basic embedding, the output is a scalar, $\mathbf{e}_{i}$ serves as the query while all the ancestors embeddings $\big{\{}\mathbf{e}_{j}\big{\}}$ serve as the keys. $\text{Ancestors}(i)$ represents the set of all the ancestors of the disease code $d_{i}$ . The GRAM model is illustrated in Figure 5.

MeSH Terms

Similar to modern word embeddings that represent word semantics, Medical Subject Headings (MeSH) codes from the MeSH thesaurus can also be represented using embedding approaches. MeSH-Embedding [9] has pretrained a MeSH embedding layer using the node2vec algorithm [55] with default parameters. For MeSH terms that have not been included in pretraining the MeSH embedding layer [9], we employ a new parametric embedding layer learned from scratch.

Text Feature

Bidirectional Encoder Representations from Transformers (BERT) [56] is a powerful pretraining technique that has its roots in the Transformer architecture and was specifically designed for natural language processing (NLP) tasks. In recent years, it has been widely applied to drug discovery and has proven to be effective in modeling text data. BERT is constructed by stacking multiple layers of Transformer blocks. The output of each layer is used as the input to the subsequent layer, thus allowing the model to learn increasingly complex representations of the input data. This technique results in a deep, bidirectional architecture that is capable of capturing contextual information from both the past and future tokens in a sequence. The key advantage of using BERT for this task is that it enables the model to leverage the knowledge learned from the massive unlabeled data to better understand the relationships between the sequences and their corresponding properties. This allows the model to make more accurate predictions compared to training the model from scratch using only the limited labeled data available for the specific task. In this paper, we use Bio-BERT [8], a variant of BERT that is pretrained in biomedical literature.

Drug Molecule

Drug Molecule is essentially 2D planar graph. Graph neural network (GNN) is a neural network architecture that takes graph-structured data as input, transmits the information between the connected edges and nodes to capture the interaction between them, and learns a vector representation of graph nodes and the entire graph [57]. Message Passing Neural Network (MPNN) [7] is a popular variant of GNN, which updates the information of edges in a graph. First, on the node level, each node $v$ has a feature vector denoted $\mathbf{e}_{v}$ . For example, node $v$ in a molecular graph $G$ is an atom, $\mathbf{e}_{v}$ includes the atom type, valence, and other atomic properties. $\mathbf{e}_{v}$ can be a one-hot vector indicating the category of the node $v$ . On the edge level, $\mathbf{e}_{uv}$ is the feature vector for edge $(u,v)$ . $\mathcal{N}(u)$ represents the set of all the neighbor nodes of the node $u$ . At the $l$ -th layer, $\mathbf{m}^{(l)}_{uv}$ and $\mathbf{m}^{(l)}_{vu}$ are the directional edge embeddings representing the message from node $u$ to node $v$ and vice versa. They are iteratively updated as

\displaystyle\mathbf{m}_{uv}^{(l)}=f_{1}\bigg{(}\mathbf{e}_{u}\oplus\mathbf{e}% _{uv}^{(l-1)}\oplus\sum_{w\in\mathcal{N}(u)\backslash v}\mathbf{m}_{wu}^{(l-1)% }\bigg{)},\ \ \ \ l=1,\cdots,L,

(3)

where $\oplus$ denotes the concatenation of two vectors; $f_{1}(\cdot)$ is a multiple layer perceptron (MLP), $\mathbf{m}_{uv}^{(l)}$ is the message vector from node $u$ to node $v$ at the $l$ -th iteration, whose initialization is all-0 vector, i.e., $\mathbf{m}_{uv}^{(0)}=\mathbf{0}$ , following the rule of thumb [58, 59]. After $L$ steps of iteration ( $L$ is the depth), another multiple layer perceptron (MLP) $f_{2}(\cdot)$ is used to aggregate these messages. Each node has an embedding vector as

\displaystyle\mathbf{h}_{u}=f_{2}\bigg{(}\mathbf{e}_{u}\oplus\sum_{v\in{% \mathcal{N}(u)}}\mathbf{m}_{vu}^{(L)}\bigg{)}.

(4)

We are interested in graph-level representation $\mathbf{h}_{G}$ , we can further use the readout function (e.g., average) to aggregate all the node embeddings.

Representation Fusion

After obtaining the representations of multi-modal data, we concatenate these representations, feed the concatenated vector into MLP, and make the prediction. For binary classification tasks (e.g., trial approval prediction), we use sigmoid function as the activation function in the output layer to yield predicted probability; for multi-category classification tasks (e.g., trial failure reason identification), we use softmax as the activation function in the output layer to produce probability distribution over all the categories; for regression tasks (e.g., trial duration prediction), we do not use activation function in the output layer to produce continuous-valued prediction. We use cross-entropy criterion as the loss function for classification tasks and mean-square error (MSE) as the loss function for regression tasks.

Table 6: Experimental results on the curated datasets using multi-modal deep learning method.

patient dropout prediction (classification)
Phase	PR-AUC ( $\uparrow$ )	F1 ( $\uparrow$ )	ROC-AUC ( $\uparrow$ )	Precision ( $\uparrow$ )	Recall ( $\uparrow$ )	Accuracy ( $\uparrow$ )
I	0.6907 $\pm$ 0.0174	0.7176 $\pm$ 0.0137	0.7226 $\pm$ 0.0107	0.7331 $\pm$ 0.0185	0.7030 $\pm$ 0.0176	0.6738 $\pm$ 0.0129
II	0.7775 $\pm$ 0.0081	0.8628 $\pm$ 0.0053	0.7309 $\pm$ 0.0085	0.7778 $\pm$ 0.0081	0.9686 $\pm$ 0.0034	0.7634 $\pm$ 0.0080
III	0.9126 $\pm$ 0.0060	0.9512 $\pm$ 0.0031	0.7345 $\pm$ 0.0150	0.9126 $\pm$ 0.0060	0.9932 $\pm$ 0.0012	0.9073 $\pm$ 0.0056
IV	0.7093 $\pm$ 0.0101	0.8272 $\pm$ 0.0069	0.6711 $\pm$ 0.0105	0.7093 $\pm$ 0.0101	0.9924 $\pm$ 0.0025	0.7071 $\pm$ 0.0101
patient dropout prediction (regression)
Phase	MAE ( $\downarrow$ )	RMSE ( $\downarrow$ )	$R^{2}$ ( $\uparrow$ )
I	0.4451 $\pm$ 0.0030	0.4608 $\pm$ 0.0025	0.6284 $\pm$ 0.0290
II	0.4203 $\pm$ 0.0024	0.4432 $\pm$ 0.0020	0.4033 $\pm$ 0.0169
III	0.4054 $\pm$ 0.0040	0.4285 $\pm$ 0.0034	0.4172 $\pm$ 0.0154
IV	0.4180 $\pm$ 0.0038	0.4385 $\pm$ 0.0030	0.2188 $\pm$ 0.0318
adverse event prediction
Phase	PR-AUC ( $\uparrow$ )	F1 ( $\uparrow$ )	ROC-AUC ( $\uparrow$ )	Precision ( $\uparrow$ )	Recall ( $\uparrow$ )	Accuracy ( $\uparrow$ )
I	0.7259 $\pm$ 0.0300	0.7932 $\pm$ 0.0229	0.8740 $\pm$ 0.0185	0.8055 $\pm$ 0.0311	0.7824 $\pm$ 0.0315	0.8211 $\pm$ 0.0177
II	0.8201 $\pm$ 0.0085	0.8670 $\pm$ 0.0054	0.7988 $\pm$ 0.0123	0.8272 $\pm$ 0.0086	0.9109 $\pm$ 0.0067	0.7910 $\pm$ 0.0076
III	0.8938 $\pm$ 0.0098	0.9312 $\pm$ 0.0059	0.8638 $\pm$ 0.0129	0.8951 $\pm$ 0.0098	0.9704 $\pm$ 0.0061	0.8779 $\pm$ 0.0099
mortality rate prediction
Phase	PR-AUC ( $\uparrow$ )	F1 ( $\uparrow$ )	ROC-AUC ( $\uparrow$ )	Precision ( $\uparrow$ )	Recall ( $\uparrow$ )	Accuracy ( $\uparrow$ )
I	0.6103 $\pm$ 0.0382	0.7454 $\pm$ 0.0273	0.9009 $\pm$ 0.0094	0.6877 $\pm$ 0.0423	0.8160 $\pm$ 0.0313	0.8511 $\pm$ 0.0147
II	0.6697 $\pm$ 0.0149	0.7303 $\pm$ 0.0121	0.8110 $\pm$ 0.0103	0.7577 $\pm$ 0.0174	0.7051 $\pm$ 0.0161	0.7609 $\pm$ 0.0107
III	0.6282 $\pm$ 0.0229	0.7258 $\pm$ 0.0173	0.7976 $\pm$ 0.0155	0.6649 $\pm$ 0.0241	0.7994 $\pm$ 0.0143	0.7095 $\pm$ 0.0159
trial approval prediction
Phase	PR-AUC ( $\uparrow$ )	F1 ( $\uparrow$ )	ROC-AUC ( $\uparrow$ )	Precision ( $\uparrow$ )	Recall ( $\uparrow$ )	Accuracy ( $\uparrow$ )
I	0.5794 $\pm$ 0.0211	0.7011 $\pm$ 0.0159	0.7824 $\pm$ 0.0121	0.6148 $\pm$ 0.0219	0.8102 $\pm$ 0.0153	0.7012 $\pm$ 0.0124
II	0.5099 $\pm$ 0.0101	0.5895 $\pm$ 0.0081	0.7714 $\pm$ 0.0076	0.6176 $\pm$ 0.0111	0.5640 $\pm$ 0.0100	0.7089 $\pm$ 0.0077
III	0.6383 $\pm$ 0.0088	0.7416 $\pm$ 0.0074	0.7405 $\pm$ 0.0118	0.6520 $\pm$ 0.0085	0.8599 $\pm$ 0.0086	0.6677 $\pm$ 0.0074
IV	0.4137 $\pm$ 0.0171	0.5845 $\pm$ 0.0172	0.6417 $\pm$ 0.0176	0.4137 $\pm$ 0.0171	0.9969 $\pm$ 0.0019	0.4315 $\pm$ 0.0161
drug dose finding
Phase	PR-AUC ( $\uparrow$ )	F1 ( $\uparrow$ )	ROC-AUC ( $\uparrow$ )	Precision ( $\uparrow$ )	Recall ( $\uparrow$ )	Accuracy ( $\uparrow$ )
All	0.5333 $\pm$ 0.0160	0.5072 $\pm$ 0.0125	0.7617 $\pm$ 0.0073	0.5796 $\pm$ 0.0186	0.4811 $\pm$ 0.0107	0.5882 $\pm$ 0.0086
trial failure reason identification
Phase	PR-AUC ( $\uparrow$ )	F1 ( $\uparrow$ )	ROC-AUC ( $\uparrow$ )	Precision ( $\uparrow$ )	Recall ( $\uparrow$ )	Accuracy ( $\uparrow$ )
I	0.2798 $\pm$ 0.0096	0.2028 $\pm$ 0.0104	0.5599 $\pm$ 0.0166	0.1901 $\pm$ 0.0228	0.2523 $\pm$ 0.0062	0.6157 $\pm$ 0.0169
II	0.2857 $\pm$ 0.0058	0.1505 $\pm$ 0.0029	0.5627 $\pm$ 0.0081	0.1077 $\pm$ 0.0029	0.25 $\pm$ 0.0	0.4310 $\pm$ 0.0119
III	0.2880 $\pm$ 0.0086	0.1972 $\pm$ 0.0111	0.5583 $\pm$ 0.0179	0.1971 $\pm$ 0.0179	0.2670 $\pm$ 0.0076	0.4517 $\pm$ 0.0164
IV	0.2473 $\pm$ 0.0050	0.1691 $\pm$ 0.0070	0.4709 $\pm$ 0.0297	0.2215 $\pm$ 0.0221	0.2480 $\pm$ 0.0048	0.4327 $\pm$ 0.0182
trial duration prediction
Phase	MAE ( $\downarrow$ )	RMSE ( $\downarrow$ )	$R^{2}$ ( $\uparrow$ )
I	0.8334 $\pm$ 0.0133	1.2611 $\pm$ 0.0261	0.6514 $\pm$ 0.0085
II	1.2980 $\pm$ 0.0202	1.1756 $\pm$ 0.0316	0.4125 $\pm$ 0.0081
III	1.4411 $\pm$ 0.0226	1.8356 $\pm$ 0.0302	0.3148 $\pm$ 0.0085
eligibility criteria design
Phase	cosine sim. ( $\uparrow$ )	informative ( $\uparrow$ )	redundancy ( $\downarrow$ )
All	0.6988	0.6518	0.1181

4.2 Experimental Setup

Implementation Details

All the code is implemented in Python 3.8. All the deep learning models are implemented in PyTorch. We use GPT 4.0 for data annotation and some generation tasks. The total cost was around $100 US dollars. The embedding size of all the representations is set to 100. We use Adam [60] as the numerical optimizer to minimize the loss function with an initial learning rate at $1e{-3}$ and zero weight decay. The batch size is set to 64. The maximal training epochs is set to 20.

Evaluation Metrics

For classification tasks, we assess the model performance using accuracy, PR-AUC (the area under the Precision-Recall curve), F1 score (the harmonic mean of precision and recall), and ROC-AUC (the Area Under the Receiver Operating Characteristic Curve). For regression tasks, we use RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), Concordance Index, and Pearson Correlation as metrics. For generation tasks (eligibility criteria design), we design some semantic metrics to measure the alignment between real and designed criteria, including text embeddings’ cosine similarity, informativeness, and redundancy, detailed in supplementary materials.

4.3 Experimental Results

In this section, we demonstrate the experimental results of multi-modal deep learning methods on all the curated tasks and datasets in Table 6. We find that the direct use of a multimodal deep learning method leads to decent performance in most of the curated tasks. Specifically, for 14 binary classification datasets (across patient dropout prediction, adverse event prediction, mortality rate prediction, and trial approval prediction), the multimodal deep learning method achieves at least 0.7 F1 scores in 11 datasets. On regression and generation tasks, the simple multi-modal deep learning method also achieves decent performance. These results validate the AI-readiness and high quality of the curated datasets.

5 Usage Note

This paper extracts various properties of clinical trials and integrates them with multiple data sources. These properties are essential for analyzing and predicting different aspects of clinical trial performance and outcomes. The properties extracted include:

•

Trial duration: The length of time a clinical trial lasts, from its start date to its completion date. This helps in understanding the efficiency and planning required for trials.
•

Patient dropout rate: The proportion of participants who leave the trial before its completion. This is critical for assessing the trial’s ability to retain participants and the reliability of the results.
•

Serious adverse event: Instances of significant negative health effects observed during the trial, which are crucial for evaluating the safety profile of the treatment being tested.
•

Mortality rate: The proportion of participants who die during the trial. This measure is vital for assessing the potential risks associated with the treatment.
•

Trial approval outcome: Whether a drug can pass a certain phase of the clinical trial, which is a binary outcome indicating success or failure.
•

Trial failure reason: The identification of reasons why a clinical trial may fail, such as poor enrollment, safety issues, or lack of efficacy. This helps in improving the design of future trials.
•

Eligibility criteria design: The inclusion and exclusion criteria for participants are essential for ensuring that the right population is targeted for the trial.
•

Drug dosage: Estimating the appropriate dosage of drugs being tested to ensure safety and efficacy.

These properties and the datasets provided in this study enable researchers and AI practitioners to apply advanced machine learning models to predict and optimize various aspects of clinical trials. The datasets include multi-modal data, such as drug molecules, disease codes, textual descriptions, and categorical/numerical features, making them versatile for different predictive tasks. By leveraging these datasets, researchers can improve clinical trial design, enhance patient safety, optimize resource allocation, and ultimately accelerate the development of new medical treatments.

Intended Users

AI4Trial is intended for healthcare, biomedical, and artificial intelligence researchers and data scientists who want to apply AI algorithms and innovate novel methods to tackle problems formulated in TrialBench datasets and tasks.

Hosting and Maintenance Plan

All datasets in TrialBench are hosted and version-tracked via GitHub and are publicly available for direct download using the persistent data identifier. Our core develo** team is committed and has the resources to maintain and actively develop TrialBench for, at minimum, the next five years. We plan to grow TrialBench in several dimensions by including new learning tasks, datasets, and leaderboards. We welcome external contributors.

Computing Resources

We use a server with an NVIDIA GeForce RTX 3090 GPU, Intel(R) Xeon(R) CPU with 50GB RAM for all empirical experiments in this manuscript.

Limitations

Artificial intelligence for clinical trial is a vast and fast-growing field, and there are important tasks and datasets yet to be included in TrialBench. However, TrialBench is an ongoing effort and we strive to continuously include more datasets and tasks in the future.

Licensing

Most of the data features come from ClinicalTrials.gov, which is a service of the U.S. National Institutes of Health, provides access to information on publicly and privately supported clinical studies. The data available on ClinicalTrials.gov is generally free for use. Some TrialBench tasks involve data in DrugBank, which is available for free to academic institutions and non-profit organizations for research and educational purposes. The subset of TrialTrove is released by Fu’s study [36] and is publicly available for Non-Commercial Use.

Ethics Statement

The development and dissemination of the TrialBench dataset adhere to stringent ethical standards to ensure the protection of patient privacy, the integrity of the data, and the responsible use of the information. The source of the data is clearly documented, and proper attribution is given to ClinicalTrials.gov and other databases such as DrugBank [6] and TrialTrove. This transparency ensures that users of the TrialBench dataset understand the origin of the data and the context in which it was collected.

Code Availability

The curated dataset and relevant code are publicly available at https://github.com/ML2Health/ML2ClinicalTrials/tree/main/AI4Trial.

Competing Interests

The authors declare no competing interests.

References

[1] Eichler, H.-G. & Sweeney, F. The evolution of clinical trials: Can we address the challenges of the future? \JournalTitleClinical trials 15, 27–32 (2018).
[2] Sun, D., Gao, W., Hu, H. & Zhou, S. Why 90% of clinical drug development fails and how to improve it? \JournalTitleActa Pharmaceutica Sinica B 12, 3049–3062 (2022).
[3] Martin, L., Hutchens, M., Hawkins, C. & Radnov, A. How much do clinical trials cost. \JournalTitleNat Rev Drug Discov 16, 381–382 (2017).
[4] Huang, K. et al. Therapeutics data commons: machine learning datasets and tasks for therapeutics. \JournalTitleNeurIPS Track Datasets and Benchmarks (2021).
[5] Huang, K. et al. Artificial intelligence foundation for therapeutic science. \JournalTitleNature Chemical Biology 1–4 (2022).
[6] Wishart, D. S. et al. Drugbank 5.0: a major update to the drugbank database for 2018. \JournalTitleNucleic acids research 46, D1074–D1082 (2018).
[7] Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In International conference on machine learning, 1263–1272 (PMLR, 2017).
[8] Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. \JournalTitleBioinformatics 36, 1234–1240 (2020).
[9] Helboukkouri, I. Mesh embeddings. https://github.com/helboukkouri/mesh-embeddings (ongoing). [Accessed: 2024-06-02].
[10] Choi, E., Bahadori, M. T., Song, L., Stewart, W. F. & Sun, J. GRAM: graph-based attention model for healthcare representation learning. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 787–795 (2017).
[11] Chen, J., Liao, K., Wan, Y., Chen, D. Z. & Wu, J. DANETs: Deep abstract networks for tabular data classification and regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 3930–3938 (2022).
[12] McDowell, A. How long do clinical trial phases take? https://www.antidote.me/blog/how-long-do-clinical-trial-phases-take (2024). [Accessed: 2024-06-02].
[13] Harrison, R. K. Phase II and phase III failures: 2013–2015. \JournalTitleNat Rev Drug Discov 15, 817–818 (2016).
[14] Long, B., Lai, S.-W., Wu, J. & Bellur, S. Predicting Phase I lymphoma clinical trial durations using machine learning: An in-depth analysis and broad application insights. \JournalTitleClinics and Practice 14, 69–88 (2023).
[15] Alexander, W. The uphill path to successful clinical trials: kee** patients enrolled. \JournalTitlePharmacy and Therapeutics 38, 225 (2013).
[16] Britton, A., Murray, D., Bulstrode, C., McPherson, K. & Denham, R. Loss to follow-up: does it matter? \JournalTitleThe Lancet 345, 1511–1512 (1995).
[17] Sharma, S. K., Tobin, J. D. & Brant, L. J. Factors affecting attrition in the baltimore longitudinal study of aging. \JournalTitleExperimental gerontology 21, 329–340 (1986).
[18] Francisco, S. Adverse events in clinical trials: Definitions and documentation. \JournalTitleAdvers. Event Tin Clininal Trials Defin. Doc 1–30 (2014).
[19] Phillips, R., Hazell, L., Sauzet, O. & Cornelius, V. Analysis and reporting of adverse events in randomised controlled trials: a review. \JournalTitleBMJ Open 9, e024537 (2019).
[20] Silverman, H. Ethical issues during the conduct of clinical trials. \JournalTitleProceedings of the American Thoracic Society 4, 180–184 (2007).
[21] Friedman, L. M., Furberg, C. D., DeMets, D. L., Reboussin, D. M. & Granger, C. B. Fundamentals of clinical trials (Springer, 2015).
[22] He, H., Liu, L., Morin, E. E., Liu, M. & Schwendeman, A. Survey of clinical translation of cancer nanomedicines—lessons learned from successes and failures. \JournalTitleAccounts of chemical research 52, 2445–2461 (2019).
[23] Food and Drug Administration. Enhancing the diversity of clinical trial populations — eligibility criteria, enrollment practices, and trial designs guidance for industry. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/enhancing-diversity-clinical-trial-populations-eligibility-criteria-enrollment-practices-and-trial (2020). [Accessed: 2024-06-02].
[24] Van Spall, H. G., Toren, A., Kiss, A. & Fowler, R. A. Eligibility criteria of randomized controlled trials published in high-impact general medical journals: a systematic sampling review. \JournalTitleJAMA 297, 1233–1240 (2007).
[25] Huang, G. D. et al. Clinical trials recruitment planning: a proposed framework from the clinical trials transformation initiative. \JournalTitleContemporary Clinical Trials 66, 74–79 (2018).
[26] Ting, N. Dose finding in drug development (Springer Science & Business Media, 2006).
[27] Glick, H. A., Doshi, J. A., Sonnad, S. S. & Polsky, D. Economic evaluation in clinical trials (OUP Oxford, 2014).
[28] Singh, S. & Loke, Y. K. Drug safety assessment in clinical trials: methodological challenges and opportunities. \JournalTitleTrials 13, 1–8 (2012).
[29] Van Gerven, J. & Bonelli, M. Commentary on the ema guideline on strategies to identify and mitigate risks for first-in-human and early clinical trials with investigational medicinal products. \JournalTitleBritish Journal of Clinical Pharmacology 84, 1401 (2018).
[30] Kobak, K. A., Kane, J. M., Thase, M. E. & Nierenberg, A. A. Why do clinical trials fail?: the problem of measurement error in clinical trials: time to test new paradigms? \JournalTitleJournal of clinical psychopharmacology 27, 1–5 (2007).
[31] Chow, S.-C., Shao, J., Wang, H. & Lokhnygina, Y. Sample size calculations in clinical research (chapman and hall/CRC, 2017).
[32] Peters-Lawrence, M. H. et al. Clinical trial implementation and recruitment: lessons learned from the early closure of a randomized clinical trial. \JournalTitleContemporary clinical trials 33, 291–297 (2012).
[33] Chang, Y.-T. et al. Integrated identification of disease specific pathways using multi-omics data. \JournalTitlebioRxiv 666065 (2019).
[34] Sheridan, R. P. Time-split cross-validation as a method for estimating the goodness of prospective prediction. \JournalTitleJournal of chemical information and modeling 53, 783–790 (2013).
[35] Zhang, B. et al. DDN2.0: R and python packages for differential dependency network analysis of biological systems. \JournalTitlebioRxiv 2021–04 (2021).
[36] Fu, T., Huang, K., Xiao, C., Glass, L. M. & Sun, J. HINT: Hierarchical interaction network for clinical-trial-outcome predictions. \JournalTitlePatterns 3, 100445 (2022).
[37] Fu, T., Huang, K. & Sun, J. Automated prediction of clinical trial outcome (2023). US Patent App. 17/749,065.
[38] Fu, Y. et al. Ddn3. 0: Determining significant rewiring of biological network structure with differential dependency networks. \JournalTitleBioinformatics btae376 (2024).
[39] Lu, Y., Sato, K. & Wang, J. Deep learning based multi-label image classification of protest activities. \JournalTitlearXiv preprint arXiv:2301.04212 (2023).
[40] Chen, T., Hao, N., Van Rechem, C., Chen, J. & Fu, T. Uncertainty quantification and interpretability for clinical trial approval prediction. \JournalTitleHealth Data Science 4, 0126 (2024).
[41] Chen, T., Hao, N., Lu, Y. & Van Rechem, C. Uncertainty quantification on clinical trial outcome prediction. \JournalTitlearXiv preprint arXiv:2401.03482 (2024).
[42] Wang, Y. et al. TWIN-GPT: Digital twins for clinical trials via large language model. \JournalTitlearXiv preprint arXiv:2404.01273 (2024).
[43] Lu, Y. Multi-omics Data Integration for Identifying Disease Specific Biological Pathways. Ph.D. thesis, Virginia Tech (2018).
[44] Wu, C.-T. et al. Cosbin: cosine score-based iterative normalization of biologically diverse samples. \JournalTitleBioinformatics Advances 2, vbac076 (2022).
[45] Lu, Y. et al. COT: an efficient and accurate method for detecting marker genes among many subtypes. \JournalTitleBioinformatics Advances 2, vbac037 (2022).
[46] Chen, L. et al. Data-driven detection of subtype-specific differentially expressed genes. \JournalTitleScientific reports 11, 332 (2021).
[47] Fu, T., Hoang, T. N., Xiao, C. & Sun, J. DDL: Deep dictionary learning for predictive phenoty**. In IJCAI: proceedings of the conference, vol. 2019, 5857 (NIH Public Access, 2019).
[48] Zhang, X., Xiao, C., Glass, L. M. & Sun, J. Deepenroll: Patient-trial matching with deep embedding and entailment prediction. In Proceedings of The Web Conference 2020, 1029–1037 (2020).
[49] Gao, J., Xiao, C., Glass, L. M. & Sun, J. COMPOSE: Cross-modal pseudo-siamese network for patient trial matching. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 803–812 (2020).
[50] Coley, C. W., Barzilay, R., Green, W. H., Jaakkola, T. S. & Jensen, K. F. Convolutional embedding of attributed molecular graphs for physical property prediction. \JournalTitleJournal of chemical information and modeling 57, 1757–1772 (2017).
[51] Anker, S. D., Morley, J. E. & von Haehling, S. Welcome to the ICD-10 code for sarcopenia. \JournalTitleJournal of cachexia, sarcopenia and muscle 7, 512–514 (2016).
[52] Chen, J. et al. Excelformer: Can a dnn be a sure bet for tabular prediction? In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2024).
[53] Chen, J., Liao, K., Fang, Y., Chen, D. & Wu, J. Tabcaps: A capsule neural network for tabular data classification with bow routing. In The Eleventh International Conference on Learning Representations (2022).
[54] Gorishniy, Y., Rubachev, I., Khrulkov, V. & Babenko, A. Revisiting deep learning models for tabular data. \JournalTitleAdvances in Neural Information Processing Systems 34, 18932–18943 (2021).
[55] Grover, A. & Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 855–864 (2016).
[56] Devlin, J., Chang, M., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 4171–4186 (Association for Computational Linguistics, 2019).
[57] Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. \JournalTitleThe International Conference on Learning Representations (ICLR) (2016).
[58] Fu, T., Xiao, C. & Sun, J. CORE: Automatic molecule optimization using copy and refine strategy. \JournalTitleAAAI (2020).
[59] Fu, T., Xiao, C., Li, X., Glass, L. M. & Sun, J. MIMOSA: Multi-constraint molecule sampling for molecule optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 125–133 (2021).
[60] Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. \JournalTitlearXiv (2014).

Appendix A Evaluation Metrics

Classification Metrics

Binary classification is the most widely used classification. For example, in clinical trial approval prediction, there are usually binary labels, 1 denotes the trial passes while 0 indicates fail. The prediction can be the probability that the trial pass. In binary classification, there are four kinds of test data points based on their ground truth and the model’s prediction,

1.

positive sample and is correctly predicted as positive, also known as True Positive (TP);
2.

negative samples and is wrongly predicted as positive samples, also known as False Positive (FP);
3.

negative samples and is correctly predicted as negative samples, also known as True Negative (TN);
4.

positive samples and is wrongly predicted as negative samples, also known as False Negative (FN).

We use different evaluation metrics for multi-category classification and binary classification.

•

Precision. The precision is the performance of a classifier on the samples that are predicted as positive. It is formally defined as $\text{precision}=\frac{TP}{TP+FP}$ .
•

Recall. The recall score measures the performance of the classifier to find all the positive samples. It is formally defined as $\text{recall}=\frac{TP}{TP+FN}$ .
•

Accuracy. Accuracy is the fraction of correctly predicted/classified samples. It is formally defined as $\text{accuracy}=\frac{TP+TN}{TP+TN+FP+FN}$ . It is worth mentioning that when the data samples are highly imbalanced, e.g., positive samples are far more than negative ones, the accuracy is not a proper metric as it is trivial to achieve high accuracy by always predicting the majority class.
•

PR-AUC (Precision-Recall Area Under Curve). The area under the Precision-Recall curve summarizes the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
•

F1. The F1 score is the harmonic mean of the precision and recall. That is, $\text{F1}=\frac{2}{\frac{1}{\text{precision }}+\frac{1}{\text{recall}}}$ .
•

ROC-AUC Area Under the Receiver Operating Characteristic Curve summarizes the trade-off between the true positive rate and the false positive rate for a predictive model using different probability thresholds. ROC-AUC is also known as Area Under the Receiver Operating Characteristic curve (AUROC) in some literature.

For all these metrics, the numerical values range from 0 to 1, a higher value represents better performance. We typically report multiple metrics to measure the performance comprehensively.

Regression Metrics

In the regression task, both ground truth and prediction are continuous values.

•

Mean Squared Error (MSE) measures the average of the squares of the difference between the forecasted value and the actual value. It is defined as $\text{MSE}=\frac{1}{N}\sum^{N}_{i=1}(y_{i}-{\widehat{y}}_{i})^{2}$ , where $N$ is the size of the test set; $y_{i}$ and $\widehat{y}_{i}$ denote the ground truth and predicted score of the $i$ -th data sample in the test set, respectively. MSE value ranges from 0 to positive infinity. A lower MSE value indicates better performance.
•

Mean Absolute Error (MAE) measures the absolute value of the difference between the predicted value and the actual value. It is defined as $\text{MAE}=\frac{1}{N}\sum^{N}_{i=1}|y_{i}-{\widehat{y}}_{i}|$ , where $N$ is the size of the test set; $y_{i}$ and $\widehat{y}_{i}$ denote the ground truth and predicted score of the $i$ -th data sample in the test set, respectively. MAE value ranges from 0 to positive infinity. It emphasizes the ranking order of the prediction instead of the absolute value. A lower MAE value indicates better performance.
•

Concordance Index (also known as c-index) is defined as the proportion of concordant pairs divided by the total number of possible evaluation pairs. A higher value represents better prediction performance.

•

Pearson Correlation (PC) is defined as the covariance of the prediction and the ground truth divided by the product of their standard deviations. For two random variables $x$ and $y$ , Pearson Correlation is formally defined as

\text{PC}=\frac{\mathbb{E}[(x-\mu_{x})({y}-\mu_{y})]}{\sigma_{x}\sigma_{y}},

(5)

In the regression task, suppose there are $N$ data points in the test set, $y_{i}$ is the ground truth of the $i$ -th data sample, $\widehat{y}_{i}$ is the prediction for $i$ -th data, Pearson Correlation becomes

\text{PC}=\frac{\sum_{i=1}^{N}\big{(}(y_{i}-\mu_{y})(\widehat{y}_{i}-\mu_{% \widehat{y}})\big{)}}{\sigma_{y}\sigma_{\widehat{y}}},

(6)

where $\mu_{y}=\frac{1}{N}\sum_{j=1}^{N}y_{j}$ and $\mu_{\widehat{y}}=\frac{1}{N}\sum_{j=1}^{N}\widehat{y}_{j}$ are mean of ground truth and prediction, respectively. $\sigma_{y}=\sum_{i=1}^{N}(y_{i}-\frac{1}{N}\sum_{j=1}^{N}y_{j})^{2}$ and $\sigma_{\widehat{y}}=\sum_{i=1}^{N}(\widehat{y}_{i}-\frac{1}{N}\sum_{j=1}^{N}% \widehat{y}_{j})^{2}$ are the standard deviations of ground truth and prediction, respectively. The value ranges from -1 to 1. A higher Pearson correlation value indicates better performance.

•

R-squared ( $R^{2}$ ) score is defined as the proportion of the variation in the dependent variable that is predictable from the independent variable(s). It is also known as the coefficient of determination in statistics. Suppose we have $N$ continuous ground truth $y_{1},\cdots,y_{N}$ and $N$ corresponding predictions $\widehat{y_{1}},\cdots,\widehat{y_{N}}$ . The difference $y_{i}-\widehat{y_{i}}$ is called residual. Then, we defined the residual sum of squares as

\text{SS}_{\text{res}}=\frac{1}{N}\sum_{i=1}^{N}(y_{i}-\widehat{y_{i}})^{2},

(7)

and define the total sum of squares as

\text{SS}_{\text{total}}=\frac{1}{N}\sum_{i=1}^{N}(y_{i}-\bar{y})^{2},

(8)

where $\bar{y}=\frac{1}{N}y_{i}$ is the mean of the ground truth. Then $R^{2}$ score is defined as

R^{2}=1-\frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{total}}}.

(9)

Higher $R^{2}$ scores indicate better performance. In a perfect prediction model, the prediction exactly matches the ground truth, then $\text{SS}_{\text{res}}=0$ and $R^{2}=1$ . A weak predictor that always predicts $\bar{y}$ would cause $R^{2}=0$ . Some predictors would even lead to a negative $R^{2}$ score.

•

Spearman’s rank correlation coefficient is the Pearson correlation coefficient between the rank variables. Given two groups of variables $X$ and $Y$ , it is formally defined as

Spearman

\displaystyle=\text{PC}(R(X),R(Y))=\frac{\mathbb{E}\big{[}(R(X)-\mu_{R(X)})(R(% Y)-\mu_{R(Y)})\big{]}}{\sigma_{R(X)}\sigma_{R(Y)}},

(10)

where PC is Pearson Correlation defined above; $R(X)$ is the rank variable of $X$ , whose value is from the set $\{1,2,\cdots,|R(X)|\}$ . In regression tasks, $X$ and $Y$ represent the ground truth and prediction, respectively.

In some cases, we are interested in the difference between ground truth and prediction, and would typically choose mean squared error or mean absolute error as the evaluation metrics. Sometimes, we are interested in the correlation between ground truth and prediction. In these cases, we prefer to use Pearson correlation and R-squared score ( $R^{2}$ score). Also, in some applications, the rank-ordering of predictions is more important than the difference between prediction and ground truth. In this case, we prefer to use Spearman’s rank correlation coefficient or concordance index.

Generation Metrics

In the generation tasks (eligibility criteria design, trial summarization), we design a couple of metrics to evaluate the quality of the generated text content.

•

Cosine Similarity between text embeddings. To measure the semantic similarity, we use OpenAI API (GPT 4.0) ^j^jjhttps://openai.com/index/openai-api/ to obtain the text embeddings of generated and groundtruth content and then calculate their cosine similarity.
•

Informativeness. In text generation, informativeness is one of the key metrics for evaluating the quality of generated text. Informativeness measures the richness of useful information provided by the generated text. High informativeness usually means the text contains more relevant details, facts, and insights, rather than simply repeating known information or containing empty content. We use TF-IDF (Term Frequency-Inverse Document Frequency) to measure the informativeness of the generated text relative to a reference text: (1) TF-IDF vectorization: Combine the generated text and reference text, and vectorize them using the TF-IDF method. TF-IDF is a statistical method used to evaluate the importance of a word in a document. (2) cosine similarity: Calculate the cosine similarity between the generated text’s TF-IDF vector and the reference text’s average TF-IDF vector. The higher the cosine similarity, the more similar the generated text is to the reference text. (3) informativeness score: The informativeness score is calculated as 1 - cosine similarity. A higher score indicates that the generated text contains more different information relative to the reference text, thus having greater informativeness.
•

Redundancy. Redundancy refers to the presence of unnecessary repetitive information or extraneous content in the text, which can affect the quality and readability of the text. Methods for evaluating redundancy typically involve detecting repeated words, phrases, or sentences. We need the following four steps for measuring the redundancy of generated text: (1) N-gram list: Split the generated text into a list of words and generate an N-gram list. (2) count N-gram occurrences: Count the number of times each N-gram appears in the text. (3) calculate unique N-gram quantity: Count the number of N-grams that appear only once. (4) calculate redundancy: Redundancy is calculated as 1 minus the ratio of the number of unique N-grams to the total number of N-grams.