Search | arXiv e-print repository

Challenges with the application and adoption of artificial intelligence for drug discovery

Authors: Ghita Ghislat, Saiveth Hernandez-Hernandez, Chayanit Piwajanusorn, Pedro J. Ballester

Abstract: Artificial intelligence (AI) is exhibiting tremendous potential to reduce the massive costs and long timescales of drug discovery. There are however important challenges limiting the impact and scope of AI models. Typically, these models are evaluated on benchmarks that are unlikely to anticipate their prospective performance, which inadvertently misguides their development. Indeed, while all the… ▽ More Artificial intelligence (AI) is exhibiting tremendous potential to reduce the massive costs and long timescales of drug discovery. There are however important challenges limiting the impact and scope of AI models. Typically, these models are evaluated on benchmarks that are unlikely to anticipate their prospective performance, which inadvertently misguides their development. Indeed, while all the developed models excel in a selected benchmark, only a small proportion of them are ultimately reported to have prospective value (e.g. by discovering potent and innovative drug leads for a therapeutic target). Here we discuss a range of data issues (bias, inconsistency, skewness, irrelevance, small size, high dimensionality), how they challenge AI models and which issue-specific mitigations have been effective. Next, we point out the challenges faced by uncertainty quantification techniques aimed at enhancing these AI models. We also discuss how conceptual errors, unrealistic benchmarks and performance misestimation can confound the evaluation of models and thus their development. Lastly, we explain how human bias, whether from AI experts or drug discovery experts, constitutes another challenge that can be alleviated with prospective studies. △ Less

Submitted 6 July, 2024; originally announced July 2024.

arXiv:2406.00873 [pdf]

Scaffold Splits Overestimate Virtual Screening Performance

Authors: Qianrong Guo, Saiveth Hernandez-Hernandez, Pedro J Ballester

Abstract: Virtual Screening (VS) of vast compound libraries guided by Artificial Intelligence (AI) models is a highly productive approach to early drug discovery. Data splitting is crucial for better benchmarking of such AI models. Traditional random data splits produce similar molecules between training and test sets, conflicting with the reality of VS libraries which mostly contain structurally distinct c… ▽ More Virtual Screening (VS) of vast compound libraries guided by Artificial Intelligence (AI) models is a highly productive approach to early drug discovery. Data splitting is crucial for better benchmarking of such AI models. Traditional random data splits produce similar molecules between training and test sets, conflicting with the reality of VS libraries which mostly contain structurally distinct compounds. Scaffold split, grou** molecules by shared core structure, is widely considered to reflect this real-world scenario. However, here we show that the scaffold split also overestimates VS performance. The reason is that molecules with different chemical scaffolds are often similar, which hence introduces unrealistically high similarities between training molecules and test molecules following a scaffold split. Our study examined three representative AI models on 60 NCI-60 datasets, each with approximately 30,000 to 50,000 molecules tested on a different cancer cell line. Each dataset was split with three methods: scaffold, Butina clustering and the more accurate Uniform Manifold Approximation and Projection (UMAP) clustering. Regardless of the model, model performance is much worse with UMAP splits from the results of the 2100 models trained and evaluated for each algorithm and split. These robust results demonstrate the need for more realistic data splits to tune, compare, and select models for VS. For the same reason, avoiding the scaffold split is also recommended for other molecular property prediction problems. The code to reproduce these results is available at https://github.com/ScaffoldSplitsOverestimateVS △ Less

Submitted 30 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

arXiv:1212.0504 [pdf]

doi 10.1371/journal.pone.0061318

Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties

Authors: Michael P. Menden, Francesco Iorio, Mathew Garnett, Ultan McDermott, Cyril Benes, Pedro J. Ballester, Julio Saez-Rodriguez

Abstract: Predicting the response of a specific cancer to a therapy is a major goal in modern oncology that should ultimately lead to a personalised treatment. High-throughput screenings of potentially active compounds against a panel of genomically heterogeneous cancer cell lines have unveiled multiple relationships between genomic alterations and drug responses. Various computational approaches have been… ▽ More Predicting the response of a specific cancer to a therapy is a major goal in modern oncology that should ultimately lead to a personalised treatment. High-throughput screenings of potentially active compounds against a panel of genomically heterogeneous cancer cell lines have unveiled multiple relationships between genomic alterations and drug responses. Various computational approaches have been proposed to predict sensitivity based on genomic features, while others have used the chemical properties of the drugs to ascertain their effect. In an effort to integrate these complementary approaches, we developed machine learning models to predict the response of cancer cell lines to drug treatment, quantified through IC50 values, based on both the genomic features of the cell lines and the chemical properties of the considered drugs. Models predicted IC50 values in a 8-fold cross-validation and an independent blind test with coefficient of determination R2 of 0.72 and 0.64 respectively. Furthermore, models were able to predict with comparable accuracy (R2 of 0.61) IC50s of cell lines from a tissue not used in the training stage. Our in silico models can be used to optimise the experimental design of drug-cell screenings by estimating a large proportion of missing IC50 values rather than experimentally measure them. The implications of our results go beyond virtual drug screening design: potentially thousands of drugs could be probed in silico to systematically test their potential efficacy as anti-tumour agents based on their structure, thus providing a computational framework to identify new drug repositioning opportunities as well as ultimately be useful for personalized medicine by linking the genomic traits of patients to drug sensitivity. △ Less

Submitted 18 March, 2013; v1 submitted 3 December, 2012; originally announced December 2012.

Comments: 26 pages, 7 figures, including supplemental information, presented by Michael Menden at the 5th annual RECOMB Conference on Regulatory and Systems Genomics with DREAM Challenges; accepted in PLOS ONE

Showing 1–3 of 3 results for author: Ballester, P J