Current Methods for Drug Property Prediction in the Real World
Authors:
Jacob Green,
Cecilia Cabrera Diaz,
Maximilian A. H. Jakobs,
Andrea Dimitracopoulos,
Mark van der Wilk,
Ryan D. Greenhalgh
Abstract:
Predicting drug properties is key in drug discovery to enable de-risking of assets before expensive clinical trials, and to find highly active compounds faster. Interest from the Machine Learning community has led to the release of a variety of benchmark datasets and proposed methods. However, it remains unclear for practitioners which method or approach is most suitable, as different papers bench…
▽ More
Predicting drug properties is key in drug discovery to enable de-risking of assets before expensive clinical trials, and to find highly active compounds faster. Interest from the Machine Learning community has led to the release of a variety of benchmark datasets and proposed methods. However, it remains unclear for practitioners which method or approach is most suitable, as different papers benchmark on different datasets and methods, leading to varying conclusions that are not easily compared. Our large-scale empirical study links together numerous earlier works on different datasets and methods; thus offering a comprehensive overview of the existing property classes, datasets, and their interactions with different methods. We emphasise the importance of uncertainty quantification and the time and therefore cost of applying these methods in the drug development decision-making cycle. We discover that the best method depends on the dataset, and that engineered features with classical ML methods often outperform deep learning. Specifically, QSAR datasets are typically best analysed with classical methods such as Gaussian Processes while ADMET datasets are sometimes better described by Trees or Deep Learning methods such as Graph Neural Networks or language models. Our work highlights that practitioners do not yet have a straightforward, black-box procedure to rely on, and sets the precedent for creating practitioner-relevant benchmarks. Deep learning approaches must be proven on these benchmarks to become the practical method of choice in drug property prediction.
△ Less
Submitted 25 July, 2023;
originally announced September 2023.
Multi-wavelength scaling relations in galaxy groups: a detailed comparison of GAMA and KiDS observations to BAHAMAS simulations
Authors:
Arthur Jakobs,
Massimo Viola,
Ian McCarthy,
Ludovic van Waerbeke,
Henk Hoekstra,
Aaron Robotham,
Gary Hinshaw,
Alireza Hojjati,
Hideki Tanimura,
Tilman Tröster,
Ivan Baldry,
Catherine Heymans,
Hendrik Hildebrandt,
Konrad Kuijken,
Peder Norberg,
Joop Schaye,
Cristóbal Sifon,
Edo van Uitert,
Edwin Valentijn,
Gijs Verdoes Kleijn,
Lingyu Wang
Abstract:
We study the scaling relations between the baryonic content and total mass of groups of galaxies, as these systems provide a unique way to examine the role of non-gravitational processes in structure formation. Using Planck and ROSAT data, we conduct detailed comparisons of the stacked thermal Sunyaev-Zel'dovich (tSZ) and X-ray scaling relations of galaxy groups found in the Galaxy And Mass Assemb…
▽ More
We study the scaling relations between the baryonic content and total mass of groups of galaxies, as these systems provide a unique way to examine the role of non-gravitational processes in structure formation. Using Planck and ROSAT data, we conduct detailed comparisons of the stacked thermal Sunyaev-Zel'dovich (tSZ) and X-ray scaling relations of galaxy groups found in the Galaxy And Mass Assembly (GAMA) survey and the BAHAMAS hydrodynamical simulation. We use weak gravitational lensing data from the Kilo Degree Survey (KiDS) to determine the average halo mass of the studied systems. We analyse the simulation in the same way, using realistic weak lensing, X-ray, and tSZ synthetic observations. Furthermore, to keep selection biases under control, we employ exactly the same galaxy selection and group identification procedures to the observations and simulation. Applying this comparison, we find that the simulations reproduce the richness, size, and stellar mass functions of GAMA groups, as well as the stacked weak lensing and tSZ signals in bins of group stellar mass. However, the simulations predict X-ray luminosities that are higher than observed for this optically-selected group sample. As the same simulations were previously shown to match the luminosities of X-ray-selected groups, this suggests that X-ray-selected systems may form a biased subset. Finally, we demonstrate that our observational processing of the X-ray and tSZ signals is free of significant biases. We find that our optical group selection procedure has, however, some room for improvement.
△ Less
Submitted 9 April, 2021; v1 submitted 14 December, 2017;
originally announced December 2017.