The N-ary in the Coal Mine: Avoiding Mixture Model Failure with Proper Validation
Authors:
Travis Maxfield,
Joshua Hochuli,
James Wellnitz,
Cleber Melo-Filho,
Konstantin I. Popov,
Eugene Muratov,
Alex Tropsha
Abstract:
Modeling the properties of chemical mixtures is a difficult but important part of any modeling process intended to be applicable to the often messy and impure phenomena of everyday life, including food and environmental safety, healthcare, etc. Part of this difficulty stems from the increased complexity of designing suitable model validation schemes for mixture data, a fact which has been elucidat…
▽ More
Modeling the properties of chemical mixtures is a difficult but important part of any modeling process intended to be applicable to the often messy and impure phenomena of everyday life, including food and environmental safety, healthcare, etc. Part of this difficulty stems from the increased complexity of designing suitable model validation schemes for mixture data, a fact which has been elucidated in previous work only in the case of binary mixture models. We extend these previously defined validation strategies for QSAR modeling of binary mixtures to the more complex case of general, $N$-ary mixtures and argue that these strategies are applicable to many modeling tasks beyond simple chemical mixtures. Additionally, we propose a method of establishing a baseline model performance for each mixture dataset to be in used in model selection comparisons. This baseline is intended to account for the statistical dependence generically present between the properties of mixtures that share constituents. We contend that without such a baseline, estimates of model performance can be dramatically overestimated, and we demonstrate this with multiple case studies using real and simulated data.
△ Less
Submitted 11 August, 2023;
originally announced August 2023.
Text Mining to Identify and Extract Novel Disease Treatments From Unstructured Datasets
Authors:
Rahul Yedida,
Saad Mohammad Abrar,
Cleber Melo-Filho,
Eugene Muratov,
Rada Chirkova,
Alexander Tropsha
Abstract:
Objective: We aim to learn potential novel cures for diseases from unstructured text sources. More specifically, we seek to extract drug-disease pairs of potential cures to diseases by a simple reasoning over the structure of spoken text.
Materials and Methods: We use Google Cloud to transcribe podcast episodes of an NPR radio show. We then build a pipeline for systematically pre-processing the…
▽ More
Objective: We aim to learn potential novel cures for diseases from unstructured text sources. More specifically, we seek to extract drug-disease pairs of potential cures to diseases by a simple reasoning over the structure of spoken text.
Materials and Methods: We use Google Cloud to transcribe podcast episodes of an NPR radio show. We then build a pipeline for systematically pre-processing the text to ensure quality input to the core classification model, which feeds to a series of post-processing steps for obtaining filtered results. Our classification model itself uses a language model pre-trained on PubMed text. The modular nature of our pipeline allows for ease of future developments in this area by substituting higher quality components at each stage of the pipeline. As a validation measure, we use ROBOKOP, an engine over a medical knowledge graph with only validated pathways, as a ground truth source for checking the existence of the proposed pairs. For the proposed pairs not found in ROBOKOP, we provide further verification using Chemotext.
Results: We found 30.4% of our proposed pairs in the ROBOKOP database. For example, our model successfully identified that Omeprazole can help treat heartburn.We discuss the significance of this result, showing some examples of the proposed pairs.
Discussion and Conclusion: The agreement of our results with the existing knowledge source indicates a step in the right direction. Given the plug-and-play nature of our framework, it is easy to add, remove, or modify parts to improve the model as necessary. We discuss the results showing some examples, and note that this is a potentially new line of research that has further scope to be explored. Although our approach was originally oriented on radio podcast transcripts, it is input-agnostic and could be applied to any source of textual data and to any problem of interest.
△ Less
Submitted 22 October, 2020;
originally announced November 2020.