-
Identifying metabolites from protein identifiers with P2M
Authors:
Christine H. Chang,
Bryan J. Killinger,
Ryan S. Renslow,
Sean M. Colby
Abstract:
The identification of metabolites from complex biological samples often involves matching experimental mass spectrometry data to signatures of compounds derived from massive chemical databases. However, misidentifications may result due to the complexity of potential chemical space that leads to databases containing compounds with nearly identical structures. Prior knowledge of compounds that may…
▽ More
The identification of metabolites from complex biological samples often involves matching experimental mass spectrometry data to signatures of compounds derived from massive chemical databases. However, misidentifications may result due to the complexity of potential chemical space that leads to databases containing compounds with nearly identical structures. Prior knowledge of compounds that may be enzymatically consumed or produced by an organism can help reduce misidentifications by restricting initial database searching to compounds that are likely to be present in a biological system. While databases such as UniProt allow for the identification of small molecules that may be consumed or generated by enzymes encoded in an organism's genome, currently no tool exists for identifying SMILES strings of metabolites associated with protein identifiers and expanding R-containing substructures to fully defined, biologically relevant chemical structures. Here we present Proteome2Metabolome (P2M), a tool that performs these tasks using external database querying behind a simple command line interface. Beyond mass spectrometry based applications, P2M can be generally used to identify biologically relevant chemical structures likely to be observed in a biological system.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
AI for Chemical Space Gap Filling and Novel Compound Generation
Authors:
Monee Y. McGrady,
Sean M. Colby,
Jamie R Nuñez,
Ryan S. Renslow,
Thomas O. Metz
Abstract:
When considering large sets of molecules, it is helpful to place them in the context of a "chemical space" - a multidimensional space defined by a set of descriptors that can be used to visualize and analyze compound grou** as well as identify regions that might be void of valid structures. The chemical space of all possible molecules in a given biological or environmental sample can be vast and…
▽ More
When considering large sets of molecules, it is helpful to place them in the context of a "chemical space" - a multidimensional space defined by a set of descriptors that can be used to visualize and analyze compound grou** as well as identify regions that might be void of valid structures. The chemical space of all possible molecules in a given biological or environmental sample can be vast and largely unexplored, mainly due to current limitations in processing of 'big data' by brute force methods (e.g., enumeration of all possible compounds in a space). Recent advances in artificial intelligence (AI) have led to multiple new cheminformatics tools that incorporate AI techniques to characterize and learn the structure and properties of molecules in order to generate plausible compounds, thereby contributing to more accessible and explorable regions of chemical space without the need for brute force methods. We have used one such tool, a deep-learning software called DarkChem, which learns a representation of the molecular structure of compounds by compressing them into a latent space. With DarkChem's design, distance in this latent space is often associated with compound similarity, making sparse regions interesting targets for compound generation due to the possibility of generating novel compounds. In this study, we used 1 million small molecules (less than 1000 Da) to create a representative chemical space (defined by calculated molecular properties) of all small molecules. We identified regions with few or no compounds and investigated their location in DarkChem's latent space. From these spaces, we generated 694,645 valid molecules, all of which represent molecules not found in any chemical database to date. These molecules filled 50.8% of the probed empty spaces in molecular property space. Generated molecules are provided in the supporting information.
△ Less
Submitted 28 January, 2022;
originally announced January 2022.
-
DEIMoS: an open-source tool for processing high-dimensional mass spectrometry data
Authors:
Sean M. Colby,
Christine H. Chang,
Jessica L. Bade,
Jamie R. Nunez,
Madison R. Blumer,
Daniel J. Orton,
Kent J. Bloodsworth,
Ernesto S. Nakayasu,
Richard D. Smith,
Yehia M. Ibrahim,
Ryan S. Renslow,
Thomas O. Metz
Abstract:
We present DEIMoS: Data Extraction for Integrated Multidimensional Spectrometry, a Python application programming interface (API) and command-line tool for high-dimensional mass spectrometry data analysis workflows that offers ease of development and access to efficient algorithmic implementations. Functionality includes feature detection, feature alignment, collision cross section (CCS) calibrati…
▽ More
We present DEIMoS: Data Extraction for Integrated Multidimensional Spectrometry, a Python application programming interface (API) and command-line tool for high-dimensional mass spectrometry data analysis workflows that offers ease of development and access to efficient algorithmic implementations. Functionality includes feature detection, feature alignment, collision cross section (CCS) calibration, isotope detection, and MS/MS spectral deconvolution, with the output comprising detected features aligned across study samples and characterized by mass, CCS, tandem mass spectra, and isotopic signature. Notably, DEIMoS operates on N-dimensional data, largely agnostic to acquisition instrumentation; algorithm implementations simultaneously utilize all dimensions to (i) offer greater separation between features, thus improving detection sensitivity, (ii) increase alignment/feature matching confidence among datasets, and (iii) mitigate convolution artifacts in tandem mass spectra. We demonstrate DEIMoS with LC-IMS-MS/MS data to illustrate the advantages of a multidimensional approach in each data processing step.
△ Less
Submitted 6 December, 2021;
originally announced December 2021.
-
SPECTRe: Substructure Processing, Enumeration, and Comparison Tool Resource: An efficient tool to encode all substructures of molecules represented in SMILES
Authors:
Yasemin Yesiltepe,
Ryan S. Renslow,
Thomas O. Metz
Abstract:
Functional groups and moieties are chemical descriptors of biomolecules that can be used to interpret their properties and functions, leading to the understanding of chemical or biological mechanisms. These chemical building blocks, or sub-structures, enable the identification of common molecular subgroups, assessing the structural similarities and critical interactions among a set of biological m…
▽ More
Functional groups and moieties are chemical descriptors of biomolecules that can be used to interpret their properties and functions, leading to the understanding of chemical or biological mechanisms. These chemical building blocks, or sub-structures, enable the identification of common molecular subgroups, assessing the structural similarities and critical interactions among a set of biological molecules with known activities, and designing novel compounds with similar chemical properties. Here, we introduce a Python-based tool, SPECTRe (Substructure Processing, Enumeration, and Comparison Tool Resource), designed to provide all substructures in a given molecular structure, regardless of the molecule size, employing efficient enumeration and generation of substructures represented in a human-readable SMILES format through the use of classical graph traversal (breadth-first and depth-first search) algorithms. We demonstrate the application of SPECTRe for a set of 10,375 molecules in the molecular weight range 27 to 350 Da (<=26 non-hydrogen atoms), spanning a wide array of structure-based chemical functionalities and chemical classes. We found that the substructure count as a measure of molecular complexity depends strongly on the number of unique atom and bond types present, degree of branching, and presence of rings. The substructure counts are found to be similar for a set of molecules belonging to particular chemical classes and classified based on the characteristic features of certain topologies. We demonstrate that SPECTRe shows promise to be useful in many applications of cheminformatics such as virtual screening for drug discovery, property prediction, fingerprint-based molecular similarity searching, and data mining for identifying frequent substructures.
△ Less
Submitted 4 November, 2021;
originally announced November 2021.
-
A Validated Method for Predicting Small Molecule Ionization Sites using Gibb's Free Energies
Authors:
Jessica L. Bade,
Sean M. Colby,
Ryan S. Renslow,
Thomas O. Metz
Abstract:
Accurate molecular identification of metabolites can unlock new areas of the molecular universe and allow greater insight into complex biological and environmental systems than currently possible. Analytical approaches for measuring the metabolome, such as NMR spectroscopy, and separation techniques coupled with mass spectrometry, such as LC-IMS-MS, have risen to this challenge by yielding rich ex…
▽ More
Accurate molecular identification of metabolites can unlock new areas of the molecular universe and allow greater insight into complex biological and environmental systems than currently possible. Analytical approaches for measuring the metabolome, such as NMR spectroscopy, and separation techniques coupled with mass spectrometry, such as LC-IMS-MS, have risen to this challenge by yielding rich experimental data that can be queried by cross-reference with similar information for known standards in reference libraries. Confident identification of molecules in metabolomics studies, though, is often limited by the diversity of available data across chemical space, the unavailability of authentic reference standards, and the corresponding lack of comprehensiveness of standard reference libraries. The In Silico Chemical Library Engine (ISiCLE) addresses theses hindrances by providing a first-principles, cheminformatics pipeline that yields collisional cross section (CCS) values for any given molecule and without the need for training data. In this program, chemical identifiers undergo MD simulations, quantum chemical transformations, and ion mobility calculations for the generation of predicted CCS values. Here, we present a new module for ISiCLE that addresses the sensitivity of CCS predictions to ionization site location. An update to adduct creation methods is proposed concerning a transition from pKa and pKb led predictions to a Gibb's free energy (GFE) based determinacy of true ionization site location. A validation set of experimentally confirmed molecular protonation sites was assembled from literature and cross-referenced with the respective pKb predicted locations and GFE values for all potential ionization site placements. Upon evaluation of the two methods, the lowest GFE value was found to predict the true ionization site location with 100% accuracy while pKb had less accuracy.
△ Less
Submitted 4 November, 2021;
originally announced November 2021.
-
Collision cross section specificity for small molecule identification workflows
Authors:
Jamie Nunez,
Eva Brayfindley,
Sean M. Colby,
Monee McGrady,
Kristin H. Jarman,
Ryan S. Renslow,
Thomas O. Metz
Abstract:
The physical-chemical property of molecular collision cross section (CCS) is increasingly used to assist in small molecule identification; however, questions remain regarding the extent of its true utility in contributing to such identifications, especially given its correlation with mass. To investigate the contribution of CCS to uniqueness within a given library, we measured its discriminatory c…
▽ More
The physical-chemical property of molecular collision cross section (CCS) is increasingly used to assist in small molecule identification; however, questions remain regarding the extent of its true utility in contributing to such identifications, especially given its correlation with mass. To investigate the contribution of CCS to uniqueness within a given library, we measured its discriminatory capacity as a function of error in CCS values (from measurement or prediction), CCS variance, parent mass, mass error, and/or reference database size using a multi-directional grid search. While experimental CCS databases exist, they are currently small; thus, we used a CCS prediction tool, DarkChem, to provide theoretical CCS values for use in this study. These predicted CCS values were then modified to mirror experimental variance. By augmenting our search within a library based on mass alone with CCS at a variety of accuracies, we found that, (i) the use of multiple adducts (i.e. alternative ionized forms of the same parent compound) for the same molecule, compared to using a single adduct, greatly improves specificity and (ii) even a single CCS leads to a significant specificity boost when low CCS error (e.g. 1% composite error) can be achieved. Based on these results, we recommend using multiple adducts to build up evidence of presence, as each adduct supplies additional information per dimension. Additionally, the utility of ion mobility spectrometry when coupled with mass spectrometry should still be considered, regardless of whether CCS is considered as an identification metric, due to advantages such as increased peak resolution, sensitivity (e.g. from reducing load on the detector at any given time), improvements in data-independent MS/MS spectra acquisition, and cleaner tandem mass spectral fragmentation patterns.
△ Less
Submitted 4 November, 2021;
originally announced November 2021.
-
Similarity Downselection: A Python implementation of a heuristic search algorithm for finding the set of the n most dissimilar items with an application in conformer sampling
Authors:
Felicity F. Nielson,
Sean M. Colby,
Ryan S. Renslow,
Thomas O. Metz
Abstract:
Finding the set of the n items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either n or the population size grows large. Finding the set of the n most dissimilar items is different than simply sorting an array of numbers because there exists a pairwise relationship between each item and all other items in the population.…
▽ More
Finding the set of the n items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either n or the population size grows large. Finding the set of the n most dissimilar items is different than simply sorting an array of numbers because there exists a pairwise relationship between each item and all other items in the population. For instance, if you have a set of the most dissimilar n=4 items, one or more of the items from n=4 might not be in the set n=5. An exact solution would have to search all possible combinations of size n in the population, exhaustively. We present an open-source software called similarity downselection (SDS), written in Python and freely available on GitHub. SDS implements a heuristic algorithm for quickly finding the approximate set(s) of the n most dissimilar items. We benchmark SDS against a Monte Carlo method, which attempts to find the exact solution through repeated random sampling. We show that for SDS to find the set of n most dissimilar conformers, our method is not only orders of magnitude faster, but is also more accurate than running the Monte Carlo for 1,000,000 iterations, each searching for set sizes n=3-7 out of a population of 50,000. We also benchmark SDS against the exact solution for example small populations, showing SDS produces a solution close to the exact solution in these instances.
△ Less
Submitted 6 May, 2021;
originally announced May 2021.
-
Exploring the impacts of conformer selection methods on ion mobility collision cross section predictions
Authors:
Felicity F. Nielson,
Sean M. Colby,
Dennis G. Thomas,
Ryan S. Renslow,
Thomas O. Metz
Abstract:
The prediction of structure dependent molecular properties, such as collision cross sections as measured using ion mobility spectrometry, are crucially dependent on the selection of the correct population of molecular conformers. Here, we report an in-depth evaluation of multiple conformation selection techniques, including simple averaging, Boltzmann weighting, lowest energy selection, low energy…
▽ More
The prediction of structure dependent molecular properties, such as collision cross sections as measured using ion mobility spectrometry, are crucially dependent on the selection of the correct population of molecular conformers. Here, we report an in-depth evaluation of multiple conformation selection techniques, including simple averaging, Boltzmann weighting, lowest energy selection, low energy threshold reductions, and similarity reduction. Generating 50,000 conformers each for 18 molecules, we used the In Silico Chemical Library Engine (ISiCLE) to calculate the collision cross sections for the entire dataset. First, we employed Monte Carlo simulations to understand the variability between conformer structures as generated using simulated annealing. Then we employed Monte Carlo simulations to the aforementioned conformer selection techniques applied on the simulated molecular property - the ion mobility collision cross section. Based on our analyses, we found Boltzmann weighting to be a good tradeoff between precision and theoretical accuracy. Combining multiple techniques revealed that energy thresholds and root-mean-squared deviation-based similarity reductions can save considerable computational expense while maintaining property prediction accuracy. Molecular dynamic conformer generation tools like AMBER can continue to generate new lowest energy conformers even after tens of thousands of generations, decreasing precision between runs. This reduced precision can be ameliorated and theoretical accuracy increased by running density functional theory geometry optimization on carefully selected conformers.
△ Less
Submitted 14 October, 2020;
originally announced October 2020.
-
Advancing Standards-Free Methods for the Identification of Small Molecules in Complex Samples
Authors:
Jamie R. Nuñez,
Sean M. Colby,
Dennis G. Thomas,
Malak M. Tfaily,
Nikola Tolic,
Elin M. Ulrich,
Jon R. Sobus,
Thomas O. Metz,
Justin G. Teeguarden,
Ryan S. Renslow
Abstract:
The current gold standard for unambiguous identification in metabolomics analysis is based on comparing two or more orthogonal properties from the analysis of authentic, pure reference materials (standards) to experimental data acquired in the same laboratory with the same analytical methods. This represents a significant limitation for comprehensive chemical identification of small molecules in c…
▽ More
The current gold standard for unambiguous identification in metabolomics analysis is based on comparing two or more orthogonal properties from the analysis of authentic, pure reference materials (standards) to experimental data acquired in the same laboratory with the same analytical methods. This represents a significant limitation for comprehensive chemical identification of small molecules in complex samples since this process is time-consuming and costly, and the majority of molecules are not yet represented by standards, leading to a need for standards-free identification. To address this need, we are advancing chemical property calculations and develo** multi-attribute scoring and matching algorithms to utilize data from multiple analytical platforms through the utilization and creation of the in silico Chemical Library Engine (ISiCLE) and the Multi-Attribute Matching Engine (MAME). Here, we describe our results in a blinded analysis of synthetic chemical mixtures as part of the U.S. Environmental Protection Agency's (EPA) Non-Targeted Analysis Collaborative Trial (ENTACT). The blinded false negative rate (FNR), false discovery rate (FDR), and accuracy were 57%, 77%, and 91%, respectively. For high confidence identifications, the FDR was 35%. After unblinding of the sample compositions, we improved our approach by optimizing the scoring parameters used to increase confidence. The final FNR, FDR, and accuracy were 67%, 53%, and 96%, respectively. For high confidence identifications, the FDR was 10%. This study demonstrates that standards-free small molecule identification and multi-attribute matching methods can significantly reduce reliance on standards.
△ Less
Submitted 16 October, 2018;
originally announced October 2018.
-
ISiCLE: A molecular collision cross section calculation pipeline for establishing large in silico reference libraries for compound identification
Authors:
Sean M. Colby,
Dennis G. Thomas,
Jamie R. Nunez,
Douglas J. Baxter,
Kurt R. Glaesemann,
Joseph M. Brown,
Meg A Pirrung,
Niranjan Govind,
Justin G. Teeguarden,
Thomas O. Metz,
Ryan S. Renslow
Abstract:
Comprehensive and confident identifications of metabolites and other chemicals in complex samples will revolutionize our understanding of the role these chemically diverse molecules play in biological systems. Despite recent advances, metabolomics studies still result in the detection of a disproportionate number of features than cannot be confidently assigned to a chemical structure. This inadequ…
▽ More
Comprehensive and confident identifications of metabolites and other chemicals in complex samples will revolutionize our understanding of the role these chemically diverse molecules play in biological systems. Despite recent advances, metabolomics studies still result in the detection of a disproportionate number of features than cannot be confidently assigned to a chemical structure. This inadequacy is driven by the single most significant limitation in metabolomics: the reliance on reference libraries constructed by analysis of authentic reference chemicals. To this end, we have developed the in silico chemical library engine (ISiCLE), a high-performance computing-friendly cheminformatics workflow for generating libraries of chemical properties. In the instantiation described here, we predict probable three-dimensional molecular conformers using chemical identifiers as input, from which collision cross sections (CCS) are derived. The approach employs state-of-the-art first-principles simulation, distinguished by use of molecular dynamics, quantum chemistry, and ion mobility calculations to generate structures and libraries, all without training data. Importantly, optimization of ISiCLE included a refactoring of the popular MOBCAL code for trajectory-based mobility calculations, improving its computational efficiency by over two orders of magnitude. Calculated CCS values were validated against 1,983 experimentally-measured CCS values and compared to previously reported CCS calculation approaches. An online database is introduced for sharing both calculated and experimental CCS values (metabolomics.pnnl.gov), initially including a CCS library with over 1 million entries. Finally, three successful applications of molecule characterization using calculated CCS are described. This work represents a promising method to address the limitations of small molecule identification.
△ Less
Submitted 21 September, 2018;
originally announced September 2018.
-
Optimizing colormaps with consideration for color vision deficiency to enable accurate interpretation of scientific data
Authors:
Jamie R. Nuñez,
Christopher R. Anderton,
Ryan S. Renslow
Abstract:
Color vision deficiency (CVD) affects more than 4% of the population and leads to a different visual perception of colors. Though this has been known for decades, colormaps with many colors across the visual spectra are often used to represent data, leading to the potential for misinterpretation or difficulty with interpretation by someone with this deficiency. Until the creation of the module pre…
▽ More
Color vision deficiency (CVD) affects more than 4% of the population and leads to a different visual perception of colors. Though this has been known for decades, colormaps with many colors across the visual spectra are often used to represent data, leading to the potential for misinterpretation or difficulty with interpretation by someone with this deficiency. Until the creation of the module presented here, there were no colormaps mathematically optimized for CVD using modern color appearance models. While there have been some attempts to make aesthetically pleasing or subjectively tolerable colormaps for those with CVD, our goal was to make optimized colormaps for the most accurate perception of scientific data by as many viewers as possible. We developed a Python module, cmaputil, to create CVD-optimized colormaps, which imports colormaps and modifies them to be perceptually uniform in CVD-safe colorspace while linearizing and maximizing the brightness range. The module is made available to the science community to enable others to easily create their own CVDoptimized colormaps. Here, we present an example CVD-optimized colormap created with this module that is optimized for viewing by those without a CVD as well as those with redgreen colorblindness. This colormap, cividis, enables nearly-identical visual-data interpretation to both groups, is perceptually uniform in hue and brightness, and increases in brightness linearly.
△ Less
Submitted 1 August, 2018; v1 submitted 29 November, 2017;
originally announced December 2017.