-
MTLComb: multi-task learning combining regression and classification tasks for joint feature selection
Authors:
Han Cao,
Sivanesan Rajan,
Bianka Hahn,
Ersoy Kocak,
Daniel Durstewitz,
Emanuel Schwarz,
Verena Schneider-Lindner
Abstract:
Multi-task learning (MTL) is a learning paradigm that enables the simultaneous training of multiple communicating algorithms. Although MTL has been successfully applied to ether regression or classification tasks alone, incorporating mixed types of tasks into a unified MTL framework remains challenging, primarily due to variations in the magnitudes of losses associated with different tasks. This c…
▽ More
Multi-task learning (MTL) is a learning paradigm that enables the simultaneous training of multiple communicating algorithms. Although MTL has been successfully applied to ether regression or classification tasks alone, incorporating mixed types of tasks into a unified MTL framework remains challenging, primarily due to variations in the magnitudes of losses associated with different tasks. This challenge, particularly evident in MTL applications with joint feature selection, often results in biased selections. To overcome this obstacle, we propose a provable loss weighting scheme that analytically determines the optimal weights for balancing regression and classification tasks. This scheme significantly mitigates the otherwise biased feature selection. Building upon this scheme, we introduce MTLComb, an MTL algorithm and software package encompassing optimization procedures, training protocols, and hyperparameter estimation procedures. MTLComb is designed for learning shared predictors among tasks of mixed types. To showcase the efficacy of MTLComb, we conduct tests on both simulated data and biomedical studies pertaining to sepsis and schizophrenia.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension
Authors:
Xingyu Lu,
He Cao,
Zi**g Liu,
Shengyuan Bai,
Leqing Chen,
Yuan Yao,
Hai-Tao Zheng,
Yu Li
Abstract:
Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information, posing challenges to accurate molecular comprehension. Traditional evaluation metrics for generated content fail to assess a model's accuracy in molecular understanding. To rectify the absence of factual evaluation, we present MoleculeQA, a novel quest…
▽ More
Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information, posing challenges to accurate molecular comprehension. Traditional evaluation metrics for generated content fail to assess a model's accuracy in molecular understanding. To rectify the absence of factual evaluation, we present MoleculeQA, a novel question answering (QA) dataset which possesses 62K QA pairs over 23K molecules. Each QA pair, composed of a manual question, a positive option and three negative options, has consistent semantics with a molecular description from authoritative molecular corpus. MoleculeQA is not only the first benchmark for molecular factual bias evaluation but also the largest QA dataset for molecular research. A comprehensive evaluation on MoleculeQA for existing molecular LLMs exposes their deficiencies in specific areas and pinpoints several particularly crucial factors for molecular understanding.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
An Autonomous Large Language Model Agent for Chemical Literature Data Mining
Authors:
Kexin Chen,
Hanqun Cao,
Junyou Li,
Yuyang Du,
Menghao Guo,
Xin Zeng,
Lanqing Li,
Jiezhong Qiu,
Pheng Ann Heng,
Guangyong Chen
Abstract:
Chemical synthesis, which is crucial for advancing material synthesis and drug discovery, impacts various sectors including environmental science and healthcare. The rise of technology in chemistry has generated extensive chemical data, challenging researchers to discern patterns and refine synthesis processes. Artificial intelligence (AI) helps by analyzing data to optimize synthesis and increase…
▽ More
Chemical synthesis, which is crucial for advancing material synthesis and drug discovery, impacts various sectors including environmental science and healthcare. The rise of technology in chemistry has generated extensive chemical data, challenging researchers to discern patterns and refine synthesis processes. Artificial intelligence (AI) helps by analyzing data to optimize synthesis and increase yields. However, AI faces challenges in processing literature data due to the unstructured format and diverse writing style of chemical literature. To overcome these difficulties, we introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature. This AI agent employs large language models (LLMs) for prompt generation and iterative optimization. It functions as a chemistry assistant, automating data collection and analysis, thereby saving manpower and enhancing performance. Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data, and we compared our method with human experts in terms of content correctness and time efficiency. The proposed approach marks a significant advancement in automating chemical literature extraction and demonstrates the potential for AI to revolutionize data management and utilization in chemistry.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery
Authors:
He Cao,
Zi**g Liu,
Xingyu Lu,
Yuan Yao,
Yu Li
Abstract:
The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in resha** interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a t…
▽ More
The rapid evolution of artificial intelligence in drug discovery encounters challenges with generalization and extensive training, yet Large Language Models (LLMs) offer promise in resha** interactions with complex molecular data. Our novel contribution, InstructMol, a multi-modal LLM, effectively aligns molecular structures with natural language via an instruction-tuning approach, utilizing a two-stage training strategy that adeptly combines limited domain-specific data with molecular and textual information. InstructMol showcases substantial performance improvements in drug discovery-related molecular tasks, surpassing leading LLMs and significantly reducing the gap with specialized models, thereby establishing a robust foundation for a versatile and dependable drug discovery assistant.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Leveraging Side Information for Ligand Conformation Generation using Diffusion-Based Approaches
Authors:
Jiamin Wu,
He Cao,
Yuan Yao
Abstract:
Ligand molecule conformation generation is a critical challenge in drug discovery. Deep learning models have been developed to tackle this problem, particularly through the use of generative models in recent years. However, these models often generate conformations that lack meaningful structure and randomness due to the absence of essential side information. Examples of such side information incl…
▽ More
Ligand molecule conformation generation is a critical challenge in drug discovery. Deep learning models have been developed to tackle this problem, particularly through the use of generative models in recent years. However, these models often generate conformations that lack meaningful structure and randomness due to the absence of essential side information. Examples of such side information include the chemical and geometric features of the target protein, ligand-target compound interactions, and ligand chemical properties. Without these constraints, the generated conformations may not be suitable for further selection and design of new drugs. To address this limitation, we propose a novel method for generating ligand conformations that leverage side information and incorporate flexible constraints into standard diffusion models. Drawing inspiration from the concept of message passing, we introduce ligand-target massage passing block, a mechanism that facilitates the exchange of information between target nodes and ligand nodes, thereby incorporating target node features. To capture non-covalent interactions, we introduce ligand-target compound inter and intra edges. To further improve the biological relevance of the generated conformations, we train energy models using scalar chemical features. These models guide the progress of the standard Denoising Diffusion Probabilistic Models, resulting in more biologically meaningful conformations. We evaluate the performance of SIDEGEN using the PDBBind-2020 dataset, comparing it against other methods. The results demonstrate improvements in both Aligned RMSD and Ligand RMSD evaluations. Specifically, our model outperforms GeoDiff (trained on PDBBind-2020) by 20% in terms of the median aligned RMSD metric.
△ Less
Submitted 2 August, 2023;
originally announced September 2023.
-
Breathing cluster in complex neuron-astrocyte networks
Authors:
Ya Wang,
Liang Wang,
Huawei Fan,
Jun Ma,
Hui Cao,
Xingang Wang
Abstract:
Brain activities are featured by spatially distributed neural clusters of coherent firings and a spontaneous switching of the clusters between the synchrony and asynchrony states. Evidences from {\it in vivo} experiments suggest that astrocytes, a type of glial cell regarded previously as providing only structural and metabolic supports to neurons, participate actively in brain functions and play…
▽ More
Brain activities are featured by spatially distributed neural clusters of coherent firings and a spontaneous switching of the clusters between the synchrony and asynchrony states. Evidences from {\it in vivo} experiments suggest that astrocytes, a type of glial cell regarded previously as providing only structural and metabolic supports to neurons, participate actively in brain functions and play a crucial role in regulating the neural firing activities, yet the mechanism remains unknown. Introducing astrocyte as a reservoir of the glutamate released from neuron synapses, here we propose the model of complex neuron-astrocyte network and employ it to explore the roles of astrocyte in regulating the synchronization behaviors of networked neurons. It is found that a fraction of neurons on the network can be synchronized as a cluster, while the remaining neurons are kept as desynchronized. Moreover, during the course of network evolution, the cluster is switching between the synchrony and asynchrony states intermittently, henceforth the phenomenon of ``breathing cluster". By the method of symmetry-based analysis, we conduct a theoretical investigation on the stability of the cluster and the mechanism generating the breathing activities. It is revealed that the contents of the cluster are determined by the network symmetry and the breathing activities are due to the interplay between the neural network and the astrocyte. The breathing phenomenon is demonstrated in network models of different structures and neural dynamics. The studies give insights into the cellular mechanism of astrocytes in regulating neural activities, and shed lights onto the spontaneous state switching of the neocortex.
△ Less
Submitted 26 January, 2023;
originally announced February 2023.
-
Deciphering RNA Secondary Structure Prediction: A Probabilistic K-Rook Matching Perspective
Authors:
Cheng Tan,
Zhangyang Gao,
Hanqun Cao,
Xingran Chen,
Ge Wang,
Lirong Wu,
Jun Xia,
Jiangbin Zheng,
Stan Z. Li
Abstract:
The secondary structure of ribonucleic acid (RNA) is more stable and accessible in the cell than its tertiary structure, making it essential for functional prediction. Although deep learning has shown promising results in this field, current methods suffer from poor generalization and high complexity. In this work, we reformulate the RNA secondary structure prediction as a K-Rook problem, thereby…
▽ More
The secondary structure of ribonucleic acid (RNA) is more stable and accessible in the cell than its tertiary structure, making it essential for functional prediction. Although deep learning has shown promising results in this field, current methods suffer from poor generalization and high complexity. In this work, we reformulate the RNA secondary structure prediction as a K-Rook problem, thereby simplifying the prediction process into probabilistic matching within a finite solution space. Building on this innovative perspective, we introduce RFold, a simple yet effective method that learns to predict the most matching K-Rook solution from the given sequence. RFold employs a bi-dimensional optimization strategy that decomposes the probabilistic matching problem into row-wise and column-wise components to reduce the matching complexity, simplifying the solving process while guaranteeing the validity of the output. Extensive experiments demonstrate that RFold achieves competitive performance and about eight times faster inference efficiency than the state-of-the-art approaches. The code and Colab demo are available in (http://github.com/A4Bio/RFold).
△ Less
Submitted 19 June, 2024; v1 submitted 2 December, 2022;
originally announced December 2022.
-
Scalable lipid droplet microarray fabrication, validation, and screening
Authors:
Tracey N. Bell,
Aubrey E. Kusi-Appiaha,
Pengfei Lyu,
L. Zhu,
F. Zhu,
David Van Winkle,
Hongyuan Cao,
M. Singh,
Steven Lenhert
Abstract:
High throughput screening of small molecules and natural products is costly, requiring significant amounts of time, reagents, and operating space. Although microarrays have proven effective in the miniaturization of screening for certain biochemical assays, such as nucleic acid hybridization or antibody binding, they are not widely used for drug discovery in cell culture due to the need for cells…
▽ More
High throughput screening of small molecules and natural products is costly, requiring significant amounts of time, reagents, and operating space. Although microarrays have proven effective in the miniaturization of screening for certain biochemical assays, such as nucleic acid hybridization or antibody binding, they are not widely used for drug discovery in cell culture due to the need for cells to internalize lipophilic drug candidates. Lipid droplet microarrays are a promising solution to this problem as they are capable of delivering lipophilic drugs to cells at dosages comparable to solution delivery. However, the scalablility of the array fabrication, assay validation, and screening steps has limited the utility of this approach. Here we demonstrate a scalable process for lipid droplet array fabrication, assay validation in cell culture, and drug screening. A nanointaglio printing process has been adapted for use with a printing press. The arrays are stabilized for immersion into aqueous solution using a vapor coating process. In addition to delivery of lipophilic compounds, we found that we are also able to encapsulate and deliver a water-soluble compound in this way. The arrays can be functionalized by extracellular matrix proteins such as collagen prior to cell culture as the mechanism for uptake is based on direct contact with the lipid delivery vehicles rather than diffusion of the drug out of the microarray spots. We demonstrate this method for delivery to 3 different cell types and the screening of 90 natural product extracts on a microarray covering an area of less than 0.1 cm2. The arrays are suitable for miniaturized screening, for instance in BSL-3 conditions where space is limited and for applications where cell numbers are limited, such as in functional precision medicine.
△ Less
Submitted 13 October, 2022;
originally announced October 2022.
-
Computational Protein Design with Deep Learning Neural Networks
Authors:
**gxue Wang,
Huali Cao,
John Z. H. Zhang,
Yifei Qi
Abstract:
Computational protein design has a wide variety of applications. Despite its remarkable success, designing a protein for a given structure and function is still a challenging task. On the other hand, the number of solved protein structures is rapidly increasing while the number of unique protein folds has reached a steady number, suggesting more structural information is being accumulated on each…
▽ More
Computational protein design has a wide variety of applications. Despite its remarkable success, designing a protein for a given structure and function is still a challenging task. On the other hand, the number of solved protein structures is rapidly increasing while the number of unique protein folds has reached a steady number, suggesting more structural information is being accumulated on each fold. Deep learning neural network is a powerful method to learn such big data set and has shown superior performance in many machine learning fields. In this study, we applied the deep learning neural network approach to computational protein design for predicting the probability of 20 natural amino acids on each residue in a protein. A large set of protein structures was collected and a multi-layer neural network was constructed. A number of structural properties were extracted as input features and the best network achieved an accuracy of 38.3%. Using the network output as residue type restraints was able to improve the average sequence identity in designing three natural proteins using Rosetta. Moreover, the predictions from our network show ~3% higher sequence identity than a previous method. Results from this study may benefit further development of computational protein design methods.
△ Less
Submitted 23 February, 2018; v1 submitted 22 January, 2018;
originally announced January 2018.
-
Proton Conducting Graphene Oxide Coupled Neuron Transistors for Brain-Inspired Cognitive Systems
Authors:
Chang** Wan,
Liqiang Zhu,
Yanghui Liu,
** Feng,
Zhao** Liu,
Hailiang Cao,
Peng Xiao,
Yi Shi,
Qing Wan
Abstract:
Neuron is the most important building block in our brain, and information processing in individual neuron involves the transformation of input synaptic spike trains into an appropriate output spike train. Hardware implementation of neuron by individual ionic/electronic hybrid device is of great significance for enhancing our understanding of the brain and solving sensory processing and complex rec…
▽ More
Neuron is the most important building block in our brain, and information processing in individual neuron involves the transformation of input synaptic spike trains into an appropriate output spike train. Hardware implementation of neuron by individual ionic/electronic hybrid device is of great significance for enhancing our understanding of the brain and solving sensory processing and complex recognition tasks. Here, we provide a proof-of-principle artificial neuron based on a proton conducting graphene oxide (GO) coupled oxide-based electric-double-layer (EDL) transistor with multiple driving inputs and one modulatory input terminal. Paired-pulse facilitation, dendritic integration and orientation tuning were successfully emulated. Additionally, neuronal gain control (arithmetic) in the scheme of rate coding is also experimentally demonstrated. Our results provide a new-concept approach for building brain-inspired cognitive systems.
△ Less
Submitted 20 October, 2015;
originally announced October 2015.
-
On the Origins and Control of Community Types in the Human Microbiome
Authors:
Travis E. Gibson,
Amir Bashan,
Hong-Tai Cao,
Scott T. Weiss,
Yang-Yu Liu
Abstract:
Microbiome-based stratification of healthy individuals into compositional categories, referred to as "community types", holds promise for drastically improving personalized medicine. Despite this potential, the existence of community types and the degree of their distinctness have been highly debated. Here we adopted a dynamic systems approach and found that heterogeneity in the interspecific inte…
▽ More
Microbiome-based stratification of healthy individuals into compositional categories, referred to as "community types", holds promise for drastically improving personalized medicine. Despite this potential, the existence of community types and the degree of their distinctness have been highly debated. Here we adopted a dynamic systems approach and found that heterogeneity in the interspecific interactions or the presence of strongly interacting species is sufficient to explain community types, independent of the topology of the underlying ecological network. By controlling the presence or absence of these strongly interacting species we can steer the microbial ecosystem to any desired community type. This open-loop control strategy still holds even when the community types are not distinct but appear as dense regions within a continuous gradient. This finding can be used to develop viable therapeutic strategies for shifting the microbial composition to a healthy configuration
△ Less
Submitted 21 January, 2016; v1 submitted 16 June, 2015;
originally announced June 2015.