Search | arXiv e-print repository

Optimizing Drug Design by Merging Generative AI With Active Learning Frameworks

Authors: Isaac Filella-Merce, Alexis Molina, Marek Orzechowski, Lucía Díaz, Yang Ming Zhu, Julia Vilalta Mor, Laura Malo, Ajay S Yekkirala, Soumya Ray, Victor Guallar

Abstract: Traditional drug discovery programs are being transformed by the advent of machine learning methods. Among these, Generative AI methods (GM) have gained attention due to their ability to design new molecules and enhance specific properties of existing ones. However, current GM methods have limitations, such as low affinity towards the target, unknown ADME/PK properties, or the lack of synthetic tr… ▽ More Traditional drug discovery programs are being transformed by the advent of machine learning methods. Among these, Generative AI methods (GM) have gained attention due to their ability to design new molecules and enhance specific properties of existing ones. However, current GM methods have limitations, such as low affinity towards the target, unknown ADME/PK properties, or the lack of synthetic tractability. To improve the applicability domain of GM methods, we have developed a workflow based on a variational autoencoder coupled with active learning steps. The designed GM workflow iteratively learns from molecular metrics, including drug likeliness, synthesizability, similarity, and docking scores. In addition, we also included a hierarchical set of criteria based on advanced molecular modeling simulations during a final selection step. We tested our GM workflow on two model systems, CDK2 and KRAS. In both cases, our model generated chemically viable molecules with a high predicted affinity toward the targets. Particularly, the proportion of high-affinity molecules inferred by our GM workflow was significantly greater than that in the training data. Notably, we also uncovered novel scaffolds significantly dissimilar to those known for each target. These results highlight the potential of our GM workflow to explore novel chemical space for specific targets, thereby opening up new possibilities for drug discovery endeavors. △ Less

Submitted 4 May, 2023; originally announced May 2023.

arXiv:2102.01521 [pdf]

doi 10.1128/mSystems.00095-21

Pathogenesis, Symptomatology, and Transmission of SARS-CoV-2 through Analysis of Viral Genomics and Structure

Authors: Halie M. Rando, Adam L. MacLean, Alexandra J. Lee, Ronan Lordan, Sandipan Ray, Vikas Bansal, Ashwin N. Skelly, Elizabeth Sell, John J. Dziak, Lamonica Shinholster, Lucy D'Agostino McGowan, Marouen Ben Guebila, Nils Wellhausen, Sergey Knyazev, Simina M. Boca, Stephen Capone, Yanjun Qi, YoSon Park, Yuchen Sun, David Mai, Joel D. Boerckel, Christian Brueffer, James Brian Byrd, Jeremy P. Kamil, **hui Wang , et al. (9 additional authors not shown)

Abstract: The novel coronavirus SARS-CoV-2, which emerged in late 2019, has since spread around the world and infected hundreds of millions of people with coronavirus disease 2019 (COVID-19). While this viral species was unknown prior to January 2020, its similarity to other coronaviruses that infect humans has allowed for rapid insight into the mechanisms that it uses to infect human hosts, as well as the… ▽ More The novel coronavirus SARS-CoV-2, which emerged in late 2019, has since spread around the world and infected hundreds of millions of people with coronavirus disease 2019 (COVID-19). While this viral species was unknown prior to January 2020, its similarity to other coronaviruses that infect humans has allowed for rapid insight into the mechanisms that it uses to infect human hosts, as well as the ways in which the human immune system can respond. Here, we contextualize SARS-CoV-2 among other coronaviruses and identify what is known and what can be inferred about its behavior once inside a human host. Because the genomic content of coronaviruses, which specifies the virus's structure, is highly conserved, early genomic analysis provided a significant head start in predicting viral pathogenesis and in understanding potential differences among variants. The pathogenesis of the virus offers insights into symptomatology, transmission, and individual susceptibility. Additionally, prior research into interactions between the human immune system and coronaviruses has identified how these viruses can evade the immune system's protective mechanisms. We also explore systems-level research into the regulatory and proteomic effects of SARS-CoV-2 infection and the immune response. Understanding the structure and behavior of the virus serves to contextualize the many facets of the COVID-19 pandemic and can influence efforts to control the virus and treat the disease. △ Less

Submitted 3 December, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

arXiv:2009.10766 [pdf, ps, other]

doi 10.1371/journal.pone.0246920

DTI-SNNFRA: Drug-Target interaction prediction by shared nearest neighbors and fuzzy-rough approximation

Authors: Sk Mazharul Islam, Sk Md Mosaddek Hossain, Sumanta Ray

Abstract: In-silico prediction of repurposable drugs is an effective drug discovery strategy that supplements de-nevo drug discovery from scratch. Reduced development time, less cost and absence of severe side effects are significant advantages of using drug repositioning. Most recent and most advanced artificial intelligence (AI) approaches have boosted drug repurposing in terms of throughput and accuracy… ▽ More In-silico prediction of repurposable drugs is an effective drug discovery strategy that supplements de-nevo drug discovery from scratch. Reduced development time, less cost and absence of severe side effects are significant advantages of using drug repositioning. Most recent and most advanced artificial intelligence (AI) approaches have boosted drug repurposing in terms of throughput and accuracy enormously. However, with the growing number of drugs, targets and their massive interactions produce imbalanced data which may not be suitable as input to the classification model directly. Here, we have proposed DTI-SNNFRA, a framework for predicting drug-target interaction (DTI), based on shared nearest neighbour (SNN) and fuzzy-rough approximation (FRA). It uses sampling techniques to collectively reduce the vast search space covering the available drugs, targets and millions of interactions between them. DTI-SNNFRA operates in two stages: first, it uses SNN followed by a partitioning clustering for sampling the search space. Next, it computes the degree of fuzzy-rough approximations and proper degree threshold selection for the negative samples' undersampling from all possible interaction pairs between drugs and targets obtained in the first stage. Finally, classification is performed using the positive and selected negative samples. We have evaluated the efficacy of DTI-SNNFRA using AUC (Area under ROC Curve), Geometric Mean, and F1 Score. The model performs exceptionally well with a high prediction score of 0.95 for ROC-AUC. The predicted drug-target interactions are validated through an existing drug-target database (Connectivity Map (Cmap)). △ Less

Submitted 20 February, 2021; v1 submitted 22 September, 2020; originally announced September 2020.

Journal ref: PLOS ONE 16(2): e0246920. (2021) 1-19

arXiv:2007.06971 [pdf]

doi 10.1016/j.intimp.2020.106705

Use of Machine Learning and Artificial Intelligence to predict SARS-CoV-2 infection from Full Blood Counts in a population

Authors: Abhirup Banerjee, Surajit Ray, Bart Vorselaars, Joanne Kitson, Michail Mamalakis, Simonne Weeks, Mark Baker, Louise S. Mackenzie

Abstract: Since December 2019 the novel coronavirus SARS-CoV-2 has been identified as the cause of the pandemic COVID-19. Early symptoms overlap with other common conditions such as common cold and Influenza, making early screening and diagnosis are crucial goals for health practitioners. The aim of the study was to use machine learning (ML), an artificial neural network (ANN) and a simple statistical test… ▽ More Since December 2019 the novel coronavirus SARS-CoV-2 has been identified as the cause of the pandemic COVID-19. Early symptoms overlap with other common conditions such as common cold and Influenza, making early screening and diagnosis are crucial goals for health practitioners. The aim of the study was to use machine learning (ML), an artificial neural network (ANN) and a simple statistical test to identify SARS-CoV-2 positive patients from full blood counts without knowledge of symptoms or history of the individuals. The dataset included in the analysis and training contains anonymized full blood counts results from patients seen at the Hospital Israelita Albert Einstein, at São Paulo, Brazil, and who had samples collected to perform the SARS-CoV-2 rt-PCR test during a visit to the hospital. Patient data was anonymised by the hospital, clinical data was standardized to have a mean of zero and a unit standard deviation. This data was made public with the aim to allow researchers to develop ways to enable the hospital to rapidly predict and potentially identify SARS-CoV-2 positive patients. We find that with full blood counts random forest, shallow learning and a flexible ANN model predict SARS-CoV-2 patients with high accuracy between populations on regular wards (AUC = 94-95%) and those not admitted to hospital or in the community (AUC=80-86%). Here, AUC is the Area Under the receiver operating characteristics Curve and a measure for model performance. Moreover, a simple linear combination of 4 blood counts can be used to have an AUC of 85% for patients within the community. The normalised data of different blood parameters from SARS-CoV-2 positive patients exhibit a decrease in platelets, leukocytes, eosinophils, basophils and lymphocytes, and an increase in monocytes. △ Less

Submitted 12 April, 2021; v1 submitted 14 July, 2020; originally announced July 2020.

Journal ref: International Immunopharmacology, Volume 86,2020,106705,

arXiv:2007.02338 [pdf, other]

Predicting potential drug targets and repurposable drugs for COVID-19 via a deep generative model for graphs

Authors: Sumanta Ray, Snehalika Lall, Anirban Mukhopadhyay, Sanghamitra Bandyopadhyay, Alexander Schönhuth

Abstract: Coronavirus Disease 2019 (COVID-19) has been creating a worldwide pandemic situation. Repurposing drugs, already shown to be free of harmful side effects, for the treatment of COVID-19 patients is an important option in launching novel therapeutic strategies. Therefore, reliable molecule interaction data are a crucial basis, where drug-/protein-protein interaction networks establish invaluable, ye… ▽ More Coronavirus Disease 2019 (COVID-19) has been creating a worldwide pandemic situation. Repurposing drugs, already shown to be free of harmful side effects, for the treatment of COVID-19 patients is an important option in launching novel therapeutic strategies. Therefore, reliable molecule interaction data are a crucial basis, where drug-/protein-protein interaction networks establish invaluable, year-long carefully curated data resources. However, these resources have not yet been systematically exploited using high-performance artificial intelligence approaches. Here, we combine three networks, two of which are year-long curated, and one of which, on SARS-CoV-2-human host-virus protein interactions, was published only most recently (30th of April 2020), raising a novel network that puts drugs, human and virus proteins into mutual context. We apply Variational Graph AutoEncoders (VGAEs), representing most advanced deep learning based methodology for the analysis of data that are subject to network constraints. Reliable simulations confirm that we operate at utmost accuracy in terms of predicting missing links. We then predict hitherto unknown links between drugs and human proteins against which virus proteins preferably bind. The corresponding therapeutic agents present splendid starting points for exploring novel host-directed therapy (HDT) options. △ Less

Submitted 5 July, 2020; originally announced July 2020.

Comments: 19 pages, 5 figures

arXiv:2001.03426 [pdf]

DNA Linear Block Codes: Generation, Error-detection and Error-correction of DNA Codeword

Authors: Mandrita Mondal, Kumar S. Ray

Abstract: In modern age, the increasing complexity of computation and communication technology is leading us towards the necessity of new paradigm. As a result, unconventional approach like DNA coding theory is gaining considerable attention. The storage capacity, information processing and transmission properties of DNA molecules stimulate the notion of DNA coding theory as well as DNA cryptography. In thi… ▽ More In modern age, the increasing complexity of computation and communication technology is leading us towards the necessity of new paradigm. As a result, unconventional approach like DNA coding theory is gaining considerable attention. The storage capacity, information processing and transmission properties of DNA molecules stimulate the notion of DNA coding theory as well as DNA cryptography. In this paper we generate DNA codeword using DNA linear block codes which ensures the secure transmission of information. In the proposed code design strategy DNA-based XOR operation (DNAX) is applied for effective construction of DNA codewords which are quadruples generated over the set of alphabets consisting of four DNA bases adenine, thymine, guanine, and cytosine. By worked out examples we explain the use of generator matrix and parity check matrix in encryption and decryption of coded data in the form of short single stranded DNA sequences. The newly developed technique can detect as well as correcting error in transmission of DNA codewords through biological channels from sender to the intended receiver. Through DNA coding theory we are expanding the paths towards data compression and error correction in the form of DNA strands. This leads us towards a broader domain of DNA cryptography. △ Less

Submitted 3 August, 2023; v1 submitted 31 December, 2019; originally announced January 2020.

Comments: 16 pages, 1 figure, 5 tables

Journal ref: International Journal of Bioinformatics Intelligent Computing. 2022;1(2):103-126

arXiv:1908.08623 [pdf]

Exact inference under the perfect phylogeny model

Authors: Surjyendu Ray, Bei Jia, Sam Safavi, Tim van Opijnen, Ralph Isberg, Jason Rosch, José Bento

Abstract: Motivation: Many inference tools use the Perfect Phylogeny Model (PPM) to learn trees from noisy variant allele frequency (VAF) data. Learning in this setting is hard, and existing tools use approximate or heuristic algorithms. An algorithmic improvement is important to help disentangle the limitations of the PPM's assumptions from the limitations in our capacity to learn under it. Results: We mak… ▽ More Motivation: Many inference tools use the Perfect Phylogeny Model (PPM) to learn trees from noisy variant allele frequency (VAF) data. Learning in this setting is hard, and existing tools use approximate or heuristic algorithms. An algorithmic improvement is important to help disentangle the limitations of the PPM's assumptions from the limitations in our capacity to learn under it. Results: We make such improvement in the scenario, where the mutations that are relevant for evolution can be clustered into a small number of groups, and the trees to be reconstructed have a small number of nodes. We use a careful combination of algorithms, software, and hardware, to develop EXACT: a tool that can explore the space of all possible phylogenetic trees, and performs exact inference under the PPM with noisy data. EXACT allows users to obtain not just the most-likely tree for some input data, but exact statistics about the distribution of trees that might explain the data. We show that EXACT outperforms several existing tools for this same task. Availability: https://github.com/surjray-repos/EXACT △ Less

Submitted 22 August, 2019; originally announced August 2019.

arXiv:1904.05528 [pdf]

Review on DNA Cryptography

Authors: Mandrita Mondal, Kumar S. Ray

Abstract: Cryptography is the science that secures data and communication over the network by applying mathematics and logic to design strong encryption methods. In the modern era of e-business and e-commerce the protection of confidentiality, integrity and availability (CIA triad) of stored information as well as of transmitted data is very crucial. Deoxyribonucleic acid (DNA) is a genetic molecule consist… ▽ More Cryptography is the science that secures data and communication over the network by applying mathematics and logic to design strong encryption methods. In the modern era of e-business and e-commerce the protection of confidentiality, integrity and availability (CIA triad) of stored information as well as of transmitted data is very crucial. Deoxyribonucleic acid (DNA) is a genetic molecule consisting of two linked strands that wind around each other to form a double helical structure. The backbone of each strand is made of alternating deoxyribose sugar and phosphate groups. To each sugar one of four bases are attached i.e., adenine (A), cytosine (C), guanine (G) and thymine (T). DNA molecules, having the capacity to store, process and transmit information, inspires the idea of DNA cryptography. It is the rapid emerging unconventional techniques which combines the chemical characteristics of biological DNA sequences with classical cryptography to ensure non-vulnerable transmission of data. This innovative method is based on the notion of DNA computing. The methodologies of DNA cryptography are not coded mathematically; thus, it could be too secure to be cracked easily. △ Less

Submitted 3 August, 2023; v1 submitted 15 March, 2019; originally announced April 2019.

Comments: 21 pages, 12 figures, 6 tables

Journal ref: International Journal of Bioinformatics and Intelligent Computing. 2023;2(1):44-72

arXiv:1903.04260

Review on DNA Strand Algebra and its Application

Authors: Mandrita Mondal, Kumar S. Ray

Abstract: Several technological limitations of traditional silicon based computing are leading towards the paradigm shift, from silicon to carbon, in computational world. Among the unconventional modes of computing evolved in past several decades, DNA computing has been considered to be quite promising in solving computational and reasoning problems by using DNA strands. Along with the sequential operations… ▽ More Several technological limitations of traditional silicon based computing are leading towards the paradigm shift, from silicon to carbon, in computational world. Among the unconventional modes of computing evolved in past several decades, DNA computing has been considered to be quite promising in solving computational and reasoning problems by using DNA strands. Along with the sequential operations, the huge parallelism of DNA computing methodologies engaging numerous numbers of DNA strands induce the consideration of concurrent high-level formalisms. In this paper we have reviewed the algebraic explanation of concurrent DNA processes using DNA strand algebra, process calculus and DNA strand graph. We have demonstrated the application of syntax and semantics of the illustrated methodologies in the domains of reasoning and theorem proving with resolution refutation. Finally, we have presented DNA cryptography as one of the prominent areas for the future scope of research work where DNA strand algebra can be used as formal modelling tool to authenticate the security, logic and reasoning of the existing protocols. △ Less

Submitted 30 September, 2020; v1 submitted 4 March, 2019; originally announced March 2019.

Comments: This paper has already been illustrated in arXiv:1702.05383, arXiv:1703.10481

arXiv:1810.03158 [pdf, ps, other]

doi 10.1016/j.ecocom.2018.07.004

Modelling the effects of awareness-based interventions to control the mosaic disease of Jatropha curcas

Authors: F. Al Basir, K. B. Blyuss, S. Ray

Abstract: Plant diseases are responsible for substantial and sometimes devastating economic and societal costs and thus are a major limiting factor for stable and sustainable agricultural production. Diseases of crops are particular crippling in develo** countries that are heavily dependent on agriculture for food security and income. Various techniques have been developed to reduce the negative impact of… ▽ More Plant diseases are responsible for substantial and sometimes devastating economic and societal costs and thus are a major limiting factor for stable and sustainable agricultural production. Diseases of crops are particular crippling in develo** countries that are heavily dependent on agriculture for food security and income. Various techniques have been developed to reduce the negative impact of plant diseases and eliminate the associated parasites, but the success of these approaches strongly depends on population awareness and the degree of engagement with disease control and prevention programs. In this paper we derive and analyse a mathematical model of mosaic disease of {\it Jatropha curcas}, an important biofuel plant, with particular emphasis on the effects of interventions in the form of nutrients and insecticides, whose use depends on the level of population awareness. Two contributions to disease awareness are considered in the model: global awareness campaigns, and awareness from observing infected plants. All steady states of the model are found, and their stability is analysed in terms of system parameters. We identify parameter regions associated with eradication of disease, stable endemic infection, and periodic oscillations in the level of infection. Analytical results are supported by numerical simulations that illustrate the behaviour of the model in different dynamical regimes. Implications of theoretical results for practical implementation of disease control are discussed. △ Less

Submitted 7 October, 2018; originally announced October 2018.

Comments: 18 pages, 6 figures

Journal ref: Ecol. Compl. 36, 92-100 (2018)

arXiv:1703.10481 [pdf]

DNA Tweezers Based on Semantics of DNA Strand Graph

Authors: Mandrita Mondal, Kumar S. Ray

Abstract: Because of the limitations of classical silicon based computational technology, several alternatives to traditional method in form of unconventional computing have been proposed. In this paper we will focus on DNA computing which is showing the possibility of excellence for its massive parallelism, potential for information storage, speed and energy efficiency. In this paper we will describe how s… ▽ More Because of the limitations of classical silicon based computational technology, several alternatives to traditional method in form of unconventional computing have been proposed. In this paper we will focus on DNA computing which is showing the possibility of excellence for its massive parallelism, potential for information storage, speed and energy efficiency. In this paper we will describe how syllogistic reasoning by DNA tweezers can be presented by the semantics of process calculus and DNA strand graph. Syllogism is an essential ingredient for commonsense reasoning of an individual. This paper enlightens the procedure to deduce a precise conclusion from a set of propositions by using formal language theory in form of process calculus and the expressive power of DNA strand graph. △ Less

Submitted 29 March, 2017; originally announced March 2017.

Comments: 22 pages, 11 figures. arXiv admin note: substantial text overlap with arXiv:1702.05383

arXiv:1507.01731 [pdf]

Prediction of Radiation Fog by DNA Computing

Authors: Kumar Sankar Ray, Mandrita Mondal

Abstract: In this paper we propose a wet lab algorithm for prediction of radiation fog by DNA computing. The concept of DNA computing is essentially exploited for generating the classifier algorithm in the wet lab. The classifier is based on a new concept of similarity based fuzzy reasoning suitable for wet lab implementation. This new concept of similarity based fuzzy reasoning is different from convention… ▽ More In this paper we propose a wet lab algorithm for prediction of radiation fog by DNA computing. The concept of DNA computing is essentially exploited for generating the classifier algorithm in the wet lab. The classifier is based on a new concept of similarity based fuzzy reasoning suitable for wet lab implementation. This new concept of similarity based fuzzy reasoning is different from conventional approach to fuzzy reasoning based on similarity measure and also replaces the logical aspect of classical fuzzy reasoning by DNA chemistry. Thus, we add a new dimension to existing forms of fuzzy reasoning by bringing it down to nanoscale. We exploit the concept of massive parallelism of DNA computing by designing this new classifier in the wet lab. This newly designed classifier is very much generalized in nature and apart from prediction of radiation fog this methodology can be applied to other types of data also. To achieve our goal we first fuzzify the given observed parameters in a form of synthetic DNA sequence which is called fuzzy DNA and which handles the vague concept of human reasoning. △ Less

Submitted 7 July, 2015; originally announced July 2015.

Comments: 36 pages

arXiv:1506.04923 [pdf]

Logical Inference by DNA Strand Algebra

Authors: Kumar Sankar Ray, Mandrita Mondal

Abstract: Based on the concept of DNA strand displacement and DNA strand algebra we have developed a method for logical inference which is not based on silicon based computing. Essentially, it is a paradigm shift from silicon to carbon. In this paper we have considered the inference mechanism, viz. modus ponens, to draw conclusion from any observed fact. Thus, the present approach to logical inference based… ▽ More Based on the concept of DNA strand displacement and DNA strand algebra we have developed a method for logical inference which is not based on silicon based computing. Essentially, it is a paradigm shift from silicon to carbon. In this paper we have considered the inference mechanism, viz. modus ponens, to draw conclusion from any observed fact. Thus, the present approach to logical inference based on DNA strand algebra is basically an attempt to develop expert system design in the domain of DNA computing. We have illustrated our methodology with respect to worked out example. Our methodology is very flexible for implementation of different expert system applications. △ Less

Submitted 16 June, 2015; originally announced June 2015.

Comments: 18 pages, 10 figures

arXiv:physics/0203092 [pdf, ps, other]

doi 10.1103/PhysRevE.65.061909

Class of self-limiting growth models in the presence of nonlinear diffusion

Authors: Sandip Kar, Suman Kumar Banik, Deb Shankar Ray

Abstract: The source term in a reaction-diffusion system, in general, does not involve explicit time dependence. A class of self-limiting growth models dealing with animal and tumor growth and bacterial population in a culture, on the other hand are described by kinetics with explicit functions of time. We analyze a reaction-diffusion system to study the propagation of spatial front for these models. The source term in a reaction-diffusion system, in general, does not involve explicit time dependence. A class of self-limiting growth models dealing with animal and tumor growth and bacterial population in a culture, on the other hand are described by kinetics with explicit functions of time. We analyze a reaction-diffusion system to study the propagation of spatial front for these models. △ Less

Submitted 29 March, 2002; originally announced March 2002.

Comments: RevTex, 13 pages, 5 figures. To appear in Physical Review E

Showing 1–14 of 14 results for author: Ray, S