-
Multistable protocells can aid the evolution of prebiotic autocatalytic sets
Authors:
Angad Yuvraj Singh,
Sanjay Jain
Abstract:
We present a simple mathematical model that captures the evolutionary capabilities of a prebiotic compartment or protocell. In the model the protocell contains an autocatalytic set whose chemical dynamics is coupled to the growth-division dynamics of the compartment. Bistability in the dynamics of the autocatalytic set results in a protocell that can exist with two distinct growth rates. Stochasti…
▽ More
We present a simple mathematical model that captures the evolutionary capabilities of a prebiotic compartment or protocell. In the model the protocell contains an autocatalytic set whose chemical dynamics is coupled to the growth-division dynamics of the compartment. Bistability in the dynamics of the autocatalytic set results in a protocell that can exist with two distinct growth rates. Stochasticity in chemical reactions plays the role of mutations and causes transitions from one growth regime to another. We show that the system exhibits `natural selection', where a `mutant' protocell in which the autocatalytic set is active arises by chance in a population of inactive protocells, and then takes over the population because of its higher growth rate or `fitness'. The work integrates three levels of dynamics: intracellular chemical, single protocell, and population (or ecosystem) of protocells..
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
Explaining black box text modules in natural language with language models
Authors:
Chandan Singh,
Aliyah R. Hsu,
Richard Antonello,
Shailee Jain,
Alexander G. Huth,
Bin Yu,
Jianfeng Gao
Abstract:
Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A "text module" is any function that maps text to a scalar continuous v…
▽ More
Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A "text module" is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. "Black box" indicates that we only have access to the module's inputs/outputs.
We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation is. We study SASC in 3 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model's internals. Finally, we show that SASC can generate explanations for the response of individual fMRI voxels to language stimuli, with potential applications to fine-grained brain map**. All code for using SASC and reproducing results is made available on Github.
△ Less
Submitted 15 November, 2023; v1 submitted 16 May, 2023;
originally announced May 2023.
-
Graph Regularized Probabilistic Matrix Factorization for Drug-Drug Interactions Prediction
Authors:
Stuti Jain,
Emilie Chouzenoux,
Kriti Kumar,
Angshul Majumdar
Abstract:
Co-administration of two or more drugs simultaneously can result in adverse drug reactions. Identifying drug-drug interactions (DDIs) is necessary, especially for drug development and for repurposing old drugs. DDI prediction can be viewed as a matrix completion task, for which matrix factorization (MF) appears as a suitable solution. This paper presents a novel Graph Regularized Probabilistic Mat…
▽ More
Co-administration of two or more drugs simultaneously can result in adverse drug reactions. Identifying drug-drug interactions (DDIs) is necessary, especially for drug development and for repurposing old drugs. DDI prediction can be viewed as a matrix completion task, for which matrix factorization (MF) appears as a suitable solution. This paper presents a novel Graph Regularized Probabilistic Matrix Factorization (GRPMF) method, which incorporates expert knowledge through a novel graph-based regularization strategy within an MF framework. An efficient and sounded optimization algorithm is proposed to solve the resulting non-convex problem in an alternating fashion. The performance of the proposed method is evaluated through the DrugBank dataset, and comparisons are provided against state-of-the-art techniques. The results demonstrate the superior performance of GRPMF when compared to its counterparts.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
An infectious diseases hazard map for India based on mobility and transportation networks
Authors:
Onkar Sadekar,
Mansi Budamagunta,
G. J. Sreejith,
Sachin Jain,
M. S. Santhanam
Abstract:
We propose a risk measure and construct an infectious diseases hazard map for India. Given an outbreak location, a hazard index is assigned to each city using an effective distance that depends on inter-city mobilities instead of geographical distance. We demonstrate its utility using an SIR model augmented with air, rail, and road data between top 446 cities. Simulations show that the effective d…
▽ More
We propose a risk measure and construct an infectious diseases hazard map for India. Given an outbreak location, a hazard index is assigned to each city using an effective distance that depends on inter-city mobilities instead of geographical distance. We demonstrate its utility using an SIR model augmented with air, rail, and road data between top 446 cities. Simulations show that the effective distance from outbreak location reliably predicts the time of arrival of infection in other cities. The hazard index predictions compare well with the observed spread of SARS-CoV-2. The hazard map can be useful in other outbreaks also.
△ Less
Submitted 4 August, 2021; v1 submitted 24 May, 2021;
originally announced May 2021.
-
Brain Signals to Rescue Aphasia, Apraxia and Dysarthria Speech Recognition
Authors:
Gautam Krishna,
Mason Carnahan,
Shilpa Shamapant,
Yashitha Surendranath,
Saumya Jain,
Arundhati Ghosh,
Co Tran,
Jose del R Millan,
Ahmed H Tewfik
Abstract:
In this paper, we propose a deep learning-based algorithm to improve the performance of automatic speech recognition (ASR) systems for aphasia, apraxia, and dysarthria speech by utilizing electroencephalography (EEG) features recorded synchronously with aphasia, apraxia, and dysarthria speech. We demonstrate a significant decoding performance improvement by more than 50\% during test time for isol…
▽ More
In this paper, we propose a deep learning-based algorithm to improve the performance of automatic speech recognition (ASR) systems for aphasia, apraxia, and dysarthria speech by utilizing electroencephalography (EEG) features recorded synchronously with aphasia, apraxia, and dysarthria speech. We demonstrate a significant decoding performance improvement by more than 50\% during test time for isolated speech recognition task and we also provide preliminary results indicating performance improvement for the more challenging continuous speech recognition task by utilizing EEG features. The results presented in this paper show the first step towards demonstrating the possibility of utilizing non-invasive neural signals to design a real-time robust speech prosthetic for stroke survivors recovering from aphasia, apraxia, and dysarthria. Our aphasia, apraxia, and dysarthria speech-EEG data set will be released to the public to help further advance this interesting and crucial research.
△ Less
Submitted 17 July, 2021; v1 submitted 27 February, 2021;
originally announced March 2021.
-
Mixture Model Framework for Traumatic Brain Injury Prognosis Using Heterogeneous Clinical and Outcome Data
Authors:
Alan D. Kaplan,
Qi Cheng,
K. Aditya Mohan,
Lindsay D. Nelson,
Sonia Jain,
Harvey Levin,
Abel Torres-Espin,
Austin Chou,
J. Russell Huie,
Adam R. Ferguson,
Michael McCrea,
Joseph Giacino,
Shivshankar Sundaram,
Amy J. Markowitz,
Geoffrey T. Manley
Abstract:
Prognoses of Traumatic Brain Injury (TBI) outcomes are neither easily nor accurately determined from clinical indicators. This is due in part to the heterogeneity of damage inflicted to the brain, ultimately resulting in diverse and complex outcomes. Using a data-driven approach on many distinct data elements may be necessary to describe this large set of outcomes and thereby robustly depict the n…
▽ More
Prognoses of Traumatic Brain Injury (TBI) outcomes are neither easily nor accurately determined from clinical indicators. This is due in part to the heterogeneity of damage inflicted to the brain, ultimately resulting in diverse and complex outcomes. Using a data-driven approach on many distinct data elements may be necessary to describe this large set of outcomes and thereby robustly depict the nuanced differences among TBI patients' recovery. In this work, we develop a method for modeling large heterogeneous data types relevant to TBI. Our approach is geared toward the probabilistic representation of mixed continuous and discrete variables with missing values. The model is trained on a dataset encompassing a variety of data types, including demographics, blood-based biomarkers, and imaging findings. In addition, it includes a set of clinical outcome assessments at 3, 6, and 12 months post-injury. The model is used to stratify patients into distinct groups in an unsupervised learning setting. We use the model to infer outcomes using input data, and show that the collection of input data reduces uncertainty of outcomes over a baseline approach. In addition, we quantify the performance of a likelihood scoring technique that can be used to self-evaluate the extrapolation risk of prognosis on unseen patients.
△ Less
Submitted 20 July, 2021; v1 submitted 22 December, 2020;
originally announced December 2020.
-
New mixture models for decoy-free false discovery rate estimation in mass-spectrometry proteomics
Authors:
Yisu Peng,
Shantanu Jain,
Yong Fuga Li,
Michal Gregus,
Alexander R. Ivanov,
Olga Vitek,
Predrag Radivojac
Abstract:
Motivation: Accurate estimation of false discovery rate (FDR) of spectral identification is a central problem in mass spectrometry-based proteomics. Over the past two decades, target decoy approaches (TDAs) and decoy-free approaches (DFAs), have been widely used to estimate FDR. TDAs use a database of decoy species to faithfully model score distributions of incorrect peptide-spectrum matches (PSMs…
▽ More
Motivation: Accurate estimation of false discovery rate (FDR) of spectral identification is a central problem in mass spectrometry-based proteomics. Over the past two decades, target decoy approaches (TDAs) and decoy-free approaches (DFAs), have been widely used to estimate FDR. TDAs use a database of decoy species to faithfully model score distributions of incorrect peptide-spectrum matches (PSMs). DFAs, on the other hand, fit two-component mixture models to learn the parameters of correct and incorrect PSM score distributions. While conceptually straightforward, both approaches lead to problems in practice, particularly in experiments that push instrumentation to the limit and generate low fragmentation-efficiency and low signal-to-noise-ratio spectra. Results: We introduce a new decoy-free framework for FDR estimation that generalizes present DFAs while exploiting more search data in a manner similar to TDAs. Our approach relies on multi-component mixtures, in which score distributions corresponding to the correct PSMs, best incorrect PSMs, and second-best incorrect PSMs are modeled by the skew normal family. We derive EM algorithms to estimate parameters of these distributions from the scores of best and second-best PSMs associated with each experimental spectrum. We evaluate our models on multiple proteomics datasets and a HeLa cell digest case study consisting of more than a million spectra in total. We provide evidence of improved performance over existing DFAs and improved stability and speed over TDAs without any performance degradation. We propose that the new strategy has the potential to extend beyond peptide identification and reduce the need for TDA on all analytical platforms.
△ Less
Submitted 16 September, 2020;
originally announced September 2020.
-
Construction and Usage of a Human Body Common Coordinate Framework Comprising Clinical, Semantic, and Spatial Ontologies
Authors:
Katy Börner,
Ellen M. Quardokus,
Bruce W. Herr II,
Leonard E. Cross,
Elizabeth G. Record,
Yingnan Ju,
Andreas D. Bueckle,
James P. Sluka,
Jonathan C. Silverstein,
Kristen M. Browne,
Sanjay Jain,
Clive H. Wasserfall,
Marda L. Jorgensen,
Jeffrey M. Spraggins,
Nathan H. Patterson,
Mark A. Musen,
Griffin M. Weber
Abstract:
The National Institutes of Health's (NIH) Human Biomolecular Atlas Program (HuBMAP) aims to create a comprehensive high-resolution atlas of all the cells in the healthy human body. Multiple laboratories across the United States are collecting tissue specimens from different organs of donors who vary in sex, age, and body size. Integrating and harmonizing the data derived from these samples and 'ma…
▽ More
The National Institutes of Health's (NIH) Human Biomolecular Atlas Program (HuBMAP) aims to create a comprehensive high-resolution atlas of all the cells in the healthy human body. Multiple laboratories across the United States are collecting tissue specimens from different organs of donors who vary in sex, age, and body size. Integrating and harmonizing the data derived from these samples and 'map**' them into a common three-dimensional (3D) space is a major challenge. The key to making this possible is a 'Common Coordinate Framework' (CCF), which provides a semantically annotated, 3D reference system for the entire body. The CCF enables contributors to HuBMAP to 'register' specimens and datasets within a common spatial reference system, and it supports a standardized way to query and 'explore' data in a spatially and semantically explicit manner. [...] This paper describes the construction and usage of a CCF for the human body and its reference implementation in HuBMAP. The CCF consists of (1) a CCF Clinical Ontology, which provides metadata about the specimen and donor (the 'who'); (2) a CCF Semantic Ontology, which describes 'what' part of the body a sample came from and details anatomical structures, cell types, and biomarkers (ASCT+B); and (3) a CCF Spatial Ontology, which indicates 'where' a tissue sample is located in a 3D coordinate system. An initial version of all three CCF ontologies has been implemented for the first HuBMAP Portal release. It was successfully used by Tissue Map** Centers to semantically annotate and spatially register 48 kidney and spleen tissue blocks. The blocks can be queried and explored in their clinical, semantic, and spatial context via the CCF user interface in the HuBMAP Portal.
△ Less
Submitted 28 July, 2020;
originally announced July 2020.
-
Morphological Reconstruction of Detached Dendritic Spines via Geodesic Path Prediction
Authors:
Sammit Jain,
Suvadip Mukherjee,
Lydia Danglot,
Jean-Christophe Olivo-Marin
Abstract:
Morphological reconstruction of dendritic spines from fluorescent microscopy is a critical open problem in neuro-image analysis. Existing segmentation tools are ill-equipped to handle thin spines with long, poorly illuminated neck membranes. We address this issue, and introduce an unsupervised path prediction technique based on a stochastic framework which seeks the optimal solution from a path-sp…
▽ More
Morphological reconstruction of dendritic spines from fluorescent microscopy is a critical open problem in neuro-image analysis. Existing segmentation tools are ill-equipped to handle thin spines with long, poorly illuminated neck membranes. We address this issue, and introduce an unsupervised path prediction technique based on a stochastic framework which seeks the optimal solution from a path-space of possible spine neck reconstructions. Our method is specifically designed to reduce bias due to outliers, and is adept at reconstructing challenging shapes from images plagued by noise and poor contrast. Experimental analyses on two photon microscopy data demonstrate the efficacy of our method, where an improvement of 12.5% is observed over the state-of-the-art in terms of mean absolute reconstruction error.
△ Less
Submitted 21 September, 2020; v1 submitted 19 March, 2020;
originally announced March 2020.
-
Neural Network Segmentation of Interstitial Fibrosis, Tubular Atrophy, and Glomerulosclerosis in Renal Biopsies
Authors:
Brandon Ginley,
Kuang-Yu Jen,
Avi Rosenberg,
Felicia Yen,
Sanjay Jain,
Agnes Fogo,
Pinaki Sarder
Abstract:
Glomerulosclerosis, interstitial fibrosis, and tubular atrophy (IFTA) are histologic indicators of irrecoverable kidney injury. In standard clinical practice, the renal pathologist visually assesses, under the microscope, the percentage of sclerotic glomeruli and the percentage of renal cortical involvement by IFTA. Estimation of IFTA is a subjective process due to a varied spectrum and definition…
▽ More
Glomerulosclerosis, interstitial fibrosis, and tubular atrophy (IFTA) are histologic indicators of irrecoverable kidney injury. In standard clinical practice, the renal pathologist visually assesses, under the microscope, the percentage of sclerotic glomeruli and the percentage of renal cortical involvement by IFTA. Estimation of IFTA is a subjective process due to a varied spectrum and definition of morphological manifestations. Modern artificial intelligence and computer vision algorithms have the ability to reduce inter-observer variability through rigorous quantitation. In this work, we apply convolutional neural networks for the segmentation of glomerulosclerosis and IFTA in periodic acid-Schiff stained renal biopsies. The convolutional network approach achieves high performance in intra-institutional holdout data, and achieves moderate performance in inter-intuitional holdout data, which the network had never seen in training. The convolutional approach demonstrated interesting properties, such as learning to predict regions better than the provided ground truth as well as develo** its own conceptualization of segmental sclerosis. Subsequent estimations of IFTA and glomerulosclerosis percentages showed high correlation with ground truth.
△ Less
Submitted 28 February, 2020;
originally announced February 2020.
-
WU-NEAT: A clinically validated, open- source MATLAB toolbox for limited-channel neonatal EEG analysis
Authors:
Z. A. Vesoulis,
P. G. Gamble,
S. Jain,
N. M. El Ters,
S. M. Liao,
A. M. Mathur
Abstract:
Goal: Limited-channel EEG research in neonates is hindered by lack of open, accessible analytic tools. To overcome this limitation, we have created the Washington University- Neonatal EEG Analysis Toolbox (WU-NEAT), containing two of the most commonly used tools, provided in an open-source, clinically-validated package running within MATLAB. Methods: The first algorithm is the amplitude-integrated…
▽ More
Goal: Limited-channel EEG research in neonates is hindered by lack of open, accessible analytic tools. To overcome this limitation, we have created the Washington University- Neonatal EEG Analysis Toolbox (WU-NEAT), containing two of the most commonly used tools, provided in an open-source, clinically-validated package running within MATLAB. Methods: The first algorithm is the amplitude-integrated EEG (aEEG), which is generated by filtering, rectifying and time-compressing the original EEG recording, with subsequent semi-logarithmic display. The second algorithm is the spectral edge frequency (SEF), calculated as the critical frequency below which a user- defined proportion of the EEG spectral power is located. The aEEG algorithm was validated by three experienced reviewers. Reviewers evaluated aEEG recordings of fourteen preterm/term infants, displayed twice in random order, once using a reference algorithm and again using the WU-NEAT aEEG algorithm. Using standard methodology, reviewers assigned a background pattern classification. Inter/intra-rater reliability was assessed. For the SEF, calculations were made using the same fourteen recordings, first with the reference and then with the WU-NEAT algorithm. Results were compared using Pearson's correlation coefficient. Results: For the aEEG algorithm, intra- and inter-rater reliability was 100% and 98%, respectively. For the SEF, the mean (SD) Pearson correlation coefficient between algorithms was 0.96 (0.04). Conclusion: We have demonstrated a clinically-validated toolbox for generating the aEEG as well as calculating the SEF from EEG data. Open-source access will enable widespread use of common analytic algorithms which are device-independent and not subject to obsolescence, thereby facilitating future collaborative research in neonatal EEG.
△ Less
Submitted 11 May, 2018;
originally announced May 2018.
-
Feedbacks from the metabolic network to the genetic network reveal regulatory modules in E. coli and B. subtilis
Authors:
Santhust Kumar,
Saurabh Mahajan,
Sanjay Jain
Abstract:
The genetic regulatory network (GRN) plays a key role in controlling the response of the cell to changes in the environment. Although the structure of GRNs has been the subject of many studies, their large scale structure in the light of feedbacks from the metabolic network (MN) has received relatively little attention. Here we study the causal structure of the GRNs, namely the chain of influence…
▽ More
The genetic regulatory network (GRN) plays a key role in controlling the response of the cell to changes in the environment. Although the structure of GRNs has been the subject of many studies, their large scale structure in the light of feedbacks from the metabolic network (MN) has received relatively little attention. Here we study the causal structure of the GRNs, namely the chain of influence of one component on the other, taking into account feedback from the MN. First we consider the GRNs of E. coli and B. subtilis without feedback from MN and illustrate their causal structure. Next we augment the GRNs with feedback from their respective MNs by including (a) links from genes coding for enzymes to metabolites produced or consumed in reactions catalyzed by those enzymes and (b) links from metabolites to genes coding for transcription factors whose transcriptional activity the metabolites alter by binding to them. We find that the inclusion of feedback from MN into GRN significantly affects its causal structure, in particular the number of levels and relative positions of nodes in the hierarchy, and the number and size of the strongly connected components (SCCs). We then study the functional significance of the SCCs. For this we identify condition specific feedbacks from the MN into the GRN by retaining only those enzymes that are essential for growth in specific environmental conditions simulated via the technique of flux balance analysis (FBA). We find that the SCCs of the GRN augmented by these feedbacks can be ascribed specific functional roles in the organism. Our algorithmic approach thus reveals relatively autonomous subsystems with specific functionality, or regulatory modules in the organism. This automated approach could be useful in identifying biologically relevant modules in other organisms for which network data is available, but whose biology is less well studied.
△ Less
Submitted 9 March, 2018;
originally announced March 2018.
-
Assessing the Effects of Treatment in HIV-TB Co-infection Model
Authors:
Sachin Kumar,
Shikha Jain
Abstract:
We propose a population model for HIV-TB co-infection dynamics by considering treatments for HIV infection, active tuberculosis and co-infection. The HIV only and TB only models are analyzed separately, as well as full model. The basic reproduction numbers for TB ($\mathcal{R}_0^T$) and HIV ($\mathcal{R}_0^H$) and overall reproduction number for the system…
▽ More
We propose a population model for HIV-TB co-infection dynamics by considering treatments for HIV infection, active tuberculosis and co-infection. The HIV only and TB only models are analyzed separately, as well as full model. The basic reproduction numbers for TB ($\mathcal{R}_0^T$) and HIV ($\mathcal{R}_0^H$) and overall reproduction number for the system $\mathcal{R}_0= \max\{\mathcal{R}_0^T, \mathcal{R}_0^H\}$ are computed. The equilibria and their stability are studied. The main model undergoes supercritical transcritical bifurcation at $\mathcal{R}_0^T=1$ and $\mathcal{R}_0^H=1$ whereas the parameters $β^*=βe$ and $λ^*=λσ$ act as bifurcation parameters, respectively. Numerical simulation claims the existence of interior equilibrium when both the reproduction numbers are greater than unity. We explore the effect of early and late HIV treatment on disease-induced deaths during the TB treatment course. Mathematical analysis of our model shows that successful disease eradication requires treatment of single disease, that is, treatment for HIV only and TB only infected individuals with addition to co-infection treatment and in absence of which disease eradication is extremely difficult even for $\mathcal{R}_0<1$. When both the diseases are epidemic, the treatment for TB only infected individuals is very effective in reducing the total infected population and disease-induced deaths in comparison to the treatment for HIV infected individuals while these are minimum when both the single disease treatments are given with co-infection treatment.
△ Less
Submitted 9 August, 2018; v1 submitted 28 September, 2017;
originally announced September 2017.
-
Multi-radial LBP Features as a Tool for Rapid Glomerular Detection and Assessment in Whole Slide Histopathology Images
Authors:
Olivier Simon,
Rabi Yacoub,
Sanjay Jain,
Pinaki Sarder
Abstract:
We demonstrate a simple and effective automated method for the segmentation of glomeruli from large (~1 gigapixel) histopathological whole-slide images (WSIs) of thin renal tissue sections and biopsies, using an adaptation of the well-known local binary patterns (LBP) image feature vector to train a support vector machine (SVM) model. Our method offers high precision (>90%) and reasonable recall (…
▽ More
We demonstrate a simple and effective automated method for the segmentation of glomeruli from large (~1 gigapixel) histopathological whole-slide images (WSIs) of thin renal tissue sections and biopsies, using an adaptation of the well-known local binary patterns (LBP) image feature vector to train a support vector machine (SVM) model. Our method offers high precision (>90%) and reasonable recall (>70%) for glomeruli from WSIs, is readily adaptable to glomeruli from multiple species, including mouse, rat, and human, and is robust to diverse slide staining methods. Using 5 Intel(R) Core(TM) i7-4790 CPUs with 40 GB RAM, our method typically requires ~15 sec for training and ~2 min to extract glomeruli reproducibly from a WSI. Deploying a deep convolutional neural network trained for glomerular recognition in tandem with the SVM suffices to reduce false positives to below 3%. We also apply our LBP-based descriptor to successfully detect pathologic changes in a mouse model of diabetic nephropathy. We envision potential clinical and laboratory applications for this approach in the study and diagnosis of glomerular disease, and as a means of greatly accelerating the construction of feature sets to fuel deep learning studies into tissue structure and pathology.
△ Less
Submitted 20 September, 2017; v1 submitted 6 September, 2017;
originally announced September 2017.
-
Duplication Distance to the Root for Binary Sequences
Authors:
Noga Alon,
Jehoshua Bruck,
Farzad Farnoud,
Siddharth Jain
Abstract:
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of le…
▽ More
We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form $\seq x = \seq a \seq b \seq c \to \seq y = \seq a \seq b \seq b \seq c$, where $\seq x$ and $\seq y$ are sequences and $\seq a$, $\seq b$, and $\seq c$ are their substrings, needed to generate a binary sequence of length $n$ starting from a square-free sequence from the set $\{0,1,01,10,010,101\}$. This problem is a restricted case of finding the duplication/deduplication distance between two sequences, defined as the minimum number of duplication and deduplication operations required to transform one sequence to the other. We consider both exact and approximate tandem duplications. For exact duplication, denoting the maximum distance to the root of a sequence of length $n$ by $f(n)$, we prove that $f(n)=Θ(n)$. For the case of approximate duplication, where a $β$-fraction of symbols may be duplicated incorrectly, we show that the maximum distance has a sharp transition from linear in $n$ to logarithmic at $β=1/2$. We also study the duplication distance to the root for sequences with a given root and for special classes of sequences, namely, the de Bruijn sequences, the Thue-Morse sequence, and the Fibbonaci words. The problem is motivated by genomic tandem duplication mutations and the smallest number of tandem duplication events required to generate a given biological sequence.
△ Less
Submitted 16 November, 2016;
originally announced November 2016.
-
Inference of internal stress in a cell monolayer
Authors:
V. Nier,
S. Jain,
C. T. Lim,
S. Ishihara,
B. Ladoux,
P. Marcq
Abstract:
We combine traction force data with Bayesian inversion to obtain an absolute estimate of the internal stress field of a cell monolayer. The method, Bayesian inversion stress microscopy (BISM), is validated using numerical simulations performed in a wide range of conditions. It is robust to changes in each ingredient of the underlying statistical model. Importantly, its accuracy does not depend on…
▽ More
We combine traction force data with Bayesian inversion to obtain an absolute estimate of the internal stress field of a cell monolayer. The method, Bayesian inversion stress microscopy (BISM), is validated using numerical simulations performed in a wide range of conditions. It is robust to changes in each ingredient of the underlying statistical model. Importantly, its accuracy does not depend on the rheology of the tissue. We apply BISM to experimental traction force data measured in a narrow ring of cohesive epithelial cells, and check that the inferred stress field coincides with that obtained by direct spatial integration of the traction force data in this quasi-one-dimensional geometry.
△ Less
Submitted 11 March, 2016;
originally announced March 2016.
-
Capacity and Expressiveness of Genomic Tandem Duplication
Authors:
Siddharth Jain,
Farzad Farnoud,
Jehoshua Bruck
Abstract:
The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence $AGTC\underline{TGTG}C$, $TGTG$ is a tandem repeat, that may be generated from $AGTCTGC$ by a tandem duplication of length $2$. In this work, we investigate the possibil…
▽ More
The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence $AGTC\underline{TGTG}C$, $TGTG$ is a tandem repeat, that may be generated from $AGTCTGC$ by a tandem duplication of length $2$. In this work, we investigate the possibility of generating a large number of sequences from a \textit{seed}, i.e.\ a small initial string, by tandem duplications of bounded length. We study the capacity of such a system, a notion that quantifies the system's generating power. Our results include \textit{exact capacity} values for certain tandem duplication string systems. In addition, motivated by the role of DNA sequences in expressing proteins via RNA and the genetic code, we define the notion of the \textit{expressiveness} of a tandem duplication system as the capability of expressing arbitrary substrings. We then \textit{completely} characterize the expressiveness of tandem duplication systems for general alphabet sizes and duplication lengths. In particular, based on a celebrated result by Axel Thue from 1906, presenting a construction for ternary square-free sequences, we show that for alphabets of size 4 or larger, bounded tandem duplication systems, regardless of the seed and the bound on duplication length, are not fully expressive, i.e. they cannot generate all strings even as substrings of other strings. Note that the alphabet of size 4 is of particular interest as it pertains to the genomic alphabet. Building on this result, we also show that these systems do not have full capacity. In general, our results illustrate that duplication lengths play a more significant role than the seed in generating a large number of sequences for these systems.
△ Less
Submitted 20 September, 2015;
originally announced September 2015.
-
Flux-based classification of reactions reveals a functional bow-tie organization of complex metabolic networks
Authors:
Shalini Singh,
Areejit Samal,
Varun Giri,
Sandeep Krishna,
Nandula Raghuram,
Sanjay Jain
Abstract:
Unraveling the structure of complex biological networks and relating it to their functional role is an important task in systems biology. Here we attempt to characterize the functional organization of the large-scale metabolic networks of three microorganisms. We apply flux balance analysis to study the optimal growth states of these organisms in different environments. By investigating the differ…
▽ More
Unraveling the structure of complex biological networks and relating it to their functional role is an important task in systems biology. Here we attempt to characterize the functional organization of the large-scale metabolic networks of three microorganisms. We apply flux balance analysis to study the optimal growth states of these organisms in different environments. By investigating the differential usage of reactions across flux patterns for different environments, we observe a striking bimodal distribution in the activity of reactions. Motivated by this, we propose a simple algorithm to decompose the metabolic network into three sub-networks. It turns out that our reaction classifier which is blind to the biochemical role of pathways leads to three functionally relevant sub-networks that correspond to input, output and intermediate parts of the metabolic network with distinct structural characteristics. Our decomposition method unveils a functional bow-tie organization of metabolic networks that is different from the bow-tie structure determined by graph-theoretic methods that do not incorporate functionality.
△ Less
Submitted 9 December, 2012;
originally announced December 2012.
-
The origin of large molecules in primordial autocatalytic reaction networks
Authors:
Varun Giri,
Sanjay Jain
Abstract:
Large molecules such as proteins and nucleic acids are crucial for life, yet their primordial origin remains a major puzzle. The production of large molecules, as we know it today, requires good catalysts, and the only good catalysts we know that can accomplish this task consist of large molecules. Thus the origin of large molecules is a chicken and egg problem in chemistry. Here we present a mech…
▽ More
Large molecules such as proteins and nucleic acids are crucial for life, yet their primordial origin remains a major puzzle. The production of large molecules, as we know it today, requires good catalysts, and the only good catalysts we know that can accomplish this task consist of large molecules. Thus the origin of large molecules is a chicken and egg problem in chemistry. Here we present a mechanism, based on autocatalytic sets (ACSs), that is a possible solution to this problem. We discuss a mathematical model describing the population dynamics of molecules in a stylized but prebiotically plausible chemistry. Large molecules can be produced in this chemistry by the coalescing of smaller ones, with the smallest molecules, the `food set', being buffered. Some of the reactions can be catalyzed by molecules within the chemistry with varying catalytic strengths. Normally the concentrations of large molecules in such a scenario are very small, diminishing exponentially with their size. ACSs, if present in the catalytic network, can focus the resources of the system into a sparse set of molecules. ACSs can produce a bistability in the population dynamics and, in particular, steady states wherein the ACS molecules dominate the population. However to reach these steady states from initial conditions that contain only the food set typically requires very large catalytic strengths, growing exponentially with the size of the catalyst molecule. We present a solution to this problem by studying `nested ACSs', a structure in which a small ACS is connected to a larger one and reinforces it. We show that when the network contains a cascade of nested ACSs with the catalytic strengths of molecules increasing gradually with their size (e.g., as a power law), a sparse subset of molecules including some very large molecules can come to dominate the system.
△ Less
Submitted 17 October, 2011;
originally announced October 2011.
-
Flows in complex biochemical networks: Role of low degree nodes
Authors:
Areejit Samal,
Sanjay Jain
Abstract:
Metabolic networks have two properties that are generally regarded as unrelated: One, they have metabolic reactions whose single knockout is lethal for the organism, and two, they have correlated sets of reactions forming functional modules. In this review we argue that both essentiality and modularity seem to arise as a consequence of the same structural property: the existence of low degree meta…
▽ More
Metabolic networks have two properties that are generally regarded as unrelated: One, they have metabolic reactions whose single knockout is lethal for the organism, and two, they have correlated sets of reactions forming functional modules. In this review we argue that both essentiality and modularity seem to arise as a consequence of the same structural property: the existence of low degree metabolites. This observation allows a prediction of (a) essential metabolic reactions which are potential drug targets in pathogenic microorganisms and (b) regulatory modules within biological networks, from purely structural information about the metabolic network.
△ Less
Submitted 4 December, 2010;
originally announced December 2010.
-
The regulatory network of E. coli metabolism as a boolean dynamical system exhibits both homeostasis and flexibility of response
Authors:
Areejit Samal,
Sanjay Jain
Abstract:
Elucidating the architecture and dynamics of large scale genetic regulatory networks of cells is an important goal in systems biology. We study the system level dynamical properties of the genetic network of Escherichia coli that regulates its metabolism, and show how its design leads to biologically useful cellular properties. Our study uses the database (Covert et al., Nature 2004) containing…
▽ More
Elucidating the architecture and dynamics of large scale genetic regulatory networks of cells is an important goal in systems biology. We study the system level dynamical properties of the genetic network of Escherichia coli that regulates its metabolism, and show how its design leads to biologically useful cellular properties. Our study uses the database (Covert et al., Nature 2004) containing 583 genes and 96 external metabolites which describes not only the network connections but also the boolean rule at each gene node that controls the switching on or off of the gene as a function of its inputs. We have studied how the attractors of the boolean dynamical system constructed from this database depend on the initial condition of the genes and on various environmental conditions corresponding to buffered minimal media. We find that the system exhibits homeostasis in that its attractors, that turn out to be fixed points or low period cycles, are highly insensitive to initial conditions or perturbations of gene configurations for any given fixed environment. At the same time the attractors show a wide variation when external media are varied implying that the system mounts a highly flexible response to changed environmental conditions. The regulatory dynamics acts to enhance the cellular growth rate under changed media. Our study shows that the reconstructed genetic network regulating metabolism in {\it E. coli} is hierarchical, modular, and largely acyclic, with environmental variables controlling the root of the hierarchy. This architecture makes the cell highly robust to perturbations of gene configurations as well as highly responsive to environmental changes. The twin properties of homeostasis and response flexibility are achieved by this dynamical system even though it is not close to the edge of chaos.
△ Less
Submitted 5 October, 2007; v1 submitted 27 March, 2007;
originally announced March 2007.
-
Low Degree Metabolites Explain Essential Reactions and Enhance Modularity in Biological Networks
Authors:
Areejit Samal,
Shalini Singh,
Varun Giri,
Sandeep Krishna,
N. Raghuram,
Sanjay Jain
Abstract:
Recently there has been a lot of interest in identifying modules at the level of genetic and metabolic networks of organisms, as well as in identifying single genes and reactions that are essential for the organism. A goal of computational and systems biology is to go beyond identification towards an explanation of specific modules and essential genes and reactions in terms of specific structura…
▽ More
Recently there has been a lot of interest in identifying modules at the level of genetic and metabolic networks of organisms, as well as in identifying single genes and reactions that are essential for the organism. A goal of computational and systems biology is to go beyond identification towards an explanation of specific modules and essential genes and reactions in terms of specific structural or evolutionary constraints. In the metabolic networks of E. coli, S. cerevisiae and S. aureus, we identified metabolites with a low degree of connectivity, particularly those that are produced and/or consumed in just a single reaction. Using FBA we also determined reactions essential for growth in these metabolic networks. We find that most reactions identified as essential in these networks turn out to be those involving the production or consumption of low degree metabolites. Applying graph theoretic methods to these metabolic networks, we identified connected clusters of these low degree metabolites. The genes involved in several operons in E. coli are correctly predicted as those of enzymes catalyzing the reactions of these clusters. We independently identified clusters of reactions whose fluxes are perfectly correlated. We find that the composition of the latter `functional clusters' is also largely explained in terms of clusters of low degree metabolites in each of these organisms. Our findings mean that most metabolic reactions that are essential can be tagged by one or more low degree metabolites. Those reactions are essential because they are the only ways of producing or consuming their respective tagged metabolites. Furthermore, reactions whose fluxes are strongly correlated can be thought of as `glued together' by these low degree metabolites.
△ Less
Submitted 21 October, 2005; v1 submitted 20 April, 2005;
originally announced April 2005.
-
Evidence of a universal power law characterizing the evolution of metabolic networks
Authors:
Shalini,
Areejit Samal,
Varun Giri,
Sandeep Krishna,
N. Raghuram,
Sanjay Jain
Abstract:
Metabolic networks are known to be scale free but the evolutionary origin of this structural property is not clearly understood. One way of studying the dynamical process is to compare the metabolic networks of species that have arisen at different points in evolution and hence are related to each other to varying extents. We have compared the reaction sets of each metabolite across and within 1…
▽ More
Metabolic networks are known to be scale free but the evolutionary origin of this structural property is not clearly understood. One way of studying the dynamical process is to compare the metabolic networks of species that have arisen at different points in evolution and hence are related to each other to varying extents. We have compared the reaction sets of each metabolite across and within 15 groups of species. For a given pair of species and a given metabolite, the number $Δk$ of reactions of the metabolite that appear in the metabolic network of only one species and not the other is a measure of the distance between the two networks. While $Δk$ is small within groups of related species and large across groups, we find its probability distribution to be $\sim (Δk)^{-γ'}$ where $γ'$ is a universal exponent that is the same within and across groups. This exponent equals, upto statistical uncertainties, the exponent $γ$ in the scale free degree distribution $\sim k^{-γ}$. We argue that this, as well as our finding that $Δk$ is approximately linearly correlated with the degree $k$ of the metabolite, is evidence of a `proportionate change' process in evolution. We also discuss some molecular mechanisms that might be responsible for such an evolutionary process.
△ Less
Submitted 11 April, 2005;
originally announced April 2005.
-
Large extinctions in an evolutionary model: The role of innovation and keystone species
Authors:
Sanjay Jain,
Sandeep Krishna
Abstract:
The causes of major and rapid transitions observed in biological macroevolution as well as in the evolution of social systems are a subject of much debate. Here we identify the proximate causes of crashes and recoveries that arise dynamically in a model system in which populations of (molecular) species co-evolve with their network of chemical interactions. Crashes are events that involve the ra…
▽ More
The causes of major and rapid transitions observed in biological macroevolution as well as in the evolution of social systems are a subject of much debate. Here we identify the proximate causes of crashes and recoveries that arise dynamically in a model system in which populations of (molecular) species co-evolve with their network of chemical interactions. Crashes are events that involve the rapid extinction of many species and recoveries the assimilation of new ones. These are analyzed and classified in terms of the structural properties of the network. We find that in the absence of large external perturbation, `innovation' is a major cause of large extinctions and the prime cause of recoveries. Another major cause of crashes is the extinction of a `keystone species'. Different classes of causes produce crashes of different characteristic sizes.
△ Less
Submitted 17 July, 2001;
originally announced July 2001.