-
3D-based RNA function prediction tools in rnaglib
Authors:
Carlos Oliver,
Vincent Mallet,
Jérôme Waldispühl
Abstract:
Understanding the connection between complex structural features of RNA and biological function is a fundamental challenge in evolutionary studies and in RNA design. However, building datasets of RNA 3D structures and making appropriate modeling choices remains time-consuming and lacks standardization. In this chapter, we describe the use of rnaglib, to train supervised and unsupervised machine le…
▽ More
Understanding the connection between complex structural features of RNA and biological function is a fundamental challenge in evolutionary studies and in RNA design. However, building datasets of RNA 3D structures and making appropriate modeling choices remains time-consuming and lacks standardization. In this chapter, we describe the use of rnaglib, to train supervised and unsupervised machine learning-based function prediction models on datasets of RNA 3D structures.
△ Less
Submitted 3 May, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
RNAglib: A Python Package for RNA 2.5D Graphs
Authors:
Vincent Mallet,
Carlos Oliver,
Jonathan Broadbent,
William L. Hamilton,
Jérôme Waldispühl
Abstract:
RNA 3D architectures are stabilized by sophisticated networks of (non-canonical) base pair interactions, which can be conveniently encoded as multi-relational graphs and efficiently exploited by graph theoretical approaches and recent progresses in machine learning techniques. RNAglib is a library that eases the use of this representation, by providing clean data, methods to load it in machine lea…
▽ More
RNA 3D architectures are stabilized by sophisticated networks of (non-canonical) base pair interactions, which can be conveniently encoded as multi-relational graphs and efficiently exploited by graph theoretical approaches and recent progresses in machine learning techniques. RNAglib is a library that eases the use of this representation, by providing clean data, methods to load it in machine learning pipelines and graph-based deep learning models suited for this representation. RNAglib also offers other utilities to model RNA with 2.5D graphs, such as drawing tools, comparison functions or baseline performances on RNA applications. The method and data is distributed as a fully documented pip package.
Availability: https://rnaglib.cs.mcgill.ca
△ Less
Submitted 9 September, 2021;
originally announced September 2021.
-
VeRNAl: Mining RNA Structures for Fuzzy Base Pairing Network Motifs
Authors:
Carlos Oliver,
Vincent Mallet,
Pericles Philippopoulos,
William L. Hamilton,
Jerome Waldispuhl
Abstract:
RNA 3D motifs are recurrent substructures, modelled as networks of base pair interactions, which are crucial for understanding structure-function relationships. The task of automatically identifying such motifs is computationally hard, and remains a key challenge in the field of RNA structural biology and network analysis. State of the art methods solve special cases of the motif problem by constr…
▽ More
RNA 3D motifs are recurrent substructures, modelled as networks of base pair interactions, which are crucial for understanding structure-function relationships. The task of automatically identifying such motifs is computationally hard, and remains a key challenge in the field of RNA structural biology and network analysis. State of the art methods solve special cases of the motif problem by constraining the structural variability in occurrences of a motif, and narrowing the substructure search space. Here, we relax these constraints by posing the motif finding problem as a graph representation learning and clustering task. This framing takes advantage of the continuous nature of graph representations to model the flexibility and variability of RNA motifs in an efficient manner. We propose a set of node similarity functions, clustering methods, and motif construction algorithms to recover flexible RNA motifs. Our tool, VeRNAl can be easily customized by users to desired levels of motif flexibility, abundance and size. We show that VeRNAl is able to retrieve and expand known classes of motifs, as well as to propose novel motifs.
△ Less
Submitted 18 October, 2021; v1 submitted 1 September, 2020;
originally announced September 2020.
-
Leveraging binding-site structure for drug discovery with point-cloud methods
Authors:
Vincent Mallet,
Carlos G. Oliver,
Nicolas Moitessier,
Jerome Waldispuhl
Abstract:
Computational drug discovery strategies can be broadly placed in two categories: ligand-based methods which identify novel molecules by similarity with known ligands, and structure-based methods which predict molecules with high-affinity to a given 3D structure (e.g. a protein). However, ligand-based methods do not leverage information about the binding site, and structure-based approaches rely on…
▽ More
Computational drug discovery strategies can be broadly placed in two categories: ligand-based methods which identify novel molecules by similarity with known ligands, and structure-based methods which predict molecules with high-affinity to a given 3D structure (e.g. a protein). However, ligand-based methods do not leverage information about the binding site, and structure-based approaches rely on the knowledge of a finite set of ligands binding the target. In this work, we introduce TarLig, a novel approach that aims to bridge the gap between ligand and structure-based approaches. We use the 3D structure of the binding site as input to a model which predicts the ligand preferences of the binding site. The resulting predictions could then offer promising seeds and constraints in the chemical space search, based on the binding site structure. TarLig outperforms standard models by introducing a data-alignment and augmentation technique. The recent popularity of Volumetric 3DCNN pipelines in structural bioinformatics suggests that this extra step could help a wide range of methods to improve their results with minimal modifications.
△ Less
Submitted 28 May, 2019;
originally announced May 2019.
-
10 simple rules to create a serious game, illustrated with examples from structural biology
Authors:
Marc Baaden,
Olivier Delalande,
Nicolas Ferey,
Samuela Pasquali,
Jérôme Waldispühl,
Antoine Taly
Abstract:
Serious scientific games are games whose purpose is not only fun. In the field of science, the serious goals include crucial activities for scientists: outreach, teaching and research. The number of serious games is increasing rapidly, in particular citizen science games, games that allow people to produce and/or analyze scientific data. Interestingly, it is possible to build a set of rules provid…
▽ More
Serious scientific games are games whose purpose is not only fun. In the field of science, the serious goals include crucial activities for scientists: outreach, teaching and research. The number of serious games is increasing rapidly, in particular citizen science games, games that allow people to produce and/or analyze scientific data. Interestingly, it is possible to build a set of rules providing a guideline to create or improve serious games. We present arguments gathered from our own experience ( Phylo , DocMolecules , HiRE-RNA contest and Pangu) as well as examples from the growing literature on scientific serious games.
△ Less
Submitted 9 March, 2018; v1 submitted 14 August, 2017;
originally announced August 2017.
-
The Topology of Biological Networks from a Complexity Perspective
Authors:
Ali Atiia,
François Major,
Jérôme Waldispühl
Abstract:
A complexity-theoretic approach to studying biological networks is proposed. A simple graph representation is used where molecules (DNA, RNA, proteins and chemicals) are vertices and relations between them are directed and signed (promotional (+) or inhibitory (-)) edges. Based on this model, the problem of network evolution (NE) is defined formally as an optimization problem and subsequently prov…
▽ More
A complexity-theoretic approach to studying biological networks is proposed. A simple graph representation is used where molecules (DNA, RNA, proteins and chemicals) are vertices and relations between them are directed and signed (promotional (+) or inhibitory (-)) edges. Based on this model, the problem of network evolution (NE) is defined formally as an optimization problem and subsequently proven to be fundamentally hard (NP-hard) by means of reduction from the Knapsack problem (KP). Second, for empirical validation, various biological networks of experimentally-validated interactions are compared against randomly generated networks with varying degree distributions. An NE instance is created using a given real or synthetic (random) network. After being reverse-reduced to a KP instance, each NE instance is fed to a KP solver and the average achieved knapsack value-to-weight ratio is recorded from multiple rounds of simulated evolutionary pressure. The results show that biological networks (and synthetic networks of similar degree distribution) achieve the highest ratios at maximal evolutionary pressure and minimal error tolerance conditions. The more distant (in degree distribution) a synthetic network is from biological networks the lower its achieved ratio. The results shed light on how computational intractability has shaped the evolution of biological networks into their current topology.
△ Less
Submitted 24 April, 2018; v1 submitted 10 May, 2015;
originally announced May 2015.
-
Using structural and evolutionary information to detect and correct pyrosequencing errors in non-coding RNAs
Authors:
Vladimir Reinharz,
Yann Ponty,
Jérôme Waldispühl
Abstract:
Analysis of the sequence-structure relationship in RNA molecules are essential to evolutionary studies but also to concrete applications such as error-correction methodologies in sequencing technologies. The prohibitive sizes of the mutational and conformational landscapes combined with the volume of data to proceed require efficient algorithms to compute sequence-structure properties. More specif…
▽ More
Analysis of the sequence-structure relationship in RNA molecules are essential to evolutionary studies but also to concrete applications such as error-correction methodologies in sequencing technologies. The prohibitive sizes of the mutational and conformational landscapes combined with the volume of data to proceed require efficient algorithms to compute sequence-structure properties. More specifically, here we aim to calculate which mutations increase the most the likelihood of a sequence to a given structure and RNA family. In this paper, we introduce RNApyro, an efficient linear-time and space inside-outside algorithm that computes exact mutational probabilities under secondary structure and evolutionary constraints given as a multiple sequence alignment with a consensus structure. We develop a scoring scheme combining classical stacking base pair energies to novel isostericity scales, and apply our techniques to correct point-wise errors in 5s and 16s rRNA sequences. Our results suggest that RNApyro is a promising algorithm to complement existing tools in the NGS error-correction pipeline.
△ Less
Submitted 30 May, 2013;
originally announced May 2013.
-
Flexible RNA design under structure and sequence constraints using formal languages
Authors:
Yu Zhou,
Yann Ponty,
Stéphane Vialette,
Jérôme Waldispühl,
Yi Zhang,
Alain Denise
Abstract:
The problem of RNA secondary structure design (also called inverse folding) is the following: given a target secondary structure, one aims to create a sequence that folds into, or is compatible with, a given structure. In several practical applications in biology, additional constraints must be taken into account, such as the presence/absence of regulatory motifs, either at a specific location or…
▽ More
The problem of RNA secondary structure design (also called inverse folding) is the following: given a target secondary structure, one aims to create a sequence that folds into, or is compatible with, a given structure. In several practical applications in biology, additional constraints must be taken into account, such as the presence/absence of regulatory motifs, either at a specific location or anywhere in the sequence. In this study, we investigate the design of RNA sequences from their targeted secondary structure, given these additional sequence constraints. To this purpose, we develop a general framework based on concepts of language theory, namely context-free grammars and finite automata. We efficiently combine a comprehensive set of constraints into a unifying context-free grammar of moderate size. From there, we use generic generic algorithms to perform a (weighted) random generation, or an exhaustive enumeration, of candidate sequences. The resulting method, whose complexity scales linearly with the length of the RNA, was implemented as a standalone program. The resulting software was embedded into a publicly available dedicated web server. The applicability demonstrated of the method on a concrete case study dedicated to Exon Splicing Enhancers, in which our approach was successfully used in the design of \emph{in vitro} experiments.
△ Less
Submitted 1 August, 2013; v1 submitted 16 May, 2013;
originally announced May 2013.