-
RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks
Authors:
Rafael Josip Penić,
Tin Vlašić,
Roland G. Huber,
Yue Wan,
Mile Šikić
Abstract:
Ribonucleic acid (RNA) plays a variety of crucial roles in fundamental biological processes. Recently, RNA has become an interesting drug target, emphasizing the need to improve our understanding of its structures and functions. Over the years, sequencing technologies have produced an enormous amount of unlabeled RNA data, which hides important knowledge and potential. Motivated by the successes o…
▽ More
Ribonucleic acid (RNA) plays a variety of crucial roles in fundamental biological processes. Recently, RNA has become an interesting drug target, emphasizing the need to improve our understanding of its structures and functions. Over the years, sequencing technologies have produced an enormous amount of unlabeled RNA data, which hides important knowledge and potential. Motivated by the successes of protein language models, we introduce RiboNucleic Acid Language Model (RiNALMo) to help unveil the hidden code of RNA. RiNALMo is the largest RNA language model to date with $650$ million parameters pre-trained on $36$ million non-coding RNA sequences from several available databases. RiNALMo is able to extract hidden knowledge and capture the underlying structure information implicitly embedded within the RNA sequences. RiNALMo achieves state-of-the-art results on several downstream tasks. Notably, we show that its generalization capabilities can overcome the inability of other deep learning methods for secondary structure prediction to generalize on unseen RNA families. The code has been made publicly available on https://github.com/lbcb-sci/RiNALMo.
△ Less
Submitted 29 February, 2024;
originally announced March 2024.
-
Classifying Organizations for Food System Ontologies using Natural Language Processing
Authors:
Tianyu Jiang,
Sonia Vinogradova,
Nathan Stringham,
E. Louise Earl,
Allan D. Hollander,
Patrick R. Huber,
Ellen Riloff,
R. Sandra Schillo,
Giorgio A. Ubbiali,
Matthew Lange
Abstract:
Our research explores the use of natural language processing (NLP) methods to automatically classify entities for the purpose of knowledge graph population and integration with food system ontologies. We have created NLP models that can automatically classify organizations with respect to categories associated with environmental issues as well as Standard Industrial Classification (SIC) codes, whi…
▽ More
Our research explores the use of natural language processing (NLP) methods to automatically classify entities for the purpose of knowledge graph population and integration with food system ontologies. We have created NLP models that can automatically classify organizations with respect to categories associated with environmental issues as well as Standard Industrial Classification (SIC) codes, which are used by the U.S. government to characterize business activities. As input, the NLP models are provided with text snippets retrieved by the Google search engine for each organization, which serves as a textual description of the organization that is used for learning. Our experimental results show that NLP models can achieve reasonably good performance for these two classification tasks, and they rely on a general framework that could be applied to many other classification problems as well. We believe that NLP models represent a promising approach for automatically harvesting information to populate knowledge graphs and aligning the information with existing ontologies through shared categories and concepts.
△ Less
Submitted 19 September, 2023;
originally announced September 2023.
-
Reduction of Subjective Listening Effort for TV Broadcast Signals with Recurrent Neural Networks
Authors:
Nils L. Westhausen,
Rainer Huber,
Hannah Baumgartner,
Ragini Sinha,
Jan Rennies,
Bernd T. Meyer
Abstract:
Listening to the audio of TV broadcast signals can be challenging for hearing-impaired as well as normal-hearing listeners, especially when background sounds are prominent or too loud compared to the speech signal. This can result in a reduced satisfaction and increased listening effort of the listeners. Since the broadcast sound is usually premixed, we perform a subjective evaluation for quantify…
▽ More
Listening to the audio of TV broadcast signals can be challenging for hearing-impaired as well as normal-hearing listeners, especially when background sounds are prominent or too loud compared to the speech signal. This can result in a reduced satisfaction and increased listening effort of the listeners. Since the broadcast sound is usually premixed, we perform a subjective evaluation for quantifying the potential of speech enhancement systems based on audio source separation and recurrent neural networks (RNN). Recently, RNNs have shown promising results in the context of sound source separation and real-time signal processing. In this paper, we separate the speech from the background signals and remix the separated sounds at a higher signal-to-noise ratio. This differs from classic speech enhancement, where usually only the extracted speech signal is exploited. The subjective evaluation with 20 normal-hearing subjects on real TV-broadcast material shows that our proposed enhancement system is able to reduce the listening effort by around 2 points on a 13-point listening effort rating scale and increases the perceived sound quality compared to the original mixture.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.
-
The I-ADOPT Interoperability Framework for FAIRer data descriptions of biodiversity
Authors:
Barbara Magagna,
Ilaria Rosati,
Maria Stoica,
Sirko Schindler,
Gwenaelle Moncoiffe,
Anusuriya Devaraju,
Johannes Peterseil,
Robert Huber
Abstract:
Biodiversity, the variation within and between species and ecosystems, is essential for human well-being and the equilibrium of the planet. It is critical for the sustainable development of human society and is an important global challenge. Biodiversity research has become increasingly data-intensive and it deals with heterogeneous and distributed data made available by global and regional initia…
▽ More
Biodiversity, the variation within and between species and ecosystems, is essential for human well-being and the equilibrium of the planet. It is critical for the sustainable development of human society and is an important global challenge. Biodiversity research has become increasingly data-intensive and it deals with heterogeneous and distributed data made available by global and regional initiatives, such as GBIF, ILTER, LifeWatch, BODC, PANGAEA, and TERN, that apply different data management practices. In particular, a variety of metadata and semantic resources have been produced by these initiatives to describe biodiversity observations, introducing interoperability issues across data management systems. To address these challenges, the InteroperAble Descriptions of Observable Property Terminology WG (I-ADOPT WG) was formed by a group of international terminology providers and data center managers in 2019 with the aim to build a common approach to describe what is observed, measured, calculated, or derived. Based on an extensive analysis of existing semantic representations of variables, the WG has recently published the I-ADOPT framework ontology to facilitate interoperability between existing semantic resources and support the provision of machine-readable variable descriptions whose components are mapped to FAIR vocabulary terms. The I-ADOPT framework ontology defines a set of high level semantic components that can be used to describe a variety of patterns commonly found in scientific observations. This contribution will focus on how the I-ADOPT framework can be applied to represent variables commonly used in the biodiversity domain.
△ Less
Submitted 14 July, 2021;
originally announced July 2021.
-
Data-driven Evolutions of Critical Points
Authors:
Stefano Almi,
Massimo Fornasier,
Richard Huber
Abstract:
In this paper we are concerned with the learnability of energies from data obtained by observing time evolutions of their critical points starting at random initial equilibria. As a byproduct of our theoretical framework we introduce the novel concept of mean-field limit of critical point evolutions and of their energy balance as a new form of transport. We formulate the energy learning as a varia…
▽ More
In this paper we are concerned with the learnability of energies from data obtained by observing time evolutions of their critical points starting at random initial equilibria. As a byproduct of our theoretical framework we introduce the novel concept of mean-field limit of critical point evolutions and of their energy balance as a new form of transport. We formulate the energy learning as a variational problem, minimizing the discrepancy of energy competitors from fulfilling the equilibrium condition along any trajectory of critical points originated at random initial equilibria. By Gamma-convergence arguments we prove the convergence of minimal solutions obtained from finite number of observations to the exact energy in a suitable sense. The abstract framework is actually fully constructive and numerically implementable. Hence, the approximation of the energy from a finite number of observations of past evolutions allows to simulate further evolutions, which are fully data-driven. As we aim at a precise quantitative analysis, and to provide concrete examples of tractable solutions, we present analytic and numerical results on the reconstruction of an elastic energy for a one-dimensional model of thin nonlinear-elastic rod.
△ Less
Submitted 1 November, 2019;
originally announced November 2019.
-
Use of semantic technologies for the development of a dynamic trajectories generator in a Semantic Chemistry eLearning platform
Authors:
Richard Huber,
Kirsten Hantelmann,
Alexandru Todor,
Sebastian Krebs,
Ralf Heese,
Adrian Paschke
Abstract:
ChemgaPedia is a multimedia, webbased eLearning service platform that currently contains about 18.000 pages organized in 1.700 chapters covering the complete bachelor studies in chemistry and related topics of chemistry, pharmacy, and life sciences. The eLearning encyclopedia contains some 25.000 media objects and the eLearning platform provides services such as virtual and remote labs for experim…
▽ More
ChemgaPedia is a multimedia, webbased eLearning service platform that currently contains about 18.000 pages organized in 1.700 chapters covering the complete bachelor studies in chemistry and related topics of chemistry, pharmacy, and life sciences. The eLearning encyclopedia contains some 25.000 media objects and the eLearning platform provides services such as virtual and remote labs for experiments. With up to 350.000 users per month the platform is the most frequently used scientific educational service in the German spoken Internet. In this demo we show the benefit of map** the static eLearning contents of ChemgaPedia to a Linked Data representation for Semantic Chemistry which allows for generating dynamic eLearning paths tailored to the semantic profiles of the users.
△ Less
Submitted 7 December, 2010;
originally announced December 2010.