-
Embed-Search-Align: DNA Sequence Alignment using Transformer Models
Authors:
Pavan Holur,
K. C. Enevoldsen,
Shreyas Rajesh,
Lajoyce Mboning,
Thalia Georgiou,
Louis-S. Bouchard,
Matteo Pellegrini,
Vwani Roychowdhury
Abstract:
DNA sequence alignment involves assigning short DNA reads to the most probable locations on an extensive reference genome. This process is crucial for various genomic analyses, including variant calling, transcriptomics, and epigenomics. Conventional methods, refined over decades, tackle this challenge in two steps: genome indexing followed by efficient search to locate likely positions for given…
▽ More
DNA sequence alignment involves assigning short DNA reads to the most probable locations on an extensive reference genome. This process is crucial for various genomic analyses, including variant calling, transcriptomics, and epigenomics. Conventional methods, refined over decades, tackle this challenge in two steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of Large Language Models (LLM) in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have explored whether the same Transformer architecture can produce numerical representations for DNA sequences. Such models have shown early promise in tasks involving classification of short DNA sequences, such as the detection of coding vs non-coding regions, as well as the identification of enhancer and promoter sequences. Performance at sequence classification tasks does not, however, translate to sequence alignment, where it is necessary to conduct a genome-wide search to successfully align every read. We address this open problem by framing it as an Embed-Search-Align task. In this framework, a novel encoder model DNA-ESA generates representations of reads and fragments of the reference, which are projected into a shared vector space where the read-fragment distance is used as surrogate for alignment. In particular, DNA-ESA introduces: (1) Contrastive loss for self-supervised training of DNA sequence representations, facilitating rich sequence-level embeddings, and (2) a DNA vector store to enable search across fragments on a global scale. DNA-ESA is >97% accurate when aligning 250-length reads onto a human reference genome of 3 gigabases (single-haploid), far exceeds the performance of 6 recent DNA-Transformer model baselines and shows task transfer across chromosomes and species.
△ Less
Submitted 23 April, 2024; v1 submitted 20 September, 2023;
originally announced September 2023.
-
When no news is bad news -- Detection of negative events from news media content
Authors:
Kristoffer L. Nielbo,
Frida Haestrup,
Kenneth C. Enevoldsen,
Peter B. Vahlstrup,
Rebekah B. Baglini,
Andreas Roepstorff
Abstract:
During the first wave of Covid-19 information decoupling could be observed in the flow of news media content. The corollary of the content alignment within and between news sources experienced by readers (i.e., all news transformed into Corona-news), was that the novelty of news content went down as media focused monotonically on the pandemic event. This all-important Covid-19 news theme turned ou…
▽ More
During the first wave of Covid-19 information decoupling could be observed in the flow of news media content. The corollary of the content alignment within and between news sources experienced by readers (i.e., all news transformed into Corona-news), was that the novelty of news content went down as media focused monotonically on the pandemic event. This all-important Covid-19 news theme turned out to be quite persistent as the pandemic continued, resulting in the, from a news media's perspective, paradoxical situation where the same news was repeated over and over. This information phenomenon, where novelty decreases and persistence increases, has previously been used to track change in news media, but in this study we specifically test the claim that new information decoupling behavior of media can be used to reliably detect change in news media content originating in a negative event, using a Bayesian approach to change point detection.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
News Information Decoupling: An Information Signature of Catastrophes in Legacy News Media
Authors:
Kristoffer L. Nielbo,
Rebekah B. Baglini,
Peter B. Vahlstrup,
Kenneth C. Enevoldsen,
Anja Bechmann,
Andreas Roepstorff
Abstract:
Content alignment in news media was an observable information effect of Covid-19's initial phase. During the first half of 2020, legacy news media became "corona news" following national outbreak and crises management patterns. While news media are neither unbiased nor infallible as sources of events, they do provide a window into socio-cultural responses to events. In this paper, we use legacy pr…
▽ More
Content alignment in news media was an observable information effect of Covid-19's initial phase. During the first half of 2020, legacy news media became "corona news" following national outbreak and crises management patterns. While news media are neither unbiased nor infallible as sources of events, they do provide a window into socio-cultural responses to events. In this paper, we use legacy print media to empirically derive the principle News Information Decoupling (NID) that functions as an information signature of culturally significant catastrophic event. Formally, NID can provide input to change detection algorithms and points to several unsolved research problems in the intersection of information theory and media studies.
△ Less
Submitted 8 January, 2021;
originally announced January 2021.