-
Jaccard/Tanimoto similarity test and estimation methods
Authors:
Neo Christopher Chung,
Błażej Miasojedow,
Michał Startek,
Anna Gambin
Abstract:
Binary data are used in a broad area of biological sciences. Using binary presence-absence data, we can evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify…
▽ More
Binary data are used in a broad area of biological sciences. Using binary presence-absence data, we can evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied.
We introduce a hypothesis test for similarity for biological presence-absence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. We derived the exact and asymptotic solutions and developed the bootstrap and measurement concentration algorithms to compute statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate p-values and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution. The proposed methods are implemented in an open source R package called jaccard (https://cran.r-project.org/package=jaccard).
We introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient, that enable straightforward incorporation of probabilistic measures in analysis for species co-occurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science.
△ Less
Submitted 27 March, 2019;
originally announced March 2019.
-
Assigning peaks and modeling ETD in top-down mass spectrometry
Authors:
Mateusz Krzysztof Łącki,
Frederik Lermyte,
Błażej Miasojedow,
Mikołaj Olszański,
Michał Startek,
Frank Sobott,
Dirk Valkenborg,
Anna Gambin
Abstract:
Among many techniques of modern mass spectrometry, the top down methods are becoming continuously more popular in the overall strive to describe the proteome. These techniques are based on fragmentation of ions inside mass spectrometers instead of being proteolytically digested. In some of these techniques, the fragmentation is induced by electron transfer. It can trigger several concurring reacti…
▽ More
Among many techniques of modern mass spectrometry, the top down methods are becoming continuously more popular in the overall strive to describe the proteome. These techniques are based on fragmentation of ions inside mass spectrometers instead of being proteolytically digested. In some of these techniques, the fragmentation is induced by electron transfer. It can trigger several concurring reactions: electron transfer dissociation, electron transfer without dissociation, and proton transfer reaction. The evaluation of the extent of these reactions is important for the proper understanding of the functioning of the instrument and, what is even more important, to know if it can be used to reveal important structural information. We present a workflow for assigning peaks and interpreting the results of electron transfer driven reactions. We also present software written in Python and available under GNU v3 license.
△ Less
Submitted 25 August, 2017; v1 submitted 1 August, 2017;
originally announced August 2017.
-
Modelling the proliferation of transposable elements in populations under environmental stress
Authors:
K. Gogolewski,
M. Startek,
A. Gambin,
A. Le Rouzic
Abstract:
In this article, we investigate the evolution of sexual diploid populations which are hosts for active TE families. Our purpose is to explore the relationship between the environmental change, that influences such population and activity of those TEs that are present in genomes.
In this article, we investigate the evolution of sexual diploid populations which are hosts for active TE families. Our purpose is to explore the relationship between the environmental change, that influences such population and activity of those TEs that are present in genomes.
△ Less
Submitted 15 November, 2016;
originally announced November 2016.
-
An asymptotically optimal, online algorithm for weighted random sampling with replacement
Authors:
Michał Startek
Abstract:
This paper presents a novel algorithm solving the classic problem of generating a random sample of size s from population of size n with non-uniform probabilities. The sampling is done with replacement. The algorithm requires constant additional memory, and works in O(n) time (even when s >> n, in which case the algorithm produces a list containing, for every population member, the number of times…
▽ More
This paper presents a novel algorithm solving the classic problem of generating a random sample of size s from population of size n with non-uniform probabilities. The sampling is done with replacement. The algorithm requires constant additional memory, and works in O(n) time (even when s >> n, in which case the algorithm produces a list containing, for every population member, the number of times it has been selected for sample). The algorithm works online, and as such is well-suited to processing streams. In addition, a novel method of mass-sampling from any discrete distribution using the algorithm is presented.
△ Less
Submitted 2 November, 2016;
originally announced November 2016.