-
Constant-time edge label and leaf pointer maintenance on sliding suffix trees
Authors:
Laurentius Leonard,
Shunsuke Inenaga,
Hideo Bannai,
Takuya Mieno
Abstract:
Sliding suffix trees (Fiala & Greene, 1989) for an input text $T$ over an alphabet of size $σ$ and a sliding window $W$ of $T$ can be maintained in $O(|T| \log σ)$ time and $O(|W|)$ space. The two previous approaches that achieve this can be categorized into the credit-based approach of Fiala and Greene (1989) and Larsson (1996, 1999), or the batch-based approach proposed by Senft (2005). Brodnik…
▽ More
Sliding suffix trees (Fiala & Greene, 1989) for an input text $T$ over an alphabet of size $σ$ and a sliding window $W$ of $T$ can be maintained in $O(|T| \log σ)$ time and $O(|W|)$ space. The two previous approaches that achieve this can be categorized into the credit-based approach of Fiala and Greene (1989) and Larsson (1996, 1999), or the batch-based approach proposed by Senft (2005). Brodnik and Jekovec (2018) showed that the sliding suffix tree can be supplemented with leaf pointers in order to find all occurrences of an online query pattern in the current window, and that leaf pointers can be maintained by credit-based arguments as well. The main difficulty in the credit-based approach is in the maintenance of index-pairs that represent each edge. In this paper, we show that valid edge index-pairs can be derived in constant time from leaf pointers, thus reducing the maintenance of edge index-pairs to the maintenance of leaf pointers. We further propose a new simple method that maintains leaf pointers without using credit-based arguments. The lack of credit-based arguments allow a simpler proof of correctness compared to the credit-based approach, whose analyses were initially flawed (Senft 2005). In addition, our method reduces the worst-case time of leaf pointer and edge label maintenance per leaf insertion and deletion from $Θ(|W|)$ time to $O(1)$ time.
△ Less
Submitted 29 February, 2024; v1 submitted 3 July, 2023;
originally announced July 2023.
-
Online algorithms for finding distinct substrings with length and multiple prefix and suffix conditions
Authors:
Laurentius Leonard,
Shunsuke Inenaga,
Hideo Bannai,
Takuya Mieno
Abstract:
Let two static sequences of strings $P$ and $S$, representing prefix and suffix conditions respectively, be given as input for preprocessing. For the query, let two positive integers $k_1$ and $k_2$ be given, as well as a string $T$ given in an online manner, such that $T_i$ represents the length-$i$ prefix of $T$ for $1 \leq i \leq |T|$. In this paper we are interested in computing the set…
▽ More
Let two static sequences of strings $P$ and $S$, representing prefix and suffix conditions respectively, be given as input for preprocessing. For the query, let two positive integers $k_1$ and $k_2$ be given, as well as a string $T$ given in an online manner, such that $T_i$ represents the length-$i$ prefix of $T$ for $1 \leq i \leq |T|$. In this paper we are interested in computing the set $\mathit{ans_i}$ of distinct substrings $w$ of $T_i$ such that $k_1 \leq |w| \leq k_2$, and $w$ contains some $p \in P$ as a prefix and some $s \in S$ as a suffix. More specifically, the counting problem is to output $|\mathit{ans_i}|$, whereas the reporting problem is to output all elements of $\mathit{ans_i}$, for each iteration $i$. Let $σ$ denote the alphabet size, and for a sequence of strings $A$, $\Vert A\Vert=\sum_{u\in A}|u|$. Then, we show that after $O((\Vert P\Vert +\Vert S\Vert)\logσ)$-time preprocessing, the solutions for the counting and reporting problems for each iteration up to $i$ can be output in $O(|T_i| \logσ)$ and $O(|T_i| \logσ+ |\mathit{ans_i}|)$ total time. The preprocessing time can be reduced to $O(\Vert P\Vert +\Vert S\Vert)$ for integer alphabets of size polynomial with regard to $\Vert P\Vert +\Vert S\Vert$. Our algorithms have possible applications to network traffic classification.
△ Less
Submitted 30 October, 2022; v1 submitted 9 July, 2022;
originally announced July 2022.
-
Suffix tree-based linear algorithms for multiple prefixes, single suffix counting and listing problems
Authors:
Laurentius Leonard,
Ken Tanaka
Abstract:
Given two strings $T$ and $S$ and a set of strings $P$, for each string $p \in P$, consider the unique substrings of $T$ that have $p$ as their prefix and $S$ as their suffix. Two problems then come to mind; the first problem being the counting of such substrings, and the second problem being the problem of listing all such substrings. In this paper, we describe linear-time, linear-space suffix tr…
▽ More
Given two strings $T$ and $S$ and a set of strings $P$, for each string $p \in P$, consider the unique substrings of $T$ that have $p$ as their prefix and $S$ as their suffix. Two problems then come to mind; the first problem being the counting of such substrings, and the second problem being the problem of listing all such substrings. In this paper, we describe linear-time, linear-space suffix tree-based algorithms for both problems. More specifically, we describe an $O(|T| + |P|)$ time algorithm for the counting problem, and an $O(|T| + |P| + \#(ans))$ time algorithm for the listing problem, where $\#(ans)$ refers to the number of strings being listed in total, and $|P|$ refers to the total length of the strings in $P$. We also consider the reversed version of the problems, where one prefix condition string and multiple suffix condition strings are given instead, and similarly describe linear-time, linear-space algorithms to solve them.
△ Less
Submitted 18 April, 2022; v1 submitted 31 March, 2022;
originally announced March 2022.
-
Specular reflections removal in colposcopic images based on neural networks: Supervised training with no ground truth previous knowledge
Authors:
Lauren Jimenez-Martin,
Daniel A. Valdés Pérez,
Ana M. Solares Asteasuainzarra,
Ludwig Leonard,
Marta L. Baguer Díaz-Romañach
Abstract:
Cervical cancer is a malignant tumor that seriously threatens women's health, and is one of the most common that affects women worldwide. For its early detection, colposcopic images of the cervix are used for searching for possible injuries or abnormalities. An inherent characteristic of these images is the presence of specular reflections (brightness) that make it difficult to observe some region…
▽ More
Cervical cancer is a malignant tumor that seriously threatens women's health, and is one of the most common that affects women worldwide. For its early detection, colposcopic images of the cervix are used for searching for possible injuries or abnormalities. An inherent characteristic of these images is the presence of specular reflections (brightness) that make it difficult to observe some regions, which might imply misdiagnosis. In this paper, a new strategy based on neural networks is introduced for eliminating specular reflections and estimating the unobserved anatomical cervix portion under the bright zones. For overcoming the fact that the ground truth corresponding to the specular reflection regions is always unknown, the new strategy proposes the supervised training of a neural network to learn how to restore any hidden regions of colposcopic images. Once the specular reflections are identified, they are removed from the image, and the previously trained network is used to fulfill these deleted areas. The quality of the processed images was evaluated quantitatively and qualitatively. In 21 of the 22 evaluated images, the detected specular reflections were eliminated, whereas, in the remaining one, these reflections were almost completely eliminated. The distribution of the colors and the content of the restored images are similar to those of the originals. The evaluation carried out by a specialist in Cervix Pathology concluded that, after eliminating the specular reflections, the anatomical and physiological elements of the cervix are observable in the restored images, which facilitates the medical diagnosis of cervical pathologies. Our method has the potential to improve the early detection of cervical cancer.
△ Less
Submitted 21 June, 2021; v1 submitted 3 June, 2021;
originally announced June 2021.
-
GenoML: Automated Machine Learning for Genomics
Authors:
Mary B. Makarious,
Hampton L. Leonard,
Dan Vitale,
Hirotaka Iwaki,
David Saffo,
Lana Sargent,
Anant Dadu,
Eduardo Salmerón Castaño,
John F. Carter,
Melina Maleknia,
Juan A. Botia,
Cornelis Blauwendraat,
Roy H. Campbell,
Sayed Hadi Hashemi,
Andrew B. Singleton,
Mike A. Nalls,
Faraz Faghri
Abstract:
GenoML is a Python package automating machine learning workflows for genomics (genetics and multi-omics) with an open science philosophy. Genomics data require significant domain expertise to clean, pre-process, harmonize and perform quality control of the data. Furthermore, tuning, validation, and interpretation involve taking into account the biology and possibly the limitations of the underlyin…
▽ More
GenoML is a Python package automating machine learning workflows for genomics (genetics and multi-omics) with an open science philosophy. Genomics data require significant domain expertise to clean, pre-process, harmonize and perform quality control of the data. Furthermore, tuning, validation, and interpretation involve taking into account the biology and possibly the limitations of the underlying data collection, protocols, and technology. GenoML's mission is to bring machine learning for genomics and clinical data to non-experts by develo** an easy-to-use tool that automates the full development, evaluation, and deployment process. Emphasis is put on open science to make workflows easily accessible, replicable, and transferable within the scientific community. Source code and documentation is available at https://genoml.com.
△ Less
Submitted 4 March, 2021;
originally announced March 2021.
-
Learning Multiple-Scattering Solutions for Sphere-Tracing of Volumetric Subsurface Effects
Authors:
Ludwig Leonard,
Kevin Hoehlein,
Ruediger Westermann
Abstract:
Accurate subsurface scattering solutions require the integration of optical material properties along many complicated light paths. We present a method that learns a simple geometric approximation of random paths in a homogeneous volume of translucent material. The generated representation allows determining the absorption along the path as well as a direct lighting contribution, which is represen…
▽ More
Accurate subsurface scattering solutions require the integration of optical material properties along many complicated light paths. We present a method that learns a simple geometric approximation of random paths in a homogeneous volume of translucent material. The generated representation allows determining the absorption along the path as well as a direct lighting contribution, which is representative of all scattering events along the path. A sequence of conditional variational auto-encoders (CVAEs) is trained to model the statistical distribution of the photon paths inside a spherical region in presence of multiple scattering events. A first CVAE learns to sample the number of scattering events, occurring on a ray path inside the sphere, which effectively determines the probability of the ray being absorbed. Conditioned on this, a second model predicts the exit position and direction of the light particle. Finally, a third model generates a representative sample of photon position and direction along the path, which is used to approximate the contribution of direct illumination due to in-scattering. To accelerate the tracing of the light path through the volumetric medium toward the solid boundary, we employ a sphere-tracing strategy that considers the light absorption and is able to perform statistically accurate next-event estimation. We demonstrate efficient learning using shallow networks of only three layers and no more than 16 nodes. In combination with a GPU shader that evaluates the CVAEs' predictions, performance gains can be demonstrated for a variety of different scenarios. A quality evaluation analyzes the approximation error that is introduced by the data-driven scattering simulation and sheds light on the major sources of error in the accelerated path tracing process.
△ Less
Submitted 5 November, 2020;
originally announced November 2020.