Search | arXiv e-print repository

Sketching methods with small window guarantee using minimum decycling sets

Authors: Guillaume Marçais, Dan DeBlasio, Carl Kingsford

Abstract: Most sequence sketching methods work by selecting specific $k$-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Estimating sequence similarity is much faster using sketches than using sequence alignment, hence sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using s… ▽ More Most sequence sketching methods work by selecting specific $k$-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Estimating sequence similarity is much faster using sketches than using sequence alignment, hence sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the $k$-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. In particular the window guarantee ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee corresponds to a Decycling Set, aka an unavoidable sets of $k$-mers. Any long enough sequence must contain a $k$-mer from any decycling set (hence, it is unavoidable). Conversely, a decycling set defines a sketching method by selecting the $k$-mers from the set. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger, and largely unexplored. Finding decycling sets with desirable characteristics is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their small size. Only two algorithms, by Mykkeltveit and Champarnaud, are known to generate two particular MDSs, although there is a vast number of alternative MDSs. We provide a simple method that allows one to explore the space of MDSs and to find sets optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Comments: Code available at https://github.com/Kingsford-Group/mdsscope

arXiv:2305.10577 [pdf, other]

Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and Its Variants

Authors: Yutong Qiu, Yihang Shen, Carl Kingsford

Abstract: The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al.~(2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone p… ▽ More The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al.~(2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al.~(2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs always yields optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity results of existing string-to-graph matching problems. We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics. The source code to reproduce experimental results is available at https://github.com/Kingsford-Group/gtednewilp/. △ Less

Submitted 8 November, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

arXiv:2109.09264 [pdf, other]

Computationally Efficient High-Dimensional Bayesian Optimization via Variable Selection

Authors: Yihang Shen, Carl Kingsford

Abstract: Bayesian Optimization (BO) is a method for globally optimizing black-box functions. While BO has been successfully applied to many scenarios, develo** effective BO algorithms that scale to functions with high-dimensional domains is still a challenge. Optimizing such functions by vanilla BO is extremely time-consuming. Alternative strategies for high-dimensional BO that are based on the idea of e… ▽ More Bayesian Optimization (BO) is a method for globally optimizing black-box functions. While BO has been successfully applied to many scenarios, develo** effective BO algorithms that scale to functions with high-dimensional domains is still a challenge. Optimizing such functions by vanilla BO is extremely time-consuming. Alternative strategies for high-dimensional BO that are based on the idea of embedding the high-dimensional space to the one with low dimension are sensitive to the choice of the embedding dimension, which needs to be pre-specified. We develop a new computationally efficient high-dimensional BO method that exploits variable selection. Our method is able to automatically learn axis-aligned sub-spaces, i.e. spaces containing selected variables, without the demand of any pre-specified hyperparameters. We theoretically analyze the computational complexity of our algorithm and derive the regret bound. We empirically show the efficacy of our method on several synthetic and real problems. △ Less

Submitted 12 February, 2024; v1 submitted 19 September, 2021; originally announced September 2021.

Comments: This work has already been accepted in AutoML 2023

arXiv:2001.06550 [pdf, other]

Lower density selection schemes via small universal hitting sets with short remaining path length

Authors: Hongyu Zheng, Carl Kingsford, Guillaume Marçais

Abstract: Universal hitting sets are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between universal hitting sets and minimizers schemes, where minimizers schemes with low density (i.e., efficient schemes) correspond to universal hitting sets of small size. Local schemes are a generalization of minimize… ▽ More Universal hitting sets are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between universal hitting sets and minimizers schemes, where minimizers schemes with low density (i.e., efficient schemes) correspond to universal hitting sets of small size. Local schemes are a generalization of minimizers schemes which can be used as replacement for minimizers scheme with the possibility of being much more efficient. We establish the link between efficient local schemes and the minimum length of a string that must be hit by a universal hitting set. We give bounds for the remaining path length of the Mykkeltveit universal hitting set. Additionally, we create a local scheme with the lowest known density that is only a log factor away from the theoretical lower bound. △ Less

Submitted 16 January, 2020; originally announced January 2020.

Comments: 16+7 pages. Accepted to RECOMB 2020

arXiv:1908.02894 [pdf, other]

How much data is sufficient to learn high-performing algorithms? Generalization guarantees for data-driven algorithm design

Authors: Maria-Florina Balcan, Dan DeBlasio, Travis Dick, Carl Kingsford, Tuomas Sandholm, Ellen Vitercik

Abstract: Algorithms often have tunable parameters that impact performance metrics such as runtime and solution quality. For many algorithms used in practice, no parameter settings admit meaningful worst-case bounds, so the parameters are made available for the user to tune. Alternatively, parameters may be tuned implicitly within the proof of a worst-case approximation ratio or runtime bound. Worst-case in… ▽ More Algorithms often have tunable parameters that impact performance metrics such as runtime and solution quality. For many algorithms used in practice, no parameter settings admit meaningful worst-case bounds, so the parameters are made available for the user to tune. Alternatively, parameters may be tuned implicitly within the proof of a worst-case approximation ratio or runtime bound. Worst-case instances, however, may be rare or nonexistent in practice. A growing body of research has demonstrated that data-driven algorithm design can lead to significant improvements in performance. This approach uses a training set of problem instances sampled from an unknown, application-specific distribution and returns a parameter setting with strong average performance on the training set. We provide a broadly applicable theory for deriving generalization guarantees that bound the difference between the algorithm's average performance over the training set and its expected performance. Our results apply no matter how the parameters are tuned, be it via an automated or manual approach. The challenge is that for many types of algorithms, performance is a volatile function of the parameters: slightly perturbing the parameters can cause large changes in behavior. Prior research has proved generalization bounds by employing case-by-case analyses of greedy algorithms, clustering algorithms, integer programming algorithms, and selling mechanisms. We uncover a unifying structure which we use to prove extremely general guarantees, yet we recover the bounds from prior research. Our guarantees apply whenever an algorithm's performance is a piecewise-constant, -linear, or -- more generally -- piecewise-structured function of its parameters. Our theory also implies novel bounds for voting mechanisms and dynamic programming algorithms from computational biology. △ Less

Submitted 25 April, 2021; v1 submitted 7 August, 2019; originally announced August 2019.

arXiv:1604.03132 [pdf]

Efficient Index Maintenance Under Dynamic Genome Modification

Authors: Nitish Gupta, Komal Sanjeev, Tim Wall, Carl Kingsford, Rob Patro

Abstract: Efficient text indexing data structures have enabled large-scale genomic sequence analysis and are used to help solve problems ranging from assembly to read map**. However, these data structures typically assume that the underlying reference text is static and will not change over the course of the queries being made. Some progress has been made in exploring how certain text indices, like the su… ▽ More Efficient text indexing data structures have enabled large-scale genomic sequence analysis and are used to help solve problems ranging from assembly to read map**. However, these data structures typically assume that the underlying reference text is static and will not change over the course of the queries being made. Some progress has been made in exploring how certain text indices, like the suffix array, may be updated, rather than rebuilt from scratch, when the underlying reference changes. Yet, these update operations can be complex in practice, difficult to implement, and give fairly pessimistic worst-case bounds. We present a novel data structure, SkipPatch, for maintaining a k-mer-based index over a dynamically changing genome. SkipPatch pairs a hash-based k-mer index with an indexable skip list that is used to efficiently maintain the set of edits that have been applied to the original genome. SkipPatch is practically fast, significantly outperforming the dynamic extended suffix array in terms of update and query speed. △ Less

Submitted 11 April, 2016; originally announced April 2016.

Comments: paper accepted at the RECOMB-Seq 2016

arXiv:1506.08235 [pdf, other]

doi 10.1093/bioinformatics/btv670

Optimal Seed Solver: Optimizing Seed Selection in Read Map**

Authors: Hongyi Xin, Richard Zhu, Sunny Nahar, John Emmons, Gennady Pekhimenko, Carl Kingsford, Can Alkan, Onur Mutlu

Abstract: Motivation: Optimizing seed selection is an important problem in read map**. The number of non-overlap** seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both o… ▽ More Motivation: Optimizing seed selection is an important problem in read map**. The number of non-overlap** seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both of which limit the potential of the mapper to select less frequent seeds to speed up the map** process. Therefore, it is crucial to develop a new algorithm that can adjust both the individual seed length and the seed placement, as well as derive less frequent seeds. Results: We present the Optimal Seed Solver (OSS), a dynamic programming algorithm that discovers the least frequently-occurring set of x seeds in an L-bp read in $O(x \times L)$ operations on average and in $O(x \times L^{2})$ operations in the worst case. We compared OSS against four state-of-the-art seed selection schemes and observed that OSS provides a 3-fold reduction of average seed frequency over the best previous seed selection optimizations. △ Less

Submitted 26 June, 2015; originally announced June 2015.

Comments: 10 pages of main text. 6 pages of supplementary materials. Under review by Oxford Bioinformatics

Journal ref: Bioinformatics, Jun 1;32(11):1632-42, 2016

arXiv:1308.3700 [pdf, other]

doi 10.1038/nbt.2862

Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms

Authors: Rob Patro, Stephen M. Mount, Carl Kingsford

Abstract: RNA-seq has rapidly become the de facto technique to measure gene expression. However, the time required for analysis has not kept up with the pace of data generation. Here we introduce Sailfish, a novel computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data. Sailfish entirely avoids map** reads, which is a time-consuming step in all current met… ▽ More RNA-seq has rapidly become the de facto technique to measure gene expression. However, the time required for analysis has not kept up with the pace of data generation. Here we introduce Sailfish, a novel computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data. Sailfish entirely avoids map** reads, which is a time-consuming step in all current methods. Sailfish provides quantification estimates much faster than existing approaches (typically 20-times faster) without loss of accuracy. △ Less

Submitted 16 August, 2013; originally announced August 2013.

Comments: 28 pages, 2 main figures, 2 algorithm displays, 5 supplementary figures and 2 supplementary notes. Accompanying software available at http://www.cs.cmu.edu/~ckingsf/software/sailfish

arXiv:1307.7862 [pdf, other]

Multiscale Identification of Topological Domains in Chromatin

Authors: Darya Filippova, Rob Patro, Geet Duggal, Carl Kingsford

Abstract: Recent chromosome conformation capture experiments have led to the discovery of dense, contiguous, megabase-sized topological domains that are similar across cell types and conserved across species. These domains are strongly correlated with a number of chromatin markers and have since been included in a number of analyses. However, functionally-relevant domains may exist at multiple length scales… ▽ More Recent chromosome conformation capture experiments have led to the discovery of dense, contiguous, megabase-sized topological domains that are similar across cell types and conserved across species. These domains are strongly correlated with a number of chromatin markers and have since been included in a number of analyses. However, functionally-relevant domains may exist at multiple length scales. We introduce a new and efficient algorithm that is able to capture persistent domains across various resolutions by adjusting a single scale parameter. The identified novel domains are substantially different from domains reported previously and are highly enriched for insulating factor CTCF binding and histone modfications at the boundaries. △ Less

Submitted 30 July, 2013; originally announced July 2013.

Comments: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013)

arXiv:1008.5166 [pdf, other]

doi 10.1371/journal.pcbi.1001119

Network Archaeology: Uncovering Ancient Networks from Present-day Interactions

Authors: Saket Navlakha, Carl Kingsford

Abstract: Often questions arise about old or extinct networks. What proteins interacted in a long-extinct ancestor species of yeast? Who were the central players in the Last.fm social network 3 years ago? Our ability to answer such questions has been limited by the unavailability of past versions of networks. To overcome these limitations, we propose several algorithms for reconstructing a network's history… ▽ More Often questions arise about old or extinct networks. What proteins interacted in a long-extinct ancestor species of yeast? Who were the central players in the Last.fm social network 3 years ago? Our ability to answer such questions has been limited by the unavailability of past versions of networks. To overcome these limitations, we propose several algorithms for reconstructing a network's history of growth given only the network as it exists today and a generative model by which the network is believed to have evolved. Our likelihood-based method finds a probable previous state of the network by reversing the forward growth model. This approach retains node identities so that the history of individual nodes can be tracked. We apply these algorithms to uncover older, non-extant biological and social networks believed to have grown via several models, including duplication-mutation with complementarity, forest fire, and preferential attachment. Through experiments on both synthetic and real-world data, we find that our algorithms can estimate node arrival times, identify anchor nodes from which new nodes copy links, and can reveal significant features of networks that have long since disappeared. △ Less

Submitted 30 August, 2010; originally announced August 2010.

Comments: 16 pages, 10 figures

ACM Class: G.2.2; G.3; H.2.8

arXiv:0905.1064 [pdf, other]

Vertices of degree k in edge-minimal, k-edge-connected graphs

Authors: Carl Kingsford, Guillaume Marçais

Abstract: Halin showed that every edge minimal, k-vertex connected graph has a vertex of degree k. In this note, we prove the analogue to Halin's theorem for edge-minimal, k-edge-connected graphs. We show there are two vertices of degree k in every edge-minimal, k-edge-connected graph. Halin showed that every edge minimal, k-vertex connected graph has a vertex of degree k. In this note, we prove the analogue to Halin's theorem for edge-minimal, k-edge-connected graphs. We show there are two vertices of degree k in every edge-minimal, k-edge-connected graph. △ Less

Submitted 7 May, 2009; originally announced May 2009.

Comments: 3 pages

MSC Class: 05C40

arXiv:0905.1053 [pdf, other]

A synthesis for exactly 3-edge-connected graphs

Authors: Carl Kingsford, Guillaume Marçais

Abstract: A multigraph is exactly k-edge-connected if there are exactly k edge-disjoint paths between any pair of vertices. We characterize the class of exactly 3-edge-connected graphs, giving a synthesis involving two operations by which every exactly 3-edge-connected multigraph can be generated. Slightly modified syntheses give the planar exactly 3-edge-connected graphs and the exactly 3-edge-connected… ▽ More A multigraph is exactly k-edge-connected if there are exactly k edge-disjoint paths between any pair of vertices. We characterize the class of exactly 3-edge-connected graphs, giving a synthesis involving two operations by which every exactly 3-edge-connected multigraph can be generated. Slightly modified syntheses give the planar exactly 3-edge-connected graphs and the exactly 3-edge-connected graphs with the fewest possible edges. △ Less

Submitted 7 May, 2009; originally announced May 2009.

Comments: 15 pages, 4 figures Submitted to FOCS 2009

MSC Class: 05C40

Showing 1–12 of 12 results for author: Kingsford, C