-
Another virtue of wavelet forests?
Authors:
Christina Boucher,
Travis Gagie,
Aaron Hong,
Yansong Li,
Norbert Zeh
Abstract:
A wavelet forest for a text $T [1..n]$ over an alphabet $σ$ takes $n H_0 (T) + o (n \log σ)$ bits of space and supports access and rank on $T$ in $O (\log σ)$ time. Kärkkäinen and Puglisi (2011) implicitly introduced wavelet forests and showed that when $T$ is the Burrows-Wheeler Transform (BWT) of a string $S$, then a wavelet forest for $T$ occupies space bounded in terms of higher-order empirica…
▽ More
A wavelet forest for a text $T [1..n]$ over an alphabet $σ$ takes $n H_0 (T) + o (n \log σ)$ bits of space and supports access and rank on $T$ in $O (\log σ)$ time. Kärkkäinen and Puglisi (2011) implicitly introduced wavelet forests and showed that when $T$ is the Burrows-Wheeler Transform (BWT) of a string $S$, then a wavelet forest for $T$ occupies space bounded in terms of higher-order empirical entropies of $S$ even when the forest is implemented with uncompressed bitvectors. In this paper we show experimentally that wavelet forests also have better access locality than wavelet trees and are thus interesting even when higher-order compression is not effective on $S$, or when $T$ is not a BWT at all.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
Sum-of-Local-Effects Data Structures for Separable Graphs
Authors:
Xing Lyu,
Travis Gagie,
Meng He,
Yakov Nekrich,
Norbert Zeh
Abstract:
It is not difficult to think of applications that can be modelled as graph problems in which placing some facility or commodity at a vertex has some positive or negative effect on the values of all the vertices out to some distance, and we want to be able to calculate quickly the cumulative effect on any vertex's value at any time or the list of the most beneficial or most detrimential effects on…
▽ More
It is not difficult to think of applications that can be modelled as graph problems in which placing some facility or commodity at a vertex has some positive or negative effect on the values of all the vertices out to some distance, and we want to be able to calculate quickly the cumulative effect on any vertex's value at any time or the list of the most beneficial or most detrimential effects on a vertex. In this paper we show how, given an edge-weighted graph with constant-size separators, we can support the following operations on it in time polylogarithmic in the number of vertices and the number of facilities placed on the vertices, where distances between vertices are measured with respect to the edge weights:
Add (v, f, w, d) places a facility of weight w and with effect radius d onto vertex v.
Remove (v, f) removes a facility f previously placed on v using Add from v.
Sum (v) or Sum (v, d) returns the total weight of all facilities affecting v or, with a distance parameter d, the total weight of all facilities whose effect region intersects the ``circle'' with radius d around v.
Top (v, k) or Top (v, k, d) returns the k facilities of greatest weight that affect v or, with a distance parameter d, whose effect region intersects the ``circle'' with radius d around v.
The weights of the facilities and the operation that Sum uses to ``sum'' them must form a semigroup. For Top queries, the weights must be drawn from a total order.
△ Less
Submitted 4 May, 2023;
originally announced May 2023.
-
A Near-Linear Kernel for Two-Parsimony Distance
Authors:
Elise Deen,
Leo van Iersel,
Remie Janssen,
Mark Jones,
Yuki Murakami,
Norbert Zeh
Abstract:
The maximum parsimony distance $d_{\textrm{MP}}(T_1,T_2)$ and the bounded-state maximum parsimony distance $d_{\textrm{MP}}^t(T_1,T_2)$ measure the difference between two phylogenetic trees $T_1,T_2$ in terms of the maximum difference between their parsimony scores for any character (with $t$ a bound on the number of states in the character, in the case of $d_{\textrm{MP}}^t(T_1,T_2)$). While comp…
▽ More
The maximum parsimony distance $d_{\textrm{MP}}(T_1,T_2)$ and the bounded-state maximum parsimony distance $d_{\textrm{MP}}^t(T_1,T_2)$ measure the difference between two phylogenetic trees $T_1,T_2$ in terms of the maximum difference between their parsimony scores for any character (with $t$ a bound on the number of states in the character, in the case of $d_{\textrm{MP}}^t(T_1,T_2)$). While computing $d_{\textrm{MP}}(T_1, T_2)$ was previously shown to be fixed-parameter tractable with a linear kernel, no such result was known for $d_{\textrm{MP}}^t(T_1,T_2)$. In this paper, we prove that computing $d_{\textrm{MP}}^t(T_1, T_2)$ is fixed-parameter tractable for all~$t$. Specifically, we prove that this problem has a kernel of size $O(k \lg k)$, where $k = d_{\textrm{MP}}^t(T_1, T_2)$. As the primary analysis tool, we introduce the concept of leg-disjoint incompatible quartets, which may be of independent interest.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
Enhancement of Short Text Clustering by Iterative Classification
Authors:
Md Rashadul Hasan Rakib,
Norbert Zeh,
Magdalena Jankowska,
Evangelos Milios
Abstract:
Short text clustering is a challenging task due to the lack of signal contained in such short texts. In this work, we propose iterative classification as a method to b o ost the clustering quality (e.g., accuracy) of short texts. Given a clustering of short texts obtained using an arbitrary clustering algorithm, iterative classification applies outlier removal to obtain outlier-free clusters. Then…
▽ More
Short text clustering is a challenging task due to the lack of signal contained in such short texts. In this work, we propose iterative classification as a method to b o ost the clustering quality (e.g., accuracy) of short texts. Given a clustering of short texts obtained using an arbitrary clustering algorithm, iterative classification applies outlier removal to obtain outlier-free clusters. Then it trains a classification algorithm using the non-outliers based on their cluster distributions. Using the trained classification model, iterative classification reclassifies the outliers to obtain a new set of clusters. By repeating this several times, we obtain a much improved clustering of texts. Our experimental results show that the proposed clustering enhancement method not only improves the clustering quality of different clustering methods (e.g., k-means, k-means--, and hierarchical clustering) but also outperforms the state-of-the-art short text clustering methods on several short text datasets by a statistically significant margin.
△ Less
Submitted 30 January, 2020;
originally announced January 2020.
-
A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees
Authors:
Leo van Iersel,
Remie Janssen,
Mark Jones,
Yukihiro Murakami,
Norbert Zeh
Abstract:
We present the first fixed-parameter algorithm for constructing a tree-child phylogenetic network that displays an arbitrary number of binary input trees and has the minimum number of reticulations among all such networks. The algorithm uses the recently introduced framework of cherry picking sequences and runs in $O((8k)^k \mathrm{poly}(n, t))$ time, where $n$ is the number of leaves of every tre…
▽ More
We present the first fixed-parameter algorithm for constructing a tree-child phylogenetic network that displays an arbitrary number of binary input trees and has the minimum number of reticulations among all such networks. The algorithm uses the recently introduced framework of cherry picking sequences and runs in $O((8k)^k \mathrm{poly}(n, t))$ time, where $n$ is the number of leaves of every tree, $t$ is the number of trees, and $k$ is the reticulation number of the constructed network. Moreover, we provide an efficient parallel implementation of the algorithm and show that it can deal with up to $100$ input trees on a standard desktop computer, thereby providing a major improvement over previous phylogenetic network construction methods.
△ Less
Submitted 19 July, 2019;
originally announced July 2019.
-
Hybridization Number on Three Rooted Binary Trees is EPT
Authors:
Leo van Iersel,
Steven Kelk,
Nela Lekić,
Chris Whidden,
Norbert Zeh
Abstract:
Phylogenetic networks are leaf-labelled directed acyclic graphs that are used to describe non-treelike evolutionary histories and are thus a generalization of phylogenetic trees. The hybridization number of a phylogenetic network is the sum of all indegrees minus the number of nodes plus one. The Hybridization Number problem takes as input a collection of phylogenetic trees and asks to construct a…
▽ More
Phylogenetic networks are leaf-labelled directed acyclic graphs that are used to describe non-treelike evolutionary histories and are thus a generalization of phylogenetic trees. The hybridization number of a phylogenetic network is the sum of all indegrees minus the number of nodes plus one. The Hybridization Number problem takes as input a collection of phylogenetic trees and asks to construct a phylogenetic network that contains an embedding of each of the input trees and has a smallest possible hybridization number. We present an algorithm for the Hybridization Number problem on three binary trees on $n$ leaves, which runs in time $O(c^k poly(n))$, with $k$ the hybridization number of an optimal network and $c$ a constant. For two trees, an algorithm with running time $O(3.18^k n)$ was proposed before whereas an algorithm with running time $O(c^k poly(n))$ had prior to this article remained elusive for more than two trees. The algorithm for two trees uses the close connection to acyclic agreement forests to achieve a linear exponent in the running time, while previous algorithms for more than two trees (explicitly or implicitly) relied on a brute force search through all possible underlying network topologies, leading to running times that are not $O(c^k poly(n))$ for any $c$. The connection to acyclic agreement forests is much weaker for more than two trees, so even given the right agreement forest, reconstructing the network poses major challenges. We prove novel structural results that allow us to reconstruct a network without having to guess the underlying topology. Our techniques generalize to more than three input trees with the exception of one key lemma that maps nodes in the network to tree nodes and, thus, minimizes the amount of guessing involved in constructing the network. The main open problem therefore is to establish a similar map** for more than three trees.
△ Less
Submitted 31 May, 2016; v1 submitted 10 February, 2014;
originally announced February 2014.
-
QuPARA: Query-Driven Large-Scale Portfolio Aggregate Risk Analysis on MapReduce
Authors:
Andrew Rau-Chaplin,
Blesson Varghese,
Duane Wilson,
Zhimin Yao,
Norbert Zeh
Abstract:
Stochastic simulation techniques are used for portfolio risk analysis. Risk portfolios may consist of thousands of reinsurance contracts covering millions of insured locations. To quantify risk each portfolio must be evaluated in up to a million simulation trials, each capturing a different possible sequence of catastrophic events over the course of a contractual year. In this paper, we explore th…
▽ More
Stochastic simulation techniques are used for portfolio risk analysis. Risk portfolios may consist of thousands of reinsurance contracts covering millions of insured locations. To quantify risk each portfolio must be evaluated in up to a million simulation trials, each capturing a different possible sequence of catastrophic events over the course of a contractual year. In this paper, we explore the design of a flexible framework for portfolio risk analysis that facilitates answering a rich variety of catastrophic risk queries. Rather than aggregating simulation data in order to produce a small set of high-level risk metrics efficiently (as is often done in production risk management systems), the focus here is on allowing the user to pose queries on unaggregated or partially aggregated data. The goal is to provide a flexible framework that can be used by analysts to answer a wide variety of unanticipated but natural ad hoc queries. Such detailed queries can help actuaries or underwriters to better understand the multiple dimensions (e.g., spatial correlation, seasonality, peril features, construction features, and financial terms) that can impact portfolio risk. We implemented a prototype system, called QuPARA (Query-Driven Large-Scale Portfolio Aggregate Risk Analysis), using Hadoop, which is Apache's implementation of the MapReduce paradigm. This allows the user to take advantage of large parallel compute servers in order to answer ad hoc risk analysis queries efficiently even on very large data sets typically encountered in practice. We describe the design and implementation of QuPARA and present experimental results that demonstrate its feasibility. A full portfolio risk analysis run consisting of a 1,000,000 trial simulation, with 1,000 events per trial, and 3,200 risk transfer contracts can be completed on a 16-node Hadoop cluster in just over 20 minutes.
△ Less
Submitted 16 August, 2013;
originally announced August 2013.
-
Fixed-Parameter and Approximation Algorithms for Maximum Agreement Forests of Multifurcating Trees
Authors:
Chris Whidden,
Robert G. Beiko,
Norbert Zeh
Abstract:
We present efficient algorithms for computing a maximum agreement forest (MAF) of a pair of multifurcating (nonbinary) rooted trees. Our algorithms match the running times of the currently best algorithms for the binary case. The size of an MAF corresponds to the subtree prune-and-regraft (SPR) distance of the two trees and is intimately connected to their hybridization number. These distance meas…
▽ More
We present efficient algorithms for computing a maximum agreement forest (MAF) of a pair of multifurcating (nonbinary) rooted trees. Our algorithms match the running times of the currently best algorithms for the binary case. The size of an MAF corresponds to the subtree prune-and-regraft (SPR) distance of the two trees and is intimately connected to their hybridization number. These distance measures are essential tools for understanding reticulate evolution, such as lateral gene transfer, recombination, and hybridization. Multifurcating trees arise naturally as a result of statistical uncertainty in current tree construction methods.
△ Less
Submitted 2 May, 2013;
originally announced May 2013.
-
Fixed-Parameter and Approximation Algorithms for Maximum Agreement Forests
Authors:
Chris Whidden,
Robert G. Beiko,
Norbert Zeh
Abstract:
We present new and improved fixed-parameter algorithms for computing maximum agreement forests (MAFs) of pairs of rooted binary phylogenetic trees. The size of such a forest for two trees corresponds to their subtree prune-and-regraft distance and, if the agreement forest is acyclic, to their hybridization number. These distance measures are essential tools for understanding reticulate evolution.…
▽ More
We present new and improved fixed-parameter algorithms for computing maximum agreement forests (MAFs) of pairs of rooted binary phylogenetic trees. The size of such a forest for two trees corresponds to their subtree prune-and-regraft distance and, if the agreement forest is acyclic, to their hybridization number. These distance measures are essential tools for understanding reticulate evolution. Our algorithm for computing maximum acyclic agreement forests is the first depth-bounded search algorithm for this problem. Our algorithms substantially outperform the best previous algorithms for these problems.
△ Less
Submitted 2 May, 2013; v1 submitted 12 August, 2011;
originally announced August 2011.
-
NAPX: A Polynomial Time Approximation Scheme for the Noah's Ark Problem
Authors:
G. Hickey,
P. Carmi,
A. Maheshwari,
N. Zeh
Abstract:
The Noah's Ark Problem (NAP) is an NP-Hard optimization problem with relevance to ecological conservation management. It asks to maximize the phylogenetic diversity (PD) of a set of taxa given a fixed budget, where each taxon is associated with a cost of conservation and a probability of extinction. NAP has received renewed interest with the rise in availability of genetic sequence data, allowin…
▽ More
The Noah's Ark Problem (NAP) is an NP-Hard optimization problem with relevance to ecological conservation management. It asks to maximize the phylogenetic diversity (PD) of a set of taxa given a fixed budget, where each taxon is associated with a cost of conservation and a probability of extinction. NAP has received renewed interest with the rise in availability of genetic sequence data, allowing PD to be used as a practical measure of biodiversity. However, only simplified instances of the problem, where one or more parameters are fixed as constants, have as of yet been addressed in the literature. We present NAPX, the first algorithm for the general version of NAP that returns a $1 - ε$ approximation of the optimal solution. It runs in $O(\frac{n B^2 h^2 \log^2n}{\log^2(1 - ε)})$ time where $n$ is the number of species, and $B$ is the total budget and $h$ is the height of the input tree. We also provide improved bounds for its expected running time.
△ Less
Submitted 27 October, 2008; v1 submitted 12 May, 2008;
originally announced May 2008.
-
Geometric Spanners With Small Chromatic Number
Authors:
Prosenjit Bose,
Paz Carmi,
Mathieu Couture,
Anil Maheshwari,
Michiel Smid,
Norbert Zeh
Abstract:
Given an integer $k \geq 2$, we consider the problem of computing the smallest real number $t(k)$ such that for each set $P$ of points in the plane, there exists a $t(k)$-spanner for $P$ that has chromatic number at most $k$. We prove that $t(2) = 3$, $t(3) = 2$, $t(4) = \sqrt{2}$, and give upper and lower bounds on $t(k)$ for $k>4$. We also show that for any $ε>0$, there exists a $(1+ε)t(k)$-sp…
▽ More
Given an integer $k \geq 2$, we consider the problem of computing the smallest real number $t(k)$ such that for each set $P$ of points in the plane, there exists a $t(k)$-spanner for $P$ that has chromatic number at most $k$. We prove that $t(2) = 3$, $t(3) = 2$, $t(4) = \sqrt{2}$, and give upper and lower bounds on $t(k)$ for $k>4$. We also show that for any $ε>0$, there exists a $(1+ε)t(k)$-spanner for $P$ that has $O(|P|)$ edges and chromatic number at most $k$. Finally, we consider an on-line variant of the problem where the points of $P$ are given one after another, and the color of a point must be assigned at the moment the point is given. In this setting, we prove that $t(2) = 3$, $t(3) = 1+ \sqrt{3}$, $t(4) = 1+ \sqrt{2}$, and give upper and lower bounds on $t(k)$ for $k>4$.
△ Less
Submitted 1 November, 2007;
originally announced November 2007.