-
VectorTSP: A Traveling Salesperson Problem with Racetrack-like acceleration constraints
Authors:
Arnaud Casteigts,
Mathieu Raffinot,
Jason Schoeters
Abstract:
We study a new version of the Euclidean TSP called VectorTSP (VTSP for short) where a mobile entity is allowed to move according to a set of physical constraints inspired from the pen-and-pencil game Racetrack (also known as Vector Racer ). In contrast to other versions of TSP accounting for physical constraints, such as Dubins TSP, the spirit of this model is that (1) no speed limitations apply,…
▽ More
We study a new version of the Euclidean TSP called VectorTSP (VTSP for short) where a mobile entity is allowed to move according to a set of physical constraints inspired from the pen-and-pencil game Racetrack (also known as Vector Racer ). In contrast to other versions of TSP accounting for physical constraints, such as Dubins TSP, the spirit of this model is that (1) no speed limitations apply, and (2) inertia depends on the current velocity. As such, this model is closer to typical models considered in path planning problems, although applied here to the visit of n cities in a non-predetermined order. We motivate and introduce the VectorTSP problem, discussing fundamental differences with previous versions of TSP. In particular, an optimal visit order for ETSP may not be optimal for VTSP. We show that VectorTSP is NP-hard, and in the other direction, that VectorTSP reduces to GroupTSP in polynomial time (although with a significant blow-up in size). On the algorithmic side, we formulate the search for a solution as an interactive scheme between a high-level algorithm and a trajectory oracle, the former being responsible for computing the visit order and the latter for computing the cost (or the trajectory) for a given visit order. We present algorithms for both, and we demonstrate and quantify through experiments that this approach frequently finds a better solution than the optimal trajectory realizing an optimal ETSP tour, which legitimates the problem itself and (we hope) motivates further algorithmic developments.
△ Less
Submitted 20 August, 2021; v1 submitted 5 June, 2020;
originally announced June 2020.
-
On improving the approximation ratio of the r-shortest common superstring problem
Authors:
Tristan Braquelaire,
Marie Gasparoux,
Mathieu Raffinot,
Raluca Uricaru
Abstract:
The Shortest Common Superstring problem (SCS) consists, for a set of strings S = {s_1,...,s_n}, in finding a minimum length string that contains all s_i, 1<= i <= n, as substrings. While a 2+11/30 approximation ratio algorithm has recently been published, the general objective is now to break the conceptual lower bound barrier of 2. This paper is a step ahead in this direction. Here we focus on a…
▽ More
The Shortest Common Superstring problem (SCS) consists, for a set of strings S = {s_1,...,s_n}, in finding a minimum length string that contains all s_i, 1<= i <= n, as substrings. While a 2+11/30 approximation ratio algorithm has recently been published, the general objective is now to break the conceptual lower bound barrier of 2. This paper is a step ahead in this direction. Here we focus on a particular instance of the SCS problem, meaning the r-SCS problem, which requires all input strings to be of the same length, r. Golonev et al. proved an approximation ratio which is better than the general one for r<= 6. Here we extend their approach and improve their approximation ratio, which is now better than the general one for r<= 7, and less than or equal to 2 up to r = 6.
△ Less
Submitted 30 April, 2018;
originally announced May 2018.
-
Optimizing Google Shop** Campaigns Structures With Query-Level Matching
Authors:
Mathieu Raffinot,
Romain Rivière
Abstract:
How to bid on a Google shop** account (set of shop** campaigns) with query-level matching like in Google Adwords.
How to bid on a Google shop** account (set of shop** campaigns) with query-level matching like in Google Adwords.
△ Less
Submitted 3 August, 2017;
originally announced August 2017.
-
Indexing and querying color sets of images
Authors:
Djamal Belazzougui,
Roman Kolpakov,
Mathieu Raffinot
Abstract:
We aim to study the set of color sets of continuous regions of an image given as a matrix of $m$ rows over $n\geq m$ columns where each element in the matrix is an integer from $[1,σ]$ named a {\em color}.
The set of distinct colors in a region is called fingerprint. We aim to compute, index and query the fingerprints of all rectangular regions named rectangles. The set of all such fingerprints…
▽ More
We aim to study the set of color sets of continuous regions of an image given as a matrix of $m$ rows over $n\geq m$ columns where each element in the matrix is an integer from $[1,σ]$ named a {\em color}.
The set of distinct colors in a region is called fingerprint. We aim to compute, index and query the fingerprints of all rectangular regions named rectangles. The set of all such fingerprints is denoted by ${\cal F}$. A rectangle is {\em maximal} if it is not contained in a greater rectangle with the same fingerprint. The set of all locations of maximal rectangles is denoted by $\mathcal{L}.$ We first explain how to determine all the $|\mathcal{L}|$ maximal locations with their fingerprints in expected time $O(nm^2σ)$ using a Monte Carlo algorithm (with polynomially small probability of error) or within deterministic $O(nm^2σ\log(\frac{|\mathcal{L}|}{nm^2}+2))$ time. We then show how to build a data structure which occupies $O(nm\log n+\mathcal{|L|})$ space such that a query which asks for all the maximal locations with a given fingerprint $f$ can be answered in time $O(|f|+\log\log n+k)$, where $k$ is the number of maximal locations with fingerprint $f$. If the query asks only for the presence of the fingerprint, then the space usage becomes $O(nm\log n+|{\cal F}|)$ while the query time becomes $O(|f|+\log\log n)$. We eventually consider the special case of squared regions (squares).
△ Less
Submitted 28 August, 2016;
originally announced August 2016.
-
A note on the shortest common superstring of NGS reads
Authors:
Tristan Braquelaire,
Marie Gasparoux,
Mathieu Raffinot,
Raluca Uricaru
Abstract:
The Shortest Superstring Problem (SSP) consists, for a set of strings S = {s_1,...,s_n}, to find a minimum length string that contains all s_i, 1 <= i <= k, as substrings. This problem is proved to be NP-Complete and APX-hard. Guaranteed approximation algorithms have been proposed, the current best ratio being 2+11/23, which has been achieved following a long and difficult quest. However, SSP is h…
▽ More
The Shortest Superstring Problem (SSP) consists, for a set of strings S = {s_1,...,s_n}, to find a minimum length string that contains all s_i, 1 <= i <= k, as substrings. This problem is proved to be NP-Complete and APX-hard. Guaranteed approximation algorithms have been proposed, the current best ratio being 2+11/23, which has been achieved following a long and difficult quest. However, SSP is highly used in practice on next generation sequencing (NGS) data, which plays an increasingly important role in sequencing. In this note, we show that the SSP approximation ratio can be improved on NGS reads by assuming specific characteristics of NGS data that are experimentally verified on a very large sampling set.
△ Less
Submitted 18 May, 2016;
originally announced May 2016.
-
Practical combinations of repetition-aware data structures
Authors:
Djamal Belazzougui,
Fabio Cunial,
Travis Gagie,
Nicola Prezza,
Mathieu Raffinot
Abstract:
Highly-repetitive collections of strings are increasingly being amassed by genome sequencing and genetic variation experiments, as well as by storing all versions of human-generated files, like webpages and source code. Existing indexes for locating all the exact occurrences of a pattern in a highly-repetitive string take advantage of a single measure of repetition. However, multiple, distinct mea…
▽ More
Highly-repetitive collections of strings are increasingly being amassed by genome sequencing and genetic variation experiments, as well as by storing all versions of human-generated files, like webpages and source code. Existing indexes for locating all the exact occurrences of a pattern in a highly-repetitive string take advantage of a single measure of repetition. However, multiple, distinct measures of repetition all grow sublinearly in the length of a highly-repetitive string. In this paper we explore the practical advantages of combining data structures whose size depends on distinct measures of repetition. The main ingredient of our structures is the run-length encoded BWT (RLBWT), which takes space proportional to the number of runs in the Burrows-Wheeler transform of a string. We describe a range of practical variants that combine RLBWT with the set of boundaries of the Lempel-Ziv 77 factors of a string, which take space proportional to the number of factors. Such variants use, respectively, the RLBWT of a string and the RLBWT of its reverse, or just one RLBWT inside a bidirectional index, or just one RLBWT with support for unidirectional extraction. We also study the practical advantages of combining RLBWT with the compact directed acyclic word graph of a string, a data structure that takes space proportional to the number of one-character extensions of maximal repeats. Our approaches are easy to implement, and provide competitive tradeoffs on significant datasets.
△ Less
Submitted 21 April, 2016; v1 submitted 20 April, 2016;
originally announced April 2016.
-
Composite repetition-aware data structures
Authors:
Djamal Belazzougui,
Fabio Cunial,
Travis Gagie,
Nicola Prezza,
Mathieu Raffinot
Abstract:
In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for…
▽ More
In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.
△ Less
Submitted 23 February, 2015; v1 submitted 20 February, 2015;
originally announced February 2015.
-
Easy identification of generalized common and conserved nested intervals
Authors:
Fabien de Montgolfier,
Mathieu Raffinot,
Irena Rusu
Abstract:
In this paper we explain how to easily compute gene clusters, formalized by classical or generalized nested common or conserved intervals, between a set of K genomes represented as K permutations. A b-nested common (resp. conserved) interval I of size |I| is either an interval of size 1 or a common (resp. conserved) interval that contains another b-nested common (resp. conserved) interval of size…
▽ More
In this paper we explain how to easily compute gene clusters, formalized by classical or generalized nested common or conserved intervals, between a set of K genomes represented as K permutations. A b-nested common (resp. conserved) interval I of size |I| is either an interval of size 1 or a common (resp. conserved) interval that contains another b-nested common (resp. conserved) interval of size at least |I|-b. When b=1, this corresponds to the classical notion of nested interval. We exhibit two simple algorithms to output all b-nested common or conserved intervals between K permutations in O(Kn+nocc) time, where nocc is the total number of such intervals. We also explain how to count all b-nested intervals in O(Kn) time. New properties of the family of conserved intervals are proposed to do so.
△ Less
Submitted 2 December, 2013; v1 submitted 21 May, 2013;
originally announced May 2013.
-
Single and multiple consecutive permutation motif search
Authors:
Djamal Belazzougui,
Adeline Pierrot,
Mathieu Raffinot,
Stéphane Vialette
Abstract:
Let $t$ be a permutation (that shall play the role of the {\em text}) on $[n]$ and a pattern $p$ be a sequence of $m$ distinct integer(s) of $[n]$, $m\leq n$. The pattern $p$ occurs in $t$ in position $i$ if and only if $p_1... p_m$ is order-isomorphic to $t_i... t_{i+m-1}$, that is, for all $1 \leq k< \ell \leq m$, $p_k>p_\ell$ if and only if $t_{i+k-1}>t_{i+\ell-1}$. Searching for a pattern $p$…
▽ More
Let $t$ be a permutation (that shall play the role of the {\em text}) on $[n]$ and a pattern $p$ be a sequence of $m$ distinct integer(s) of $[n]$, $m\leq n$. The pattern $p$ occurs in $t$ in position $i$ if and only if $p_1... p_m$ is order-isomorphic to $t_i... t_{i+m-1}$, that is, for all $1 \leq k< \ell \leq m$, $p_k>p_\ell$ if and only if $t_{i+k-1}>t_{i+\ell-1}$. Searching for a pattern $p$ in a text $t$ consists in identifying all occurrences of $p$ in $t$. We first present a forward automaton which allows us to search for $p$ in $t$ in $O(m^2\log \log m +n)$ time. We then introduce a Morris-Pratt automaton representation of the forward automaton which allows us to reduce this complexity to $O(m\log \log m +n)$ at the price of an additional amortized constant term by integer of the text. Both automata occupy $O(m)$ space. We then extend the problem to search for a set of patterns and exhibit a specific Aho-Corasick like algorithm. Next we present a sub-linear average case search algorithm running in $O(\frac{m\log m}{\log\log m}+\frac{n\log m}{m\log\log m})$ time, that we eventually prove to be optimal on average.
△ Less
Submitted 25 April, 2013; v1 submitted 21 January, 2013;
originally announced January 2013.
-
Various improvements to text fingerprinting
Authors:
Djamal Belazzougui,
Roman Kolpakov,
Mathieu Raffinot
Abstract:
Let s = s_1 .. s_n be a text (or sequence) on a finite alphabet Σof size σ. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set {\cal F} of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s_i .. s_j is a maximal location for a fingerprint f in F (…
▽ More
Let s = s_1 .. s_n be a text (or sequence) on a finite alphabet Σof size σ. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set {\cal F} of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s_i .. s_j is a maximal location for a fingerprint f in F (denoted by <i,j>) if the alphabet of s_i .. s_j is f and s_{i-1}, s_{j+1}, if defined, are not in f. The set of maximal locations ins is {\cal L} (it is easy to see that |{\cal L}| \leq n σ). Two maximal locations <i,j> and <k,l> such that s_i .. s_j = s_k .. s_l are named {\em copies}, and the quotient set of {\cal L} according to the copy relation is denoted by {\cal L}_C. We present new exact and approximate efficient algorithms and data structures for the following three problems: (1) to compute {\cal F}; (2) given f as a set of distinct characters in Σ, to answer if f represents a fingerprint in {\cal F}; (3) given f, to find all maximal locations of f in s.
△ Less
Submitted 15 January, 2013;
originally announced January 2013.
-
Faster and Simpler Minimal Conflicting Set Identification
Authors:
Aida Ouangraoua,
Mathieu Raffinot
Abstract:
Let C be a finite set of N elements and R = r_1,r_2,..., r_m a family of M subsets of C. A subset X of R verifies the Consecutive Ones Property (C1P) if there exists a permutation P of C such that each r_i in X is an interval of P. A Minimal Conflicting Set (MCS) S is a subset of R that does not verify the C1P, but such that any of its proper subsets does. In this paper, we present a new simpler a…
▽ More
Let C be a finite set of N elements and R = r_1,r_2,..., r_m a family of M subsets of C. A subset X of R verifies the Consecutive Ones Property (C1P) if there exists a permutation P of C such that each r_i in X is an interval of P. A Minimal Conflicting Set (MCS) S is a subset of R that does not verify the C1P, but such that any of its proper subsets does. In this paper, we present a new simpler and faster algorithm to decide if a given element r in R belongs to at least one MCS. Our algorithm runs in O(N^2M^2 + NM^7), largely improving the current O(M^6N^5 (M+N)^2 log(M+N)) fastest algorithm of [Blin {\em et al}, CSR 2011]. The new algorithm is based on an alternative approach considering minimal forbidden induced subgraphs of interval graphs instead of Tucker matrices.
△ Less
Submitted 26 January, 2012;
originally announced January 2012.
-
Consecutive ones property testing: cut or swap
Authors:
Mathieu Raffinot
Abstract:
Let C be a finite set of $N elements and R = {R_1,R_2, ..,R_m} a family of M subsets of C. The family R verifies the consecutive ones property if there exists a permutation P of C such that each R_i in R is an interval of P. There already exist several algorithms to test this property in sum_{i=1}^m |R_i| time, all being involved. We present a simpler algorithm, based on a new partitioning scheme.
Let C be a finite set of $N elements and R = {R_1,R_2, ..,R_m} a family of M subsets of C. The family R verifies the consecutive ones property if there exists a permutation P of C such that each R_i in R is an interval of P. There already exist several algorithms to test this property in sum_{i=1}^m |R_i| time, all being involved. We present a simpler algorithm, based on a new partitioning scheme.
△ Less
Submitted 23 August, 2010;
originally announced August 2010.
-
Linear Time Split Decomposition Revisited
Authors:
Pierre Charbit,
Fabien de Montgolfier,
Mathieu Raffinot
Abstract:
Given a family F of subsets of a ground set V, its orthogonal is defined to be the family of subsets that do not overlap any element of F.
Using this tool we revisit the problem of designing a simple linear time algorithm for undirected graph split (also known as 1-join) decomposition.
Given a family F of subsets of a ground set V, its orthogonal is defined to be the family of subsets that do not overlap any element of F.
Using this tool we revisit the problem of designing a simple linear time algorithm for undirected graph split (also known as 1-join) decomposition.
△ Less
Submitted 28 June, 2010; v1 submitted 10 February, 2009;
originally announced February 2009.
-
A Note On Computing Set Overlap Classes
Authors:
Pierre Charbit,
Michel Habib,
Vincent Limouzy,
Fabien De Montgolfier,
Mathieu Raffinot,
Michaël Rao
Abstract:
Let ${\cal V}$ be a finite set of $n$ elements and ${\cal F}=\{X_1,X_2, >..., X_m\}$ a family of $m$ subsets of ${\cal V}.$ Two sets $X_i$ and $X_j$ of ${\cal F}$ overlap if $X_i \cap X_j \neq \emptyset,$ $X_j \setminus X_i \neq \emptyset,$ and $X_i \setminus X_j \neq \emptyset.$ Two sets $X,Y\in {\cal F}$ are in the same overlap class if there is a series $X=X_1,X_2, ..., X_k=Y$ of sets of…
▽ More
Let ${\cal V}$ be a finite set of $n$ elements and ${\cal F}=\{X_1,X_2, >..., X_m\}$ a family of $m$ subsets of ${\cal V}.$ Two sets $X_i$ and $X_j$ of ${\cal F}$ overlap if $X_i \cap X_j \neq \emptyset,$ $X_j \setminus X_i \neq \emptyset,$ and $X_i \setminus X_j \neq \emptyset.$ Two sets $X,Y\in {\cal F}$ are in the same overlap class if there is a series $X=X_1,X_2, ..., X_k=Y$ of sets of ${\cal F}$ in which each $X_iX_{i+1}$ overlaps. In this note, we focus on efficiently identifying all overlap classes in $O(n+\sum_{i=1}^m |X_i|)$ time. We thus revisit the clever algorithm of Dahlhaus of which we give a clear presentation and that we simplify to make it practical and implementable in its real worst case complexity. An useful variant of Dahlhaus's approach is also explained.
△ Less
Submitted 28 November, 2007;
originally announced November 2007.