-
Efficient Online String Matching through Linked Weak Factors
Authors:
Matthew N. Palmer,
Simone Faro,
Stefano Scafiti
Abstract:
Online string matching is a computational problem involving the search for patterns or substrings in a large text dataset, with the pattern and text being processed sequentially, without prior access to the entire text. Its relevance stems from applications in data compression, data mining, text editing, and bioinformatics, where rapid and efficient pattern matching is crucial. Various solutions h…
▽ More
Online string matching is a computational problem involving the search for patterns or substrings in a large text dataset, with the pattern and text being processed sequentially, without prior access to the entire text. Its relevance stems from applications in data compression, data mining, text editing, and bioinformatics, where rapid and efficient pattern matching is crucial. Various solutions have been proposed over the past few decades, employing diverse techniques. Recently, weak recognition approaches have attracted increasing attention. This paper presents Hash Chain, a new algorithm based on a robust weak factor recognition approach that connects adjacent factors through hashing. Despite its O(nm) complexity, the algorithm exhibits a sublinear behavior in practice and achieves superior performance compared to the most effective algorithms.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Longest Common Substring and Longest Palindromic Substring in $\tilde{\mathcal{O}}(\sqrt{n})$ Time
Authors:
Domenico Cantone,
Simone Faro,
Arianna Pavone,
Caterina Viola
Abstract:
The Longest Common Substring (LCS) and Longest Palindromic Substring (LPS) are classical problems in computer science, representing fundamental challenges in string processing. Both problems can be solved in linear time using a classical model of computation, by means of very similar algorithms, both relying on the use of suffix trees. Very recently, two sublinear algorithms for LCS and LPS in the…
▽ More
The Longest Common Substring (LCS) and Longest Palindromic Substring (LPS) are classical problems in computer science, representing fundamental challenges in string processing. Both problems can be solved in linear time using a classical model of computation, by means of very similar algorithms, both relying on the use of suffix trees. Very recently, two sublinear algorithms for LCS and LPS in the quantum query model have been presented by Le Gall and Seddighin~\cite{GallS23}, requiring $\tilde{\mathcal{O}}(n^{5/6})$ and $\tilde{\mathcal{O}}(\sqrt{n})$ queries, respectively. However, while the query model is fascinating from a theoretical standpoint, its practical applicability becomes limited when it comes to crafting algorithms meant for actual execution on real hardware. In this paper we present, for the first time, a $\tilde{\mathcal{O}}(\sqrt{n})$ quantum algorithm for both LCS and LPS working in the circuit model of computation. Our solutions are simpler than previous ones and can be easily translated into quantum procedures. We also present actual implementations of the two algorithms as quantum circuits working in $\mathcal{O}(\sqrt{n}\log^5(n))$ and $\mathcal{O}(\sqrt{n}\log^4(n))$ time, respectively.
△ Less
Submitted 3 September, 2023;
originally announced September 2023.
-
Quantum Circuits for Fixed Substring Matching Problems
Authors:
Domenico Cantone,
Simone Faro,
Arianna Pavone,
Caterina Viola
Abstract:
Quantum computation represents a computational paradigm whose distinctive attributes confer the ability to devise algorithms with asymptotic performance levels significantly superior to those achievable via classical computation. Recent strides have been taken to apply this computational framework in tackling and resolving various issues related to text processing. The resultant solutions demonstr…
▽ More
Quantum computation represents a computational paradigm whose distinctive attributes confer the ability to devise algorithms with asymptotic performance levels significantly superior to those achievable via classical computation. Recent strides have been taken to apply this computational framework in tackling and resolving various issues related to text processing. The resultant solutions demonstrate marked advantages over their classical counterparts. This study employs quantum computation to efficaciously surmount text processing challenges, particularly those involving string comparison. The focus is on the alignment of fixed-length substrings within two input strings. Specifically, given two input strings, $x$ and $y$, both of length $n$, and a value $d \leq n$, we want to verify the following conditions: the existence of a common prefix of length $d$, the presence of a common substring of length $d$ beginning at position $j$ (with $0 \leq j < n$) and, the presence of any common substring of length $d$ beginning in both strings at the same position. Such problems find applications as sub-procedures in a variety of problems concerning text processing and sequence analysis. Notably, our approach furnishes polylogarithmic solutions, a stark contrast to the linear complexity inherent in the best classical alternatives.
△ Less
Submitted 22 August, 2023;
originally announced August 2023.
-
The Many Qualities of a New Directly Accessible Compression Scheme
Authors:
Domenico Cantone,
Simone Faro
Abstract:
We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either pr…
▽ More
We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either practical efficiency or compression ratios, as required. For a text of length $n$ over an alphabet of size $σ$ and a fixed parameter $λ$, the access time of the proposed encoding is proportional to the length of the character's code-word, plus an expected $\mathcal{O}((F_{σ- λ+ 3} - 3)/F_{σ+1})$ overhead, where $F_j$ is the $j$-th number of the Fibonacci sequence. In the overall it uses $N+\mathcal{O}\big(n \left(λ- (F_{σ+3}-3)/F_{σ+1}\big) \right) = N + \mathcal{O}(n)$ bits, where $N$ is the length of the encoded string. Experimental results show that the performance of our scheme is, in some respects, comparable with the performance of DACs and Wavelet Tees, which are among of the most efficient schemes. In addition our scheme is configured as a \emph{computation-friendly compression} scheme, as it counts several features that make it very effective in text processing tasks. In the string matching problem, that we take as a case study, we experimentally prove that the new scheme enables results that are up to 29 times faster than standard string-matching techniques on plain texts.
△ Less
Submitted 31 March, 2023;
originally announced March 2023.
-
Daml: A Smart Contract Language for Securely Automating Real-World Multi-Party Business Workflows
Authors:
Alexander Bernauer,
Sofia Faro,
Rémy Hämmerle,
Martin Huschenbett,
Moritz Kiefer,
Andreas Lochbihler,
Jussi Mäki,
Francesco Mazzoli,
Simon Meier,
Neil Mitchell,
Ratko G. Veprek
Abstract:
Distributed ledger technologies, also known as blockchains for enterprises, promise to significantly reduce the high cost of automating multi-party business workflows. We argue that a programming language for writing such on-ledger logic should satisfy three desiderata: (1) Provide concepts to capture the legal rules that govern real-world business workflows. (2) Include simple means for specifyin…
▽ More
Distributed ledger technologies, also known as blockchains for enterprises, promise to significantly reduce the high cost of automating multi-party business workflows. We argue that a programming language for writing such on-ledger logic should satisfy three desiderata: (1) Provide concepts to capture the legal rules that govern real-world business workflows. (2) Include simple means for specifying policies for access and authorization. (3) Support the composition of simple workflows into complex ones, even when the simple workflows have already been deployed.
We present the open-source smart contract language Daml based on Haskell with strict evaluation. Daml achieves these desiderata by offering novel primitives for representing, accessing, and modifying data on the ledger, which are mimicking the primitives of today's legal systems. Robust access and authorization policies are specified as part of these primitives, and Daml's built-in authorization rules enable delegation, which is key for workflow composability. These properties make Daml well-suited for orchestrating business workflows across multiple, otherwise heterogeneous parties.
Daml contracts run (1) on centralized ledgers backed by a database, (2) on distributed deployments with Byzantine fault tolerant consensus, and (3) on top of conventional blockchains, as a second layer via an atomic commit protocol.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
Text Searching Allowing for Non-Overlap** Adjacent Unbalanced Translocations
Authors:
Domenico Cantone,
Simone Faro,
Arianna Pavone
Abstract:
In this paper we investigate the \emph{approximate string matching problem} when the allowed edit operations are \emph{non-overlap** unbalanced translocations of adjacent factors}. Such kind of edit operations take place when two adjacent sub-strings of the text swap, resulting in a modified string. The two involved substrings are allowed to be of different lengths.
Such large-scale modificati…
▽ More
In this paper we investigate the \emph{approximate string matching problem} when the allowed edit operations are \emph{non-overlap** unbalanced translocations of adjacent factors}. Such kind of edit operations take place when two adjacent sub-strings of the text swap, resulting in a modified string. The two involved substrings are allowed to be of different lengths.
Such large-scale modifications on strings have various applications. They are among the most frequent chromosomal alterations, accounted for 30\% of all losses of heterozygosity, a major genetic event causing inactivation of cancer suppressor genes. In addition, among other applications, they are frequent modifications accounted in musical or in natural language information retrieval. However, despite of their central role in so many fields of text processing, little attention has been devoted to the problem of matching strings allowing for this kind of edit operation.
In this paper we present three algorithms for solving the problem, all of them with a $\bigO(nm^3)$ worst-case and a $\bigO(m^2)$-space complexity, where $m$ and $n$ are the length of the pattern and of the text, respectively. %
In particular, our first algorithm is based on the dynamic-programming approach. Our second solution improves the previous one by making use of the Directed Acyclic Word Graph of the pattern. Finally our third algorithm is based on an alignment procedure. We also show that under the assumptions of equiprobability and independence of characters, our second algorithm has a $\bigO(n\log^2_σ m)$ average time complexity, for an alphabet of size $σ\geq 4$.
△ Less
Submitted 3 January, 2021;
originally announced January 2021.
-
Fast Multiple Pattern Cartesian Tree Matching
Authors:
Geonmo Gu,
Siwoo Song,
Simone Faro,
Thierry Lecroq,
Kunsoo Park
Abstract:
Cartesian tree matching is the problem of finding all substrings in a given text which have the same Cartesian trees as that of a given pattern. In this paper, we deal with Cartesian tree matching for the case of multiple patterns. We present two fingerprinting methods, i.e., the parent-distance encoding and the binary encoding. By combining an efficient fingerprinting method and a conventional mu…
▽ More
Cartesian tree matching is the problem of finding all substrings in a given text which have the same Cartesian trees as that of a given pattern. In this paper, we deal with Cartesian tree matching for the case of multiple patterns. We present two fingerprinting methods, i.e., the parent-distance encoding and the binary encoding. By combining an efficient fingerprinting method and a conventional multiple string matching algorithm, we can efficiently solve multiple pattern Cartesian tree matching. We propose three practical algorithms for multiple pattern Cartesian tree matching based on the Wu-Manber algorithm, the Rabin-Karp algorithm, and the Alpha Skip Search algorithm, respectively. In the experiments we compare our solutions against the previous algorithm [18]. Our solutions run faster than the previous algorithm as the pattern lengths increase. Especially, our algorithm based on Wu-Manber runs up to 33 times faster.
△ Less
Submitted 5 November, 2019;
originally announced November 2019.
-
Efficient Online String Matching Based on Characters Distance Text Sampling
Authors:
Simone Faro,
Arianna Pavone,
Francesco Pio Marino
Abstract:
Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string matching is an efficient approach recently introduced in order to overcome the prohibitive space requirements of an index construction, on the one hand, and drastic…
▽ More
Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string matching is an efficient approach recently introduced in order to overcome the prohibitive space requirements of an index construction, on the one hand, and drastically reduce searching time for the online solutions, on the other hand. In this paper we present a new algorithm for the sampled string matching problem, based on a characters distance sampling approach. The main idea is to sample the distances between consecutive occurrences of a given pivot character and then to search online the sampled data for any occurrence of the sampled pattern, before verifying the original text. From a theoretical point of view we prove that, under suitable conditions, our solution can achieve both linear worst-case time complexity and optimal average-time complexity. From a practical point of view it turns out that our solution shows a sub-linear behaviour in practice and speeds up online searching by a factor of up to 9, using limited additional space whose amount goes from 11% to 2.8% of the text size, with a gain up to 50% if compared with previous solutions.
△ Less
Submitted 16 August, 2019;
originally announced August 2019.
-
Fast Cartesian Tree Matching
Authors:
Siwoo Song,
Cheol Ryu,
Simone Faro,
Thierry Lecroq,
Kunsoo Park
Abstract:
Cartesian tree matching is the problem of finding all substrings of a given text which have the same Cartesian trees as that of a given pattern. So far there is one linear-time solution for Cartesian tree matching, which is based on the KMP algorithm. We improve the running time of the previous solution by introducing new representations. We present the framework of a binary filtration method and…
▽ More
Cartesian tree matching is the problem of finding all substrings of a given text which have the same Cartesian trees as that of a given pattern. So far there is one linear-time solution for Cartesian tree matching, which is based on the KMP algorithm. We improve the running time of the previous solution by introducing new representations. We present the framework of a binary filtration method and an efficient verification technique for Cartesian tree matching. Any exact string matching algorithm can be used as a filtration for Cartesian tree matching on our framework. We also present a SIMD solution for Cartesian tree matching suitable for short patterns. By experiments we show that known string matching algorithms combined on our framework of binary filtration and efficient verification produce algorithms of good performances for Cartesian tree matching.
△ Less
Submitted 13 August, 2019;
originally announced August 2019.
-
Boundedness from below in the $U(1)\times U(1)$ three-Higgs-Doublet model
Authors:
Francisco S. Faro,
Igor P. Ivanov
Abstract:
Establishing if multi-Higgs potentials are bounded from below (BFB) can be rather challenging, and it may impede efficient investigation of all phenomenological consequences of such models. In this paper, we find the necessary and sufficient BFB conditions for the Three-Higgs-Doublet model (3HDM) with the global symmetry group $U(1)\times U(1)$. We observed an important role played by charge-break…
▽ More
Establishing if multi-Higgs potentials are bounded from below (BFB) can be rather challenging, and it may impede efficient investigation of all phenomenological consequences of such models. In this paper, we find the necessary and sufficient BFB conditions for the Three-Higgs-Doublet model (3HDM) with the global symmetry group $U(1)\times U(1)$. We observed an important role played by charge-breaking directions in the Higgs space, even for situations when a good-looking neutral minimum exists. This remark is not limited to the particular model we consider but represents a rather general feature of elaborate multi-Higgs potentials which must be carefully dealt with. Also, applying this method to Weinberg's model (the $\mathbb{Z}_2 \times \mathbb{Z}_2$ symmetric 3HDM) turned out to be more challenging than was believed in the literature. In particular, we have found that the approach taken in a paper from 2009 does not lead to the necessary and sufficient BFB conditions for this case.
△ Less
Submitted 1 September, 2019; v1 submitted 3 July, 2019;
originally announced July 2019.
-
Sequence Searching Allowing for Non-Overlap** Adjacent Unbalanced Translocations
Authors:
Domenico Cantone,
Simone Faro,
Arianna Pavone
Abstract:
Unbalanced translocations are among the most frequent chromosomal alterations, accounted for 30\% of all losses of heterozygosity, a major genetic event causing inactivation of tumor suppressor genes. Despite of their central role in genomic sequence analysis, little attention has been devoted to the problem of matching sequences allowing for this kind of chromosomal alteration. In this paper we i…
▽ More
Unbalanced translocations are among the most frequent chromosomal alterations, accounted for 30\% of all losses of heterozygosity, a major genetic event causing inactivation of tumor suppressor genes. Despite of their central role in genomic sequence analysis, little attention has been devoted to the problem of matching sequences allowing for this kind of chromosomal alteration. In this paper we investigate the \emph{approximate string matching} problem when the edit operations are non-overlap** unbalanced translocations of adjacent factors. In particular, we first present a $O(nm^3)$-time and $O(m^2)$-space algorithm based on the dynamic-programming approach. Then we improve our first result by designing a second solution which makes use of the Directed Acyclic Word Graph of the pattern. In particular, we show that under the assumptions of equiprobability and independence of characters, our algorithm has a $O(n\log^2_σ m)$ average time complexity, for an alphabet of size $σ$, still maintaining the $O(nm^3)$-time and the $O(m^2)$-space complexity in the worst case. To the best of our knowledge this is the first solution in literature for the approximate string matching problem allowing for unbalanced translocations of factors.
△ Less
Submitted 2 December, 2018;
originally announced December 2018.
-
Flexible and Efficient Algorithms for Abelian Matching in Strings
Authors:
Simone Faro,
Arianna Pavone
Abstract:
The abelian pattern matching problem consists in finding all substrings of a text which are permutations of a given pattern. This problem finds application in many areas and can be solved in linear time by a naive sliding window approach. In this short communication we present a new class of algorithms based on a new efficient fingerprint computation approach, called Heap-Counting, which turns out…
▽ More
The abelian pattern matching problem consists in finding all substrings of a text which are permutations of a given pattern. This problem finds application in many areas and can be solved in linear time by a naive sliding window approach. In this short communication we present a new class of algorithms based on a new efficient fingerprint computation approach, called Heap-Counting, which turns out to be fast, flexible and easy to be implemented. It can be proved that our solutions have a linear worst case time complexity and, in addition, we present an extensive experimental evaluation which shows that our newly presented algorithms are among the most efficient and flexible solutions in practice for the abelian matching problem in strings.
△ Less
Submitted 7 March, 2018;
originally announced March 2018.
-
Speeding Up String Matching by Weak Factor Recognition
Authors:
Domenico Cantone,
Simone Faro,
Arianna Pavone
Abstract:
String matching is the problem of finding all the substrings of a text which match a given pattern. It is one of the most investigated problems in computer science, mainly due to its very diverse applications in several fields. Recently, much research in the string matching field has focused on the efficiency and flexibility of the searching procedure and quite effective techniques have been propo…
▽ More
String matching is the problem of finding all the substrings of a text which match a given pattern. It is one of the most investigated problems in computer science, mainly due to its very diverse applications in several fields. Recently, much research in the string matching field has focused on the efficiency and flexibility of the searching procedure and quite effective techniques have been proposed for speeding up the existing solutions. In this context, algorithms based on factors recognition are among the best solutions. In this paper, we present a simple and very efficient algorithm for string matching based on a weak factor recognition and hashing. Our algorithm has a quadratic worst-case running time. However, despite its quadratic complexity, experimental results show that our algorithm obtains in most cases the best running times when compared, under various conditions, against the most effective algorithms present in literature. In the case of small alphabets and long patterns, the gain in running times reaches 28%. This makes our proposed algorithm one of the most flexible solutions in practical cases.
△ Less
Submitted 3 July, 2017;
originally announced July 2017.
-
Exact Online String Matching Bibliography
Authors:
Simone Faro
Abstract:
In this short note we present a comprehensive bibliography for the online exact string matching problem. The problem consists in finding all occurrences of a given pattern in a text. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, informatio…
▽ More
In this short note we present a comprehensive bibliography for the online exact string matching problem. The problem consists in finding all occurrences of a given pattern in a text. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry. Since 1970 more than 120 string matching algorithms have been proposed. In this note we present a comprehensive list of (almost) all string matching algorithms. The list is updated to May 2016.
△ Less
Submitted 17 May, 2016;
originally announced May 2016.
-
Prior Polarity Lexical Resources for the Italian Language
Authors:
Valeria Borzì,
Simone Faro,
Arianna Pavone,
Sabrina Sansone
Abstract:
In this paper we present SABRINA (Sentiment Analysis: a Broad Resource for Italian Natural language Applications) a manually annotated prior polarity lexical resource for Italian natural language applications in the field of opinion mining and sentiment induction. The resource consists in two different sets, an Italian dictionary of more than 277.000 words tagged with their prior polarity value, a…
▽ More
In this paper we present SABRINA (Sentiment Analysis: a Broad Resource for Italian Natural language Applications) a manually annotated prior polarity lexical resource for Italian natural language applications in the field of opinion mining and sentiment induction. The resource consists in two different sets, an Italian dictionary of more than 277.000 words tagged with their prior polarity value, and a set of polarity modifiers, containing more than 200 words, which can be used in combination with non neutral terms of the dictionary in order to induce the sentiment of Italian compound terms. To the best of our knowledge this is the first prior polarity manually annotated resource which has been developed for the Italian natural language.
△ Less
Submitted 1 July, 2015;
originally announced July 2015.
-
Efficient Algorithms for the Order Preserving Pattern Matching Problem
Authors:
Simone Faro,
Oğuzhan Külekci
Abstract:
Given a pattern x of length m and a text y of length n, both over an ordered alphabet, the order-preserving pattern matching problem consists in finding all substrings of the text with the same relative order as the pattern. It is an approximate variant of the well known exact pattern matching problem which has gained attention in recent years. This interesting problem finds applications in a lot…
▽ More
Given a pattern x of length m and a text y of length n, both over an ordered alphabet, the order-preserving pattern matching problem consists in finding all substrings of the text with the same relative order as the pattern. It is an approximate variant of the well known exact pattern matching problem which has gained attention in recent years. This interesting problem finds applications in a lot of fields as time series analysis, like share prices on stock markets, weather data analysis or to musical melody matching. In this paper we present two new filtering approaches which turn out to be much more effective in practice than the previously presented methods. From our experimental results it turns out that our proposed solutions are up to 2 times faster than the previous solutions reducing the number of false positives up to 99%
△ Less
Submitted 16 January, 2015;
originally announced January 2015.
-
Fast Packed String Matching for Short Patterns
Authors:
Simone Faro,
M. Oguzhan Külekci
Abstract:
Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. In the last two decades a general trend has appeared trying to exploit the power of the word RAM model to speed-up the performances of classical string matching algorithms. In thi…
▽ More
Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. In the last two decades a general trend has appeared trying to exploit the power of the word RAM model to speed-up the performances of classical string matching algorithms. In this model an algorithm operates on words of length w, grou** blocks of characters, and arithmetic and logic operations on the words take one unit of time. In this paper we use specialized word-size packed string matching instructions, based on the Intel streaming SIMD extensions (SSE) technology, to design very fast string matching algorithms in the case of short patterns. From our experimental results it turns out that, despite their quadratic worst case time complexity, the new presented algorithms become the clear winners on the average for short patterns, when compared against the most effective algorithms known in literature.
△ Less
Submitted 28 September, 2012;
originally announced September 2012.
-
The Exact String Matching Problem: a Comprehensive Experimental Evaluation
Authors:
Simone Faro,
Thierry Lecroq
Abstract:
This paper addresses the online exact string matching problem which consists in finding all occurrences of a given pattern p in a text t. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemis…
▽ More
This paper addresses the online exact string matching problem which consists in finding all occurrences of a given pattern p in a text t. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry. Since 1970 more than 80 string matching algorithms have been proposed, and more than 50% of them in the last ten years. In this note we present a comprehensive list of all string matching algorithms and present experimental results in order to compare them from a practical point of view. From our experimental evaluation it turns out that the performance of the algorithms are quite different for different alphabet sizes and pattern length.
△ Less
Submitted 12 December, 2010;
originally announced December 2010.
-
On Tuning the Bad-Character Rule: the Worst-Character Rule
Authors:
Domenico Cantone,
Simone Faro
Abstract:
In this note we present the worst-character rule, an efficient variation of the bad-character heuristic for the exact string matching problem, firstly introduced in the well-known Boyer-Moore algorithm. Our proposed rule selects a position relative to the current shift which yields the largest average advancement, according to the characters distribution in the text. Experimental results show that…
▽ More
In this note we present the worst-character rule, an efficient variation of the bad-character heuristic for the exact string matching problem, firstly introduced in the well-known Boyer-Moore algorithm. Our proposed rule selects a position relative to the current shift which yields the largest average advancement, according to the characters distribution in the text. Experimental results show that the worst-character rule achieves very good results especially in the case of long patterns or small alphabets in random texts and in the case of texts in natural languages.
△ Less
Submitted 6 December, 2010;
originally announced December 2010.
-
String Matching with Inversions and Translocations in Linear Average Time (Most of the Time)
Authors:
Szymon Grabowski,
Simone Faro,
Emanuele Giaquinta
Abstract:
We present an efficient algorithm for finding all approximate occurrences of a given pattern $p$ of length $m$ in a text $t$ of length $n$ allowing for translocations of equal length adjacent factors and inversions of factors. The algorithm is based on an efficient filtering method and has an $\bigO(nm\max(α, β))$-time complexity in the worst case and $\bigO(\max(α, β))$-space complexity, where…
▽ More
We present an efficient algorithm for finding all approximate occurrences of a given pattern $p$ of length $m$ in a text $t$ of length $n$ allowing for translocations of equal length adjacent factors and inversions of factors. The algorithm is based on an efficient filtering method and has an $\bigO(nm\max(α, β))$-time complexity in the worst case and $\bigO(\max(α, β))$-space complexity, where $α$ and $β$ are respectively the maximum length of the factors involved in any translocation and inversion. Moreover we show that under the assumptions of equiprobability and independence of characters our algorithm has a $\bigO(n)$ average time complexity, whenever $σ= Ω(\log m / \log\log^{1-ε} m)$, where $ε> 0$ and $σ$ is the dimension of the alphabet. Experiments show that the new proposed algorithm achieves very good results in practical cases.
△ Less
Submitted 1 December, 2010;
originally announced December 2010.
-
Efficient Pattern Matching on Binary Strings
Authors:
Simone Faro,
Thierry Lecroq
Abstract:
The binary string matching problem consists in finding all the occurrences of a pattern in a text where both strings are built on a binary alphabet. This is an interesting problem in computer science, since binary data are omnipresent in telecom and computer network applications. Moreover the problem finds applications also in the field of image processing and in pattern matching on compressed t…
▽ More
The binary string matching problem consists in finding all the occurrences of a pattern in a text where both strings are built on a binary alphabet. This is an interesting problem in computer science, since binary data are omnipresent in telecom and computer network applications. Moreover the problem finds applications also in the field of image processing and in pattern matching on compressed texts. Recently it has been shown that adaptations of classical exact string matching algorithms are not very efficient on binary data. In this paper we present two efficient algorithms for the problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte. Experimental results show that the new algorithms outperform existing solutions in most cases.
△ Less
Submitted 15 October, 2008; v1 submitted 14 October, 2008;
originally announced October 2008.