Search | arXiv e-print repository

Efficient Online String Matching through Linked Weak Factors

Authors: Matthew N. Palmer, Simone Faro, Stefano Scafiti

Abstract: Online string matching is a computational problem involving the search for patterns or substrings in a large text dataset, with the pattern and text being processed sequentially, without prior access to the entire text. Its relevance stems from applications in data compression, data mining, text editing, and bioinformatics, where rapid and efficient pattern matching is crucial. Various solutions h… ▽ More Online string matching is a computational problem involving the search for patterns or substrings in a large text dataset, with the pattern and text being processed sequentially, without prior access to the entire text. Its relevance stems from applications in data compression, data mining, text editing, and bioinformatics, where rapid and efficient pattern matching is crucial. Various solutions have been proposed over the past few decades, employing diverse techniques. Recently, weak recognition approaches have attracted increasing attention. This paper presents Hash Chain, a new algorithm based on a robust weak factor recognition approach that connects adjacent factors through hashing. Despite its O(nm) complexity, the algorithm exhibits a sublinear behavior in practice and achieves superior performance compared to the most effective algorithms. △ Less

Submitted 24 October, 2023; originally announced October 2023.

arXiv:2309.01250 [pdf, ps, other]

Longest Common Substring and Longest Palindromic Substring in $\tilde{\mathcal{O}}(\sqrt{n})$ Time

Authors: Domenico Cantone, Simone Faro, Arianna Pavone, Caterina Viola

Abstract: The Longest Common Substring (LCS) and Longest Palindromic Substring (LPS) are classical problems in computer science, representing fundamental challenges in string processing. Both problems can be solved in linear time using a classical model of computation, by means of very similar algorithms, both relying on the use of suffix trees. Very recently, two sublinear algorithms for LCS and LPS in the… ▽ More The Longest Common Substring (LCS) and Longest Palindromic Substring (LPS) are classical problems in computer science, representing fundamental challenges in string processing. Both problems can be solved in linear time using a classical model of computation, by means of very similar algorithms, both relying on the use of suffix trees. Very recently, two sublinear algorithms for LCS and LPS in the quantum query model have been presented by Le Gall and Seddighin~\cite{GallS23}, requiring $\tilde{\mathcal{O}}(n^{5/6})$ and $\tilde{\mathcal{O}}(\sqrt{n})$ queries, respectively. However, while the query model is fascinating from a theoretical standpoint, its practical applicability becomes limited when it comes to crafting algorithms meant for actual execution on real hardware. In this paper we present, for the first time, a $\tilde{\mathcal{O}}(\sqrt{n})$ quantum algorithm for both LCS and LPS working in the circuit model of computation. Our solutions are simpler than previous ones and can be easily translated into quantum procedures. We also present actual implementations of the two algorithms as quantum circuits working in $\mathcal{O}(\sqrt{n}\log^5(n))$ and $\mathcal{O}(\sqrt{n}\log^4(n))$ time, respectively. △ Less

Submitted 3 September, 2023; originally announced September 2023.

arXiv:2308.11758 [pdf, ps, other]

Quantum Circuits for Fixed Substring Matching Problems

Authors: Domenico Cantone, Simone Faro, Arianna Pavone, Caterina Viola

Abstract: Quantum computation represents a computational paradigm whose distinctive attributes confer the ability to devise algorithms with asymptotic performance levels significantly superior to those achievable via classical computation. Recent strides have been taken to apply this computational framework in tackling and resolving various issues related to text processing. The resultant solutions demonstr… ▽ More Quantum computation represents a computational paradigm whose distinctive attributes confer the ability to devise algorithms with asymptotic performance levels significantly superior to those achievable via classical computation. Recent strides have been taken to apply this computational framework in tackling and resolving various issues related to text processing. The resultant solutions demonstrate marked advantages over their classical counterparts. This study employs quantum computation to efficaciously surmount text processing challenges, particularly those involving string comparison. The focus is on the alignment of fixed-length substrings within two input strings. Specifically, given two input strings, $x$ and $y$, both of length $n$, and a value $d \leq n$, we want to verify the following conditions: the existence of a common prefix of length $d$, the presence of a common substring of length $d$ beginning at position $j$ (with $0 \leq j < n$) and, the presence of any common substring of length $d$ beginning in both strings at the same position. Such problems find applications as sub-procedures in a variety of problems concerning text processing and sequence analysis. Notably, our approach furnishes polylogarithmic solutions, a stark contrast to the linear complexity inherent in the best classical alternatives. △ Less

Submitted 22 August, 2023; originally announced August 2023.

arXiv:2303.18063 [pdf, ps, other]

The Many Qualities of a New Directly Accessible Compression Scheme

Authors: Domenico Cantone, Simone Faro

Abstract: We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either pr… ▽ More We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either practical efficiency or compression ratios, as required. For a text of length $n$ over an alphabet of size $σ$ and a fixed parameter $λ$, the access time of the proposed encoding is proportional to the length of the character's code-word, plus an expected $\mathcal{O}((F_{σ- λ+ 3} - 3)/F_{σ+1})$ overhead, where $F_j$ is the $j$-th number of the Fibonacci sequence. In the overall it uses $N+\mathcal{O}\big(n \left(λ- (F_{σ+3}-3)/F_{σ+1}\big) \right) = N + \mathcal{O}(n)$ bits, where $N$ is the length of the encoded string. Experimental results show that the performance of our scheme is, in some respects, comparable with the performance of DACs and Wavelet Tees, which are among of the most efficient schemes. In addition our scheme is configured as a \emph{computation-friendly compression} scheme, as it counts several features that make it very effective in text processing tasks. In the string matching problem, that we take as a case study, we experimentally prove that the new scheme enables results that are up to 29 times faster than standard string-matching techniques on plain texts. △ Less

Submitted 31 March, 2023; originally announced March 2023.

Comments: 33 pages

arXiv:2303.03749 [pdf, ps, other]

Daml: A Smart Contract Language for Securely Automating Real-World Multi-Party Business Workflows

Authors: Alexander Bernauer, Sofia Faro, Rémy Hämmerle, Martin Huschenbett, Moritz Kiefer, Andreas Lochbihler, Jussi Mäki, Francesco Mazzoli, Simon Meier, Neil Mitchell, Ratko G. Veprek

Abstract: Distributed ledger technologies, also known as blockchains for enterprises, promise to significantly reduce the high cost of automating multi-party business workflows. We argue that a programming language for writing such on-ledger logic should satisfy three desiderata: (1) Provide concepts to capture the legal rules that govern real-world business workflows. (2) Include simple means for specifyin… ▽ More Distributed ledger technologies, also known as blockchains for enterprises, promise to significantly reduce the high cost of automating multi-party business workflows. We argue that a programming language for writing such on-ledger logic should satisfy three desiderata: (1) Provide concepts to capture the legal rules that govern real-world business workflows. (2) Include simple means for specifying policies for access and authorization. (3) Support the composition of simple workflows into complex ones, even when the simple workflows have already been deployed. We present the open-source smart contract language Daml based on Haskell with strict evaluation. Daml achieves these desiderata by offering novel primitives for representing, accessing, and modifying data on the ledger, which are mimicking the primitives of today's legal systems. Robust access and authorization policies are specified as part of these primitives, and Daml's built-in authorization rules enable delegation, which is key for workflow composability. These properties make Daml well-suited for orchestrating business workflows across multiple, otherwise heterogeneous parties. Daml contracts run (1) on centralized ledgers backed by a database, (2) on distributed deployments with Byzantine fault tolerant consensus, and (3) on top of conventional blockchains, as a second layer via an atomic commit protocol. △ Less

Submitted 7 March, 2023; originally announced March 2023.

ACM Class: D.3.1; F.3.2

arXiv:2101.00718 [pdf, other]

Text Searching Allowing for Non-Overlap** Adjacent Unbalanced Translocations

Authors: Domenico Cantone, Simone Faro, Arianna Pavone

Abstract: In this paper we investigate the \emph{approximate string matching problem} when the allowed edit operations are \emph{non-overlap** unbalanced translocations of adjacent factors}. Such kind of edit operations take place when two adjacent sub-strings of the text swap, resulting in a modified string. The two involved substrings are allowed to be of different lengths. Such large-scale modificati… ▽ More In this paper we investigate the \emph{approximate string matching problem} when the allowed edit operations are \emph{non-overlap** unbalanced translocations of adjacent factors}. Such kind of edit operations take place when two adjacent sub-strings of the text swap, resulting in a modified string. The two involved substrings are allowed to be of different lengths. Such large-scale modifications on strings have various applications. They are among the most frequent chromosomal alterations, accounted for 30\% of all losses of heterozygosity, a major genetic event causing inactivation of cancer suppressor genes. In addition, among other applications, they are frequent modifications accounted in musical or in natural language information retrieval. However, despite of their central role in so many fields of text processing, little attention has been devoted to the problem of matching strings allowing for this kind of edit operation. In this paper we present three algorithms for solving the problem, all of them with a $\bigO(nm^3)$ worst-case and a $\bigO(m^2)$-space complexity, where $m$ and $n$ are the length of the pattern and of the text, respectively. % In particular, our first algorithm is based on the dynamic-programming approach. Our second solution improves the previous one by making use of the Directed Acyclic Word Graph of the pattern. Finally our third algorithm is based on an alignment procedure. We also show that under the assumptions of equiprobability and independence of characters, our second algorithm has a $\bigO(n\log^2_σ m)$ average time complexity, for an alphabet of size $σ\geq 4$. △ Less

Submitted 3 January, 2021; originally announced January 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:1812.00421

arXiv:1911.01644 [pdf, other]

Fast Multiple Pattern Cartesian Tree Matching

Authors: Geonmo Gu, Siwoo Song, Simone Faro, Thierry Lecroq, Kunsoo Park

Abstract: Cartesian tree matching is the problem of finding all substrings in a given text which have the same Cartesian trees as that of a given pattern. In this paper, we deal with Cartesian tree matching for the case of multiple patterns. We present two fingerprinting methods, i.e., the parent-distance encoding and the binary encoding. By combining an efficient fingerprinting method and a conventional mu… ▽ More Cartesian tree matching is the problem of finding all substrings in a given text which have the same Cartesian trees as that of a given pattern. In this paper, we deal with Cartesian tree matching for the case of multiple patterns. We present two fingerprinting methods, i.e., the parent-distance encoding and the binary encoding. By combining an efficient fingerprinting method and a conventional multiple string matching algorithm, we can efficiently solve multiple pattern Cartesian tree matching. We propose three practical algorithms for multiple pattern Cartesian tree matching based on the Wu-Manber algorithm, the Rabin-Karp algorithm, and the Alpha Skip Search algorithm, respectively. In the experiments we compare our solutions against the previous algorithm [18]. Our solutions run faster than the previous algorithm as the pattern lengths increase. Especially, our algorithm based on Wu-Manber runs up to 33 times faster. △ Less

Submitted 5 November, 2019; originally announced November 2019.

Comments: Submitted to WALCOM 2020

arXiv:1908.05930 [pdf, ps, other]

Efficient Online String Matching Based on Characters Distance Text Sampling

Authors: Simone Faro, Arianna Pavone, Francesco Pio Marino

Abstract: Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string matching is an efficient approach recently introduced in order to overcome the prohibitive space requirements of an index construction, on the one hand, and drastic… ▽ More Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string matching is an efficient approach recently introduced in order to overcome the prohibitive space requirements of an index construction, on the one hand, and drastically reduce searching time for the online solutions, on the other hand. In this paper we present a new algorithm for the sampled string matching problem, based on a characters distance sampling approach. The main idea is to sample the distances between consecutive occurrences of a given pivot character and then to search online the sampled data for any occurrence of the sampled pattern, before verifying the original text. From a theoretical point of view we prove that, under suitable conditions, our solution can achieve both linear worst-case time complexity and optimal average-time complexity. From a practical point of view it turns out that our solution shows a sub-linear behaviour in practice and speeds up online searching by a factor of up to 9, using limited additional space whose amount goes from 11% to 2.8% of the text size, with a gain up to 50% if compared with previous solutions. △ Less

Submitted 16 August, 2019; originally announced August 2019.

arXiv:1908.04937 [pdf, other]

Fast Cartesian Tree Matching

Authors: Siwoo Song, Cheol Ryu, Simone Faro, Thierry Lecroq, Kunsoo Park

Abstract: Cartesian tree matching is the problem of finding all substrings of a given text which have the same Cartesian trees as that of a given pattern. So far there is one linear-time solution for Cartesian tree matching, which is based on the KMP algorithm. We improve the running time of the previous solution by introducing new representations. We present the framework of a binary filtration method and… ▽ More Cartesian tree matching is the problem of finding all substrings of a given text which have the same Cartesian trees as that of a given pattern. So far there is one linear-time solution for Cartesian tree matching, which is based on the KMP algorithm. We improve the running time of the previous solution by introducing new representations. We present the framework of a binary filtration method and an efficient verification technique for Cartesian tree matching. Any exact string matching algorithm can be used as a filtration for Cartesian tree matching on our framework. We also present a SIMD solution for Cartesian tree matching suitable for short patterns. By experiments we show that known string matching algorithms combined on our framework of binary filtration and efficient verification produce algorithms of good performances for Cartesian tree matching. △ Less

Submitted 13 August, 2019; originally announced August 2019.

Comments: 14 pages, 3 figures, Submitted to SPIRE 2019

arXiv:1907.01963 [pdf, other]

doi 10.1103/PhysRevD.100.035038

Boundedness from below in the $U(1)\times U(1)$ three-Higgs-Doublet model

Authors: Francisco S. Faro, Igor P. Ivanov

Abstract: Establishing if multi-Higgs potentials are bounded from below (BFB) can be rather challenging, and it may impede efficient investigation of all phenomenological consequences of such models. In this paper, we find the necessary and sufficient BFB conditions for the Three-Higgs-Doublet model (3HDM) with the global symmetry group $U(1)\times U(1)$. We observed an important role played by charge-break… ▽ More Establishing if multi-Higgs potentials are bounded from below (BFB) can be rather challenging, and it may impede efficient investigation of all phenomenological consequences of such models. In this paper, we find the necessary and sufficient BFB conditions for the Three-Higgs-Doublet model (3HDM) with the global symmetry group $U(1)\times U(1)$. We observed an important role played by charge-breaking directions in the Higgs space, even for situations when a good-looking neutral minimum exists. This remark is not limited to the particular model we consider but represents a rather general feature of elaborate multi-Higgs potentials which must be carefully dealt with. Also, applying this method to Weinberg's model (the $\mathbb{Z}_2 \times \mathbb{Z}_2$ symmetric 3HDM) turned out to be more challenging than was believed in the literature. In particular, we have found that the approach taken in a paper from 2009 does not lead to the necessary and sufficient BFB conditions for this case. △ Less

Submitted 1 September, 2019; v1 submitted 3 July, 2019; originally announced July 2019.

Comments: 8 pages, 1 figure; v2: extra clarifications, matches the published version

Report number: CFTP/19-022

Journal ref: Phys. Rev. D 100, 035038 (2019)

arXiv:1812.00421 [pdf, other]

Sequence Searching Allowing for Non-Overlap** Adjacent Unbalanced Translocations

Authors: Domenico Cantone, Simone Faro, Arianna Pavone

Abstract: Unbalanced translocations are among the most frequent chromosomal alterations, accounted for 30\% of all losses of heterozygosity, a major genetic event causing inactivation of tumor suppressor genes. Despite of their central role in genomic sequence analysis, little attention has been devoted to the problem of matching sequences allowing for this kind of chromosomal alteration. In this paper we i… ▽ More Unbalanced translocations are among the most frequent chromosomal alterations, accounted for 30\% of all losses of heterozygosity, a major genetic event causing inactivation of tumor suppressor genes. Despite of their central role in genomic sequence analysis, little attention has been devoted to the problem of matching sequences allowing for this kind of chromosomal alteration. In this paper we investigate the \emph{approximate string matching} problem when the edit operations are non-overlap** unbalanced translocations of adjacent factors. In particular, we first present a $O(nm^3)$-time and $O(m^2)$-space algorithm based on the dynamic-programming approach. Then we improve our first result by designing a second solution which makes use of the Directed Acyclic Word Graph of the pattern. In particular, we show that under the assumptions of equiprobability and independence of characters, our algorithm has a $O(n\log^2_σ m)$ average time complexity, for an alphabet of size $σ$, still maintaining the $O(nm^3)$-time and the $O(m^2)$-space complexity in the worst case. To the best of our knowledge this is the first solution in literature for the approximate string matching problem allowing for unbalanced translocations of factors. △ Less

Submitted 2 December, 2018; originally announced December 2018.

arXiv:1803.02807 [pdf, ps, other]

Flexible and Efficient Algorithms for Abelian Matching in Strings

Authors: Simone Faro, Arianna Pavone

Abstract: The abelian pattern matching problem consists in finding all substrings of a text which are permutations of a given pattern. This problem finds application in many areas and can be solved in linear time by a naive sliding window approach. In this short communication we present a new class of algorithms based on a new efficient fingerprint computation approach, called Heap-Counting, which turns out… ▽ More The abelian pattern matching problem consists in finding all substrings of a text which are permutations of a given pattern. This problem finds application in many areas and can be solved in linear time by a naive sliding window approach. In this short communication we present a new class of algorithms based on a new efficient fingerprint computation approach, called Heap-Counting, which turns out to be fast, flexible and easy to be implemented. It can be proved that our solutions have a linear worst case time complexity and, in addition, we present an extensive experimental evaluation which shows that our newly presented algorithms are among the most efficient and flexible solutions in practice for the abelian matching problem in strings. △ Less

Submitted 7 March, 2018; originally announced March 2018.

Comments: This is a short preliminary version of a full paper submitted to an international journal. Most examples, details, lemmas and theorems have been omitted

arXiv:1707.00469 [pdf, ps, other]

Speeding Up String Matching by Weak Factor Recognition

Authors: Domenico Cantone, Simone Faro, Arianna Pavone

Abstract: String matching is the problem of finding all the substrings of a text which match a given pattern. It is one of the most investigated problems in computer science, mainly due to its very diverse applications in several fields. Recently, much research in the string matching field has focused on the efficiency and flexibility of the searching procedure and quite effective techniques have been propo… ▽ More String matching is the problem of finding all the substrings of a text which match a given pattern. It is one of the most investigated problems in computer science, mainly due to its very diverse applications in several fields. Recently, much research in the string matching field has focused on the efficiency and flexibility of the searching procedure and quite effective techniques have been proposed for speeding up the existing solutions. In this context, algorithms based on factors recognition are among the best solutions. In this paper, we present a simple and very efficient algorithm for string matching based on a weak factor recognition and hashing. Our algorithm has a quadratic worst-case running time. However, despite its quadratic complexity, experimental results show that our algorithm obtains in most cases the best running times when compared, under various conditions, against the most effective algorithms present in literature. In the case of small alphabets and long patterns, the gain in running times reaches 28%. This makes our proposed algorithm one of the most flexible solutions in practical cases. △ Less

Submitted 3 July, 2017; originally announced July 2017.

Comments: 11 pages, appeared in proceedings of the Prague Stringology Conference 2017

arXiv:1605.05067 [pdf, other]

Exact Online String Matching Bibliography

Authors: Simone Faro

Abstract: In this short note we present a comprehensive bibliography for the online exact string matching problem. The problem consists in finding all occurrences of a given pattern in a text. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, informatio… ▽ More In this short note we present a comprehensive bibliography for the online exact string matching problem. The problem consists in finding all occurrences of a given pattern in a text. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry. Since 1970 more than 120 string matching algorithms have been proposed. In this note we present a comprehensive list of (almost) all string matching algorithms. The list is updated to May 2016. △ Less

Submitted 17 May, 2016; originally announced May 2016.

Comments: 23 pages

arXiv:1507.00133 [pdf, other]

Prior Polarity Lexical Resources for the Italian Language

Authors: Valeria Borzì, Simone Faro, Arianna Pavone, Sabrina Sansone

Abstract: In this paper we present SABRINA (Sentiment Analysis: a Broad Resource for Italian Natural language Applications) a manually annotated prior polarity lexical resource for Italian natural language applications in the field of opinion mining and sentiment induction. The resource consists in two different sets, an Italian dictionary of more than 277.000 words tagged with their prior polarity value, a… ▽ More In this paper we present SABRINA (Sentiment Analysis: a Broad Resource for Italian Natural language Applications) a manually annotated prior polarity lexical resource for Italian natural language applications in the field of opinion mining and sentiment induction. The resource consists in two different sets, an Italian dictionary of more than 277.000 words tagged with their prior polarity value, and a set of polarity modifiers, containing more than 200 words, which can be used in combination with non neutral terms of the dictionary in order to induce the sentiment of Italian compound terms. To the best of our knowledge this is the first prior polarity manually annotated resource which has been developed for the Italian natural language. △ Less

Submitted 1 July, 2015; originally announced July 2015.

Comments: 10 pages, Accepted to NLPCS 2015, the 12th International Workshop on Natural Language Processing and Cognitive Science

arXiv:1501.04001 [pdf, other]

Efficient Algorithms for the Order Preserving Pattern Matching Problem

Authors: Simone Faro, Oğuzhan Külekci

Abstract: Given a pattern x of length m and a text y of length n, both over an ordered alphabet, the order-preserving pattern matching problem consists in finding all substrings of the text with the same relative order as the pattern. It is an approximate variant of the well known exact pattern matching problem which has gained attention in recent years. This interesting problem finds applications in a lot… ▽ More Given a pattern x of length m and a text y of length n, both over an ordered alphabet, the order-preserving pattern matching problem consists in finding all substrings of the text with the same relative order as the pattern. It is an approximate variant of the well known exact pattern matching problem which has gained attention in recent years. This interesting problem finds applications in a lot of fields as time series analysis, like share prices on stock markets, weather data analysis or to musical melody matching. In this paper we present two new filtering approaches which turn out to be much more effective in practice than the previously presented methods. From our experimental results it turns out that our proposed solutions are up to 2 times faster than the previous solutions reducing the number of false positives up to 99% △ Less

Submitted 16 January, 2015; originally announced January 2015.

Comments: 16 pages, 3 figures, submitted to SEA 2015 conference

arXiv:1209.6449 [pdf, ps, other]

Fast Packed String Matching for Short Patterns

Authors: Simone Faro, M. Oguzhan Külekci

Abstract: Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. In the last two decades a general trend has appeared trying to exploit the power of the word RAM model to speed-up the performances of classical string matching algorithms. In thi… ▽ More Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. In the last two decades a general trend has appeared trying to exploit the power of the word RAM model to speed-up the performances of classical string matching algorithms. In this model an algorithm operates on words of length w, grou** blocks of characters, and arithmetic and logic operations on the words take one unit of time. In this paper we use specialized word-size packed string matching instructions, based on the Intel streaming SIMD extensions (SSE) technology, to design very fast string matching algorithms in the case of short patterns. From our experimental results it turns out that, despite their quadratic worst case time complexity, the new presented algorithms become the clear winners on the average for short patterns, when compared against the most effective algorithms known in literature. △ Less

Submitted 28 September, 2012; originally announced September 2012.

Comments: 15 pages

arXiv:1012.2547 [pdf, ps, other]

The Exact String Matching Problem: a Comprehensive Experimental Evaluation

Authors: Simone Faro, Thierry Lecroq

Abstract: This paper addresses the online exact string matching problem which consists in finding all occurrences of a given pattern p in a text t. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemis… ▽ More This paper addresses the online exact string matching problem which consists in finding all occurrences of a given pattern p in a text t. It is an extensively studied problem in computer science, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry. Since 1970 more than 80 string matching algorithms have been proposed, and more than 50% of them in the last ten years. In this note we present a comprehensive list of all string matching algorithms and present experimental results in order to compare them from a practical point of view. From our experimental evaluation it turns out that the performance of the algorithms are quite different for different alphabet sizes and pattern length. △ Less

Submitted 12 December, 2010; originally announced December 2010.

Comments: 22 pages

arXiv:1012.1338 [pdf, ps, other]

On Tuning the Bad-Character Rule: the Worst-Character Rule

Authors: Domenico Cantone, Simone Faro

Abstract: In this note we present the worst-character rule, an efficient variation of the bad-character heuristic for the exact string matching problem, firstly introduced in the well-known Boyer-Moore algorithm. Our proposed rule selects a position relative to the current shift which yields the largest average advancement, according to the characters distribution in the text. Experimental results show that… ▽ More In this note we present the worst-character rule, an efficient variation of the bad-character heuristic for the exact string matching problem, firstly introduced in the well-known Boyer-Moore algorithm. Our proposed rule selects a position relative to the current shift which yields the largest average advancement, according to the characters distribution in the text. Experimental results show that the worst-character rule achieves very good results especially in the case of long patterns or small alphabets in random texts and in the case of texts in natural languages. △ Less

Submitted 6 December, 2010; originally announced December 2010.

Comments: 10 pages

arXiv:1012.0280 [pdf, other]

doi 10.1016/j.ipl.2011.02.015

String Matching with Inversions and Translocations in Linear Average Time (Most of the Time)

Authors: Szymon Grabowski, Simone Faro, Emanuele Giaquinta

Abstract: We present an efficient algorithm for finding all approximate occurrences of a given pattern $p$ of length $m$ in a text $t$ of length $n$ allowing for translocations of equal length adjacent factors and inversions of factors. The algorithm is based on an efficient filtering method and has an $\bigO(nm\max(α, β))$-time complexity in the worst case and $\bigO(\max(α, β))$-space complexity, where… ▽ More We present an efficient algorithm for finding all approximate occurrences of a given pattern $p$ of length $m$ in a text $t$ of length $n$ allowing for translocations of equal length adjacent factors and inversions of factors. The algorithm is based on an efficient filtering method and has an $\bigO(nm\max(α, β))$-time complexity in the worst case and $\bigO(\max(α, β))$-space complexity, where $α$ and $β$ are respectively the maximum length of the factors involved in any translocation and inversion. Moreover we show that under the assumptions of equiprobability and independence of characters our algorithm has a $\bigO(n)$ average time complexity, whenever $σ= Ω(\log m / \log\log^{1-ε} m)$, where $ε> 0$ and $σ$ is the dimension of the alphabet. Experiments show that the new proposed algorithm achieves very good results in practical cases. △ Less

Submitted 1 December, 2010; originally announced December 2010.

Comments: 9 pages. A slightly shorter version of this manuscript was submitted to Information Processing Letters

arXiv:0810.2390 [pdf, ps, other]

Efficient Pattern Matching on Binary Strings

Authors: Simone Faro, Thierry Lecroq

Abstract: The binary string matching problem consists in finding all the occurrences of a pattern in a text where both strings are built on a binary alphabet. This is an interesting problem in computer science, since binary data are omnipresent in telecom and computer network applications. Moreover the problem finds applications also in the field of image processing and in pattern matching on compressed t… ▽ More The binary string matching problem consists in finding all the occurrences of a pattern in a text where both strings are built on a binary alphabet. This is an interesting problem in computer science, since binary data are omnipresent in telecom and computer network applications. Moreover the problem finds applications also in the field of image processing and in pattern matching on compressed texts. Recently it has been shown that adaptations of classical exact string matching algorithms are not very efficient on binary data. In this paper we present two efficient algorithms for the problem adapted to completely avoid any reference to bits allowing to process pattern and text byte by byte. Experimental results show that the new algorithms outperform existing solutions in most cases. △ Less

Submitted 15 October, 2008; v1 submitted 14 October, 2008; originally announced October 2008.

Comments: 12 pages

ACM Class: F.2.2; H.3.3; E.4

Showing 1–21 of 21 results for author: Faro, S