-
An Algorithm for Reordering Buffer Management Problem and Experimental Evaluations on Discrete Distributions
Authors:
Gözde Filiz,
M. Oğuzhan Külekci
Abstract:
In the reordering buffer management problem, a sequence of requests must be executed by a service station, where a cost occurs for each pair of consecutive requests with different attributes. A reordering buffer management algorithm aims to permute the input sequence using the buffer to minimize the total cost. Reordering buffers has many potential applications in computer sciences and economics.…
▽ More
In the reordering buffer management problem, a sequence of requests must be executed by a service station, where a cost occurs for each pair of consecutive requests with different attributes. A reordering buffer management algorithm aims to permute the input sequence using the buffer to minimize the total cost. Reordering buffers has many potential applications in computer sciences and economics. In this article, we proved the minimum buffer length for the optimal solution to the reordering buffer management problem in the offline setting. With the assumption that color selection is always made when the buffer is full, selecting the most frequent color from the buffer given the smallest buffer size $k$ that satisfies either $o_1 < 2 \cdot \lceil \frac{k}σ \rceil$ OR $o_2 < \lceil \frac{k}σ \rceil$ guarantees the optimal solution, where $o_1$ and $o_2$ represent respectively the frequency of the most and the second most frequent colors in the input sequence $\mathcal{X}$, and $σ$ is the number of distinct colors appearing in $\mathcal{X}$. We proposed a new algorithm for the online setting of the problem that uses the results of the proof made on the minimum buffer length required for the optimal solution. Moreover, we presented the results of the first experimental setup that uses input sequences following discrete distributions to evaluate the performance of algorithms. Out of 432 cases, the new algorithm showed the best performance in 409 cases that is approximately $95\%$ of all cases.
△ Less
Submitted 22 May, 2021;
originally announced May 2021.
-
Enumerative Data Compression with Non-Uniquely Decodable Codes
Authors:
M. Oğuzhan Külekci,
Yasin Öztürk,
Elif Altunok,
Can Altıniğne
Abstract:
Non-uniquely decodable codes can be defined as the codes that cannot be uniquely decoded without additional disambiguation information. These are mainly the class of non-prefix-free codes, where a codeword can be a prefix of other(s), and thus, the codeword boundary information is essential for correct decoding. Although the codeword bit stream consumes significantly less space when compared to pr…
▽ More
Non-uniquely decodable codes can be defined as the codes that cannot be uniquely decoded without additional disambiguation information. These are mainly the class of non-prefix-free codes, where a codeword can be a prefix of other(s), and thus, the codeword boundary information is essential for correct decoding. Although the codeword bit stream consumes significantly less space when compared to prefix--free codes, the additional disambiguation information makes it difficult to catch the performance of prefix-free codes in total. Previous studies considered compression with non-prefix-free codes by integrating rank/select dictionaries or wavelet trees to mark the code-word boundaries. In this study we focus on another dimension with a block--wise enumeration scheme that improves the compression ratios of the previous studies significantly. Experiments conducted on a known corpus showed that the proposed scheme successfully represents a source within its entropy, even performing better than the Huffman and arithmetic coding in some cases. The non-uniquely decodable codes also provides an intrinsic security feature due to lack of unique-decodability. We investigate this dimension as an opportunity to provide compressed data security without (or with less) encryption, and discuss various possible practical advantages supported by such codes.
△ Less
Submitted 13 November, 2019;
originally announced November 2019.
-
On Longest Repeat Queries
Authors:
Atalay Mert İleri,
M. Oğuzhan Külekci,
Bojian Xu
Abstract:
Repeat finding in strings has important applications in subfields such as computational biology. Surprisingly, all prior work on repeat finding did not consider the constraint on the locality of repeats. In this paper, we propose and study the problem of finding longest repetitive substrings covering particular string positions. We propose an $O(n)$ time and space algorithm for finding the longest…
▽ More
Repeat finding in strings has important applications in subfields such as computational biology. Surprisingly, all prior work on repeat finding did not consider the constraint on the locality of repeats. In this paper, we propose and study the problem of finding longest repetitive substrings covering particular string positions. We propose an $O(n)$ time and space algorithm for finding the longest repeat covering every position of a string of size $n$. Our work is optimal since the reading and the storage of an input string of size $n$ takes $O(n)$ time and space. Because any substring of a repeat is also a repeat, our solution to longest repeat queries effectively provides a "stabbing" tool for practitioners for finding most of the repeats that cover particular string positions.
△ Less
Submitted 26 January, 2015;
originally announced January 2015.
-
Shortest Unique Substring Query Revisited
Authors:
Atalay Mert İleri,
M. Oğuzhan Külekci,
Bojian Xu
Abstract:
We revisit the problem of finding shortest unique substring (SUS) proposed recently by [6]. We propose an optimal $O(n)$ time and space algorithm that can find an SUS for every location of a string of size $n$. Our algorithm significantly improves the $O(n^2)$ time complexity needed by [6]. We also support finding all the SUSes covering every location, whereas the solution in [6] can find only one…
▽ More
We revisit the problem of finding shortest unique substring (SUS) proposed recently by [6]. We propose an optimal $O(n)$ time and space algorithm that can find an SUS for every location of a string of size $n$. Our algorithm significantly improves the $O(n^2)$ time complexity needed by [6]. We also support finding all the SUSes covering every location, whereas the solution in [6] can find only one SUS for every location. Further, our solution is simpler and easier to implement and can also be more space efficient in practice, since we only use the inverse suffix array and longest common prefix array of the string, while the algorithm in [6] uses the suffix tree of the string and other auxiliary data structures. Our theoretical results are validated by an empirical study that shows our algorithm is much faster and more space-saving than the one in [6].
△ Less
Submitted 10 January, 2014; v1 submitted 10 December, 2013;
originally announced December 2013.
-
Enumeration of sequences with large alphabets
Authors:
M. Oguzhan Kulekci
Abstract:
This study focuses on efficient schemes for enumerative coding of $σ$--ary sequences by mainly borrowing ideas from Öktem & Astola's \cite{Oktem99} hierarchical enumerative coding and Schalkwijk's \cite{Schalkwijk72} asymptotically optimal combinatorial code on binary sequences. By observing that the number of distinct $σ$--dimensional vectors having an inner sum of $n$, where the values in each d…
▽ More
This study focuses on efficient schemes for enumerative coding of $σ$--ary sequences by mainly borrowing ideas from Öktem & Astola's \cite{Oktem99} hierarchical enumerative coding and Schalkwijk's \cite{Schalkwijk72} asymptotically optimal combinatorial code on binary sequences. By observing that the number of distinct $σ$--dimensional vectors having an inner sum of $n$, where the values in each dimension are in range $[0...n]$ is $K(σ,n) = \sum_{i=0}^{σ-1} {{n-1} \choose {σ-1-i}} {σ \choose {i}}$, we propose representing $C$ vector via enumeration, and present necessary algorithms to perform this task. We prove $\log K(σ,n)$ requires approximately $ (σ-1) \log (σ-1) $ less bits than the naive $(σ-1)\lceil \log (n+1) \rceil$ representation for relatively large $n$, and examine the results for varying alphabet sizes experimentally. We extend the basic scheme for the enumerative coding of $σ$--ary sequences by introducing a new method for large alphabets. We experimentally show that the newly introduced technique is superior to the basic scheme by providing experiments on DNA sequences.
△ Less
Submitted 13 November, 2012;
originally announced November 2012.
-
A memory versus compression ratio trade-off in PPM via compressed context modeling
Authors:
M. Oguzhan Kulekci
Abstract:
Since its introduction prediction by partial matching (PPM) has always been a de facto gold standard in lossless text compression, where many variants improving the compression ratio and speed have been proposed. However, reducing the high space requirement of PPM schemes did not gain that much attention. This study focuses on reducing the memory consumption of PPM via the recently proposed compre…
▽ More
Since its introduction prediction by partial matching (PPM) has always been a de facto gold standard in lossless text compression, where many variants improving the compression ratio and speed have been proposed. However, reducing the high space requirement of PPM schemes did not gain that much attention. This study focuses on reducing the memory consumption of PPM via the recently proposed compressed context modeling that uses the compressed representations of contexts in the statistical model. Differently from the classical context definition as the string of the preceding characters at a particular position, CCM considers context as the amount of preceding information that is actually the bit stream composed by compressing the previous symbols. We observe that by using the CCM, the data structures, particularly the context trees, can be implemented in smaller space, and present a trade-off between the compression ratio and the space requirement. The experiments conducted showed that this trade-off is especially beneficial in low orders with approximately 20 - 25 percent gain in memory by a sacrifice of up to nearly 7 percent loss in compression ratio.
△ Less
Submitted 12 November, 2012;
originally announced November 2012.
-
Fast Packed String Matching for Short Patterns
Authors:
Simone Faro,
M. Oguzhan Külekci
Abstract:
Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. In the last two decades a general trend has appeared trying to exploit the power of the word RAM model to speed-up the performances of classical string matching algorithms. In thi…
▽ More
Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. In the last two decades a general trend has appeared trying to exploit the power of the word RAM model to speed-up the performances of classical string matching algorithms. In this model an algorithm operates on words of length w, grou** blocks of characters, and arithmetic and logic operations on the words take one unit of time. In this paper we use specialized word-size packed string matching instructions, based on the Intel streaming SIMD extensions (SSE) technology, to design very fast string matching algorithms in the case of short patterns. From our experimental results it turns out that, despite their quadratic worst case time complexity, the new presented algorithms become the clear winners on the average for short patterns, when compared against the most effective algorithms known in literature.
△ Less
Submitted 28 September, 2012;
originally announced September 2012.