Blitzcrank: Fast Semantic Compression for In-memory Online Transaction Processing

Yiming Qiao Tsinghua University [email protected] Yihan Gao [email protected]  and  Huanchen Zhang Tsinghua University [email protected]
Abstract.

We present Blitzcrank, a high-speed semantic compressor designed for OLTP databases. Previous solutions are inadequate for compressing row-stores: they suffer from either low compression factor due to a coarse compression granularity or suboptimal performance due to the inefficiency in handling dynamic data sets. To solve these problems, we first propose novel semantic models that support fast inferences and dynamic value set for both discrete and continuous data types. We then introduce a new entropy encoding algorithm, called delayed coding, that achieves significant improvement in the decoding speed compared to modern arithmetic coding implementations. We evaluate Blitzcrank in both standalone microbenchmarks and a multicore in-memory row-store using the TPC-C benchmark. Our results show that Blitzcrank achieves a sub-microsecond latency for decompressing a random tuple while obtaining high compression factors. This leads to an 85% memory reduction in the TPC-C evaluation with a moderate (19%) throughput degradation. For data sets larger than the available physical memory, Blitzcrank help the database sustain a high throughput for more transactions before the I/O overhead dominates.

PVLDB Reference Format:
PVLDB, 17(10): 2528 - 2540, 2024.
doi:10.14778/3675034.3675044 This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 17, No. 10 ISSN 2150-8097.
doi:10.14778/3675034.3675044

PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at %leave␣empty␣if␣no␣availability␣url␣should␣be␣sethttps://github.com/YimingQiao/Blitzcrank.

1. Introduction

In-memory database management systems (DBMSs) offer low latency and high throughput for online transaction processing (OLTP) workloads when the working set fits in memory (van Renen et al., 2018; Sudhir et al., 2021; Tu et al., 2013). For data sets beyond physical memory, their performance degrades quickly because of expensive random I/Os to fetch tuples. Although DRAM price has been decreasing, memory is still a limiting resource because of the increasing price gap between DRAM and SSDs (Ziegler et al., 2022; Haas et al., 2020).

Applying compression can increase the capacity of in-memory DBMSs with the same hardware cost, thus reducing or even eliminating disk accesses to improve performance (Zhang et al., 2016, 2020). However, most existing compression techniques are designed for column-stores: they target read-mostly workloads with large batched processing (Kuschewski et al., 2023; Abadi et al., 2013, 2006a; Boncz et al., 2020; Damme et al., 2017; Foufoulas et al., 2021; Polychroniou and Ross, 2015; Barbarioli et al., 2023; Lasch et al., 2020). To compress in-memory row-stores efficiently, the compression schemes must satisfy additional requirements. First, random access to tuples must be fast (e.g., sub-microsecond) because OLTP applications demand low query latencies. For example, Amazon found that a 100 ms increase in latency would lead to a 1% drop in sales (Heger et al., 2017; Lersch et al., 2020). Second, the compression algorithms must handle newly inserted/updated tuples efficiently because OLTP workloads are typically write-heavy (Sinha, 2021). A frequent reconstruction of the compression model is usually unacceptable because it brings too much performance overhead (Pöss and Potapov, 2003; Raman and Swart, 2006). Unfortunately, existing compression schemes are inadequate when serving OLTP workloads: they suffer from either low compression factor (i.e. the original size divided by the compressed size) due to a coarse compression granularity or suboptimal performance due to the inefficiency in handling dynamic data sets.

Refer to caption
Figure 1. DB Size vs. Latency - Blitzcrank makes the size-latency trade-offs more attractive compared to other tools in TPC-C.

Coarse Retrieval Granularity. Modern general-purpose block compression algorithms such as Zstandard have high decompression throughput (up to 500 MB/s). They are widely used in operating systems, databases, file systems, and computer networks (zst, 2023; Yang et al., 2018; Li et al., 2021). However, to access a single tuple, they must decompress the entire compression block (Ziv and Lempel, 1977), causing high random-access latency. One solution is to compress each tuple individually, but the compression factor suffers because the algorithms prefer a longer context to create an effective dictionary in the sliding window (Ziv and Lempel, 1977, 1978).

Inefficient Handling of New Tuples. The classic Raman’s approach (Raman and Swart, 2006) concatenates Huffman-coded values into variable-length tuples and then reorders the rows and columns so that it achieves a better compression factor using delta encoding. Although this solution compresses row-oriented data well, it cannot compress unseen values unless it initiates an expensive model reconstruction because such a solution relies on static dictionaries.

The above approaches are considered “syntactic” because they treat the uncompressed data simply as consecutive bytes (MacKay, 2003). Semantic compression, on the other hand, leverages the high-level semantics in a relational table, such as value distributions and functional dependencies between columns to achieve better compression (Raman and Swart, 2006). Unlike the above syntactic methods that rely on static dictionaries, the semantic approach can use the same probability models to compress new tuples effectively as long as the attribute values follow the modeled distributions (Babu et al., 2001). Existing semantic compression methods, however, provide limited support for different data types, and their model inferences are slow. For example, Squish and the more recent DeepSqueeze take 324 and 127 seconds, respectively, to compress a 75 MB relational table (Ilkhechi et al., 2020; Gao and Parameswaran, 2016).

In this paper, we show that semantic compression can be fast, and it has potential beyond large archive compression. We present Blitzcrank, a high-speed semantic compressor designed for OLTP databases. Blitzcrank improves compression through both data modeling and data encoding (Cleary and Witten, 1984).

  • For data modeling, Blitzcrank introduces novel semantic models that allow fast encoding/decoding for both discrete- and continuous-value columns. It takes Blitzcrank less than one second to compress the aforementioned 75 MB table.

  • For data encoding, we propose delayed coding, a novel fine-grained encoding algorithm that offers near-entropy compression as with Arithmetic Coding (Langdon, 1984) while achieving a faster decompression speed compared to the modern asymmetric number system (ANS) (Duda, 2021).

We first compared Blitzcrank against state-of-the-art compressors that apply to row-stores, including Zstandard (Collet and Kucherawy, 2021) and Raman’s approach (Raman and Swart, 2006), in standalone microbenchmarks based on real data sets. Blitzcrank achieves a sub-microsecond latency (fastest among the baselines) for decompressing a random tuple while obtaining high compression factors. We then integrated Blitzcrank, along with baseline compressors, into the in-memory OLTP database, Silo (Tu et al., 2013) and evaluated it using the TPC-C benchmark (Council, 2007). The results are summarized in Figure 1. Compared to uncompressed tables, Blitzcrank reduces memory usage by 85% with a moderate (19%) throughput degradation. Compared to using Zstandard, Blitzcrank achieves a 2.4×2.4\times2.4 × higher compression factor and is 76% faster. When the data set exceeds the physical memory limit, Blitzcrank greatly helps the database sustain high throughput and execute 4×4\times4 × more transactions within the same amount of time.

The paper makes the following contributions. First, we identify the inefficiency of existing compression algorithms for OLTP databases from both data modeling and data encoding perspectives. Second, we introduce novel semantic models designed for fast inferences for discrete and continuous data types. Third, we propose the new delayed coding that is significantly faster than variants of arithmetic coding while achieving near-entropy compression. Finally, we build Blitzcrank based on the above technologies and show that semantic compression can be fast enough to make trade-offs between performance and space much more attractive than previous solutions when integrated into an in-memory OLTP database, such as Silo (Tu et al., 2013).

2. Preliminaries

This section provides the necessary background information to understand the design of Blitzcrank. Section 2.1 describes the classic arithmetic coding, which is the basis of our proposed delayed coding. Figure 2 introduces the existing structure learning techniques adopted in Blitzcrank to leverage functional dependencies between columns for compression.

2.1. Arithmetic Coding

Refer to caption
Figure 2. An Example of Arithmetic Coding - Arithmetic coding maps each possible string to disjoint probability intervals.
Refer to caption
Figure 3. An Example of Column Correlation - Probabilities of column “gender” depends on the “name” column value.
Refer to caption
Figure 4. Blitzcrank- Semantic Learner (SL), Attribute Encoder (AE), and Tuple Encoder (TE) are three components of Blitzcrank.

Arithmetic coding is one of the most widely used entropy codings for lossless compression (Langdon, 1984; Cleary and Witten, 1984). Unlike Huffman coding (Moffat, 2019) that encodes symbols individually, arithmetic coding compresses the entire message into a single fraction 0q<10𝑞10\leq q<10 ≤ italic_q < 1 with arbitrary precision. Compared to Huffman coding, arithmetic coding can achieve a higher compression factor. Arithmetic coding represents the current information as an interval, defined by two numbers (initially [0,1)01[0,1)[ 0 , 1 )). Each encoding step in arithmetic coding divides the current interval into smaller sub-intervals according to the probability distribution of the alphabet and selects the one that represents the next symbol to be encoded. For example, as shown in Figure 2, the probability distribution of alphabet {a,b,c}𝑎𝑏𝑐\{a,b,c\}{ italic_a , italic_b , italic_c } is 0.2, 0.5, and 0.3, respectively. To encode a message “bab𝑏𝑎𝑏babitalic_b italic_a italic_b”, we divide the initial interval [0,1)01[0,1)[ 0 , 1 ) into three sub-intervals [0,0.2)00.2[0,0.2)[ 0 , 0.2 ), [0.2,0.7)0.20.7[0.2,0.7)[ 0.2 , 0.7 ), and [0.7,1)0.71[0.7,1)[ 0.7 , 1 ) and select [0.2,0.7)0.20.7[0.2,0.7)[ 0.2 , 0.7 ) to represent “b𝑏bitalic_b”. To encode the next symbol “a𝑎aitalic_a”, we further divide [0.2,0.7)0.20.7[0.2,0.7)[ 0.2 , 0.7 ) into [0.2,0.3)0.20.3[0.2,0.3)[ 0.2 , 0.3 ), [0.3,0.55)0.30.55[0.3,0.55)[ 0.3 , 0.55 ), and [0.55,0.7)0.550.7[0.55,0.7)[ 0.55 , 0.7 ) based on the symbol probabilities and update the current interval to [0.2,0.3)0.20.3[0.2,0.3)[ 0.2 , 0.3 ) which now represents “ba𝑏𝑎baitalic_b italic_a”. This process continues until we reach the end of the message and obtain the final interval [0.22,0.27)0.220.27[0.22,0.27)[ 0.22 , 0.27 ). We then select a fraction q𝑞qitalic_q within the final interval that has the shortest binary representation (e.g., q=(.001111)2𝑞subscript.0011112q=(\text{.001111})_{2}italic_q = ( .001111 ) start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) as the message’s code.

2.2. Structure Learning

Structure learning refers to the process of identifying correlations between columns to achieve better compression (Gao and Parameswaran, 2016; Ilkhechi et al., 2020; Babu et al., 2001; Davies and Moore, 1999). For example, the gender column is often highly correlated with the name column in a relation. As depicted in Figure 3, 80%percent8080\%80 % of Taylors are female, while 50%percent5050\%50 % of Alexes are male. Instead of using static probabilities (e.g., 50%percent5050\%50 % male and 50%percent5050\%50 % female) for the gender column, we model its distribution using probabilities conditioned on the name column: Pgender(Female|Name=Taylor)=0.8subscript𝑃genderconditionalFemaleNameTaylor0.8P_{\text{gender}}(\text{Female}|\text{Name}=\text{Taylor})=0.8italic_P start_POSTSUBSCRIPT gender end_POSTSUBSCRIPT ( Female | Name = Taylor ) = 0.8, Pgender(Female|Name=Alex)=0.5subscript𝑃genderconditionalFemaleNameAlex0.5P_{\text{gender}}(\text{Female}|\text{Name}=\text{Alex})=0.5italic_P start_POSTSUBSCRIPT gender end_POSTSUBSCRIPT ( Female | Name = Alex ) = 0.5. Then, the more common tuple (Female, Taylor) is mapped to a larger interval [0,0.56)00.56[0,0.56)[ 0 , 0.56 ) with a short binary code (.0)2subscript.02(.0)_{2}( .0 ) start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, thus achieving better compression.

We use a Bayesian network (BN) to learn the best column ordering S𝑆Sitalic_S (e.g., {name, gender}) for compression. The output also includes a model set M𝑀Mitalic_M, where each model is a probability distribution Pxsubscript𝑃𝑥P_{x}italic_P start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT for column x𝑥xitalic_x conditioned on the values of all the columns preceding x𝑥xitalic_x in S𝑆Sitalic_S. Determining the optimal ordering S𝑆Sitalic_S is an NP-hard problem (Koller and Friedman, 2009). We, therefore, use a greedy algorithm (Gao and Parameswaran, 2016) that selects the column that produces the smallest compressed size conditioned on the existing columns in S𝑆Sitalic_S for each iteration.

3. Overview

The goal of Blitzcrank is to reduce the memory footprint of an in-memory OLTP database while imposing as small performance overhead as possible. To achieve this, Blitzcrank must be able (1) to handle newly inserted tuples with unseen values efficiently and (2) to deliver low latency and high compression factor for individual tuples. For (1), we build semantic models that describe the values’ (conditional) probability distributions instead of using static value dictionaries (Section 4). For (2), we propose delayed coding that offers fast and fine-grained encoding/decoding with near-entropy compression (Section 5). Blitzcrank is optimized for single-tuple retrieval. A larger compression granularity may improve the overall compression factor, but it introduces decompression overhead for point accesses common in OLTP workloads.

Blitzcrank sits above the table storage to compress and decompress tuples while remaining transparent to the execution engine of an OLTP database. When the execution engine inserts a tuple into a relation, Blitzcrank compresses that tuple before sending it to the table storage. Upon receiving a tuple-fetching request, Blitzcrank retrieves the compressed tuple from storage and decompresses it. The execution engine then consumes the tuple and executes the query without being aware of Blitzcrank.

As shown in Figure 4, Blitzcrank consists of three components: Semantic Learner (SL), Attribute Encoder (AE), and Tuple Encoder (TE). Specifically, the SL determines the compression ordering for the columns using structure learning and generates conditional probability models for the AE. When the tables are small, Blitzcrank leaves them uncompressed. SL is triggered when the size of a table reaches a predefined threshold (default: 216superscript2162^{16}2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT rows). For compression, the AE takes in a tuple and translates the value of each attribute into an interval in [0,1)01[0,1)[ 0 , 1 ) according to the models from SL. The AE then sends the sequence of intervals to the Tuple Encoder which uses delayed coding to produce a compressed record with a near-optimal size. For decompression, the tuple is first decoded into 16-bit codes at TE. The AE then invokes the Inv-Translate (which refers to the probability models) to recover each tuple value.

Semantic Learner approaches the optimal column compression ordering S𝑆Sitalic_S and a set of models M𝑀Mitalic_M through a greedy algorithm of structure learning, as described in Figure 2. To speed up the structure learning on large tables, we perform the algorithm on a set of randomly selected tuples from the table. Once the column ordering S𝑆Sitalic_S is obtained, the SL further scans the full table to generate accurate conditional probability models PxMsubscript𝑃𝑥𝑀P_{x}\in Mitalic_P start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ italic_M. Pxsubscript𝑃𝑥P_{x}italic_P start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is implemented as an unordered map from each value combination of the proceeding attribute models to a probability distribution pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of attribute x𝑥xitalic_x. We refer to pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as a semantic model. Semantic models can compress values unseen before because they estimate the value distribution rather than statically map** values to codes in a dictionary. We introduce two fundamental types of semantic models optimized for decompression speed in  Section 4.

Attribute Encoder converts each value into an interval and vice versa according to the semantic models. Translate maps an attribute value v𝑣vitalic_v to an interval [l,r)𝑙𝑟[l,r)[ italic_l , italic_r ), 0l<r10𝑙𝑟10\leq l<r\leq 10 ≤ italic_l < italic_r ≤ 1 (symbol-to-interval), while Inv-Translate takes in a code s[l,r)𝑠𝑙𝑟s\in[l,r)italic_s ∈ [ italic_l , italic_r ) and recovers the attribute value v𝑣vitalic_v (code-to-symbol). Inv-Translate is critical to the decompression performance. Classic arithmetic coding performs a binary search to determine the matching interval [l,r)𝑙𝑟[l,r)[ italic_l , italic_r ) for a code s𝑠sitalic_s with a time complexity of O(logN)𝑂𝑁O(\log N)italic_O ( roman_log italic_N ) where N𝑁Nitalic_N is the number of unique values in a column (MacKay, 2003). We optimize this procedure to constant time in Blitzcrank, as detailed in Section 4.1.

Tuple Encoder receives a sequence of intervals representing each value within a tuple and compresses them into a block of 16-bit integers using delayed coding. At a high level, delayed coding uses a 16-bit unsigned integer sintsubscript𝑠ints_{\text{int}}italic_s start_POSTSUBSCRIPT int end_POSTSUBSCRIPT to encode each interval [l,r)𝑙𝑟[l,r)[ italic_l , italic_r ) such that sint/216[l,r)subscript𝑠intsuperscript216𝑙𝑟s_{\text{int}}/2^{16}\in[l,r)italic_s start_POSTSUBSCRIPT int end_POSTSUBSCRIPT / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT ∈ [ italic_l , italic_r ). These integers are selected judiciously so that some integer codes are stored implicitly using the redundant information of the other integer codes. In this way, delayed coding not only supports fast decoding (because of the fixed-length codes) but also achieves near-entropy compression. We introduce delayed coding in detail in Section 5.

4. Semantic Models

A semantic model maps a value to an interval based on the estimated (conditional) probability distribution of the values within a column. In this section, we first introduce two fundamental models for discrete/categorical columns and continuous/numeric columns, respectively. We then show in Section 4.3 how to construct models for other data types (e.g., string) using the fundamental models.

4.1. Discrete/Categorical Model

As in classic entropy encodings, we construct the semantic model for a discrete/categorical column by counting the frequency of each symbol (i.e., value) and computing their cumulative distribution function (CDF). Each symbol is then mapped to its corresponding probability interval on the CDF. For example, the semantic model for column {a,b,b,a,c,b,b,b}𝑎𝑏𝑏𝑎𝑐𝑏𝑏𝑏\{a,b,b,a,c,b,b,b\}{ italic_a , italic_b , italic_b , italic_a , italic_c , italic_b , italic_b , italic_b } is {a[0,0.25),b[0.25,0.875),c[0.875,1)}𝑎00.25𝑏0.250.875𝑐0.8751\{a\leftrightarrow[0,0.25),b\leftrightarrow[0.25,0.875),c\leftrightarrow[0.875% ,1)\}{ italic_a ↔ [ 0 , 0.25 ) , italic_b ↔ [ 0.25 , 0.875 ) , italic_c ↔ [ 0.875 , 1 ) }. Inv-Translate a code (e.g., (.01)2subscript.012(.01)_{2}( .01 ) start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) back to its symbol (e.g., “b𝑏bitalic_b”) requires a binary search to find the interval that contains the code. The logarithmic complexity slows down the decompression, especially when the number of distinct values of a categorical column is large. We, therefore, propose a constant-time algorithm for the Inv-Translate function, inspired by the alias method (Kronmal and Peterson Jr, 1979).

Refer to caption
Figure 5. Interval Allocation By Pairing Symbols - There are three interval pairs {Y(1),Y(2),Y(3)}superscript𝑌1superscript𝑌2superscript𝑌3\{Y^{(1)},Y^{(2)},Y^{(3)}\}{ italic_Y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , italic_Y start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT }. In each pair, two symbols {(αN,βN)}subscript𝛼𝑁subscript𝛽𝑁\{(\alpha_{N},\beta_{N})\}{ ( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) } and the symbol boundary {wN}subscript𝑤𝑁\{w_{N}\}{ italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } are saved.

Constant-Time Inv-Translate. Let π1,π2,,πNsubscript𝜋1subscript𝜋2subscript𝜋𝑁\pi_{1},\pi_{2},\cdots,\pi_{N}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_π start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT be the interval length (i.e., probability) of each of the N𝑁Nitalic_N symbols. Given a code 0s<10𝑠10\leq s<10 ≤ italic_s < 1, if π1=π2==πNsubscript𝜋1subscript𝜋2subscript𝜋𝑁\pi_{1}=\pi_{2}=\cdots=\pi_{N}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ⋯ = italic_π start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, then s𝑠sitalic_s belongs to the sN𝑠𝑁\lfloor s\cdot N\rfloor⌊ italic_s ⋅ italic_N ⌋th interval. Therefore, the key to achieving constant-time Inv-Translate is to “create” a uniform distribution. The core idea of the algorithm is to pair the intervals so that the combined probability of each pair forms a uniform distribution.

Suppose we want to create N𝑁Nitalic_N pairs, each having a combined interval length of 1/N1𝑁1/N1 / italic_N. At each iteration of the algorithm, we pair the shortest interval with the longest one in the remaining intervals. When their combined interval length is greater than 1/N1𝑁1/N1 / italic_N, we split the longer interval in two and put the exceeded part back into the interval collection for further pairing. The algorithm terminates when there is 1absent1\leq 1≤ 1 interval left in the collection. Figure 5 shows an example where the interval for symbol “b𝑏bitalic_b” is divided and mapped to three pairs. The following theorem proves the validity of this algorithm for any discrete probability distribution.

Theorem 1.

Every probability vector π1,,πNsubscript𝜋1subscript𝜋𝑁\pi_{1},\cdots,\pi_{N}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_π start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, can be expressed as an equiprobable mixture of N𝑁Nitalic_N two-point distributions. That is, there are N𝑁Nitalic_N pairs of integers (α1,β1)subscript𝛼1subscript𝛽1(\alpha_{1},\beta_{1})( italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), \cdots, (αN,βN)subscript𝛼𝑁subscript𝛽𝑁(\alpha_{N},\beta_{N})( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) and probabilities w1,,wNsubscript𝑤1subscript𝑤𝑁w_{1},\cdots,w_{N}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT such that

πi=1/Nj=1N(wj𝟙{αj=i}+(1wj)𝟙{βj=i})=1/Nj=1NYi(j).subscript𝜋𝑖1𝑁superscriptsubscript𝑗1𝑁subscript𝑤𝑗subscript1subscript𝛼𝑗𝑖1subscript𝑤𝑗subscript1subscript𝛽𝑗𝑖1𝑁superscriptsubscript𝑗1𝑁subscriptsuperscript𝑌𝑗𝑖\pi_{i}=1/N\cdot\sum_{j=1}^{N}(w_{j}\mathds{1}_{\{\alpha_{j}=i\}}+(1-w_{j})% \mathds{1}_{\{\beta_{j}=i\}})=1/N\cdot\sum_{j=1}^{N}Y^{(j)}_{i}.italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 / italic_N ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_i } end_POSTSUBSCRIPT + ( 1 - italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) blackboard_1 start_POSTSUBSCRIPT { italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_i } end_POSTSUBSCRIPT ) = 1 / italic_N ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Y start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

for 1iN1𝑖𝑁1\leq i\leq N1 ≤ italic_i ≤ italic_N, where Y(1)superscript𝑌1Y^{(1)}italic_Y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT\cdotsY(N)superscript𝑌𝑁Y^{(N)}italic_Y start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT are two-point distributions.

Proof.

See Appendix B. ∎

The constant-time Inv-Translate function is presented in Algorithm 1. Given the binary mixtures {Y(j)}superscript𝑌𝑗\{Y^{(j)}\}{ italic_Y start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT } with parameters {(αN,βN)}subscript𝛼𝑁subscript𝛽𝑁\{(\alpha_{N},\beta_{N})\}{ ( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) } and {wN}subscript𝑤𝑁\{w_{N}\}{ italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } defined in 1, Inv-Translate first computes the index (i.e., j=sN+1𝑗𝑠𝑁1j=\lfloor s\cdot N\rfloor+1italic_j = ⌊ italic_s ⋅ italic_N ⌋ + 1) of the binary mixture Y(j)superscript𝑌𝑗Y^{(j)}italic_Y start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT that contains the input code s𝑠sitalic_s. Then the function determines the code’s position within Y(j)superscript𝑌𝑗Y^{(j)}italic_Y start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT and returns the corresponding symbol.

1 Given {(αN,βN)}subscript𝛼𝑁subscript𝛽𝑁\{(\alpha_{N},\beta_{N})\}{ ( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) } and {wN}subscript𝑤𝑁\{w_{N}\}{ italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } defined in 1. Function Inv-Translate(s):
       c=sN𝑐𝑠𝑁c=s\cdot Nitalic_c = italic_s ⋅ italic_N                       /* s[0,1)𝑠01s\in[0,1)italic_s ∈ [ 0 , 1 ) */
       j=c+1𝑗𝑐1j=\lfloor c\rfloor+1italic_j = ⌊ italic_c ⌋ + 1               /* Determine the index of Y𝑌Yitalic_Y */
       q=cc𝑞𝑐𝑐q=c-\lfloor c\rflooritalic_q = italic_c - ⌊ italic_c ⌋               /* Get the position in Y(j)superscript𝑌𝑗Y^{(j)}italic_Y start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT */
2      
3      return (q<wj𝑞subscript𝑤𝑗q<w_{j}italic_q < italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT)  ?    αj:βj:subscript𝛼𝑗subscript𝛽𝑗\alpha_{j}:\beta_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
4      
Algorithm 1 Inv-Translate

4.2. Continuous/Numeric Model

Previous semantic compression algorithms use a bisection method (Witten et al., 1987; Gao and Parameswaran, 2016) to compress continuous values such as floating-point numbers. This approach, however, generates many low-entropy intervals, thus affecting the efficiency of the subsequent Tuple Encoding. Blitzcrank, therefore, introduces a novel two-level quantization model for continuous-value columns. This model not only supports arbitrary precision to guarantee lossless compression but also leverages the distribution skew in the column for better compression.

Two-Level Quantization Model. The first-level quantization is based on an equi-width histogram of the values in a column. The goal at this level is to maximize compression by assigning larger intervals (i.e., shorter codes) to more frequent value ranges. Specifically, we divide the values into a predefined T𝑇Titalic_T (e.g., T=512𝑇512T=512italic_T = 512) disjoint value ranges, each having a bucket width of w=(v^maxv^min)/T𝑤subscript^𝑣maxsubscript^𝑣min𝑇w=(\hat{v}_{\text{max}}-\hat{v}_{\text{min}})/Titalic_w = ( over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT max end_POSTSUBSCRIPT - over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ) / italic_T, where v^maxsubscript^𝑣max\hat{v}_{\text{max}}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT max end_POSTSUBSCRIPT/v^minsubscript^𝑣min\hat{v}_{\text{min}}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT min end_POSTSUBSCRIPT is the estimated (or obtained directly from table statistics) maximum/minimum value of the column. We then obtain the frequency of each bucket by scanning the column once, and we assign each bucket i𝑖iitalic_i an interval [li,ri)subscript𝑙𝑖subscript𝑟𝑖[l_{i},r_{i})[ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) proportional to its frequency, similar to the categorical model.

To guarantee a lossless compression (i.e., to distinguish between the values within a bucket), we apply a second-level quantization where we divide the value range of the bucket equally into G𝐺Gitalic_G segments so that the width of each segment is smaller than or equal to the column’s required precision p𝑝pitalic_p. We set p=107𝑝superscript107p=10^{-7}italic_p = 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT and p=1017𝑝superscript1017p=10^{-17}italic_p = 10 start_POSTSUPERSCRIPT - 17 end_POSTSUPERSCRIPT for the float and double types, respectively. Besides the bucket’s interval [li,ri)subscript𝑙𝑖subscript𝑟𝑖[l_{i},r_{i})[ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), each segment j𝑗jitalic_j in the bucket is assigned another interval [lϵj,rϵj)subscript𝑙subscriptitalic-ϵ𝑗subscript𝑟subscriptitalic-ϵ𝑗[l_{{\epsilon}_{j}},r_{{\epsilon}_{j}})[ italic_l start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) with an equal length of 1/G1𝐺1/G1 / italic_G. If the user specifies a precision requirement for a float/double column (e.g., 2222 decimal places), we enable lossy compression by adjusting p𝑝pitalic_p accordingly (e.g., p=102𝑝superscript102p=10^{-2}italic_p = 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT) to achieve better compression.

Given a value v^minv<v^maxsubscript^𝑣min𝑣subscript^𝑣max\hat{v}_{\text{min}}\leq v<\hat{v}_{\text{max}}over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ≤ italic_v < over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT max end_POSTSUBSCRIPT, the Translate function computes its first-level bucket index i𝑖iitalic_i and its second-level offset j𝑗jitalic_j according to the value-range division because v𝑣vitalic_v can be uniquely decomposed as iw+jpvv^min<iw+(j+1)p𝑖𝑤𝑗𝑝𝑣subscript^𝑣min𝑖𝑤𝑗1𝑝i\cdot w+j\cdot p\leq v-\hat{v}_{\text{min}}<i\cdot w+(j+1)\cdot pitalic_i ⋅ italic_w + italic_j ⋅ italic_p ≤ italic_v - over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT min end_POSTSUBSCRIPT < italic_i ⋅ italic_w + ( italic_j + 1 ) ⋅ italic_p, where jp<w𝑗𝑝𝑤j\cdot p<witalic_j ⋅ italic_p < italic_w. v𝑣vitalic_v is then converted into two intervals: [li,ri)subscript𝑙𝑖subscript𝑟𝑖[l_{i},r_{i})[ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and [lϵj,rϵj)subscript𝑙subscriptitalic-ϵ𝑗subscript𝑟subscriptitalic-ϵ𝑗[l_{{\epsilon}_{j}},r_{{\epsilon}_{j}})[ italic_l start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). In cases where v𝑣vitalic_v is an outlier (i.e., v<v^min𝑣subscript^𝑣minv<\hat{v}_{\text{min}}italic_v < over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT min end_POSTSUBSCRIPT or vv^max𝑣subscript^𝑣maxv\geq\hat{v}_{\text{max}}italic_v ≥ over^ start_ARG italic_v end_ARG start_POSTSUBSCRIPT max end_POSTSUBSCRIPT), the algorithm falls back to the slow traditional bisection method. Such a fall back has negligible impact on performance because outliers are usually rare. To recover a value (a value range \leq the column precision p𝑝pitalic_p, to be precise), we invoke the Inv-Translate function twice on [li,ri)subscript𝑙𝑖subscript𝑟𝑖[l_{i},r_{i})[ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and [lϵj,rϵj)subscript𝑙subscriptitalic-ϵ𝑗subscript𝑟subscriptitalic-ϵ𝑗[l_{{\epsilon}_{j}},r_{{\epsilon}_{j}})[ italic_l start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), respectively.

4.3. Composite Models

Refer to caption
Figure 6. String Model - The URL sample is from the DBLP.

Our foundational models can be combined to compress complex attributes. We implement a string model as the example, as shown in Figure 6. It includes a prefix dictionary and a global dictionary. For words not covered by either dictionary, we use a Markov model to encode them letter-by-letter (Matthew, 2005).

Efficient Blitzcrank Integration with Intervals. The prefix dictionary compresses strings with similar prefixes and keeps a queue of the latest K𝐾Kitalic_K (e.g., K=4𝐾4K=4italic_K = 4) strings. Each string is analyzed to find the index i𝑖iitalic_i (ranging from 0 to K1𝐾1K-1italic_K - 1) of a previous string in the queue that shares the longest common prefix, and the count hhitalic_h (an indefinite integer) of identical characters. We use a categorical model and a numeric model to estimate i𝑖iitalic_i and hhitalic_h distributions, respectively. Given a new string, the two models output intervals representing i𝑖iitalic_i and hhitalic_h. This approach of interval-based representation integrates smoothly with the Blitzcrank Framework.

Adaptive Base Models for Enhanced Compression. Following the prefix dictionary, the remaining substring is split into words using delimiters like spaces and commas. This process involves two models: (1) using a numerical model to count the words in the given substring, and (2) using a categorical model to identify each delimiter. Our model can leverage the skewed distribution of word count and delimiter usage patterns for compression. For example, most sentences have 3-10 words, and the space character is the most frequent delimiter. The final technique is the global dictionary, implemented as a categorical model. It stores words that frequently occur in sentences and is used for dictionary encoding.

Besides the string model, we also design two models: one for encoding JSON collections and another for time-series column encoding, with the latter utilizing the Autoregressive Moving Average (ARMA) (Box et al., 2015). When applied to the data set Jena Climate (Mnassri, 2020), our time-series model achieved a 38% better compression factor than our standard numeric model.

5. Delayed Coding

The attribute encoder generates a series of intervals, which can be encoded into a bit stream by the arithmetic coding. However, it is slow due to its variable-length codes and extensive floating-point calculations. We propose the delayed coding to address these issues. For ease of understanding, we assume that every symbol can be represented by a single, continuous interval. We relax the constraint in Section 5.6.

5.1. Probability Representation

We begin by investigating fast algorithms to represent and compute with probability intervals. Arithmetic coding is slow due to the many interval product operations required. Recall the example in Section 2.1, when entering a sub-interval of the current interval based on the next symbol’s probability, we need to compute a product of intervals. If we use a floating-point number to represent the probability, the interval product \boxtimes is defined as:

(1) [la,ra)[lb,rb)[la+(rala)lb,la+(rala)rb),subscript𝑙𝑎subscript𝑟𝑎subscript𝑙𝑏subscript𝑟𝑏subscript𝑙𝑎subscript𝑟𝑎subscript𝑙𝑎subscript𝑙𝑏subscript𝑙𝑎subscript𝑟𝑎subscript𝑙𝑎subscript𝑟𝑏[l_{a},r_{a})\boxtimes[l_{b},r_{b})\coloneqq[l_{a}+(r_{a}-l_{a})\cdot l_{b},l_% {a}+(r_{a}-l_{a})\cdot r_{b}),[ italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ⊠ [ italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ≔ [ italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + ( italic_r start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ⋅ italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + ( italic_r start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ⋅ italic_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ,

where 0la<ra10subscript𝑙𝑎subscript𝑟𝑎10\leq l_{a}<r_{a}\leq 10 ≤ italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT < italic_r start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≤ 1, 0lb<rb10subscript𝑙𝑏subscript𝑟𝑏10\leq l_{b}<r_{b}\leq 10 ≤ italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT < italic_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ≤ 1. However, this design leads to floating-point calculations and the risk of floating-point underflow. An alternative is to use two integers U,d𝑈𝑑U,ditalic_U , italic_d to represent the probability U/2d𝑈superscript2𝑑U/2^{d}italic_U / 2 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. In this way, checking for underflow is simply inspecting the exponent of the denominator d𝑑ditalic_d.

Integer-based Probability Intervals. In Blitzcrank, a probability is presented as a 16-bit integer U𝑈Uitalic_U, which logically represents the probability U/216𝑈superscript216U/2^{16}italic_U / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT. In this paper, we use [L,R)𝐿𝑅[L,R)[ italic_L , italic_R ), where 0L,R216formulae-sequence0𝐿𝑅superscript2160\leq L,R\leq 2^{16}0 ≤ italic_L , italic_R ≤ 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT are integers, to logically represent the interval [L/216,R/216)𝐿superscript216𝑅superscript216[L/2^{16},R/2^{16})[ italic_L / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT , italic_R / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT ). We choose to use 16 bits because using shorter integers increases the decoding overhead while using 32-bit or 64-bit integers must handle integer overflow during multiplication. For an interval whose length is smaller than 1/2161superscript2161/2^{16}1 / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT (e.g. [0,1/232)01superscript232[0,1/2^{32})[ 0 , 1 / 2 start_POSTSUPERSCRIPT 32 end_POSTSUPERSCRIPT )), we use the product of two or more intervals to represent it, i.e.

[0,1/232)=[0,1/216)[0,32768/216)[0,1),[0,32768).formulae-sequence01superscript23201superscript216032768superscript21601032768[0,1/2^{32})=[0,1/2^{16})\boxtimes[0,32768/2^{16})\rightarrow[0,1),[0,32768).[ 0 , 1 / 2 start_POSTSUPERSCRIPT 32 end_POSTSUPERSCRIPT ) = [ 0 , 1 / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT ) ⊠ [ 0 , 32768 / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT ) → [ 0 , 1 ) , [ 0 , 32768 ) .
1 Given m𝑚mitalic_m, {(αN,βN)}subscript𝛼𝑁subscript𝛽𝑁\{(\alpha_{N},\beta_{N})\}{ ( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) }, {wN}subscript𝑤𝑁\{w_{N}\}{ italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } defined in 1, where N=2m,m+formulae-sequence𝑁superscript2𝑚𝑚subscriptN=2^{m},m\in\mathbb{N}_{+}italic_N = 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , italic_m ∈ blackboard_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, and wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an 16-bit integer, for i=1,,N𝑖1𝑁i=1,\cdots,Nitalic_i = 1 , ⋯ , italic_N.
2 Function Inv-Translate(s):
       js>>(16m)𝑗𝑠>>16𝑚j\leftarrow s\operatorname{{\LARGE\texttt{>>}}}(16-m)italic_j ← italic_s ShiftRight ( 16 - italic_m )         /* The higher m𝑚mitalic_m bits */
       Qs&(216m1)𝑄𝑠superscript216𝑚1Q\leftarrow s\ \&\ (2^{16-m}-1)italic_Q ← italic_s & ( 2 start_POSTSUPERSCRIPT 16 - italic_m end_POSTSUPERSCRIPT - 1 )     /* The lower 16-m𝑚mitalic_m bits */
3       return (Q<(wj>>m))𝑄subscript𝑤𝑗>>𝑚(Q<(w_{j}\operatorname{{\LARGE\texttt{>>}}}m))( italic_Q < ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ShiftRight italic_m ) ) ?  αjsubscript𝛼𝑗\alpha_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : βjsubscript𝛽𝑗\beta_{j}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
4      
Algorithm 2 Inv-Translate with the Integer Probability

Inv-Translate Becomes Faster. Although the code-to-symbol in Algorithm 1 has constant time complexity, the “floor” operation slows it down. The integer-based probability can make the Inv-Translate faster, replacing the “floor” operation with bitwise operations. Algorithm 2 gives the new Inv-Translate algorithm with integer-based probability. Note that both the {wN}subscript𝑤𝑁\{w_{N}\}{ italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } and the input s𝑠sitalic_s in Algorithm 2, using the integer-based probability, are 16-bit unsigned integers. Also, N𝑁Nitalic_N is always a power of two, a condition that can be satisfied by adding placeholder symbols with a frequency of 0 if necessary.

5.2. Options of Fixed-length Codes

We then introduce an algorithm to extract the code from an interval based on our integer-based probability representation and analyze the bits wasted in terms of code options. In Section 2.1, we explained that for the final interval [0.22,0.27)0.220.27[0.22,0.27)[ 0.22 , 0.27 ) in arithmetic coding, using just enough fractional digits to keep the number within this interval is sufficient for encoding. However, this results in variable-length codes, which can slow down decoding due to the inconsistent number of digits across intervals, requiring more branch predictions and checks (Said, 2004).

An approach to simplify the decoding process is to utilize a fixed number of bits, such as 16 bits, for encoding each interval. In this scenario, adopting integer-based probability, we define the bidirectional map** of a semantic model as follows:

symbol-to-interval::symbol-to-intervalabsent\displaystyle\text{symbol-to-interval}:symbol-to-interval : {v1,,vN}{[Li,Ri),1iN},subscript𝑣1subscript𝑣𝑁subscript𝐿𝑖subscript𝑅𝑖1𝑖𝑁\displaystyle\ \{v_{1},\cdots,v_{N}\}\rightarrow\{[L_{i},R_{i}),1\leq i\leq N\},{ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } → { [ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , 1 ≤ italic_i ≤ italic_N } ,
code-to-symbol::code-to-symbolabsent\displaystyle\text{code-to-symbol}:code-to-symbol : {s,0s<216}{v1,,vN},formulae-sequence𝑠0𝑠superscript216subscript𝑣1subscript𝑣𝑁\displaystyle\ \{s\in\mathbb{N},0\leq s<2^{16}\}\rightarrow\{v_{1},\cdots,v_{N% }\},{ italic_s ∈ blackboard_N , 0 ≤ italic_s < 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT } → { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } ,

where visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes a unique symbol in this model, allocated with a disjoint interval [Li,Ri)subscript𝐿𝑖subscript𝑅𝑖[L_{i},R_{i})[ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being the lower bound and Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being the upper bound of the interval, for 1iN1𝑖𝑁1\leq i\leq N1 ≤ italic_i ≤ italic_N. By converting the interval [0.22,0.27)0.220.27[0.22,0.27)[ 0.22 , 0.27 ) to [14418,17694)1441817694[14418,17694)[ 14418 , 17694 ) using integer-based probability, we can encode it using any 16-bit integer within this range. However, this method requires more than the 6 bits we use in the example of Section 2.1. In fact, for an interval [L,R)𝐿𝑅[L,R)[ italic_L , italic_R ), the probability of the event it represents is (RL)/216𝑅𝐿superscript216(R-L)/2^{16}( italic_R - italic_L ) / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT. Thus, its entropy is 16log2(RL)16subscript2𝑅𝐿16-\log_{2}(R-L)16 - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_R - italic_L ) bits, and we waste log2(RL)subscript2𝑅𝐿\log_{2}(R-L)roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_R - italic_L ) bits if using 16 bits to encode it. Notice that we have RL𝑅𝐿R-Litalic_R - italic_L code options for this interval. The selection of a specific code option itself carries information. Since any of these options can represent the interval [L,R)𝐿𝑅[L,R)[ italic_L , italic_R ), we can use the options to represent another interval partially.

5.3. Problem Formulation

Leveraging the concept of code options, we formalize the code extraction problem as follows. Given a series of intervals [L1,R1),,subscript𝐿1subscript𝑅1[L_{1},R_{1}),\cdots,[ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , [Ln,Rn)subscript𝐿𝑛subscript𝑅𝑛[L_{n},R_{n})[ italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) with n>0𝑛0n>0italic_n > 0 and 0Li<Ri2160subscript𝐿𝑖subscript𝑅𝑖superscript2160\leq L_{i}<R_{i}\leq 2^{16}0 ≤ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT, we can select a 16-bit integer si[Li,Ri)subscript𝑠𝑖subscript𝐿𝑖subscript𝑅𝑖s_{i}\in[L_{i},R_{i})italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to encode each interval. Then, can we represent a distinct interval [L,R)superscript𝐿superscript𝑅[L^{*},R^{*})[ italic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) using a sequence (s1,,sn)subscript𝑠1subscript𝑠𝑛(s_{1},\cdots,s_{n})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )?

Consider a special case where each interval is of length 2 (i.e., RiLi=2subscript𝑅𝑖subscript𝐿𝑖2R_{i}-L_{i}=2italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2), we have two encoding options per interval: either Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or Li+1subscript𝐿𝑖1L_{i}+1italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1. These options can uniquely represent their respective intervals. Using this binary decision for each interval, we can represent parts of the distinct interval [L,R)superscript𝐿superscript𝑅[L^{*},R^{*})[ italic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). To fully encode [L,R)superscript𝐿superscript𝑅[L^{*},R^{*})[ italic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), we need a sufficient number of intervals. If we have at least 16 intervals, matching the 16 bits of Lsuperscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, complete representation is possible. The encoding rule is simple: Let the code si=Lisubscript𝑠𝑖subscript𝐿𝑖s_{i}=L_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if the i𝑖iitalic_i-th bit of Lsuperscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is 0; otherwise, sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT takes the value Li+1subscript𝐿𝑖1L_{i}+1italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1. This way, the sequence (s1,,sn)subscript𝑠1subscript𝑠𝑛(s_{1},\cdots,s_{n})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) encodes both the series of intervals and the interval [L,R)superscript𝐿superscript𝑅[L^{*},R^{*})[ italic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), leveraging the binary encoding options of each interval.

Consider a general case where each interval’s length ki=RiLisubscript𝑘𝑖subscript𝑅𝑖subscript𝐿𝑖k_{i}=R_{i}-L_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT varies within the range [1,216)1superscript216[1,2^{16})[ 1 , 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT ). This means each interval [Li,Ri)subscript𝐿𝑖subscript𝑅𝑖[L_{i},R_{i})[ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) has kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT coding options: {Li,Li+1,,Li+(ki1)}subscript𝐿𝑖subscript𝐿𝑖1subscript𝐿𝑖subscript𝑘𝑖1\{L_{i},L_{i}+1,\cdots,L_{i}+(k_{i}-1)\}{ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 , ⋯ , italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) }. Any of these codes can uniquely represent its interval. Moreover, these codes can represent the distinct interval [L,R)superscript𝐿superscript𝑅[L^{*},R^{*})[ italic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) partially as well. In this case, the i𝑖iitalic_i-th interval offers a digit in a base kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT system. With n𝑛nitalic_n such intervals, they collectively form a mixed radix (base) numeral system (Fraenkel, 1985), with bases {kn}subscript𝑘𝑛\{k_{n}\}{ italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. Assuming i=1nki216superscriptsubscriptproduct𝑖1𝑛subscript𝑘𝑖superscript216\prod_{i=1}^{n}k_{i}\geq 2^{16}∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT, the 16-bit number Lsuperscript𝐿L^{*}italic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can be converted into this mixed-base system as follows:

(2) ai=L%ki,L=L/ki,formulae-sequencesubscript𝑎𝑖percentsuperscript𝐿subscript𝑘𝑖superscript𝐿superscript𝐿subscript𝑘𝑖a_{i}=L^{*}\ \%\ k_{i},\ \ \ \ \ \ L^{*}=L^{*}\ /\ k_{i},italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT % italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT / italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

for i=n,,1𝑖𝑛1i=n,\cdots,1italic_i = italic_n , ⋯ , 1 in a loop. The resulting value is a1a2ansubscript𝑎1subscript𝑎2subscript𝑎𝑛a_{1}a_{2}\cdots a_{n}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Setting sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to Li+aisubscript𝐿𝑖subscript𝑎𝑖L_{i}+a_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT gives a valid code, as ai<kisubscript𝑎𝑖subscript𝑘𝑖a_{i}<k_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ensures Li+ai<Risubscript𝐿𝑖subscript𝑎𝑖subscript𝑅𝑖L_{i}+a_{i}<R_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Hence, the sequence (s1,,sn)subscript𝑠1subscript𝑠𝑛(s_{1},\cdots,s_{n})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) effectively encodes both the series of intervals and the interval [L,R)superscript𝐿superscript𝑅[L^{*},R^{*})[ italic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

Example: 3-Digit Mixed Radix Numeral System. Given intervals [1,4)14[1,4)[ 1 , 4 ), [2,6)26[2,6)[ 2 , 6 ), [3,10)310[3,10)[ 3 , 10 ), forming a 3-digit mixed radix numeral system with bases (3,4,7)347(3,4,7)( 3 , 4 , 7 ). This system can encode up to 3×4×7=84347843\times 4\times 7=843 × 4 × 7 = 84 distinct states. Assume we use 4 bits to encode each interval. To encode the interval [13,14)1314[13,14)[ 13 , 14 ), we select x=13𝑥13x=13italic_x = 13 from this range. Decomposing 13 with the bases (3,4,7)347(3,4,7)( 3 , 4 , 7 ) using Equation 2 gives 13=0¯×(4×7)+1¯×7+6¯13¯047¯17¯613=\underline{0}\times(4\times 7)+\underline{1}\times 7+\underline{6}13 = under¯ start_ARG 0 end_ARG × ( 4 × 7 ) + under¯ start_ARG 1 end_ARG × 7 + under¯ start_ARG 6 end_ARG, leading to the indices (0,1,6)016(0,1,6)( 0 , 1 , 6 ). To determine s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we select the value at index 0 in the first interval [1,4)14[1,4)[ 1 , 4 ), resulting in s1=1subscript𝑠11s_{1}=1italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1. Similarly, s2=3subscript𝑠23s_{2}=3italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3 and s3=9subscript𝑠39s_{3}=9italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 9. Therefore, the encoded bit stream for the four intervals is ((((0001 0011 1001)2)_{2}) start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Refer to caption
Figure 7. Delayed Coding - Given a tuple (“b”, 1, “@”, 3), the attribute encoder translates it into intervals: [32768, 65536), [10011, 10027), [3, 32772), [1023, 1028). These intervals are then encoded and decoded as shown above.

5.4. Encoding Procedure

Refer to caption
Figure 8. Recursive Encoding - First, the last interval is encoded by the numeral system with base (k1,k2,k3)subscript𝑘1subscript𝑘2subscript𝑘3(k_{1},k_{2},k_{3})( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). Then, the last-second interval is encoded by the numeral system with base (k1,k2)subscript𝑘1subscript𝑘2(k_{1},k_{2})( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

Inspired by the mixed radix numeral system introduced above, the encoding procedure of delayed coding is essentially transforming decimal numbers into mixed radix numbers. The encoding of delayed coding has two steps. First, we mark all intervals that can be represented by their former intervals’ options. An interval can be marked if and only if the current option number is larger than λ𝜆\lambdaitalic_λ (it takes 216superscript2162^{16}2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT by default). Second, for each marked interval, we convert its 16-bit code into a mixed-radix number, represented using code options of the intervals that precede it.

Step 1: Mark Intervals. In Figure 7, we illustrate the encoding process of a tuple (“b”, 1,“@”, 3) using four categorical models that convert the tuple into intervals. We use an option counter k𝑘kitalic_k, initially set to one, to track redundant information. The 1st interval provides 215superscript2152^{15}2 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT options, updating k𝑘kitalic_k to 215superscript2152^{15}2 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT. Next, the 2nd interval cannot be marked because the current option counter kλ=216𝑘𝜆superscript216k\leq\lambda=2^{16}italic_k ≤ italic_λ = 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT. Note that we use fixed-length code (i.e., 16-bit) to encode each interval, and the current numeral system cannot represent a 16-bit integer. The 2nd interval increases k𝑘kitalic_k to 219superscript2192^{19}2 start_POSTSUPERSCRIPT 19 end_POSTSUPERSCRIPT due to its option of 16. Currently, it is enough to mark the third interval, it consumes 216superscript2162^{16}2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT options and updates k=k/216=23𝑘𝑘superscript216superscript23k=k/2^{16}=2^{3}italic_k = italic_k / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT = 2 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. The 3rd interval contributes 32769327693276932769 options, updating k=23(215+1)𝑘superscript23superscript2151k=2^{3}(2^{15}+1)italic_k = 2 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT + 1 ). The marking process continues; finally, the last two intervals are marked and will be transformed into mixed radix numbers, represented by their preceding intervals.

Step 2: Convert Intervals From the End. We convert each marked interval into a mixed-radix number in a recursive manner, as illustrated in Figure 8. Specifically, the last two intervals are marked in the marking step. First, we convert the rightmost interval into a mixed-radix number using bases (k1,k2,k3)subscript𝑘1subscript𝑘2subscript𝑘3(k_{1},k_{2},k_{3})( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ). Then, the last-second interval holds the partial code of the last interval. Since the last-second interval is also marked, it is converted into a mixed-radix number with bases (k1,k2)subscript𝑘1subscript𝑘2(k_{1},k_{2})( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). This approach highlights the necessity of processing intervals from the end.

We use a 64-bit integer Vinfosubscript𝑉infoV_{\text{info}}italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT to store the codes for marked intervals temporarily, starting with Vinfo=0subscript𝑉info0V_{\text{info}}=0italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT = 0. We use a loop to process each interval [L,R)𝐿𝑅[L,R)[ italic_L , italic_R ) from the end. For each step, the decimal number to be converted is Vinfosubscript𝑉infoV_{\text{info}}italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT, and the current interval [L,R)𝐿𝑅[L,R)[ italic_L , italic_R ) provides k=RL𝑘𝑅𝐿k=R-Litalic_k = italic_R - italic_L options, as a base-k𝑘kitalic_k digit. We compute the digit value a𝑎aitalic_a and the left decimal number, using Equation 2:

(3) a=Vinfo%k,Vinfo=Vinfo/k.formulae-sequence𝑎percentsubscript𝑉info𝑘subscript𝑉infosubscript𝑉info𝑘\displaystyle a=V_{\text{info}}\ \%\ k,\ \ \ \ \ \ V_{\text{info}}=V_{\text{% info}}\ /\ k.italic_a = italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT % italic_k , italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT / italic_k .

Therefore, the 16-bit code for the interval [L,R)𝐿𝑅[L,R)[ italic_L , italic_R ) is computed as c=L+a𝑐𝐿𝑎c=L+aitalic_c = italic_L + italic_a, i.e., we use the code options of it to store a digit value a𝑎aitalic_a. For the first processed interval, we get a=0𝑎0a=0italic_a = 0 because Vinfo=0subscript𝑉info0V_{\text{info}}=0italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT = 0, but Vinfosubscript𝑉infoV_{\text{info}}italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT can be updated. If the interval [L,R)𝐿𝑅[L,R)[ italic_L , italic_R ) is marked, Vinfo=Vinfok+csubscript𝑉infosubscript𝑉info𝑘𝑐V_{\text{info}}=V_{\text{info}}\cdot k+citalic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT ⋅ italic_k + italic_c. Otherwise, output the 16-bit code c𝑐citalic_c to the bit stream. The loop continues; finally, we encode the four intervals into 4 bytes.

5.5. Decoding Procedure

Conversely, the decoding procedure transforms the mixed radix numerals back into decimal numbers. There are two sources of bits to decode a tuple: the bit stream or the virtual input Vinfosubscript𝑉infoV_{\text{info}}italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT. The decoding of each symbol has three steps: (1) retrieve a 16-bit code from Vinfosubscript𝑉infoV_{\text{info}}italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT if the current option number Vsizesubscript𝑉sizeV_{\text{size}}italic_V start_POSTSUBSCRIPT size end_POSTSUBSCRIPT is larger than 216superscript2162^{16}2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT; otherwise from the bit stream; (2) obtain the desired symbol and its interval [L,R)𝐿𝑅[L,R)[ italic_L , italic_R ) by calling the Inv-Translate function; (3) Update Vinfosubscript𝑉infoV_{\text{info}}italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT and Vsizesubscript𝑉sizeV_{\text{size}}italic_V start_POSTSUBSCRIPT size end_POSTSUBSCRIPT accordingly.

The bottom part of Figure 7 shows the decoding process. We want to decode a tuple from the bit stream 0x8040 271D, at first, Vinfo=0subscript𝑉info0V_{\text{info}}=0italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT = 0 , and Vsize=1subscript𝑉size1V_{\text{size}}=1italic_V start_POSTSUBSCRIPT size end_POSTSUBSCRIPT = 1. For the first attribute value, We fetch 16 bits from the bit stream, getting 0x8040. The function Inv-Translate receives it and returns the symbol “b”. We use the symbol-to-interval map** to determine its interval, resulting in [L,R)=[215,216)𝐿𝑅superscript215superscript216[L,R)=[2^{15},2^{16})[ italic_L , italic_R ) = [ 2 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT ). Next, we compute the digit base k=RL=215𝑘𝑅𝐿superscript215k=R-L=2^{15}italic_k = italic_R - italic_L = 2 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT, and the digit number a=16bitsL=64𝑎16bits𝐿64a=\text{16bits}-L=64italic_a = 16bits - italic_L = 64. Using them, we update Vinfosubscript𝑉infoV_{\text{info}}italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT, and Vsizesubscript𝑉sizeV_{\text{size}}italic_V start_POSTSUBSCRIPT size end_POSTSUBSCRIPT:

Vinfo=kVinfo+a,Vsize=kVinfo.formulae-sequencesubscript𝑉info𝑘subscript𝑉info𝑎subscript𝑉size𝑘subscript𝑉info\displaystyle V_{\text{info}}=k\cdot V_{\text{info}}+a,\ \ \ \ \ \ V_{\text{% size}}=k\cdot V_{\text{info}}.italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT = italic_k ⋅ italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT + italic_a , italic_V start_POSTSUBSCRIPT size end_POSTSUBSCRIPT = italic_k ⋅ italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT .

This is just the inverse formula of Equation 3. It is necessary to use Vsizesubscript𝑉sizeV_{\text{size}}italic_V start_POSTSUBSCRIPT size end_POSTSUBSCRIPT to record the amount of information in Vinfosubscript𝑉infoV_{\text{info}}italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT. For instance, with Vinfo=1subscript𝑉info1V_{\text{info}}=1italic_V start_POSTSUBSCRIPT info end_POSTSUBSCRIPT = 1, a Vsizesubscript𝑉sizeV_{\text{size}}italic_V start_POSTSUBSCRIPT size end_POSTSUBSCRIPT of 4 results in two virtual bits 012subscript01201_{2}01 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, while a Vsizesubscript𝑉sizeV_{\text{size}}italic_V start_POSTSUBSCRIPT size end_POSTSUBSCRIPT of 8 yields three virtual bits 0012subscript0012001_{2}001 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The second interval decodes to the symbol “1”. When Vsize=219subscript𝑉sizesuperscript219V_{\text{size}}=2^{19}italic_V start_POSTSUBSCRIPT size end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT 19 end_POSTSUPERSCRIPT (λ=216absent𝜆superscript216\geq\lambda=2^{16}≥ italic_λ = 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT), we fetch the next 16-bit code from the virtual input, resulting in 0x0402 for the third symbol, as shown in the decoding table Loop 3 of Figure 7. This decoding process repeats for all symbols, getting (“b”, 1, “@”, “3”).

5.6. Modification for Non-Continuous Intervals

Up to this point, we assume that each symbol is represented by a single continuous interval. In this section, we modify our algorithm by relaxing this constraint to allow a symbol to be represented by the union of multiple non-continuous intervals. A non-continuous interval example is shown in Figure 5, where the symbol “b𝑏bitalic_b” is assigned to the interval [1/8,1/3)[7/12,1)18137121[1/8,1/3)\cup[7/12,1)[ 1 / 8 , 1 / 3 ) ∪ [ 7 / 12 , 1 ), or its integer representation [8192,21845)[38229,65536)8192218453822965536[8192,21845)\cup[38229,65536)[ 8192 , 21845 ) ∪ [ 38229 , 65536 ).

Non-continuous intervals, offering the same number of options as continuous ones but with different option positions, require slight modifications in delayed coding. Take a non-continuous interval with two segments [L(1),R(1))[L(2),R(2))superscript𝐿1superscript𝑅1superscript𝐿2superscript𝑅2[L^{(1)},R^{(1)})\cup[L^{(2)},R^{(2)})[ italic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ∪ [ italic_L start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ). It provides k=(R(1)L(1))+(R(2)L(2))𝑘superscript𝑅1superscript𝐿1superscript𝑅2superscript𝐿2k=(R^{(1)}-L^{(1)})+(R^{(2)}-L^{(2)})italic_k = ( italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) + ( italic_R start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) options. To store a number a[0,k]𝑎0𝑘a\in[0,k]italic_a ∈ [ 0 , italic_k ] using these options, we modify the selection of the 16-bit code c𝑐citalic_c as follows:

c={L(1)+aif 0aR(1)L(1),L(2)+a(R(1)L(1))if R(1)L(1)ak.𝑐casessuperscript𝐿1𝑎if 0𝑎superscript𝑅1superscript𝐿1superscript𝐿2𝑎superscript𝑅1superscript𝐿1if superscript𝑅1superscript𝐿1𝑎𝑘c=\begin{cases}L^{(1)}+a&\text{if }0\leq a\leq R^{(1)}-L^{(1)},\\ L^{(2)}+a-(R^{(1)}-L^{(1)})&\text{if }R^{(1)}-L^{(1)}\leq a\leq k.\end{cases}italic_c = { start_ROW start_CELL italic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + italic_a end_CELL start_CELL if 0 ≤ italic_a ≤ italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_L start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT + italic_a - ( italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) end_CELL start_CELL if italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ≤ italic_a ≤ italic_k . end_CELL end_ROW

In other words, we choose the a-th optional code of the symbol, regardless of its integer value. Decoding involves the reverse process. For a 16-bit code within this non-continuous interval, we retrieve the stored number a𝑎aitalic_a as follows: if 16bits[L(1),R(1))16bitssuperscript𝐿1superscript𝑅1\text{16bits}\in[L^{(1)},R^{(1)})16bits ∈ [ italic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ), then a=16bitsL(1)𝑎16bitssuperscript𝐿1a=\text{16bits}-L^{(1)}italic_a = 16bits - italic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT. Otherwise, a=16bitsL(2)+(R(1)L(1))𝑎16bitssuperscript𝐿2superscript𝑅1superscript𝐿1a=\text{16bits}-L^{(2)}+(R^{(1)}-L^{(1)})italic_a = 16bits - italic_L start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT + ( italic_R start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT - italic_L start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ). In other words, the 16-bit code is the a𝑎aitalic_a-th item in this non-continuous interval. This method can be extended to manage intervals with more than two segments, generating two piecewise linear functions for the computation of c𝑐citalic_c and a𝑎aitalic_a. Importantly, this modification has no effect on the correctness and efficiency of delayed coding, as shown in the appendix.

5.7. Fine Granularity Compression Effectiveness

We show the effectiveness and optimality of delayed coding in this section. In Figure 7, there are 20 unused options (i.e., k=20𝑘20k=20italic_k = 20) after the encoding, resulting in a waste of log220=4.32subscript2204.32\log_{2}20=4.32roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 20 = 4.32 bits. The number of wasted bits can be bounded by log2λsubscript2𝜆\log_{2}\lambdaroman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_λ (note that we mark an interval once the number of options is larger than λ𝜆\lambdaitalic_λ). 2 shows that as the number of intervals grows, the effectiveness of delayed coding improves, approaching the entropy.

Theorem 2.

Give a series of intervals [L1,R1)subscript𝐿1subscript𝑅1[L_{1},R_{1})[ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), \cdots, [Ln,Rn)subscript𝐿𝑛subscript𝑅𝑛[L_{n},R_{n})[ italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), where n1𝑛1n\geq 1italic_n ≥ 1, and Li,Risubscript𝐿𝑖subscript𝑅𝑖L_{i},R_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are 16-bit integers less than or equal to 216superscript2162^{16}2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT for all i𝑖iitalic_i. Suppose delayed coding:

  1. (1)

    Marks an interval if and only if the current option number is larger or equal to λ𝜆\lambdaitalic_λ, where λ216𝜆superscript216\lambda\geq 2^{16}italic_λ ≥ 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT.

  2. (2)

    Encodes every ζ𝜁\zetaitalic_ζ intervals as a bit stream, where 0<ζn0𝜁𝑛0<\zeta\leq n0 < italic_ζ ≤ italic_n.

Thus, the number of used bits Ln,λ,ζsuperscript𝐿𝑛𝜆𝜁L^{n,\lambda,\zeta}italic_L start_POSTSUPERSCRIPT italic_n , italic_λ , italic_ζ end_POSTSUPERSCRIPT is bounded by

Ln,λ,ζnC+(n/ζ+1)log2λ+nlog2(165535/λ)1,L^{n,\lambda,\zeta}\leq n\cdot C+(\lfloor n/\zeta\rfloor+1)\cdot\log_{2}% \lambda+n\cdot\log_{2}(1-65535/\lambda)^{-1},italic_L start_POSTSUPERSCRIPT italic_n , italic_λ , italic_ζ end_POSTSUPERSCRIPT ≤ italic_n ⋅ italic_C + ( ⌊ italic_n / italic_ζ ⌋ + 1 ) ⋅ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_λ + italic_n ⋅ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 - 65535 / italic_λ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ,

where nC𝑛𝐶n\cdot Citalic_n ⋅ italic_C is the entropy of all intervals. Further, for a sufficiently large n𝑛nitalic_n, by setting λ=ζ=n𝜆𝜁𝑛\lambda=\zeta=nitalic_λ = italic_ζ = italic_n, we have Ln,λ,ζ/(nC)1superscript𝐿𝑛𝜆𝜁𝑛𝐶1L^{n,\lambda,\zeta}/(n\cdot C)\to 1italic_L start_POSTSUPERSCRIPT italic_n , italic_λ , italic_ζ end_POSTSUPERSCRIPT / ( italic_n ⋅ italic_C ) → 1 as n+𝑛n\to+\inftyitalic_n → + ∞.

Proof.

See Appendix D.2. ∎

Summary: Delayed coding uses a fixed number of bits to encode each interval. It is based on the insight that altering the redundant information in an interval does not affect its symbol retrieval. 2 reveals that delayed coding has a near-entropy compression factor with fine compression granularity.

6. Compression Microbenchmarks

Table 1. Data sets - Unmarked data sets come from Public BI Benchmark (Vogelsgesang et al., 2018) or earlier semantic compression research (Gao and Parameswaran, 2016; Ilkhechi et al., 2020).
Group Data sets #Rows #Cols Row Length
Numeric Corel 68,040 93 820 byte
Jena Climate (Mnassri, 2020) 420,551 14 138 byte
Cars 344,287 155 393 byte
Categorical Forest Cover 581,012 55 127 byte
US Census 1990 2,458,285 69 145 byte
Food 5,216,593 5 22 byte
Bimbo 20,259,279 12 54 byte
String Yale Languages 5,762,082 30 284 byte
Medicare 8,645,072 26 229 byte
Arade 9,888,775 11 88 byte
Refer to caption
Figure 9. Compression Evaluation - We report the compression factor, insert/access latency, and training time of all compressors.

We evaluate Blitzcrank in the next two sections. First, using 10 real tables, we compare Blitzcrank with modern compressors. This comparison focuses on compression factors and fast random tuple access from compressed storage (Section 6.1). Then, we provide a breakdown of the Blitzcrank structure learner (Section 6.2). Following this, we compare delayed coding with asymmetric numeral systems (Section 6.3). Finally, we optimize the random access performance by analyzing the compression block size (Section 6.4).

Baselines. We evaluate Blitzcrank against Zstandard (Collet and Kucherawy, 2021) and Raman’s approach (Raman and Swart, 2006): (1) Zstandard is a real-time compression system. It has a training mode, designed for compressing many small files. This mode creates a “zstd-dictionary” from all files and uses it to compress each file independently. We use the open-source Zstandard (v1.5.1) in C++, setting the “zstd-dictionary” capacity to the recommended 110 KB and using the default compression level. (2) Raman’s method (Raman and Swart, 2006) focuses on tuple compression. It considers correlations between columns and combines Huffman coding and delta encoding to achieve a high compression factor. We implemented Raman’s approach in C++ using the default column ordering of an input table.

We exclude DeepSqueeze (Ilkhechi et al., 2020) because it does not support high cardinality columns and is not open-sourced. We do not include FSST (Boncz et al., 2020) and other lightweight techniques (Damme et al., 2017; Abadi et al., 2006a, 2013), because they are not for row-stores. In the appendix, we also evaluate Blitzcrank against the open-sourced (in C++) Squish (Gao and Parameswaran, 2016) and Gzip (Ziv and Lempel, 1977) for the table archive task. Our method is 20×20\times20 × faster than Squish and offers 2×2\times2 × higher compression factors compared to Gzip.

Blitzcrank Setting. Blitzcrank samples 215superscript2152^{15}2 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT tuples for structure learning, with detailed sensitivity analysis provided in Section 6.2. For delayed coding, each tuple is individually encoded for the optimal access latency. We set λ=216𝜆superscript216\lambda=2^{16}italic_λ = 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT to maximize the compression factor, as detailed in 2. We evaluate two Blitzcrank variants: one utilizes column correlation for compression (Blitzcrank w/ Correlation), while the other does not (Blitzcrank w/o Correlation).

Data Sets. Table 1 shows the data sets. Besides the data sets from previous studies (Vogelsgesang et al., 2018; Gao and Parameswaran, 2016; Ilkhechi et al., 2020), we also use Jena Climate, which consists of 14 time-series columns (Mnassri, 2020). We classify each data set into categorical, numeric, or string types. Specifically, we calculate the proportion of each attribute type in the total data set size and select the type with the highest proportion as the representative group for that data set.

Experimental Setup. We use three metrics to measure compression performance: compression factor, throughput, and random access latency. The throughput represents the amount of data processed per unit of time for a given compression/decompression task. Random access latency is the time required to retrieve a random record. We conduct our experiments on a machine equipped with two Intel®superscriptIntel®\text{Intel}^{\text{\textregistered}}Intel start_POSTSUPERSCRIPT ® end_POSTSUPERSCRIPT Xeon 8375C (32 ×\times× 2 cores) and 512512512512 GB RAM. The disk we use is an Intel®superscriptIntel®\text{Intel}^{\text{\textregistered}}Intel start_POSTSUPERSCRIPT ® end_POSTSUPERSCRIPT SSD D5-P5530 (1 TB). We use Debian GNU/Linux 11 and GCC 10.2 with -O3 enabled. All microbenchmarks are conducted with a single thread.

6.1. Compression Evaluation

We evaluate Blitzcrank on in-memory tables (constructed using the data sets above) with each tuple compressed separately. The compressed tuples are organized using a primary-key index (implemented using a simple C++ vector) where the primary keys are monotonically increasing integers. We use YCSB (workload C) with a Zipf distribution to generate the random-access workloads (Cooper et al., 2010). Specifically, for each data set, we first compress and insert 5 million tuples into the in-memory table and then execute 1 million point queries, each involving decompressing a particular tuple. We report the average latencies for compression-insertion and random access separately. We also record the size of each in-memory table after insertion to calculate the compression factor. For each compressor, we first train its model over the corresponding data set if required by the algorithm.

Figure 9 shows the results, including compression factor, latency, and training time, across various data sets on the x-axis. Blitzcrank has the highest compression factor for 7/10 tables, and offers the lowest latency for 9/10 tables among all compressors. This is because Blitzcrank models columns in a semantic way and uses fixed-length code for encoding. Raman’s approach has the highest compression factor for the remaining 3/10 tables because tuples of these tables have low entropy; each tuple requires on average just a few bits for encoding (e.g., 2.6 bytes for a Bimbo’s tuple). In this case, using fixed-length codes is less efficient. However, Raman’s approach is slow for accessing tuples, because its variable-length code for each attribute leads to additional checks when decoding. Zstandard falls short for both the compression factor and the latency, because it relies on long contexts, at least 4KB, to build an effective dictionary. However, the length of a single tuple is insufficient to meet this 4 KB requirement.

We advise using Blitzcrank w/o Correlation in most cases. Capturing column correlations improves the compression factor in the sacrifice of the access latency and the model training time. The training time is often exponential to the number of categorical columns, but a longer training time does not necessarily guarantee better performance. A detailed analysis of the semantic learner is given in Section 6.2.

Refer to caption
(a) No Correlation Found Example: Bimbo
Refer to caption
(b) Correlation Found Example: Census
Figure 10. Breakdown of Blitzcrank Distribution Learner - Vary the #samples and evaluate the performance.
Refer to caption
(a) Compression
(b) Decompression
Figure 11. Entropy Coding Running Time - Vary the #column of tables and record the processing time.

6.2. Sensitivity to Sampling Number

Blitzcrank randomly selects a subset of samples for structure learning. We now investigate the sensitivity to the sampling number. We use Blitzcrank w/ Correlation to compress and decompress the whole data sets with the compression granularity being a single tuple (i.e., delayed coding encodes each tuple into a separate compressed block). We keep this tuple-level compression granularity or the remaining experiments unless specified otherwise. We select two representative data sets Bimbo and Census for the analysis. Because the structure learning influences the complexity of the model generated, which further affects the compression speed, we report the duration of each stage within Blitzcrank: structure learning (Structuring), model generation (Generation), compression, and decompression.

Figure 10 shows the results. We vary the #samples in structure learning on the x-axis (log-scaled) and record the compression factor and the running time of each stage. For Bimbo in Figure 10a, the sample number has little effect on the compression factor and running time – the learner cannot learn many dependencies. Structure learning time increases slightly with sample number; this is expected since more samples need scanning. Census in Figure 10b shows a different pattern: the compression factor increases with the #samples – the learner finds interesting dependencies and generates more complex models. Therefore, we need more time to generate models. The disparity between the two patterns is due to different column counts: Census has 69 columns compared to Bimbo’s 12. More columns typically indicate more complex dependencies, leading to higher access latency and longer training time. Considering the running time and performance, we set the default sample number to 215superscript2152^{15}2 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT for Blitzcrank. R4.W3, R4.D5Although considering column correlation may improve the compression factor of Blitzcrank on a few data sets (e.g., Yale) with a small impact on the access latency, we opt to use Blitzcrank w/o Correlation for the remaining experiments unless otherwise specified.

6.3. Entropy Coding Running Time

We evaluate the delayed coding with arithmetic coding and asymmetric numeral systems (ANS) (Duda, 2021). We implement arithmetic coding with the integer-based probability representation; and use finite-state entropy (Collet, 2022) as the implementation of the ANS. We integrate them into Blitzcrank and use the same models for distribution estimation. In this experiment, we create 64 MB relational data sets with different numbers of columns. Each column has a uniform distribution of cardinality 255 with values sampled from ASCII codes. We vary the column number from 2 to 1024 to record the compression and decompression times of all algorithms.

In Figure 11, we compare delayed coding, arithmetic coding, and ANS, represented by the solid lines. Delayed coding is 2×2\times2 × faster than ANS for the decompression speed, with arithmetic coding being the slowest. This is because delayed coding has a constant-time decoding complexity. Arithmetic coding, on the other hand, relies on binary search for code-to-symbol map**, operating in O(logN)𝑂𝑁O(\log N)italic_O ( roman_log italic_N ). An improvement for ANS involves using an unordered map from codes to symbols to accelerate decoding (Duda, 2021). We then implement such a decoding map for both ANS and delayed coding, denoted by the dotted lines. Figure 11 shows that the delayed coding is still faster than ANS. Decompression time for ANS is similar to delayed coding with few columns but slows as column numbers increase due to cache burden. Storing decoding maps increases the cache miss rate from 0.04% to 0.132% when columns exceed 108.

Refer to caption
(a) Zoom-in
(b) Latency vs. Cpr. Factor
Figure 12. Compression Granularity - Vary the block size from 1 to 128 tuples, and write it next to the marker for each trial.

6.4. Sensitivity to Compression Granularity

We claim that delayed coding has near-entropy performance with fine compression granularity in Section 5.7. In this part, we investigate the effect of compression block size in practice. We present three data sets for this experiment: Arade, Cover Type, and Yale Language. We omit other data sets because they produce similar results. In each trial, we vary the block size (#tuple) to see how they affect the compression factor and the random access latency. We measure the access latency by repeating the process one million times. This experiment is conducted in memory.

Figure 12 shows that the delayed coding has a high compression factor with fine compression granularity. For each table, the compression factor reaches a plateau when the compression block size exceeds 8 tuples. This indicates a trade-off between compression factor and latency within the 0 to 8 tuple range, allowing the users to select their preferred block sizes. Since OLTP databases usually prioritize low latency and Blitzcrank has a high compression factor even at a block size of one tuple, we set this as the default.

Refer to caption
(a) Uniform Distribution
(b) Zipfian Distribution
Figure 13. Effect of a Fast-Path LRU Cache - The workloads are based on YCSB Workload F (read-modify-write). Dashed lines are Blitzcrank without caching, while the solid lines are Blitzcrank with fast-path cache enabled.

6.5. Fast Path for Tuple Updates

We evaluate a fast path for tuple updates in this section. Specifically, we implemented an LRU write-back cache to buffer the most recently accessed tuples in their decompressed form. Normally, a tuple update involves loading the compressed tuple, decompressing it, modifying the tuple, and re-compressing the updated tuple. With the cache, the workload flow starts by looking up the cache. If the target (decompressed) tuple is already in the cache, we modify the tuple directly. Otherwise, we first decompress the target tuple and insert it into the cache. We evaluate the cache using YCSB workload F (read-modify-write) under Uniform and Zipfian query distributions on data sets Census and Bimbo. The table initially contains five million tuples, and we execute one million read-modify-write queries on the table.

As shown in Figure 13, adding the fast-path cache slows down uniformly distributed queries because cache hits are rare, and cache lookup and maintenance bring overhead. For Zipf-distributed queries, having the fast-path cache improves the query performance because as the cache size increases, more and more tuple updates are performed directly in the cache without decompression. We conclude that a fast-path cache can benefit skewed workloads when using Blitzcrank for compression.

7. System Evaluation

We integrated Blitzcrank (w/o Correlation) into Silo (Tu et al., 2013) and measured the end-to-end performance using the TPC-C benchmark (Council, 2007). Silo is an OCC-based serializable database designed for excellent performance at scale on large multi-core machines. Silo uses the Masstree(Mao et al., 2012) for its underlying indexes, and has a very high transaction throughput, achieving more than 1 million txns/s on the standard TPC-C workload in our experiments.

Each record in a table is compressed separately. The compressors under test are Uncompressed, Zstandard, Raman, and Blitzcrank with the corresponding row-stores named Silo, ZstdDB, RamanDB, and BlitzDB, respectively. To access a record by primary key, Silo walks the index tree using that key to find the compressed record and then decompresses it into a list of attributes.

Table 2. Data Generation Methods - Addresses are generated by ZIP code conventions (us_, 2023b); Phone and district are produced by populating a predefined format with random numbers.
Column Method Source/Format
C_FIRST Sampling US Baby Names (us_, 2023a)
C_STREET Sampling Open Addresses (rea, 2023)
C_DATA Sampling City Max Capita (Vogelsgesang et al., 2018)
S_DATA Sampling Corporations (Vogelsgesang et al., 2018)
C_STATE Sampling List of US States
C_CITY Conditional Cities within C_STATE
C_ZIP Conditional ZIP Codes within C_CITY
C_PHONE Format-Based “(XXX) XXX-XXXX”
S_DIST Format-Based “dist-str#XX#XX#XXXX”

According to the TPC-C specification, some columns are filled with random bytes which are incompressible. We substitute these bytes with data that either follows real-world patterns or is sampled from the collected corpus. Table 2 details our data generation approach. The compression factors for the new Customer and Stock tables are 3.44 and 5.57, respectively.

Refer to caption
(a) Throughput
(b) DB Size
Figure 14. TPC-C Workload - In each trial, we use 16 threads and each thread executes 1 million transactions.
Refer to caption
(a) Training Time
(b) Models Size
Figure 15. Models in TPC-C - RamanDB uses full data for training, and the others only sample 16 warehouses for training.

7.1. In-Memory Workloads

In this section, we investigate the performance-space trade-offs of BlitzDB compared to the other baselines when the entire database fits in memory. We vary the number of warehouses in TPC-C from 64 to 896, in increments of 64. In each trial, we use 16 threads. Each database executes 16 million transactions and presents the average throughput results. The training time is measured before the transactions start, while the database size and model size are measured after the transactions. RamanDB uses the entire data set for training, while Blitzcrank and ZstdDB sample data from 16 warehouses. Because Raman’s approach uses a static dictionary, it cannot compress new records. We, therefore, use a buffer (size = 64K tuples) to batch the newly inserted and updated records temporarily. When the buffer is full, we create a new dictionary to compress these buffered records. Before adding or using these dictionaries, each thread secures a mutex lock.

As shown in Figure 15, Blitzcrank compresses the data to 14.8% of the original (i.e., Silo) with a throughput decrease of around 21% due to the compression overhead. Such a performance-space trade-off is much more optimized compared to ZstdDB and RamanDB. Moreover, Figure 15 shows that Blitzcrank has the smallest model size and requires orders-of-magnitude shorter time for training compared to the baselines.

Refer to caption
(a) Throughput
(b) Memory Consumption
Figure 16. TPC-C Large-Than-Memory Workload - Start with 16 warehouses, and each trial runs 20 minutes using 16 threads.

Figure 17 shows the TPC-C throughput as the number of threads grows, with each thread corresponding to a warehouse. Both BlitzDB and Silo demonstrate impressive thread scalability. However, throughput reaches a clear limit after 64 threads, which is particularly noticeable in Raman’s approach. This limit can be ascribed to several factors: hyperthreading, the increasing size of the database, shared resources such as the L3 cache, and direct thread contention.

7.2. Larger-Than-Memory Workloads

We then evaluate Blitzcrank under the case when the working set does not fit in physical memory. The tuples are stored on disk with the memory acting as a cache. The memory (i.e., the buffer pool) adopts an LRU replacement policy, and we set the memory limit to 5 GB (excluding the memory occupied by indexes). We start with 16 warehouses (around 1 GB) and execute TPC-C transactions for 20 minutes using 16 threads.

Figure 16 shows the throughput and memory consumption for the experiments. Note that the x-axis represents the number of executed transactions as did in (Zhang et al., 2016). After 20 minutes of execution, BlitzDB completed 5×5\times5 × as many transactions as Silo. This is because Blitzcrank not only achieves an exceptional compression factor but also brings moderate compression/decompression overhead to the system. BlitzDB can sustain at high throughput for a longer time because the memory saved by Blitzcrank allows the database to keep a larger working set in memory.

As a comparison, neither ZstdDB nor RamanDB significantly improves transaction execution. Zstandard suffers from a low compression factor, especially on short tuples. For example, despite using the zstd-dictionary, Zstandard only achieves a compression factor of around 1.3 for the TPC-C table OrderLine. On the other hand, Raman’s method is limited by its large dictionary size. It uses a buffer to temporarily hold tuples, compressing them once the buffer is full and then clearing them. This leads to the generation of large compression dictionaries and unstable throughput, with a notable decrease in speed during compressing the buffer tuples.

8. Related Work

Lightweight encoding, such as bit-packed, delta, run-length, dictionary, and bit vector encoding, is popular recently (Chen et al., 2001; Shi, 2020; Abadi et al., 2006b; par, 2013; Abadi et al., 2013; Boncz et al., 2020; Abadi et al., 2006a). These are used in column-store databases as they can quickly process large chunks of data using the SIMD technique (Jiang and Elmore, 2018). However, delta and run-length encoding methods can be slow when we need to quickly grab just a single tuple/value, as they have to decode an entire data block. Therefore, these lightweight encoding methods are unsuitable for processing data in OLTP databases.

General-purpose block compression methods such as Gzip(Ziv and Lempel, 1977), Snappy (sna, 2019), and Zstandard (Collet and Kucherawy, 2021) effectively save disk space by using a sliding window technique to identify word repetitions, thus minimizing data transfer between disk and memory (par, 2013). However, the static dictionary makes them inflexible for insert/update scenarios. Zstandard provides a special mode for small files. It improves compression by training on all small file, generating a dictionary. This dictionary is required to be loaded before compression and decompression. However, this mode is less effective at compressing files slightly different from the prior data.

Semantic compression needs to estimate probability distributions for each column in a relational table. Babu et al. proposed the first lossy semantic compression method, SPARTAN, for table compression (Babu et al., 2001). Subsequently, Gao et al. introduced Squish, which uses a Bayesian network and arithmetic coding (Gao and Parameswaran, 2016). Later, DeepSqueeze was conceived, using auto-encoders (Ilkhechi et al., 2020). However, being lossy, these techniques are suited for data archiving and less so for low-latency transaction processing. For example, Squish sorts each table column to make delta encoding more efficient, but slowing compression. DeepSqueeze does not support columns with high cardinality111DeepSqueeze uses one-hot encoding for each column in its network architecture, and high cardinality introduces numerous parameters in the fully connected layer (Ilkhechi et al., 2020).. In contrast, Blitzcrank supports all common column types in databases and provides a faster compression speed. Also, DeepSqueeze uses deep learning techniques for structure learning (Ilkhechi et al., 2020). However, this approach lacks explainability and has a slow inference speed. Blitzcrank, therefore, uses the Bayesian network to capture column correlations for its simplicity. The effectiveness of the Bayesian network approach has been proved in (Gao and Parameswaran, 2016).

Refer to caption
Figure 17. Scalability of Compression - Vary the thread number from 1 to 128, and each trail runs 1 minute.

9. Conclusions

We introduce Blitzcrank, a high-speed semantic compressor for OLTP databases. We first propose novel semantic models that support fast inferences and dynamic value sets for both discrete and continuous data types; we then introduce a new entropy encoding algorithm, called delayed coding, that achieves significant improvement in the decoding speed compared to modern arithmetic coding implementations. Blitzcrank has high compression factors and fast decompression speed. We integrate Blitzcrank into an in-memory OLTP database, Silo. The TPC-C benchmark shows that, for data sets larger than the available physical memory, Blitzcrank can help the database sustain a high throughput and execute four times more transactions before the I/O overhead dominates.

References

  • (1)
  • par (2013) 2013. parquet. https://parquet.apache.org/
  • sna (2019) 2019. Snappy. https://github.com/google/snappy.
  • dbl (2022) 2022. DBLP Dataset. https://dblp.org
  • rea (2023) 2023. The free and open global address collection. https://openaddresses.io/
  • us_(2023a) 2023a. Popularity of Names by US State from the Social Security Website. https://www.ssa.gov/oact/babynames/limits.html
  • us_(2023b) 2023b. US Zip Codes Database. https://simplemaps.com/data/us-zips
  • zst (2023) 2023. Zstandard - Real-time data compression algorithm. http://facebook.github.io/zstd/
  • Abadi et al. (2013) Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, Samuel Madden, et al. 2013. The design and implementation of modern column-oriented database systems. Foundations and Trends® in Databases (2013), 197–280.
  • Abadi et al. (2006a) Daniel Abadi, Samuel Madden, and Miguel Ferreira. 2006a. Integrating compression and execution in column-oriented database systems. In Proceedings of SIGMOD’06. 671–682.
  • Abadi et al. (2006b) Daniel J. Abadi, Samuel Madden, and Miguel Ferreira. 2006b. Integrating compression and execution in column-oriented database systems. In Proceedings of SIGMOD’16. ACM, 671–682.
  • Babu et al. (2001) Shivnath Babu, Minos N. Garofalakis, and Rajeev Rastogi. 2001. SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables. In Proceedings of SIGMOD’01. ACM, 283–294.
  • Barbarioli et al. (2023) Bruno Barbarioli, Gabriel Mersy, Stavros Sintos, and Sanjay Krishnan. 2023. Hierarchical Residual Encoding for Multiresolution Time Series Compression. Proceedings of SIGMOD’23 (2023), 1–26.
  • Boncz et al. (2020) Peter Boncz, Thomas Neumann, and Viktor Leis. 2020. FSST: fast random access string compression. Proceedings of VLDB’20 (2020), 2649–2661.
  • Box et al. (2015) George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. 2015. Time series analysis: forecasting and control. John Wiley & Sons.
  • Chen et al. (2001) Zhiyuan Chen, Johannes Gehrke, and Flip Korn. 2001. Query optimization in compressed database systems. In Proceedings of SIGMOD’01. 271–282.
  • Cleary and Witten (1984) John G. Cleary and Ian H. Witten. 1984. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans. Commun. 32, 4 (1984), 396–402.
  • Collet (2022) Yann Collet. 2022. Finite State Entropy. https://github.com/Cyan4973/FiniteStateEntropy
  • Collet and Kucherawy (2021) Yann Collet and Murray S. Kucherawy. 2021. Zstandard Compression and the ’application/zstd’ Media Type. RFC 8878 (2021), 1–45.
  • Cooper et al. (2010) Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of SOCC’10. 143–154.
  • Council (2007) The Transaction Processing Council. 2007. TPC-C Benchmark (Revision 5.9.0). https://www.tpc.org/tpcc/
  • Damme et al. (2017) Patrick Damme, Dirk Habich, Juliana Hildebrandt, and Wolfgang Lehner. 2017. Lightweight Data Compression Algorithms: An Experimental Survey (Experiments and Analyses).. In Proceedings of EDBT’17. 72–83.
  • Davies and Moore (1999) Scott Davies and Andrew W. Moore. 1999. Bayesian Networks for Lossless Dataset Compression. In Proceedings of SIGKDD’99. ACM, 387–391.
  • Duda (2021) Jarek Duda. 2021. Encoding of probability distributions for Asymmetric Numeral Systems. CoRR abs/2106.06438 (2021).
  • Foufoulas et al. (2021) Yannis Foufoulas, Lefteris Sidirourgos, Elefterios Stamatogiannakis, and Yannis E. Ioannidis. 2021. Adaptive Compression for Fast Scans on String Columns. In Proceedings of SIGMOD’21. ACM, 554–562.
  • Fraenkel (1985) Aviezri S Fraenkel. 1985. Systems of numeration. The American Mathematical Monthly 92, 2 (1985), 105–114.
  • Gao and Parameswaran (2016) Yihan Gao and Aditya G. Parameswaran. 2016. Squish: Near-Optimal Compression for Archival of Relational Datasets. In Proceedings of SIGKDD’16. ACM, 1575–1584.
  • Haas et al. (2020) Gabriel Haas, Michael Haubenschild, and Viktor Leis. 2020. Exploiting Directly-Attached NVMe Arrays in DBMS.. In CIDR.
  • Heger et al. (2017) Christoph Heger, André van Hoorn, Mario Mann, and Dusan Okanovic. 2017. Application Performance Management: State of the Art and Challenges for the Future. In Proceedings of ICPE’17. ACM, 429–432.
  • Ilkhechi et al. (2020) Amir Ilkhechi, Andrew Crotty, Alex Galakatos, Yicong Mao, Grace Fan, Xiran Shi, and Ugur Çetintemel. 2020. DeepSqueeze: Deep Semantic Compression for Tabular Data. In Proceedings of SIGMOD’20. ACM, 1733–1746.
  • Jiang and Elmore (2018) Hao Jiang and Aaron J Elmore. 2018. Boosting data filtering on columnar encoding with simd. In Workshop on Data Management on New Hardware, Proceedings of SIGMOD’18. 1–10.
  • Koller and Friedman (2009) Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models - Principles and Techniques. MIT Press.
  • Kronmal and Peterson Jr (1979) Richard A Kronmal and Arthur V Peterson Jr. 1979. On the alias method for generating random variables from a discrete distribution. The American Statistician 33, 4 (1979), 214–218.
  • Kuschewski et al. (2023) Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis. 2023. BtrBlocks: Efficient Columnar Compression for Data Lakes. Proceedings of SIGMOD’23 (2023), 1–26.
  • Langdon (1984) Glen G Langdon. 1984. An introduction to arithmetic coding. IBM Journal of Research and Development 28, 2 (1984), 135–149.
  • Lasch et al. (2020) Robert Lasch, Ismail Oukid, Roman Dementiev, Norman May, Suleyman S Demirsoy, and Kai-Uwe Sattler. 2020. Faster & strong: string dictionary compression using sampling and fast vectorized decompression. The VLDB Journal 29, 6 (2020), 1263–1285.
  • Lersch et al. (2020) Lucas Lersch, Ivan Schreter, Ismail Oukid, and Wolfgang Lehner. 2020. Enabling Low Tail Latency on Multicore Key-Value Stores. Proceedings of VLDB’20 13, 7 (2020), 1091–1104.
  • Li et al. (2021) Jiguo Li, Chuanmin Jia, Xinfeng Zhang, Siwei Ma, and Wen Gao. 2021. Cross Modal Compression: Towards Human-comprehensible Semantic Compression. In Proceedings of ACM MM’21. ACM, 4230–4238.
  • MacKay (2003) David J. C. MacKay. 2003. Information Theory, Inference, and Learning Algorithms. Copyright Cambridge University Press.
  • Mao et al. (2012) Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftiness for fast multicore key-value storage. In Proceedings of the 7th ACM european conference on Computer Systems. 183–196.
  • Matthew (2005) V Mahoney Matthew. 2005. Adaptive weighing of context models for lossless data compression. Florida Institute of Technology CS Dept, Technical Report (2005).
  • Mnassri (2020) Baligh Mnassri. 2020. Jena Climate Dataset. https://www.kaggle.com/datasets/mnassrib/jena-climate
  • Moffat (2019) Alistair Moffat. 2019. Huffman coding. ACM Computing Surveys (CSUR) 52, 4 (2019), 1–35.
  • Pezoa et al. (2016) Felipe Pezoa, Juan L. Reutter, Fernando Suárez, Martín Ugarte, and Domagoj Vrgoc. 2016. Foundations of JSON Schema. In Proceedings of WWW’06. ACM, 263–273.
  • Polychroniou and Ross (2015) Orestis Polychroniou and Kenneth A Ross. 2015. Efficient lightweight compression alongside fast scans. In Proceedings of DaMoN@SIGMOD’15. 1–6.
  • Pöss and Potapov (2003) Meikel Pöss and Dmitry Potapov. 2003. Data Compression in Oracle. In Proceedings of VLDB’03. VLDB Endowment, 937–947.
  • Raman and Swart (2006) Vijayshankar Raman and Garret Swart. 2006. How to Wring a Table Dry: Entropy Compression of Relations and Querying of Compressed Relations. In Proceedings of VLDB’06. VLDB Endowment, 858–869.
  • Said (2004) Amir Said. 2004. Comparative Analysis of Arithmetic Coding Computational Complexity.. In Data compression conference. Citeseer, 562.
  • Shi (2020) Jia Shi. 2020. Column partition and permutation for run length encoding in columnar databases. In Proceedings of SIGMOD’20. 2873–2874.
  • Sinha (2021) Tanmay Sinha. 2021. OLAP vs. OLTP: What’s the Difference? https://www.ibm.com/cloud/blog/olap-vs-oltp
  • Sudhir et al. (2021) Sivaprasad Sudhir, Michael Cafarella, and Samuel Madden. 2021. Replicated layout for in-memory database systems. Proceedings of VLDB’21 (2021), 984–997.
  • Tu et al. (2013) Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. 2013. Speedy transactions in multicore in-memory databases. In Proceedings of SOSP’13. 18–32.
  • van Renen et al. (2018) Alexander van Renen, Viktor Leis, Alfons Kemper, Thomas Neumann, Takushi Hashida, Kazuichi Oe, Yoshiyasu Doi, Lilian Harada, and Mitsuru Sato. 2018. Managing non-volatile memory in database systems. In Proceedings of SIGMOD’18. 1541–1555.
  • Vogelsgesang et al. (2018) Adrian Vogelsgesang, Michael Haubenschild, Jan Finis, Alfons Kemper, Viktor Leis, Tobias Mühlbauer, Thomas Neumann, and Manuel Then. 2018. Get Real: How Benchmarks Fail to Represent the Real World. In Proceedings of DBTest@SIGMOD’18. 1:1–1:6.
  • Witten et al. (1987) Ian H Witten, Radford M Neal, and John G Cleary. 1987. Arithmetic coding for data compression. Commun. ACM 30, 6 (1987), 520–540.
  • Yang et al. (2018) Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of SIGCOM’18. 561–575.
  • Zhang et al. (2016) Huanchen Zhang, David G. Andersen, Andrew Pavlo, Michael Kaminsky, Lin Ma, and Rui Shen. 2016. Reducing the Storage Overhead of Main-Memory OLTP Databases with Hybrid Indexes. In Proceedings of SIGMOD’16. ACM, 1567–1581.
  • Zhang et al. (2020) Huanchen Zhang, Xiaoxuan Liu, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2020. Order-Preserving Key Compression for In-Memory Search Trees. In Proceedings of SIGMOD’20. ACM, 1601–1615.
  • Ziegler et al. (2022) Tobias Ziegler, Carsten Binnig, and Viktor Leis. 2022. ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA. In Proceedings of SIGMOD’22. 685–699.
  • Ziv and Lempel (1977) Jacob Ziv and Abraham Lempel. 1977. A Universal Algorithm for Sequential Data Compression. IEEE Trans. Inf. Theory 23, 3 (1977), 337–343.
  • Ziv and Lempel (1978) Jacob Ziv and Abraham Lempel. 1978. Compression of Individual Sequences via Variable-Rate Coding. IEEE Trans. Inf. Theory 24, 5 (1978), 530–536.

Appendix A Arithmetic Coding

In this section, we give the pseudo-code of arithmetic coding based on integer-based probability representation, as shown in Algorithm 3. All algorithms shown in the appendix are in C++ style. Input of function Encode is a sequence of probability intervals, we first compute the product of these intervals, then find a suitable integer M𝑀Mitalic_M to represent the product result. Note that some codes are generated during the product computing process, this is because we need an early bits emission technique to avoid precision underflow.

1
2
3Function Encode([L1,r1],,[Ls,rs]subscript𝐿1subscript𝑟1subscript𝐿𝑠subscript𝑟𝑠[L_{1},r_{1}],\cdots,[L_{s},r_{s}][ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , ⋯ , [ italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ]):
4       codes𝑐𝑜𝑑𝑒𝑠codesitalic_c italic_o italic_d italic_e italic_s \leftarrow \varnothing
5       L𝐿Litalic_L \leftarrow 0
6       R𝑅Ritalic_R \leftarrow 65536
7      
8      for i1𝑖1i\leftarrow 1italic_i ← 1 to s𝑠sitalic_s do
9             code𝑐𝑜𝑑𝑒codeitalic_c italic_o italic_d italic_e, [L,R]𝐿𝑅[L,R][ italic_L , italic_R ] \leftarrow GetPIProduct([L,R]𝐿𝑅[L,R][ italic_L , italic_R ], [Li,Ri]subscript𝐿𝑖subscript𝑅𝑖[L_{i},R_{i}][ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ])
10             codescodes+code𝑐𝑜𝑑𝑒𝑠𝑐𝑜𝑑𝑒𝑠𝑐𝑜𝑑𝑒codes\leftarrow codes+codeitalic_c italic_o italic_d italic_e italic_s ← italic_c italic_o italic_d italic_e italic_s + italic_c italic_o italic_d italic_e
11            
12      
13      Find smallest k𝑘kitalic_k such that M,[2kM,2k(M+1)][L,R]𝑀superscript2𝑘𝑀superscript2𝑘𝑀1𝐿𝑅\exists M,[2^{-k}M,2^{-k}(M+1)]\subseteq[L,R]∃ italic_M , [ 2 start_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT italic_M , 2 start_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT ( italic_M + 1 ) ] ⊆ [ italic_L , italic_R ],
14       return codes+M𝑐𝑜𝑑𝑒𝑠𝑀codes+Mitalic_c italic_o italic_d italic_e italic_s + italic_M
15      
16
17Function GetPIProduct([La,Ra]subscript𝐿𝑎subscript𝑅𝑎[L_{a},R_{a}][ italic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ], [lb,rb]subscript𝑙𝑏subscript𝑟𝑏[l_{b},r_{b}][ italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ]):
18       range𝑟𝑎𝑛𝑔𝑒rangeitalic_r italic_a italic_n italic_g italic_e \leftarrow RaLasubscript𝑅𝑎subscript𝐿𝑎R_{a}-L_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
19       L32subscript𝐿32L_{32}italic_L start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT \leftarrow (La<<16subscript𝐿𝑎<<16L_{a}\operatorname{{\LARGE\texttt{<<}}}16italic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ShiftLeft 16) + rangelb𝑟𝑎𝑛𝑔𝑒subscript𝑙𝑏\textnormal{{$range$}}\cdot l_{b}italic_r italic_a italic_n italic_g italic_e ⋅ italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT
20       R32subscript𝑅32R_{32}italic_R start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT \leftarrow (La<<16subscript𝐿𝑎<<16L_{a}\operatorname{{\LARGE\texttt{<<}}}16italic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ShiftLeft 16) + rangerb𝑟𝑎𝑛𝑔𝑒subscript𝑟𝑏\textnormal{{$range$}}\cdot r_{b}italic_r italic_a italic_n italic_g italic_e ⋅ italic_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT
21       L𝐿Litalic_L \leftarrow L32>>16subscript𝐿32>>16\textnormal{{$L_{32}$}}\operatorname{{\LARGE\texttt{>>}}}16italic_L start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT ShiftRight 16
22       R𝑅Ritalic_R \leftarrow R32>>16subscript𝑅32>>16\textnormal{{$R_{32}$}}\operatorname{{\LARGE\texttt{>>}}}16italic_R start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT ShiftRight 16
23      
      /* Underflow Check */
24       if R𝑅Ritalic_R >>> L𝐿Litalic_L + 1 then
25             return \varnothing, [L,R]𝐿𝑅[\textnormal{{$L$}},\textnormal{{$R$}}][ italic_L , italic_R ]
26       else if R𝑅Ritalic_R === L𝐿Litalic_L + 1 then
             /* Truncate [L,R]𝐿𝑅[L,R][ italic_L , italic_R ] containing 32768 if necessary. */
27             if (R<<16)L32R32(R<<16)𝑅<<16subscript𝐿32subscript𝑅32𝑅<<16(\textnormal{{$R$}}\operatorname{{\LARGE\texttt{<<}}}16)-\textnormal{{$L_{32}$% }}\geq\textnormal{{$R_{32}$}}-(\textnormal{{$R$}}\operatorname{{\LARGE\texttt{% <<}}}16)( italic_R ShiftLeft 16 ) - italic_L start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT ≥ italic_R start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT - ( italic_R ShiftLeft 16 ) then
28                   return L&0xffff𝐿0xffff\textnormal{{$L$}}\ \&\ \texttt{0xffff}italic_L & 0xffff, [L32&0xffff,65536]subscript𝐿320xffff65536[\textnormal{{$L_{32}$}}\ \&\ \texttt{0xffff},65536][ italic_L start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT & 0xffff , 65536 ]
29                  
30            else
31                   return R&0xffff𝑅0xffff\textnormal{{$R$}}\ \&\ \texttt{0xffff}italic_R & 0xffff, [0,R32&0xffff]0subscript𝑅320xffff[0,\textnormal{{$R_{32}$}}\ \&\ \texttt{0xffff}][ 0 , italic_R start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT & 0xffff ]
32                  
33            
34       else
             /* Underflow happens, i.e., l=r𝑙𝑟l=ritalic_l = italic_r. */
35             return L&0xffff𝐿0xffff\textnormal{{$L$}}\ \&\ \texttt{0xffff}italic_L & 0xffff, [L32&0xffff,R32&0xffff]subscript𝐿320xffffsubscript𝑅320xffff[\textnormal{{$L_{32}$}}\ \&\ \texttt{0xffff},\textnormal{{$R_{32}$}}\ \&\ % \texttt{0xffff}][ italic_L start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT & 0xffff , italic_R start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT & 0xffff ]
36            
37      
Algorithm 3 Encoding of Arithmetic Coding

Precision Underflow is Tackled By Early Bits Emission. In practice, there could be a large number of probability intervals for one record, so the product can easily exceed the precision limit, i.e. LRsuperscript𝐿superscript𝑅L^{\prime}\geq R^{\prime}italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (actually only L=Rsuperscript𝐿superscript𝑅L^{\prime}=R^{\prime}italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is possible), where [L,R]superscript𝐿superscript𝑅[L^{\prime},R^{\prime}][ italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] is the product result. Early bits emission are leveraged in arithmetic coding with integer-based probability intervals. Suppose underflow happens, i.e., L=Rsuperscript𝐿superscript𝑅L^{\prime}=R^{\prime}italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, according to the early bits emission, we can emit the 16-bit L32>>16subscript𝐿32>>16L_{32}\operatorname{{\LARGE\texttt{>>}}}16italic_L start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT ShiftRight 16 and let

L=L32&0xffff,R=R32&0xffff.formulae-sequencesuperscript𝐿subscript𝐿320xffffsuperscript𝑅subscript𝑅320xffff\displaystyle L^{\prime}=L_{32}\ \&\ \texttt{0xffff},R^{\prime}=R_{32}\ \&\ % \texttt{0xffff}.italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_L start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT & 0xffff , italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_R start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT & 0xffff .

This technique is very interesting, underflow happens if the first 16 bits of L32subscript𝐿32L_{32}italic_L start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT and R32subscript𝑅32R_{32}italic_R start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT are the same (we use 32 bits to temporarily represent the product of probability intervals), which means that the next two bytes representing the data have been determined. Therefore, we can directly output the high 16 bits and update the product result in arithmetic coding.

Bits Prefetch. With integer-based probability representation, decoding of arithmetic coding is similar to the original version (MacKay, 2003), except that we know the next symbol can be determined by reading at most 16 bits. Thus, we can read 16 bits each time since the extra bits read will not disturb decoding. In this way, we do not need to check whether the current information is enough to decode the next symbol.

Appendix B The Constant Time Complexity of Inv-Translate

Every probability vector π1,,πNsubscript𝜋1subscript𝜋𝑁\pi_{1},\cdots,\pi_{N}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_π start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, can be expressed as an equiprobable mixture of N𝑁Nitalic_N two-point distributions. That is, there are N𝑁Nitalic_N pairs of integers (α1,β1)subscript𝛼1subscript𝛽1(\alpha_{1},\beta_{1})( italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), \cdots, (αN,βN)subscript𝛼𝑁subscript𝛽𝑁(\alpha_{N},\beta_{N})( italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) and N𝑁Nitalic_N probabilities w1,,wNsubscript𝑤1subscript𝑤𝑁w_{1},\cdots,w_{N}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT such that

πi=1/Nj=1N(wj𝟙{αj=i}+(1wj)𝟙{βj=i})=1/Nj=1NYi(j)subscript𝜋𝑖1𝑁superscriptsubscript𝑗1𝑁subscript𝑤𝑗subscript1subscript𝛼𝑗𝑖1subscript𝑤𝑗subscript1subscript𝛽𝑗𝑖1𝑁superscriptsubscript𝑗1𝑁subscriptsuperscript𝑌𝑗𝑖\pi_{i}=1/N\cdot\sum_{j=1}^{N}(w_{j}\mathds{1}_{\{\alpha_{j}=i\}}+(1-w_{j})% \mathds{1}_{\{\beta_{j}=i\}})=1/N\cdot\sum_{j=1}^{N}Y^{(j)}_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 / italic_N ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_i } end_POSTSUBSCRIPT + ( 1 - italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) blackboard_1 start_POSTSUBSCRIPT { italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_i } end_POSTSUBSCRIPT ) = 1 / italic_N ⋅ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Y start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

for 1iN1𝑖𝑁1\leq i\leq N1 ≤ italic_i ≤ italic_N, where Y(1)superscript𝑌1Y^{(1)}italic_Y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT\cdotsY(N)superscript𝑌𝑁Y^{(N)}italic_Y start_POSTSUPERSCRIPT ( italic_N ) end_POSTSUPERSCRIPT are two-point distributions.

This can be shown by induction. It is true when N=1𝑁1N=1italic_N = 1. Assuming that it is true for N<k𝑁𝑘N<kitalic_N < italic_k, we can show it is true for N=k𝑁𝑘N=kitalic_N = italic_k as follows. Choose the minimal πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Since it is at most equal to 1/N1𝑁1/N1 / italic_N, we can take α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT equal to the index of this minimum and set w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT equal to Nπα1𝑁subscript𝜋subscript𝛼1N\pi_{\alpha_{1}}italic_N italic_π start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Then choose the index β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT which corresponds to the largest πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This defines the first function Y(1)superscript𝑌1Y^{(1)}italic_Y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT. Note that we used the fact that (1w1)/Nπβ11subscript𝑤1𝑁subscript𝜋subscript𝛽1(1-w_{1})/N\leq\pi_{\beta_{1}}( 1 - italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / italic_N ≤ italic_π start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT because 1/Nπβ11𝑁subscript𝜋subscript𝛽11/N\leq\pi_{\beta_{1}}1 / italic_N ≤ italic_π start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The other N1𝑁1N-1italic_N - 1 functions have to be constructed from the leftover probabilities

π1,,πα1πα1,,πβ1(1w1)/N,,πNsubscript𝜋1subscript𝜋subscript𝛼1subscript𝜋subscript𝛼1subscript𝜋subscript𝛽11subscript𝑤1𝑁subscript𝜋𝑁\pi_{1},\cdots,\pi_{\alpha_{1}}-\pi_{\alpha_{1}},\cdots,\pi_{\beta_{1}}-(1-w_{% 1})/N,\cdots,\pi_{N}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_π start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_π start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - ( 1 - italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / italic_N , ⋯ , italic_π start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT

which, after deletion of the α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-th entry, is easily seen to be a vector of N1𝑁1N-1italic_N - 1 non-negative numbers summing to (N1)/N𝑁1𝑁(N-1)/N( italic_N - 1 ) / italic_N. For such a vector, the left Y(j)superscript𝑌𝑗Y^{(j)}italic_Y start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPTs can be found by our induction hypothesis.

Appendix C Delayed Coding

In this section, we give the pseudo-code of delayed coding in detail.

C.1. Encoding Procedure

The pseudo-code of encoding is shown in Algorithm 4. Encoding of delayed coding has two stages: (1) Planning; (2) Filling.

1 Function Encode([L1,R1],,[Ls,Rs]subscript𝐿1subscript𝑅1subscript𝐿𝑠subscript𝑅𝑠[L_{1},R_{1}],\cdots,[L_{s},R_{s}][ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , ⋯ , [ italic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ]):
       /* 1. Planning. */
2       size1𝑠𝑖𝑧𝑒1size\leftarrow 1italic_s italic_i italic_z italic_e ← 1
3       isVirtualfalse𝑖𝑠𝑉𝑖𝑟𝑡𝑢𝑎𝑙𝑓𝑎𝑙𝑠𝑒isVirtual\leftarrow falseitalic_i italic_s italic_V italic_i italic_r italic_t italic_u italic_a italic_l ← italic_f italic_a italic_l italic_s italic_e
4       while i1𝑖1i\leftarrow 1italic_i ← 1 to s𝑠sitalic_s do
5             Mark the interval [Li,Ri]subscript𝐿𝑖subscript𝑅𝑖[L_{i},R_{i}][ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] with isVirtual𝑖𝑠𝑉𝑖𝑟𝑡𝑢𝑎𝑙isVirtualitalic_i italic_s italic_V italic_i italic_r italic_t italic_u italic_a italic_l.
6             isVirtualfalse𝑖𝑠𝑉𝑖𝑟𝑡𝑢𝑎𝑙𝑓𝑎𝑙𝑠𝑒isVirtual\leftarrow falseitalic_i italic_s italic_V italic_i italic_r italic_t italic_u italic_a italic_l ← italic_f italic_a italic_l italic_s italic_e
7            
8            kRiLi𝑘subscript𝑅𝑖subscript𝐿𝑖k\leftarrow R_{i}-L_{i}italic_k ← italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
9             sizesizek𝑠𝑖𝑧𝑒𝑠𝑖𝑧𝑒𝑘size\leftarrow size\cdot kitalic_s italic_i italic_z italic_e ← italic_s italic_i italic_z italic_e ⋅ italic_k
10             if size216𝑠𝑖𝑧𝑒superscript216size\geq 2^{16}italic_s italic_i italic_z italic_e ≥ 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT then
11                   isVirtualtrue𝑖𝑠𝑉𝑖𝑟𝑡𝑢𝑎𝑙𝑡𝑟𝑢𝑒isVirtual\leftarrow trueitalic_i italic_s italic_V italic_i italic_r italic_t italic_u italic_a italic_l ← italic_t italic_r italic_u italic_e
12                   sizesize>>16𝑠𝑖𝑧𝑒𝑠𝑖𝑧𝑒>>16size\leftarrow size\operatorname{{\LARGE\texttt{>>}}}16italic_s italic_i italic_z italic_e ← italic_s italic_i italic_z italic_e ShiftRight 16
13            
14      
      /* 2. Filling. */
15       data0𝑑𝑎𝑡𝑎0data\leftarrow 0italic_d italic_a italic_t italic_a ← 0
16       bitStream𝑏𝑖𝑡𝑆𝑡𝑟𝑒𝑎𝑚bitStream\leftarrow\varnothingitalic_b italic_i italic_t italic_S italic_t italic_r italic_e italic_a italic_m ← ∅
17       for is1𝑖𝑠1i\leftarrow s-1italic_i ← italic_s - 1 to 00 do
18             kRiLi𝑘subscript𝑅𝑖subscript𝐿𝑖k\leftarrow R_{i}-L_{i}italic_k ← italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
19             adatamodk𝑎modulo𝑑𝑎𝑡𝑎𝑘a\leftarrow data\bmod kitalic_a ← italic_d italic_a italic_t italic_a roman_mod italic_k
20             datadata/k𝑑𝑎𝑡𝑎𝑑𝑎𝑡𝑎𝑘data\leftarrow data/kitalic_d italic_a italic_t italic_a ← italic_d italic_a italic_t italic_a / italic_k
21             16bitsLi+a16𝑏𝑖𝑡𝑠subscript𝐿𝑖𝑎16bits\leftarrow L_{i}+a16 italic_b italic_i italic_t italic_s ← italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a
22            
23            if [Li,Ri]subscript𝐿𝑖subscript𝑅𝑖[L_{i},R_{i}][ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] is virtual then
24                   data(data<<16)+16bits𝑑𝑎𝑡𝑎𝑑𝑎𝑡𝑎<<1616𝑏𝑖𝑡𝑠data\leftarrow(data\operatorname{{\LARGE\texttt{<<}}}16)+16bitsitalic_d italic_a italic_t italic_a ← ( italic_d italic_a italic_t italic_a ShiftLeft 16 ) + 16 italic_b italic_i italic_t italic_s
25            else
26                   bitStream16bits+bitStream𝑏𝑖𝑡𝑆𝑡𝑟𝑒𝑎𝑚16𝑏𝑖𝑡𝑠𝑏𝑖𝑡𝑆𝑡𝑟𝑒𝑎𝑚bitStream\leftarrow 16bits+bitStreamitalic_b italic_i italic_t italic_S italic_t italic_r italic_e italic_a italic_m ← 16 italic_b italic_i italic_t italic_s + italic_b italic_i italic_t italic_S italic_t italic_r italic_e italic_a italic_m
27            
28      return bitStream𝑏𝑖𝑡𝑆𝑡𝑟𝑒𝑎𝑚bitStreamitalic_b italic_i italic_t italic_S italic_t italic_r italic_e italic_a italic_m
29      
Algorithm 4 Encoding Procedure of Delayed Coding

Planning. The planning stage is to determine how many and which intervals can be virtual. The planning stage is necessary since we need to fill virtual bits from end to start, as proved in Section 5.3. Note that even if a certain probability interval is claimed to be virtual, we still need to calculate whether it can contribute more space to the virtual bits. In other words, every virtual bit is put to good use. Every time the virtual bits number is larger than 40 bits, we state that one more probability interval can be virtual, as shown in lines 9-11.

Filling. The filling stage is to fill the virtual bits with information we have collected. According to Section 5.3, we compute each kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(k𝑘kitalic_k, j𝑗jitalic_j in pseudo-code) for each probability interval. If an interval is virtual, we add it to disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a in pseudo-code); otherwise, we add it to bitStream𝑏𝑖𝑡𝑆𝑡𝑟𝑒𝑎𝑚bitStreamitalic_b italic_i italic_t italic_S italic_t italic_r italic_e italic_a italic_m. Note that in the filling phase, we traverse each probability interval from back to front.

C.2. Decoding Procedure

The pseudo-code of decoding is shown in Algorithm 5. Decoding is the inverse process of encoding in delayed encoding. Each time we read 16 bits from the bit stream or virtual bits (depending on the information we collect data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a). The function Inv-Translate is then called to get the symbol corresponding to the 16 bits of the input, and additional information a/k𝑎𝑘a/kitalic_a / italic_k. We update the virtual bits with additional information and check if the next virtual probability interval can be obtained.

The function Inv-Translate is inspired by the alias method, which is a constant time sampling algorithm from a categorical distribution (Kronmal and Peterson Jr, 1979). Alias method needs a simple prepossessing step, whose time complexity is O(n)𝑂𝑛O(n)italic_O ( italic_n ), and the intuition is that we can partition the probability interval [0,1)01[0,1)[ 0 , 1 ) into a series of buckets such that when we pick a random value (16bits) in the range, it ends up in some bucket with probability equal to the size of the bucket.

Alias Method. Let us suppose the finite alphabet has n𝑛nitalic_n characters, with weight k0,,kn1subscript𝑘0subscript𝑘𝑛1k_{0},\cdots,k_{n-1}italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_k start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT. Note that we use integer-based probability with a fixed denominator, so the length of probability for each character [li,ri]subscript𝑙𝑖subscript𝑟𝑖[l_{i},r_{i}][ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] must be the form ki216subscript𝑘𝑖superscript216k_{i}\cdot 2^{-16}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ 2 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT. Then the sum of length should be equal to one, i.e., i=0n1ki=65536superscriptsubscript𝑖0𝑛1subscript𝑘𝑖65536\sum_{i=0}^{n-1}k_{i}=65536∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 65536. When given two bytes as input, we want to find the corresponding character and the value of j𝑗jitalic_j very efficiently. This can be done through a simple trick: let m𝑚mitalic_m be such that 2m1<n2m=Msuperscript2𝑚1𝑛superscript2𝑚𝑀2^{m-1}<n\leq 2^{m}=M2 start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT < italic_n ≤ 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_M, we set up 2M2𝑀2M2 italic_M numbers a0,b0,,aM1,bM1subscript𝑎0subscript𝑏0subscript𝑎𝑀1subscript𝑏𝑀1a_{0},b_{0},\cdots,a_{M-1},b_{M-1}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_M - 1 end_POSTSUBSCRIPT such that:

  1. (1)

    ai+bi=216msubscript𝑎𝑖subscript𝑏𝑖superscript216𝑚a_{i}+b_{i}=2^{16-m}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT 16 - italic_m end_POSTSUPERSCRIPT;

  2. (2)

    each ai,bisubscript𝑎𝑖subscript𝑏𝑖a_{i},b_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is associated to one of the n𝑛nitalic_n characters ui[n],vi[n]formulae-sequencesubscript𝑢𝑖delimited-[]𝑛subscript𝑣𝑖delimited-[]𝑛u_{i}\in[n],v_{i}\in[n]italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ italic_n ] , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ italic_n ];

  3. (3)

    j,ui=jai+vi=jbi=kjfor-all𝑗subscriptsubscript𝑢𝑖𝑗subscript𝑎𝑖subscriptsubscript𝑣𝑖𝑗subscript𝑏𝑖subscript𝑘𝑗\forall j,\sum_{u_{i}=j}a_{i}+\sum_{v_{i}=j}b_{i}=k_{j}∀ italic_j , ∑ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the total sum of the number associated with a character is equal to its weight.

This decomposition is possible for any discrete distribution with a finite number of outcomes. The input 16 bits can be regarded as a value between 0 and 65536, and the first m𝑚mitalic_m bit represents the index of a bucket. Since each bucket contains two characters, we need to find the correct character by comparing the last 16 - m𝑚mitalic_m bits with the character boundary aPsubscript𝑎𝑃a_{P}italic_a start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT. Once the character is identified, we can output it along with extra information. The function Translate has a constant time complexity. The latter two terms of line 7 and line 10 can be calculated in advance, making this function faster.

1 Function Decode(bitStream, record𝑟𝑒𝑐𝑜𝑟𝑑recorditalic_r italic_e italic_c italic_o italic_r italic_d):
2       data𝑑𝑎𝑡𝑎absentdata\leftarrowitalic_d italic_a italic_t italic_a ← 0
3       size𝑠𝑖𝑧𝑒absentsize\leftarrowitalic_s italic_i italic_z italic_e ← 1
4      
5      while record𝑟𝑒𝑐𝑜𝑟𝑑recorditalic_r italic_e italic_c italic_o italic_r italic_d has unfilled attributes do
6             16bitsnext 16-bit from bitStream16𝑏𝑖𝑡𝑠next 16-bit from bitStream16bits\leftarrow\text{next 16-bit from bitStream}16 italic_b italic_i italic_t italic_s ← next 16-bit from bitStream
7             value,a,kInv-Translate(16bits)𝑣𝑎𝑙𝑢𝑒𝑎𝑘Inv-Translate16𝑏𝑖𝑡𝑠value,a,k\leftarrow\texttt{Inv-Translate}(16bits)italic_v italic_a italic_l italic_u italic_e , italic_a , italic_k ← Inv-Translate ( 16 italic_b italic_i italic_t italic_s )
             /* Inv-Translate can be a map simply. */
8             datadatak+a𝑑𝑎𝑡𝑎𝑑𝑎𝑡𝑎𝑘𝑎data\leftarrow data\cdot k+aitalic_d italic_a italic_t italic_a ← italic_d italic_a italic_t italic_a ⋅ italic_k + italic_a
9             sizesizek𝑠𝑖𝑧𝑒𝑠𝑖𝑧𝑒𝑘size\leftarrow size\cdot kitalic_s italic_i italic_z italic_e ← italic_s italic_i italic_z italic_e ⋅ italic_k
10             Fill record𝑟𝑒𝑐𝑜𝑟𝑑recorditalic_r italic_e italic_c italic_o italic_r italic_d with value𝑣𝑎𝑙𝑢𝑒valueitalic_v italic_a italic_l italic_u italic_e.
11            
12            if size216𝑠𝑖𝑧𝑒superscript216size\geq 2^{16}italic_s italic_i italic_z italic_e ≥ 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT then
13                   intervaldata&0xffff𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑑𝑎𝑡𝑎0xffffinterval\leftarrow data\ \&\ \texttt{0xffff}italic_i italic_n italic_t italic_e italic_r italic_v italic_a italic_l ← italic_d italic_a italic_t italic_a & 0xffff
14                   bitStreaminterval+bitStream𝑏𝑖𝑡𝑆𝑡𝑟𝑒𝑎𝑚𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑏𝑖𝑡𝑆𝑡𝑟𝑒𝑎𝑚bitStream\leftarrow interval+bitStreamitalic_b italic_i italic_t italic_S italic_t italic_r italic_e italic_a italic_m ← italic_i italic_n italic_t italic_e italic_r italic_v italic_a italic_l + italic_b italic_i italic_t italic_S italic_t italic_r italic_e italic_a italic_m
15                   datadata>>16𝑑𝑎𝑡𝑎𝑑𝑎𝑡𝑎>>16data\leftarrow data\operatorname{{\LARGE\texttt{>>}}}16italic_d italic_a italic_t italic_a ← italic_d italic_a italic_t italic_a ShiftRight 16
16                   sizesize>>16𝑠𝑖𝑧𝑒𝑠𝑖𝑧𝑒>>16size\leftarrow size\operatorname{{\LARGE\texttt{>>}}}16italic_s italic_i italic_z italic_e ← italic_s italic_i italic_z italic_e ShiftRight 16
17                  
18            
19      
20      return rd𝑟𝑑rditalic_r italic_d
21      
Algorithm 5 Decoding Procedure of Delayed Coding
1 Find m,{aM},{bM},{uM},{vM}𝑚subscript𝑎𝑀subscript𝑏𝑀subscript𝑢𝑀subscript𝑣𝑀m,\{a_{M}\},\{b_{M}\},\{u_{M}\},\{v_{M}\}italic_m , { italic_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } , { italic_b start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } , { italic_u start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } , { italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } by performing decomposition on {kn}subscript𝑘𝑛\{k_{n}\}{ italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }.
2 Function Inv-Translate(16bits):
3       P16bits>>(16m)𝑃16𝑏𝑖𝑡𝑠>>16𝑚P\leftarrow 16bits\operatorname{{\LARGE\texttt{>>}}}(16-m)italic_P ← 16 italic_b italic_i italic_t italic_s ShiftRight ( 16 - italic_m )
4       Q16bits&(216m1)𝑄16𝑏𝑖𝑡𝑠superscript216𝑚1Q\leftarrow 16bits\ \&\ (2^{16-m}-1)italic_Q ← 16 italic_b italic_i italic_t italic_s & ( 2 start_POSTSUPERSCRIPT 16 - italic_m end_POSTSUPERSCRIPT - 1 )
5       if Q<aP𝑄subscript𝑎𝑃Q<a_{P}italic_Q < italic_a start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT then
6             cuP𝑐subscript𝑢𝑃c\leftarrow u_{P}italic_c ← italic_u start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT
7             j16bitsz<Paz𝟙{uz=c}z<Pbz𝟙{vz=c}𝑗16𝑏𝑖𝑡𝑠subscript𝑧𝑃subscript𝑎𝑧subscript1subscript𝑢𝑧𝑐subscript𝑧𝑃subscript𝑏𝑧subscript1subscript𝑣𝑧𝑐j\leftarrow 16bits-\sum_{z<P}a_{z}\mathds{1}_{\{u_{z}=c\}}-\sum_{z<P}b_{z}% \mathds{1}_{\{v_{z}=c\}}italic_j ← 16 italic_b italic_i italic_t italic_s - ∑ start_POSTSUBSCRIPT italic_z < italic_P end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_u start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_c } end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_z < italic_P end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_c } end_POSTSUBSCRIPT
8      else
9             cvP𝑐subscript𝑣𝑃c\leftarrow v_{P}italic_c ← italic_v start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT
10             j16bitszPaz𝟙{uz=c}z<Pbz𝟙{vz=c}𝑗16𝑏𝑖𝑡𝑠subscript𝑧𝑃subscript𝑎𝑧subscript1subscript𝑢𝑧𝑐subscript𝑧𝑃subscript𝑏𝑧subscript1subscript𝑣𝑧𝑐j\leftarrow 16bits-\sum_{z\leq P}a_{z}\mathds{1}_{\{u_{z}=c\}}-\sum_{z<P}b_{z}% \mathds{1}_{\{v_{z}=c\}}italic_j ← 16 italic_b italic_i italic_t italic_s - ∑ start_POSTSUBSCRIPT italic_z ≤ italic_P end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_u start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_c } end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_z < italic_P end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_c } end_POSTSUBSCRIPT
11      
12      return c,j,kc𝑐𝑗subscript𝑘𝑐c,j,k_{c}italic_c , italic_j , italic_k start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT
13      
Algorithm 6 Inv-Translate of Blitzcrank

Decomposition. It remains to show that such decomposition is indeed possible, which can be obtained through the following procedure:

  1. (1)

    Initially, let S={i:ki<216m}𝑆conditional-set𝑖subscript𝑘𝑖superscript216𝑚S=\{i:k_{i}<2^{16-m}\}italic_S = { italic_i : italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 2 start_POSTSUPERSCRIPT 16 - italic_m end_POSTSUPERSCRIPT }, L={i:ki216m}𝐿conditional-set𝑖subscript𝑘𝑖superscript216𝑚L=\{i:k_{i}\geq 2^{16-m}\}italic_L = { italic_i : italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 2 start_POSTSUPERSCRIPT 16 - italic_m end_POSTSUPERSCRIPT }.

  2. (2)

    If S𝑆S\neq\varnothingitalic_S ≠ ∅, choose ksS,klLformulae-sequencesubscript𝑘𝑠𝑆subscript𝑘𝑙𝐿k_{s}\in S,k_{l}\in Litalic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ italic_S , italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_L, associate ai,bisubscript𝑎𝑖subscript𝑏𝑖a_{i},b_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to kssubscript𝑘𝑠k_{s}italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 216mkssuperscript216𝑚subscript𝑘𝑠2^{16-m}-k_{s}2 start_POSTSUPERSCRIPT 16 - italic_m end_POSTSUPERSCRIPT - italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, respectively, add kl(216mks)subscript𝑘𝑙superscript216𝑚subscript𝑘𝑠k_{l}-(2^{16-m}-k_{s})italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - ( 2 start_POSTSUPERSCRIPT 16 - italic_m end_POSTSUPERSCRIPT - italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) to S𝑆Sitalic_S or L𝐿Litalic_L; accordingly.

  3. (3)

    Otherwise, choose klLsubscript𝑘𝑙𝐿k_{l}\in Litalic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_L, associate ai,bisubscript𝑎𝑖subscript𝑏𝑖a_{i},b_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 0,216m0superscript216𝑚0,2^{16-m}0 , 2 start_POSTSUPERSCRIPT 16 - italic_m end_POSTSUPERSCRIPT respectively, add kl216msubscript𝑘𝑙superscript216𝑚k_{l}-2^{16-m}italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - 2 start_POSTSUPERSCRIPT 16 - italic_m end_POSTSUPERSCRIPT to S𝑆Sitalic_S or L𝐿Litalic_L accordingly.

  4. (4)

    If S𝑆S\neq\varnothingitalic_S ≠ ∅ or L𝐿L\neq\varnothingitalic_L ≠ ∅, return to (2).

It can be proved via induction that at the end of step (3), |S|+|L|<|S|+|L|𝑆𝐿superscript𝑆superscript𝐿|S|+|L|<|S^{\prime}|+|L^{\prime}|| italic_S | + | italic_L | < | italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | + | italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | always holds, where S,Lsuperscript𝑆superscript𝐿S^{\prime},L^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the sets in the previous loop after step (3), so the correctness of the procedure can be guaranteed.

Discussion: The Alias method tells us that we can sample from a discrete distribution with complexity O(1)𝑂1O(1)italic_O ( 1 ). This process has two parts: (1) Get a random value; (2) Find the corresponding symbol in the distribution. In decompression of delayed coding, we read the “random value” from the bit stream and then generate the corresponding symbols. Such symbols are prepared and stored in compression deliberately.

Appendix D Uniqueness and Efficiency

D.1. Uniqueness of Delayed Coding

We then show that delayed code is uniquely decodable. We first prove such a virtual bits input can be constructed. Recall that we use 16 bits to represent probability interval and collect extra bits in each probability interval as virtual bits. Suppose there are some probability intervals [Li,Ri]subscript𝐿𝑖subscript𝑅𝑖[L_{i},R_{i}][ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ], where ki=RiLisubscript𝑘𝑖subscript𝑅𝑖subscript𝐿𝑖k_{i}=R_{i}-L_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=1,,n𝑖1𝑛i=1,\cdots,nitalic_i = 1 , ⋯ , italic_n, and i=1nki216superscriptsubscriptproduct𝑖1𝑛subscript𝑘𝑖superscript216\prod_{i=1}^{n}k_{i}\geq 2^{16}∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT. We want to store a probability interval (an integer of 16 bits) D𝐷Ditalic_D with virtual bits provided by its former intervals, where D[1,216)𝐷1superscript216D\in[1,2^{16})italic_D ∈ [ 1 , 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT ). Note that for each [Li,Ri)subscript𝐿𝑖subscript𝑅𝑖[L_{i},R_{i})[ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we can store integer aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if and only if 0ai/ki<10subscript𝑎𝑖subscript𝑘𝑖10\leq a_{i}/k_{i}<10 ≤ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 1. Thus, let dn+1=Dsubscript𝑑𝑛1𝐷d_{n+1}=Ditalic_d start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_D, and

(4) ai=di%ki,di=di+1/ki+1,formulae-sequencesubscript𝑎𝑖percentsubscript𝑑𝑖subscript𝑘𝑖subscript𝑑𝑖subscript𝑑𝑖1subscript𝑘𝑖1\displaystyle a_{i}=d_{i}\ \%\ k_{i},\ \ \ d_{i}=\lfloor d_{i+1}/k_{i+1}\rfloor,italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT % italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⌊ italic_d start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ⌋ ,

for i=1,,n𝑖1𝑛i=1,\cdots,nitalic_i = 1 , ⋯ , italic_n. Then the following decomposition holds:

D=dn+1=dnkn+an==i=1naij=i+1nkj𝐷subscript𝑑𝑛1subscript𝑑𝑛subscript𝑘𝑛subscript𝑎𝑛superscriptsubscript𝑖1𝑛subscript𝑎𝑖superscriptsubscriptproduct𝑗𝑖1𝑛subscript𝑘𝑗\displaystyle D=d_{n+1}=d_{n}\cdot k_{n}+a_{n}=\cdots=\sum_{i=1}^{n}a_{i}\prod% _{j=i+1}^{n}k_{j}italic_D = italic_d start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ⋯ = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

which means that information D𝐷Ditalic_D are divided into {an}subscript𝑎𝑛\{a_{n}\}{ italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } and stored by these probability intervals. Then, the code corresponding to probability interval [Li,Ri)subscript𝐿𝑖subscript𝑅𝑖[L_{i},R_{i})[ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is Li+aisubscript𝐿𝑖subscript𝑎𝑖L_{i}+a_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Intuitively, disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the information we have gained from the first i1𝑖1i-1italic_i - 1 probability intervals. For example, dn+1=Dsubscript𝑑𝑛1𝐷d_{n+1}=Ditalic_d start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_D, d1=0subscript𝑑10d_{1}=0italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 since

0d1=d2/k1dn+1/i=1nki=D/i=1nki=0.0subscript𝑑1subscript𝑑2subscript𝑘1subscript𝑑𝑛1superscriptsubscriptproduct𝑖1𝑛subscript𝑘𝑖𝐷superscriptsubscriptproduct𝑖1𝑛subscript𝑘𝑖00\leq d_{1}=\lfloor d_{2}/k_{1}\rfloor\leq\cdots\leq\lfloor d_{n+1}/\prod_{i=1% }^{n}k_{i}\rfloor=\lfloor D/\prod_{i=1}^{n}k_{i}\rfloor=0.0 ≤ italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⌊ italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⌋ ≤ ⋯ ≤ ⌊ italic_d start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT / ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⌋ = ⌊ italic_D / ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⌋ = 0 .

Moreover, we can get D𝐷Ditalic_D from {an}subscript𝑎𝑛\{a_{n}\}{ italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, notice that

(5) di+1=diki+ai.subscript𝑑𝑖1subscript𝑑𝑖subscript𝑘𝑖subscript𝑎𝑖d_{i+1}=d_{i}\cdot k_{i}+a_{i}.italic_d start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

The virtual bits can be updated according to Equation 5 since disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is known and aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are given by translating the i𝑖iitalic_ith probability interval.

Since virtual bits input has been proven to be correct, it suffices to show the obtained code Li+ai,i=1,,nformulae-sequencesubscript𝐿𝑖subscript𝑎𝑖𝑖1𝑛L_{i}+a_{i},i=1,\cdots,nitalic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , ⋯ , italic_n is uniquely decodable because its uniqueness is equivalent to a simple concatenation of probability interval codes. Recall each outcome is assigned a disjoint probability interval. Therefore, any number in the interval would be a unique identifier, note that

Li+ai[Li,Ri)subscript𝐿𝑖subscript𝑎𝑖subscript𝐿𝑖subscript𝑅𝑖L_{i}+a_{i}\in[L_{i},R_{i})italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

for i=1,,n𝑖1𝑛i=1,\cdots,nitalic_i = 1 , ⋯ , italic_n since ai=di%kisubscript𝑎𝑖percentsubscript𝑑𝑖subscript𝑘𝑖a_{i}=d_{i}\ \%\ k_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT % italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, then Li+aisubscript𝐿𝑖subscript𝑎𝑖L_{i}+a_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is obvious a unique representation. Also, since each codeword is 16-bit, it is a prefix code. Prefix code is always uniquely decodable, so is delayed code.

D.2. Efficiency of Delayed Coding

Suppose [L1,R1),,[Ln,Rn)subscript𝐿1subscript𝑅1subscript𝐿𝑛subscript𝑅𝑛[L_{1},R_{1}),\cdots,[L_{n},R_{n})[ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , [ italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) are n𝑛nitalic_n intervals, and LiRisubscript𝐿𝑖subscript𝑅𝑖L_{i}-R_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are random variables such that

P(RiLi=x)=xj=165536j,x[0,65536];otherwise 0formulae-sequence𝑃subscript𝑅𝑖subscript𝐿𝑖𝑥𝑥superscriptsubscript𝑗165536𝑗𝑥065536otherwise 0P(R_{i}-L_{i}=x)=\frac{x}{\sum_{j=1}^{65536}j},x\in[0,65536];\text{otherwise }0italic_P ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x ) = divide start_ARG italic_x end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 65536 end_POSTSUPERSCRIPT italic_j end_ARG , italic_x ∈ [ 0 , 65536 ] ; otherwise 0

Let μ=𝔼(RiLi)𝜇𝔼subscript𝑅𝑖subscript𝐿𝑖\mu=\mathbb{E}(R_{i}-L_{i})italic_μ = blackboard_E ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Suppose delayed coding generates one virtual 16-bit once the virtual bits number is larger than α𝛼\alphaitalic_α, α16𝛼16\alpha\geq 16italic_α ≥ 16. Delayed coding encodes every m𝑚mitalic_m intervals as a block, where n+1m>>μ𝑛1𝑚much-greater-than𝜇n+1\geq m>>\muitalic_n + 1 ≥ italic_m > > italic_μ.

Compared to entropy, delayed coding may use more bits for two reasons: (1) integer division; and (2) unused virtual bits when coding ends.

Integer Division Loss. Let Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT time to generate one virtual 16-bit, and the index of current interval is Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We have Di2αsubscript𝐷𝑖superscript2𝛼D_{i}\geq 2^{\alpha}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT, and E(I1)=α/(1log2(μ))𝐸subscript𝐼1𝛼1subscript2𝜇E(I_{1})=\alpha/(1-\log_{2}(\mu))italic_E ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_α / ( 1 - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ ) ). Also, let i=j=Iin(RjLj)subscriptproduct𝑖superscriptsubscriptproduct𝑗subscript𝐼𝑖𝑛subscript𝑅𝑗subscript𝐿𝑗\prod_{i}=\prod_{j=I_{i}}^{n}(R_{j}-L_{j})∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_j = italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Notice that

Di/216subscript𝐷𝑖superscript216\displaystyle D_{i}/2^{16}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT =Di/216+ri/216,absentsubscript𝐷𝑖superscript216subscript𝑟𝑖superscript216\displaystyle=\lfloor D_{i}/2^{16}\rfloor+r_{i}/2^{16},= ⌊ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT ⌋ + italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT ,

where ri=Di%216subscript𝑟𝑖percentsubscript𝐷𝑖superscript216r_{i}=D_{i}\%2^{16}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT % 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT. According to entropy, the total number of virtual bits we can get for use from interval Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to n𝑛nitalic_n is 16+log2(Di/216i)16subscript2subscript𝐷𝑖superscript216subscriptproduct𝑖16+\log_{2}(D_{i}/2^{16}\cdot\prod_{i})16 + roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT ⋅ ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). But due to the existence of integer division, we only have

16+log2(Di216i)=16+log2((Di216ri216)i)16subscript2subscript𝐷𝑖superscript216subscriptproduct𝑖16subscript2subscript𝐷𝑖superscript216subscript𝑟𝑖superscript216subscriptproduct𝑖16+\log_{2}(\lfloor\frac{D_{i}}{2^{16}}\rfloor\cdot\prod_{i})=16+\log_{2}((% \frac{D_{i}}{2^{16}}-\frac{r_{i}}{2^{16}})\cdot\prod_{i})16 + roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⌊ divide start_ARG italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT end_ARG ⌋ ⋅ ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 16 + roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ( divide start_ARG italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT end_ARG ) ⋅ ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

bits we can use for encoding the Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to n𝑛nitalic_n intervals. Thus, bits loss due to the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT virtual bytes generation is

log2(DiDiri).subscript2subscript𝐷𝑖subscript𝐷𝑖subscript𝑟𝑖\log_{2}(\frac{D_{i}}{D_{i}-r_{i}}).roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) .

Note that Di2αsubscript𝐷𝑖superscript2𝛼D_{i}\geq 2^{\alpha}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT and α16𝛼16\alpha\geq 16italic_α ≥ 16, then we can get the total bits loss due to integer division,

ilog2(DiDiri)ilog2(2α2αri).subscript𝑖subscript2subscript𝐷𝑖subscript𝐷𝑖subscript𝑟𝑖subscript𝑖subscript2superscript2𝛼superscript2𝛼subscript𝑟𝑖\sum_{i}\log_{2}(\frac{D_{i}}{D_{i}-r_{i}})\leq\sum_{i}\log_{2}(\frac{2^{% \alpha}}{2^{\alpha}-r_{i}}).∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) ≤ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) .

Notice there are at most log2μn=nlog2μ𝑙𝑜subscript𝑔2superscript𝜇𝑛𝑛subscript2𝜇log_{2}\mu^{n}=n\log_{2}\muitalic_l italic_o italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = italic_n roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_μ redundant bits in these intervals, so the number of virtual bytes generation is less than (nlog2μ(α16))/16𝑛subscript2𝜇𝛼1616(n\log_{2}\mu-(\alpha-16))/16( italic_n roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_μ - ( italic_α - 16 ) ) / 16, also we have ri=Di%216<216subscript𝑟𝑖percentsubscript𝐷𝑖superscript216superscript216r_{i}=D_{i}\%2^{16}<2^{16}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT % 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT < 2 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT, therefore

ilog2(2α2αri)nlog2μ(α16)16log22α2α65535.subscript𝑖subscript2superscript2𝛼superscript2𝛼subscript𝑟𝑖𝑛subscript2𝜇𝛼1616subscript2superscript2𝛼superscript2𝛼65535\sum_{i}\log_{2}(\frac{2^{\alpha}}{2^{\alpha}-r_{i}})\leq\frac{n\log_{2}\mu-(% \alpha-16)}{16}\cdot\log_{2}\frac{2^{\alpha}}{2^{\alpha}-65535}.∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) ≤ divide start_ARG italic_n roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_μ - ( italic_α - 16 ) end_ARG start_ARG 16 end_ARG ⋅ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT - 65535 end_ARG .

Unused Bits Loss. On the other hand, we consider the unused virtual bits when coding ends. Since there are n𝑛nitalic_n intervals and m𝑚mitalic_m intervals are encoded as a block, we have n/m+1𝑛𝑚1\lfloor n/m\rfloor+1⌊ italic_n / italic_m ⌋ + 1 blocks. Therefore, total bits loss due to unused virtual bits in the end is

k=1n/m+1log2Dk(n/m+1)α,superscriptsubscript𝑘1𝑛𝑚1subscript2superscriptsubscript𝐷𝑘𝑛𝑚1𝛼\sum_{k=1}^{\lfloor n/m\rfloor+1}\log_{2}D_{\infty}^{k}\leq(\lfloor n/m\rfloor% +1)\cdot\alpha,∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ italic_n / italic_m ⌋ + 1 end_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≤ ( ⌊ italic_n / italic_m ⌋ + 1 ) ⋅ italic_α ,

where Dksuperscriptsubscript𝐷𝑘D_{\infty}^{k}italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT denotes the number of virtual bits when kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT block ends, we have 2α16Dk<2αsuperscript2𝛼16superscriptsubscript𝐷𝑘superscript2𝛼2^{\alpha-16}\leq D_{\infty}^{k}<2^{\alpha}2 start_POSTSUPERSCRIPT italic_α - 16 end_POSTSUPERSCRIPT ≤ italic_D start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT < 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT.

Length of Delayed Codes. Thus, the length of delayed codes is

Ln,α,mnC+(n/m+1)α+nlog2μ16log22α2α65535.superscript𝐿𝑛𝛼𝑚𝑛𝐶𝑛𝑚1𝛼𝑛subscript2𝜇16subscript2superscript2𝛼superscript2𝛼65535L^{n,\alpha,m}\leq n\cdot C+(\lfloor n/m\rfloor+1)\cdot\alpha+\frac{n\log_{2}% \mu}{16}\cdot\log_{2}\frac{2^{\alpha}}{2^{\alpha}-65535}.italic_L start_POSTSUPERSCRIPT italic_n , italic_α , italic_m end_POSTSUPERSCRIPT ≤ italic_n ⋅ italic_C + ( ⌊ italic_n / italic_m ⌋ + 1 ) ⋅ italic_α + divide start_ARG italic_n roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_μ end_ARG start_ARG 16 end_ARG ⋅ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT - 65535 end_ARG .

where C𝐶Citalic_C is the entropy of an interval with the given distribution. In fact, μ43690.33𝜇43690.33\mu\approx 43690.33italic_μ ≈ 43690.33. Simplify it, we get

(6) Ln,α,mnC+(n/m+1)α+nlog22α2α65535.superscript𝐿𝑛𝛼𝑚𝑛𝐶𝑛𝑚1𝛼𝑛subscript2superscript2𝛼superscript2𝛼65535L^{n,\alpha,m}\leq n\cdot C+(\lfloor n/m\rfloor+1)\cdot\alpha+n\cdot\log_{2}% \frac{2^{\alpha}}{2^{\alpha}-65535}.italic_L start_POSTSUPERSCRIPT italic_n , italic_α , italic_m end_POSTSUPERSCRIPT ≤ italic_n ⋅ italic_C + ( ⌊ italic_n / italic_m ⌋ + 1 ) ⋅ italic_α + italic_n ⋅ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT - 65535 end_ARG .

Let m=n+1𝑚𝑛1m=n+1italic_m = italic_n + 1 and α=24𝛼24\alpha=24italic_α = 24, we get the archive mode setting of delayed coding and

Ln,24,n+1nC+24+108n.L^{n,24,n+1}\leq n\cdot C+24+\cdot 10^{-8}\cdot n.italic_L start_POSTSUPERSCRIPT italic_n , 24 , italic_n + 1 end_POSTSUPERSCRIPT ≤ italic_n ⋅ italic_C + 24 + ⋅ 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT ⋅ italic_n .

Since an interval size is log2(μ)>>108much-greater-thansubscript2𝜇superscript108\log_{2}(\mu)>>10^{-8}roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ ) > > 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT byte in average, Let α=16𝛼16\alpha=16italic_α = 16, we get the random access mode setting of delayed coding and

Ln,16,mnC+16(n/m+1+n).superscript𝐿𝑛16𝑚𝑛𝐶16𝑛𝑚1𝑛L^{n,16,m}\leq n\cdot C+16\cdot(\lfloor n/m\rfloor+1+n).italic_L start_POSTSUPERSCRIPT italic_n , 16 , italic_m end_POSTSUPERSCRIPT ≤ italic_n ⋅ italic_C + 16 ⋅ ( ⌊ italic_n / italic_m ⌋ + 1 + italic_n ) .

Appendix E Integrated Semantic Models

In this section, we introduce two integrated semantic models: the JSON node model and the time-series model.

E.1. JSON Node Model

Blitzcrank can compress the collection of JSON objects. Each object follows a JSON schema, where optional nodes and multi-type nodes are allowed (Pezoa et al., 2016), for example, A={Name:``John",Age:18,Job:``student"}𝐴conditional-setName:``𝐽𝑜𝑛"Age18Job:``𝑠𝑡𝑢𝑑𝑒𝑛𝑡"A=\{\text{Name}:``John",\text{Age}:18,\text{Job}:``student"\}italic_A = { Name : ` ` italic_J italic_o italic_h italic_n " , Age : 18 , Job : ` ` italic_s italic_t italic_u italic_d italic_e italic_n italic_t " }, and B={Name:``Mary",Age:``Eighteen"}𝐵conditional-setName:``𝑀𝑎𝑟𝑦"Age``𝐸𝑖𝑔𝑡𝑒𝑒𝑛"B=\{\text{Name}:``Mary",\text{Age}:``Eighteen"\}italic_B = { Name : ` ` italic_M italic_a italic_r italic_y " , Age : ` ` italic_E italic_i italic_g italic_h italic_t italic_e italic_e italic_n " }, where the absence of the Job node in B𝐵Bitalic_B and the differing types of Age in them make it challenging to compress.

A JSON node model consists of metadata, attributes, and sub-models, as shown in Figure 18. The metadata model consists of two categorical models: the existence model and the type model. The former validates node presence using two unique values: existed or not; while the latter verifies node type, and has four possible outputs. The multi-type node has more than one attribute model. Sub-models handle arrays or objects with a vector of pointers to other JSON node models. Together, all JSON node models form the JSON tree model for nested calling. For example, the root node “Person” is an object with three children. The optional “Job” node has an additional existence model, while the multi-type node “Age” requires a type model. Given a node, Blitzcrank verifies its existence, determines its type, and then processes it using the appropriate attribute models. If the node is an array or object, other JSON node models are invoked to handle it and generate intervals.

Refer to caption
Figure 18. JSON Node Model - The JSON node model is shown on the left. Node “Job” is optional, and node “Age” has multiple types.

E.2. Time-series/Markov Model

Both the time-series and Markov models capture the first-order transition property in a continuous or discrete column. The time-series model uses the Autoregressive-moving-average (ARMA) technique to decompose a series of numeric values into residuals and a regression model (Box et al., 2015). Compressing these residuals instead of the original values improves performance since the residuals have fewer outliers and more symmetry distribution. Similarly, the Markov model is for a Markov chain categorical column. It has multiple categorical models, each corresponding to a unique value/state. The current value determines which categorical model is used to compress the next one, providing larger intervals for delayed coding. However, the time-series/Markov model in Blitzcrank cannot be used with the record index.

Appendix F Experiments

In this section, we evaluate Blitzcrank with Squish (Gao and Parameswaran, 2016), Gzip (Ziv and Lempel, 1977) and Zstandard (level-9) for the table archive task. We also give the evaluation of the times-series and JSON model.

F.1. Archive Compression

Refer to caption
Figure 19. Compression for Archive - We compare the performance of archive-focused compressors, including Squish, Gzip, Zstandard and Blitzcrank on real tables. Zstandard and Blitzcrank are in the table archival mode.

We also evaluate Blitzcrank for the table archive task. Each compressor compresses the given table in memory and then decompresses it to the original, which is typical for general-purpose compressors. We record the compression factor, the compression time, and the decompression time. Archive compression focuses on the compression factor and throughput, rather than the latency. Squish is omitted for compressing dblp, because it does not support JSON.

Figure 19 shows the archive results. For the compression factor, Blitzcrank compresses categorical and numeric tables comparably to Squish, and outperforms it on string tables. This is because our semantic string model is more effective than the letter-by-letter method. Also, Blitzcrank-Archive doubles the compression factor in US Census 1990 compared to Blitzcrank in random access mode. For the compression speed, Blitzcrank is 20×20\times20 × faster than Squish on average for both insertion and access latency, thanks to our semantic models and delayed coding. Zstandard excels at archiving, which is expected given its use of a dictionary for compression. In the table archiving task, Zstandard can build a dictionary based on a longer context than in the random access task. The use of longer dictionary entries allows for the output of more bytes per access, thus increasing throughput (Ziv and Lempel, 1978).

F.2. Semi-Structured Data and Time Series

Table 3. Compression Factors - Blitzcrank excels in compressing JSONs; Only Blitzcrank can benefit from the ARMA model.
Gzip Zstd Blitz.
Relation 4.13 4.86 5.32
JSON 4.41 5.73 6.98
\uparrow 6.78% 17.9% 31.2%
Original 4.33 4.64 4.90
Residual 4.29 4.31 6.81
\uparrow -0.92% -7.11% 39.0%

dblp (dbl, 2022) is in JSON format. To investigate how Blitzcrank performs in different data formats, we extract each attribute from dblp, forming new columns. These extracted columns can form a relational table with empty values, called relation-dblp. All compressors are in archive mode in this experiment. Table 3 shows that the compressors for relation-dblp do not reach the high compression factors of dblp, but their factors do not reduce proportionally. This is because of redundant repeat property keys in JSON objects. Including these keys enhances the compression factor, with Blitzcrank being the most efficient compressor, gaining the most from JSON structures.

We then investigate whether the time-series model helps our numeric model improve performance. The time-series model converts the continuous data into residuals through the ARMA model. We compare the compression factor of each compression algorithm on residuals and the original data. We use the Jena-Climate in this experiment. Table 3 shows the results. Among the compression methods we tried, only Blitzcrank benefits from the residuals. Residuals typically follow a normal distribution and agree with the assumption of our numeric model. Residuals have fewer outliers and are easier to compress than original data. We estimate the entropy of the time-series data and its residual by discretizing the values into 512 buckets. The results meet our expectations: the entropy is reduced by approximately 30%, which agrees with the improvement of 39% shown in Table 3.