Wavelets Are All You Need for Autoregressive Image Generation

Wael Mattar, Idan Levy, Nir Sharon and Shai Dekel
(28th June 2024)

 

Abstract. In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this ‘wavelet language’. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.

 

1 Introduction

The generation of high-resolution visual information is certainly one of the most remarkable achievements of modern-age artificial intelligence. One of the prominent methods is diffusion-based models (1, 2, 3, 4, 5). In essence, diffusion models attempt to learn inversions of ill-posed operators, such as additive Gaussian noise, blurring, etc., so an image may be generated from random noisy or blurry seeds. One then enforces various conditions on images created as the time steps of the inversion process so that the final generated image may correspond to a given text prompt.

Another line of research is designing autoregressive models, in an attempt to leverage on the powerful Large Language Models (LLMs) (6, 7, 8). These models (9, 10) provide different methods to convert the image pixel representation to a series of visual tokens and then apply generative language techniques.

In this paper, we refine this line of research and provide a mathematically robust approach to the autoregressive image generation process. To this end, we reach out to a classic technique in image processing, specifically, wavelet image coding (11, 12, 13). Wavelets (14, 15, 16) are one of the main tools of modern approximation theory for nonlinear, adaptive approximation. The various wavelet transforms provide the means to transform an image into a representation that captures the essence of the visual information in a sparse way. Typically, the significant wavelet coefficients are a small fraction of the coefficients and represent important edge and texture information, while the insignificant coefficients with small absolute values are associated with smooth regions of the image. The goal of wavelet image compression is then to efficiently store the information of only the significant coefficients. In fact, the underlying method of the popular JPEG image compression algorithm (17), invented in the 80s, contains many elements of wavelet coding, where a local Discrete Cosine Transform, a precursor of wavelets, is used. However, in this paper, we leverage the progressive wavelet compression technique, a more advanced form of image compression. It creates a bit-stream where every bit corresponds to the next most important piece of visual information. Since we are generating images rather than decoding them from a compressed file, there is no need to create actual binary bit-streams, and using a ‘wavelet language’ of a limited number of tokens is sufficient.

Thus, our new approach to autoregressive image generation is based on two main ingredients. The first is progressive wavelet image coding, which allows to tokenize the visual information of an image from coarse to fine details. This is achieved using only 7 tokens, by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second ingredient is a variant of an NLP decoder-only transformer (6, 7, 8) whose architecture was re-designed and optimized for token sequences in this ‘wavelet language’. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions (18, 19, 20). During inference, this allows the generation of visually meaningful images from an initial random seed generated from sampled from the distribution of the scaling function coefficients at the lowest resolution.

Using the wavelet autoregressive approach, where the ‘wavelet language’ contains only 7 tokens, provides many attractive features. The length of the token sequences during training or inference can be flexible, where longer sequences imply more detailed or higher-resolution images. Guiding the generative process using a class affiliation or text prompting is easily achieved by concatenating the corresponding vector representations to the tokens’ vector representation of dimension 7. Stochastic control using simple transformer inference techniques, allows to create from one textual prompt a diversity of corresponding images. Furthermore, since each token is associated with the local support in the image domain of the corresponding wavelet, one can switch the guidance during the generative process to allow different prompting for different regions.

Our paper is organized as follows. We begin with a review of related work in Section 2. In Section 3, we review wavelet image coding and explain how one may extract from the classical theory the ability to tokenize the visual information of images. In Section 4, we focus on components of language transforms that we redesign to serve our special wavelet language. We further provide several methods that can be used to direct the generation process under certain conditions: class label and/or textual prompt. In Section 5, we provide experimental results. Finally, in Section 6, we discuss possible future applications of our method, such as multi-modality generation and compositions of blobs.

2 Related work

In this section, we first review the current state of the art in image generation. We then review some methods that apply wavelets as a frequency decomposition backbone for various aspects of style transfer, acceleration, and optimization of existing image generation methods, etc.

Currently, many commercial solutions apply diffusion-based models (1, 2, 3, 4, 5, 21). In essence, diffusion models learn inversions of ill-posed operators, such as additive Gaussian noise, blurring, etc., so images may be generated from random noisy or blurry seeds. One then enforces various conditions on images created through the time steps of the inversion process so that the final generated image may correspond to a given text prompt.

The methods of VQGAN (9) and DALL-E (10) along with (22, 23) utilize a visual tokenizer to discretize images into grids of 2D tokens, which are then flattened to a 1D sequence for autoregressive learning, mirroring the process of sequential language modeling. For example, in (10) a discrete variational autoencoder is trained to compress each 256×256256256256\times 256256 × 256 RGB image into a 32×32323232\times 3232 × 32 grid of image tokens, where each such token can assume 8192 possible values. This creates a relatively short context sequence of 1024102410241024 tokens, but with a vocabulary of 8192 word tokens. The TiTok method, recently introduced in (24), shows how to compose Vision Transformers with the Vector-Quantization method to arrive at an autoregressive method that may use only 32 tokens. In comparison, our method uses only 7 tokens for any image resolution and any level of fine detail generation.

In contrast, to the raster-scan “next-token prediction”, the method of (25) provides autoregressive learning on images as “next-scale prediction” or “next-resolution prediction”

Some methods, such as (26, 27), use wavelets as means for frequency decomposition representations for image inpainting, style transfer, and generative adversarial network methods. Some works propose to use wavelets as part of diffusion methods (28, 29) to speed up the diffusion approach by applying the denoising process in the wavelet regime.

3 Elements of Wavelet Image Coding

In this section, we review some elements of wavelet image coding (11, 12, 13) that we use for our generative method. Essentially, we are interested in the process that takes an image in its raw pixel form as input and generates a sequence of tokens that capture its visual details. The structure of the sequence from coarse to fine details is achieved by ordering the information starting with the most significant bits of the most significant wavelet coefficients. En par with wavelet coding, we also have a goal to create token sequences that are as short as possible. This creates shorter contexts for the transformer decoder and improves its performance. However, in our generative application, we are not concerned with efficient encoding of the token sequence to a compressed stream of bits. This is a subtopic of information theory and is typically implemented using arithmetic encoders.

3.1 Wavelet Transforms

A univariate wavelet system (14, 16) is a family of real functions {ψj,k:(j,k)2}conditional-setsubscript𝜓𝑗𝑘𝑗𝑘superscript2\left\{{\psi_{j,k}\thinspace:\thinspace\thinspace(j,k)\in{\mathbb{Z}}^{2}}\right\}{ italic_ψ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT : ( italic_j , italic_k ) ∈ blackboard_Z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } in L2()subscript𝐿2L_{2}({\mathbb{R}})italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R ), built by dilating and translating a unique mother wavelet function ψ𝜓\psiitalic_ψ

ψj,k(x):=2j/2ψ(2jxk),assignsubscript𝜓𝑗𝑘𝑥superscript2𝑗2𝜓superscript2𝑗𝑥𝑘\psi_{j,k}(x):=2^{-j/2}\psi(2^{-j}x-k),italic_ψ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ( italic_x ) := 2 start_POSTSUPERSCRIPT - italic_j / 2 end_POSTSUPERSCRIPT italic_ψ ( 2 start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT italic_x - italic_k ) ,

where the mother wavelet typically has compact support (or fast decay) and has r𝑟ritalic_r vanishing moments

xkψ(x)𝑑x=0,k=0,1,r1.formulae-sequencesubscriptsuperscript𝑥𝑘𝜓𝑥differential-d𝑥0𝑘01𝑟1\int_{\mathbb{R}}{x^{k}\psi(x)dx=0},\qquad k=0,1\ldots,r-1.∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_ψ ( italic_x ) italic_d italic_x = 0 , italic_k = 0 , 1 … , italic_r - 1 . (1)

Wavelet systems can be constructed to serve a basis of L2()subscript𝐿2L_{2}(\mathbb{R})italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R ). To facilitate applications, one then also constructs a dual ψ~~𝜓\tilde{{\psi}}over~ start_ARG italic_ψ end_ARG of ψ𝜓\psiitalic_ψ, where ψj,k,ψ~j,k=δj,jδk,ksubscript𝜓𝑗𝑘subscript~𝜓superscript𝑗superscript𝑘subscript𝛿𝑗superscript𝑗subscript𝛿𝑘superscript𝑘\langle\psi_{j,k},\tilde{\psi}_{j^{\prime},k^{\prime}}\rangle=\delta_{j,j^{% \prime}}\delta_{k,k^{\prime}}⟨ italic_ψ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT , over~ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩ = italic_δ start_POSTSUBSCRIPT italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, so that for each fL2()𝑓subscript𝐿2f\in L_{2}(\mathbb{R})italic_f ∈ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R ),

f=j,kf,ψ~j,kψj,k.𝑓subscript𝑗𝑘𝑓subscript~𝜓𝑗𝑘subscript𝜓𝑗𝑘f=\sum\limits_{j,k}{\langle f,\tilde{{\psi}}_{j,k}\rangle\psi_{j,k}}.italic_f = ∑ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ⟨ italic_f , over~ start_ARG italic_ψ end_ARG start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ⟩ italic_ψ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT .

For special choices of ψ𝜓\psiitalic_ψ , the set {ψj,k}subscript𝜓𝑗𝑘\left\{{\psi_{j,k}}\right\}{ italic_ψ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT } forms an orthonormal basis for L2()subscript𝐿2L_{2}({\mathbb{R}})italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R ) and then, ψ=ψ~𝜓~𝜓\psi=\tilde{{\psi}}italic_ψ = over~ start_ARG italic_ψ end_ARG.

Usually, one starts the construction of a wavelet system from a Multi-Resolution Analysis (MRA) generated by a scaling function φL2()𝜑subscript𝐿2\varphi\in L_{2}\left({\mathbb{R}}\right)italic_φ ∈ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R ) that satisfies a two-scale equation

φ=kakφ(2k).\varphi=\sum\limits_{k}{a_{k}\varphi\left({2\cdot-k}\right)}.italic_φ = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_φ ( 2 ⋅ - italic_k ) .

One then sets

Vj=span¯{φj,k:=2j/2φ(2jk):k},j,V_{j}=\overline{span}\left\{{\varphi_{j,k}:=2^{{-j}\mathord{\left/{\vphantom{{% -j}2}}\right.\kern-1.2pt}2}\varphi\left({2^{-j}\cdot-k}\right):k\in{\mathbb{Z}% }}\right\},\quad j\in\mathbb{Z},italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = over¯ start_ARG italic_s italic_p italic_a italic_n end_ARG { italic_φ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT := 2 start_POSTSUPERSCRIPT - italic_j start_ID / end_ID 2 end_POSTSUPERSCRIPT italic_φ ( 2 start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT ⋅ - italic_k ) : italic_k ∈ blackboard_Z } , italic_j ∈ blackboard_Z ,

which implies (under certain mild conditions)

.V2V1V0V1V2,Vj={0},jVj=L2().formulae-sequencesubscript𝑉2subscript𝑉1subscript𝑉0subscript𝑉1subscript𝑉2formulae-sequencesubscript𝑉𝑗0subscript𝑗subscript𝑉𝑗subscript𝐿2....V_{2}\subset V_{1}\subset V_{0}\subset V_{-1}\subset V_{-2}...,\quad\cap V% _{j}=\left\{0\right\},\quad\cup_{j}V_{j}=L_{2}({\mathbb{R}}).… . italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊂ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊂ italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊂ italic_V start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ⊂ italic_V start_POSTSUBSCRIPT - 2 end_POSTSUBSCRIPT … , ∩ italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { 0 } , ∪ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R ) .

Again, to facilitate applications, one may also construct a dual φ~~𝜑\tilde{{\varphi}}over~ start_ARG italic_φ end_ARG of φ𝜑\varphiitalic_φ, where φ0,k,φ~0,k=δk,ksubscript𝜑0𝑘subscript~𝜑0superscript𝑘subscript𝛿𝑘superscript𝑘\langle\varphi_{0,k},\tilde{\varphi}_{0,k^{\prime}}\rangle=\delta_{k,k^{\prime}}⟨ italic_φ start_POSTSUBSCRIPT 0 , italic_k end_POSTSUBSCRIPT , over~ start_ARG italic_φ end_ARG start_POSTSUBSCRIPT 0 , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩ = italic_δ start_POSTSUBSCRIPT italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, so that for each fVj𝑓subscript𝑉𝑗f\in V_{j}italic_f ∈ italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT,

f=kf,φ~j,kφj,k.𝑓subscript𝑘𝑓subscript~𝜑𝑗𝑘subscript𝜑𝑗𝑘f=\sum\limits_{k}{\langle f,\tilde{{\varphi}}_{j,k}\rangle\varphi_{j,k}}.italic_f = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ italic_f , over~ start_ARG italic_φ end_ARG start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ⟩ italic_φ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT .

Equipped with the MRA, one then proceeds to construct the wavelet ψ𝜓\psiitalic_ψ such that Wj:=span¯{ψj,k:k}assignsubscript𝑊𝑗¯𝑠𝑝𝑎𝑛conditional-setsubscript𝜓𝑗𝑘𝑘W_{j}:=\overline{span}\left\{{\psi_{j,k}:k\in{\mathbb{Z}}}\right\}italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT := over¯ start_ARG italic_s italic_p italic_a italic_n end_ARG { italic_ψ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT : italic_k ∈ blackboard_Z }, with Vj+1+Wj+1=Vjsubscript𝑉𝑗1subscript𝑊𝑗1subscript𝑉𝑗V_{j+1}+W_{j+1}=V_{j}italic_V start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. A classic example for an orthonormal MRA and wavelet system where φ=φ~𝜑~𝜑\varphi=\tilde{\varphi}italic_φ = over~ start_ARG italic_φ end_ARG and ψ=ψ~𝜓~𝜓\psi=\tilde{\psi}italic_ψ = over~ start_ARG italic_ψ end_ARG, are the Haar scaling function and Haar wavelet

φ(x):={1,x[0,1],0,else.ψ(x):={1x[0,12),1x[12,1],0else.formulae-sequenceassign𝜑𝑥cases1𝑥01missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression0elsemissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionassign𝜓𝑥cases1𝑥012missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression1𝑥121missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression0else.missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression\varphi\left(x\right):=\left\{{{\begin{array}[]{*{20}c}{1,\hfill}&{x\in\left[{% 0,1}\right],\hfill}\\ {0,\hfill}&{\mbox{else}.\hfill}\\ \end{array}}}\right.\quad\psi\left(x\right):=\left\{{{\begin{array}[]{*{20}c}1% \hfill&{x\in\left[{0,\frac{1}{2}}\right),\hfill}\\ {-1\hfill}&{x\in\left[{\frac{1}{2},1}\right],\hfill}\\ 0\hfill&\mbox{else.}\hfill\\ \end{array}}}\right.italic_φ ( italic_x ) := { start_ARRAY start_ROW start_CELL 1 , end_CELL start_CELL italic_x ∈ [ 0 , 1 ] , end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL else . end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY italic_ψ ( italic_x ) := { start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL italic_x ∈ [ 0 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) , end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL italic_x ∈ [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG , 1 ] , end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL else. end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY

The bivariate Haar system (see below) is a good choice when working with piecewise constant images, such as the MNIST handwritten digits (30). For some of our experiments, we use a famous wavelet system from the Cohen–Daubechies–Feauveau (CDF) family of wavelets (14), which is sometimes termed bior4.4 (r=4𝑟4r=4italic_r = 4 in (1)) or [9,7] in the signal processing community (the supports of the scaling functions and wavelets, as well as the lengths of the associated filters, are 9 and 7). The generating functions of the bior4.4 are depicted in Figure 1.

Refer to caption

Figure 1: The CDF [9,7] wavelet system (figure reproduced from (31)).

The wavelet model can be easily generalized to any dimension, via a tensor product of the wavelet and the scaling functions. Assume that the univariate dual scaling functions φ,φ~𝜑~𝜑\varphi,\tilde{\varphi}italic_φ , over~ start_ARG italic_φ end_ARG and dual wavelets ψ,ψ~𝜓~𝜓\psi,\tilde{\psi}italic_ψ , over~ start_ARG italic_ψ end_ARG, are given. Then, a wavelet bivariate basis is constructed using three types of basic wavelets

ψ1(x1,x2):=φ(x1)ψ(x2),ψ2(x1,x2):=ψ(x1)φ(x2),ψ3(x1,x2):=ψ(x1)ψ(x2),formulae-sequenceassignsuperscript𝜓1subscript𝑥1subscript𝑥2𝜑subscript𝑥1𝜓subscript𝑥2formulae-sequenceassignsuperscript𝜓2subscript𝑥1subscript𝑥2𝜓subscript𝑥1𝜑subscript𝑥2assignsuperscript𝜓3subscript𝑥1subscript𝑥2𝜓subscript𝑥1𝜓subscript𝑥2\psi^{1}(x_{1},x_{2}):=\varphi(x_{1})\psi(x_{2}),\quad\psi^{2}(x_{1},x_{2}):=% \psi(x_{1})\varphi(x_{2}),\quad\psi^{3}(x_{1},x_{2}):=\psi(x_{1})\psi(x_{2}),italic_ψ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) := italic_φ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_ψ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_ψ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) := italic_ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_φ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_ψ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) := italic_ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_ψ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,
ψ~1(x1,x2):=φ~(x1)ψ~(x2),ψ~2(x1,x2):=ψ~(x1)φ~(x2),ψ~3(x1,x2):=ψ~(x1)ψ~(x2).formulae-sequenceassignsuperscript~𝜓1subscript𝑥1subscript𝑥2~𝜑subscript𝑥1~𝜓subscript𝑥2formulae-sequenceassignsuperscript~𝜓2subscript𝑥1subscript𝑥2~𝜓subscript𝑥1~𝜑subscript𝑥2assignsuperscript~𝜓3subscript𝑥1subscript𝑥2~𝜓subscript𝑥1~𝜓subscript𝑥2\tilde{\psi}^{1}(x_{1},x_{2}):=\tilde{\varphi}(x_{1})\tilde{\psi}(x_{2}),\quad% \tilde{\psi}^{2}(x_{1},x_{2}):=\tilde{\psi}(x_{1})\tilde{\varphi}(x_{2}),\quad% \tilde{\psi}^{3}(x_{1},x_{2}):=\tilde{\psi}(x_{1})\tilde{\psi}(x_{2}).over~ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) := over~ start_ARG italic_φ end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) over~ start_ARG italic_ψ end_ARG ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , over~ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) := over~ start_ARG italic_ψ end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) over~ start_ARG italic_φ end_ARG ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , over~ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) := over~ start_ARG italic_ψ end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) over~ start_ARG italic_ψ end_ARG ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

The bivariate wavelet transform of fL2(2)𝑓subscript𝐿2superscript2f\in L_{2}({\mathbb{R}}^{2})italic_f ∈ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), in terms of the bivariate wavelet tensor basis

ψj,ke:=2jψe(2jk),ψ~j,ke:=2jψ~e(2jk),e=1,2,3,j,k2,\psi^{e}_{j,k}:=2^{-j}\psi^{e}(2^{-j}\cdot-k),\quad\tilde{\psi}^{e}_{j,k}:=2^{% -j}\tilde{\psi}^{e}(2^{-j}\cdot-k),\quad e=1,2,3,j\in\mathbb{Z},k\in\mathbb{Z}% ^{2},italic_ψ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT := 2 start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT italic_ψ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT ⋅ - italic_k ) , over~ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT := 2 start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT over~ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT ⋅ - italic_k ) , italic_e = 1 , 2 , 3 , italic_j ∈ blackboard_Z , italic_k ∈ blackboard_Z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

is then

f=e=1,2,3,j,k2f,ψ~j,keψj,ke.𝑓subscriptformulae-sequence𝑒123formulae-sequence𝑗𝑘superscript2𝑓subscriptsuperscript~𝜓𝑒𝑗𝑘superscriptsubscript𝜓𝑗𝑘𝑒f=\sum_{e=1,2,3,j\in\mathbb{Z},k\in\mathbb{Z}^{2}}{\langle f,\tilde{\psi}^{e}_% {j,k}\rangle\psi_{j,k}^{e}}.italic_f = ∑ start_POSTSUBSCRIPT italic_e = 1 , 2 , 3 , italic_j ∈ blackboard_Z , italic_k ∈ blackboard_Z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟨ italic_f , over~ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ⟩ italic_ψ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT .

The bivariate wavelet decomposition can thus be interpreted as a signal decomposition in a set of three spatially oriented frequency subbands: LH(e=1)𝐿𝐻𝑒1LH(e=1)italic_L italic_H ( italic_e = 1 ) detects horizontal edges; HL𝐻𝐿HLitalic_H italic_L (e=2)e=2)italic_e = 2 ) detects vertical edges and HH𝐻𝐻HHitalic_H italic_H (e=3)e=3)italic_e = 3 ) detects diagonal edges.

Under the assumption that ψ𝜓\psiitalic_ψ and ψ~~𝜓\tilde{\psi}over~ start_ARG italic_ψ end_ARG are compactly supported (or have fast decay), a wavelet coefficient f,ψ~j,ke𝑓subscriptsuperscript~𝜓𝑒𝑗𝑘\langle f,\tilde{\psi}^{e}_{j,k}\rangle⟨ italic_f , over~ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ⟩ at a scale j𝑗jitalic_j represents the information about the function in the spatial region in the neighborhood of 2jk,k2superscript2𝑗𝑘𝑘superscript22^{j}k,\thinspace k\in{\mathbb{Z}}^{2}2 start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_k , italic_k ∈ blackboard_Z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. At the next finer scale j1𝑗1j-1italic_j - 1, the information about this region is represented by four wavelet coefficients, which are described as the children of f,ψ~j,ke𝑓subscriptsuperscript~𝜓𝑒𝑗𝑘\langle f,\tilde{\psi}^{e}_{j,k}\rangle⟨ italic_f , over~ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ⟩. This leads to a natural tree structure organized in a quad tree structure of each of the three subband types as shown in Figure 2. As j𝑗jitalic_j decreases, the child coefficients add finer and finer details into the spatial regions occupied by their ancestors.

Refer to caption

Figure 2: Wavelet coefficient tree structure across the subbands (MRA decomposition).

In image processing, one uses the Discrete Wavelet Transform (DWT). It works by initially assuming that the image pixels {fk=fk1,k2}k1,k2=1Msuperscriptsubscriptsubscript𝑓𝑘subscript𝑓subscript𝑘1subscript𝑘2𝑘1𝑘21𝑀\{f_{k}=f_{k_{1},k_{2}}\}_{k1,k2=1}^{M}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k 1 , italic_k 2 = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT are good approximants of the projections on the shifts of the dual scaling function with the underlying function f𝑓fitalic_f

fkf,φ~0,k.subscript𝑓𝑘𝑓subscript~𝜑0𝑘f_{k}\approx\langle f,\tilde{\varphi}_{0,k}\rangle.italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≈ ⟨ italic_f , over~ start_ARG italic_φ end_ARG start_POSTSUBSCRIPT 0 , italic_k end_POSTSUBSCRIPT ⟩ .

With these coefficients as input, one uses the DWT to compute coefficients down to some predefined low-resolution m𝑚mitalic_m. For simplicity, we may assume that M=2m𝑀superscript2𝑚M=2^{m}italic_M = 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and that we use the DWT to compute

{f,φ~m,k},{f,ψ~j,ke},1jm,e=1,2,3.formulae-sequence𝑓subscript~𝜑𝑚𝑘𝑓subscriptsuperscript~𝜓𝑒𝑗𝑘1𝑗𝑚𝑒123\{\langle f,\tilde{\varphi}_{m,k}\rangle\},\quad\{\langle f,\tilde{\psi}^{e}_{% j,k}\rangle\},\quad 1\leq j\leq m,\quad e=1,2,3.{ ⟨ italic_f , over~ start_ARG italic_φ end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT ⟩ } , { ⟨ italic_f , over~ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ⟩ } , 1 ≤ italic_j ≤ italic_m , italic_e = 1 , 2 , 3 . (2)

Wavelet representations are considered very efficient for image compression (11, 12, 13). The edge information typically constitutes a small portion of a typical image, while the dual wavelet coefficients have a large absolute value only if edges intersect the support of the corresponding dual wavelets. Consequently, the image can be approximated well using a few significant wavelet coefficients. A clear statistical structure also follows: large/small values of wavelet coefficients tend to propagate through the scales of the quadtrees depicted in Figure 2. As an example, a sparse wavelet representation of a 512×512512512512\times 512512 × 512 fishing boat image and a compressed version of it are shown in Figure 3, where the compression algorithm JPEG2000 is based on the sparse representation. The Figure clearly depicts that the significant wavelet coefficients (coefficients with relatively large absolute values) are located on strong edges of the image.

Refer to caption
(a) Fishing boat image.
Refer to caption
(b) 15267 significant coefficients.
Refer to caption
(c) Compressed image 1:17.
Figure 3: Image compression based on sparse wavelet approximation.

3.2 Embedded Wavelet Tokenization

The sparse wavelet representation (2) of an image provides the perfect infrastructure for the generation of embedded coding representations (11, 12, 13). Embedded coding is similar in spirit to binary finite precision representations of real numbers. For each digit added to the right, more precision is added. Yet, the “encoding” can cease at any time and provide the “best” representation of the real number achievable within the framework of the binary digit representation. Similarly, the embedded coder can cease at any time and provide the “best” representation of an image achievable within its framework. Embedded coding streams can be generated from wavelet representations by ordering the information on the wavelet representation starting with the most significant bits of the most significant coefficients. that is, the coefficients with the largest absolute value.

First, for simplicity of notation, using (2), denote for I=(i1,i2)𝐼subscript𝑖1subscript𝑖2I=(i_{1},i_{2})italic_I = ( italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), 1i1,i22formulae-sequence1subscript𝑖1subscript𝑖221\leq i_{1},i_{2}\leq 21 ≤ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 2

αI=f,φ~m,I.subscript𝛼𝐼𝑓subscript~𝜑𝑚𝐼\alpha_{I}=\langle f,\tilde{\varphi}_{m,I}\rangle.italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = ⟨ italic_f , over~ start_ARG italic_φ end_ARG start_POSTSUBSCRIPT italic_m , italic_I end_POSTSUBSCRIPT ⟩ .

We also map the coefficients {f,ψ~j,ke}𝑓subscriptsuperscript~𝜓𝑒𝑗𝑘\{\langle f,\tilde{\psi}^{e}_{j,k}\rangle\}{ ⟨ italic_f , over~ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ⟩ } for 1jm1𝑗𝑚1\leq j\leq m1 ≤ italic_j ≤ italic_m, e=1,2,3𝑒123e=1,2,3italic_e = 1 , 2 , 3, based on their location I=(i1,i2)𝐼subscript𝑖1subscript𝑖2I=(i_{1},i_{2})italic_I = ( italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), 3i1,i2Mformulae-sequence3subscript𝑖1subscript𝑖2𝑀3\leq i_{1},i_{2}\leq M3 ≤ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M, in the coefficient matrix

αIf,ψ~j,ke.subscript𝛼𝐼𝑓subscriptsuperscript~𝜓𝑒𝑗𝑘\alpha_{I}\leftarrow\langle f,\tilde{\psi}^{e}_{j,k}\rangle.italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ← ⟨ italic_f , over~ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ⟩ .

We note in passing that one may assume that the low resolution scaling function coefficients from (2) are known during training and are randomly sampled from some distribution during image generation and therefore need not be part of the tokenization.

3.2.1 Encoding a wavelet representation into a token sequence

We now show how to process the numeric representation of the coefficients, from most significant to least significant and ‘encode’ it into a relatively compact series of tokens. The representation using the series of tokens should be invertible. That is, one can ‘decode’ the sequence back to the wavelet representation.

To this end, assuming the image pixels are normalized to the range [0,1]01[0,1][ 0 , 1 ], one can show that for an image of dyadic dimension [M,M]=[2m,2m]𝑀𝑀superscript2𝑚superscript2𝑚[M,M]=[2^{m},2^{m}][ italic_M , italic_M ] = [ 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ]

maxI|αI|2m1.subscript𝐼subscript𝛼𝐼superscript2𝑚1\max_{I}|\alpha_{I}|\leq 2^{m-1}.roman_max start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | ≤ 2 start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT . (3)

Assuming for simplicity that all images of a given dataset have the same dyadic dimensions [M,M]𝑀𝑀[M,M][ italic_M , italic_M ], then this bound holds for all of their wavelet representations. Our first option is to initialize a threshold T=2m2𝑇superscript2𝑚2T=2^{m-2}italic_T = 2 start_POSTSUPERSCRIPT italic_m - 2 end_POSTSUPERSCRIPT and begin scanning the wavelet coefficients of the image, in a predetermined order (see below) for significance, with the goal of reporting only those coefficients for which the following holds

T|αI|<2T.𝑇subscript𝛼𝐼2𝑇T\leq|\alpha_{I}|<2T.italic_T ≤ | italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | < 2 italic_T .

Our second option, is to compute separately for each image in the dataset

m~:=log2maxI|αI|,assign~𝑚subscript2subscript𝐼subscript𝛼𝐼\tilde{m}:=\lceil{\log_{2}\max_{I}|\alpha_{I}|}\rceil,over~ start_ARG italic_m end_ARG := ⌈ roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | ⌉ , (4)

and then initialize for the specific image T=2m~1𝑇superscript2~𝑚1T=2^{\tilde{m}-1}italic_T = 2 start_POSTSUPERSCRIPT over~ start_ARG italic_m end_ARG - 1 end_POSTSUPERSCRIPT. In this scenario, we store and use the parameter m~~𝑚\tilde{m}over~ start_ARG italic_m end_ARG for each image in the training set along with its sequence of tokens.

We also maintain a matrix of approximated wavelet coefficients {α~I}subscript~𝛼𝐼\{\tilde{\alpha}_{I}\}{ over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT } which we initialize with zeros. Once we complete the processing at a given bit plane, we update TT/2𝑇𝑇2T\leftarrow T/2italic_T ← italic_T / 2 and repeat the process. At each bit-plane we report the significant coefficients that were just uncovered in this bit-plane using a token ‘NowSignificantNeg’ if the coefficient is negative or a token ‘NowSignificantPos’ if it is positive. At the time of uncovering, we modify the approximation of the coefficient α~Isubscript~𝛼𝐼\tilde{\alpha}_{I}over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT to have the absolute value 3T/23𝑇23T/23 italic_T / 2, with the reported sign. Next, we add a token to represent the coefficient’s next significant bit, ‘NextAccuracy0’ if the coefficient satisfies |αI||α~I|subscript𝛼𝐼subscript~𝛼𝐼|\alpha_{I}|\leq|\tilde{\alpha}_{I}|| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | ≤ | over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | or the token ‘NextAccuracy1’ if |αI|>|α~I|subscript𝛼𝐼subscript~𝛼𝐼|\alpha_{I}|>|\tilde{\alpha}_{I}|| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | > | over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT |. The approximation α~Isubscript~𝛼𝐼\tilde{\alpha}_{I}over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is updated accordingly by subtracting or adding T/4𝑇4T/4italic_T / 4 (depending on the sign of the coefficient and the accuracy bit type).

Let us demonstrate with an example. Assume T=16 and αI=17.45subscript𝛼𝐼17.45\alpha_{I}=-17.45italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = - 17.45. Therefore, the coefficient is first uncovered in the current bit plane. When we arrive at the index I𝐼Iitalic_I, we report a ‘NowSignificantNeg’ token for this coefficient, providing it with a temporary approximation α~I=24subscript~𝛼𝐼24\tilde{\alpha}_{I}=-24over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = - 24, which lies in the middle of the segment [T,2T]=[16,32]𝑇2𝑇1632[-T,-2T]=[-16,-32][ - italic_T , - 2 italic_T ] = [ - 16 , - 32 ]. Next, since in fact |αI|24subscript𝛼𝐼24|\alpha_{I}|\leq 24| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | ≤ 24, we report a token ‘NextAccuracy0’ to represent the coefficient’s next significant bit, providing an updated approximation α~I=20subscript~𝛼𝐼20\tilde{\alpha}_{I}=-20over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = - 20, which lies in the middle of the segment [T,3T/2]=[16,24]𝑇3𝑇21624[-T,-3T/2]=[-16,-24][ - italic_T , - 3 italic_T / 2 ] = [ - 16 , - 24 ], leading to a better approximation of the ground truth value.

In case a coefficient has been uncovered in any of the previous bit-planes and is already known to be significant, we only add one of the tokens ‘NextAccuracy0’ if |αI||α~I|subscript𝛼𝐼subscript~𝛼𝐼|\alpha_{I}|\leq|\tilde{\alpha}_{I}|| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | ≤ | over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | or ‘NextAccuracy1’ if |αI|>|α~I|subscript𝛼𝐼subscript~𝛼𝐼|\alpha_{I}|>|\tilde{\alpha}_{I}|| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | > | over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT |. We then update the approximation α~Isubscript~𝛼𝐼\tilde{\alpha}_{I}over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT by subtracting or adding T/4𝑇4T/4italic_T / 4 (depending on the sign of the coefficient and the accuracy bit type).

Assuming the bit-plane scanning order of the coefficients is fixed, one then only needs to add the token ‘Insignificant’ to provide a valid invertible tokenization process. One simply scans the coefficients in the fixed order and uses their true known value to test and apply one of three possibilities:

  1. (i)

    |αI|2Tsubscript𝛼𝐼2𝑇|\alpha_{I}|\geq 2T| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | ≥ 2 italic_T: The coefficient has already been reported as significant in a previous bit-plane. Therefore one reports the token ‘NextAccuracy0’ or ‘NextAccuracy1’ depending on the test |αI|<|α~I|subscript𝛼𝐼subscript~𝛼𝐼|\alpha_{I}|<|\tilde{\alpha}_{I}|| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | < | over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT |.

  2. (ii)

    T|αI|<2T𝑇subscript𝛼𝐼2𝑇T\leq|\alpha_{I}|<2Titalic_T ≤ | italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | < 2 italic_T: First report the token ‘NowSignificantNeg’ or ‘NowSignificantPos’ depending on the sign and then report the token ‘NextAccuracy0’ or ‘NextAccuracy1’.

  3. (iii)

    |αI|<Tsubscript𝛼𝐼𝑇|\alpha_{I}|<T| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | < italic_T: report ‘Insignificant’.

The process described above, although completely sufficient for invertible tokenization, potentially creates long sequences. Specifically, it does not take into consideration the local correlations among ‘neighboring’ insignificant wavelet coefficients. Due to the sparsity property of the wavelet transform, during the scanning process, many of the ‘Insignificant’ coefficients form local groups. Moreover, there are correlations between local groups of insignificant coefficients of the same subband type across resolutions in the manner of the quad-tree structure of Figure 2. Image compression algorithms such as the EZW (11) or SPIHT (12), are based on statistical zero tree models that try to capture these correlations across resolutions. As we shall later see, for image generation, we actually rely on the powerful capabilities of the transformer models to learn correlation patterns of the ‘wavelet language’ of the given dataset. However, we do ‘ease the burden’ off the transformers significantly by utilizing the structure of the groups of insignificant coefficients to reduce the size of the token sequences, thereby creating shorter contexts.

To this end, we add two additional tokens for groups of insignificant coefficients: ‘Group4x4’ and ‘Group2x2’ and modify the scanning process to visit the coefficients based on groups of 4×4444\times 44 × 4. The first token is used in locations where the scan is at an index (4l1,4l2)4subscript𝑙14subscript𝑙2(4l_{1},4l_{2})( 4 italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 4 italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), for some integers l1,l2subscript𝑙1subscript𝑙2l_{1},l_{2}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. If at the current bit plane, all the 16 coefficients with indices I=(i1,i2)𝐼subscript𝑖1subscript𝑖2I=(i_{1},i_{2})italic_I = ( italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), 4l1i14(l1+1),4l2i24(l2+1)formulae-sequence4subscript𝑙1subscript𝑖14subscript𝑙114subscript𝑙2subscript𝑖24subscript𝑙214l_{1}\leq i_{1}\leq 4(l_{1}+1),4l_{2}\leq i_{2}\leq 4(l_{2}+1)4 italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 4 ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 ) , 4 italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 4 ( italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 ), are still insignificant, we issue the token ‘Group4x4’ and the tokenization process continues to the next group of 4×4444\times 44 × 4 coefficients. However, if any of the coefficients of the 4×4444\times 44 × 4 group becomes significant in the current bit-plane, the group breaks down to 4 groups of 2×2222\times 22 × 2. If a group of 2×2222\times 22 × 2 is still composed of insignificant coefficients at the current bit-plane, we add a token ‘Group2x2’. If a group of 2×2222\times 22 × 2 breaks down, then each coefficient from the group is reported individually as being ‘Insignificant’ or one of ‘NowSignificantNeg’, ‘NowSignificantPos’. The scanning process keeps track of which groups broke up, so that only necessary and informative tokens are generated. We summarize the seven tokens and their roles below

  1. (i)

    ‘Group4x4’ – At the index (4l1,4l2)4subscript𝑙14subscript𝑙2(4l_{1},4l_{2})( 4 italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 4 italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), the group of 16 coefficients {αI}subscript𝛼𝐼\{\alpha_{I}\}{ italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT }, 4l1i14l1+44subscript𝑙1subscript𝑖14subscript𝑙144l_{1}\leq i_{1}\leq 4l_{1}+44 italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 4 italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 4, 4l2i24l2+44subscript𝑙2subscript𝑖24subscript𝑙244l_{2}\leq i_{2}\leq 4l_{2}+44 italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 4 italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 4, are still insignificant, |αI|<Tsubscript𝛼𝐼𝑇|\alpha_{I}|<T| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | < italic_T.

  2. (ii)

    ‘Group2x2’ – At the index (2l1,2l2)2subscript𝑙12subscript𝑙2(2l_{1},2l_{2})( 2 italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 2 italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), the group of 4 coefficients {αI}subscript𝛼𝐼\{\alpha_{I}\}{ italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT }, 2l1i12l1+22subscript𝑙1subscript𝑖12subscript𝑙122l_{1}\leq i_{1}\leq 2l_{1}+22 italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 2 italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2, 2l2i12l2+22subscript𝑙2subscript𝑖12subscript𝑙222l_{2}\leq i_{1}\leq 2l_{2}+22 italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ 2 italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2, are still insignificant, |αI|<Tsubscript𝛼𝐼𝑇|\alpha_{I}|<T| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | < italic_T.

  3. (iii)

    ’NowSignificantNeg’, ’NowSignificantPos’ – At the current location I𝐼Iitalic_I, the coefficient satisfies T|αI|<2T𝑇subscript𝛼𝐼2𝑇T\leq|\alpha_{I}|<2Titalic_T ≤ | italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | < 2 italic_T. If the coefficient was part of a group of insignificant coefficients at the previous bit-plane, the group is now automatically dissolved.

  4. (iv)

    ‘Insignificant’ – At the current location I𝐼Iitalic_I, the coefficient is still insignificant and satisfies |αI|<Tsubscript𝛼𝐼𝑇|\alpha_{I}|<T| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | < italic_T. If the coefficient was part of a group of insignificant coefficients at the previous bit-plane, the group is now automatically dissolved.

  5. v)

    ‘NextAccuracy0’, ‘NextAccuracy1’ – At the current location I𝐼Iitalic_I, the coefficient has already been reported to be significant since it satisfies |αI|Tsubscript𝛼𝐼𝑇|\alpha_{I}|\geq T| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | ≥ italic_T. Here, we improve the accuracy of its approximation using one of these tokens, depending on the test |αI|<|α~I|subscript𝛼𝐼subscript~𝛼𝐼|\alpha_{I}|<|\tilde{\alpha}_{I}|| italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT | < | over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT |.

The bit-plane scan is carried out in two nested loops; the outer loop proceeds from low resolution to high resolution, each time traversing the three types of wavelet subbands. The inner loop traverses the 4×4444\times 44 × 4 blocks. Figure 4 illustrates the outer and inner scanning patterns.

Refer to caption
(a) Subband scanning order.
Refer to caption
(b) Scanning order of 4×4444\times 44 × 4 blocks.
Figure 4: A sketch illustrating the outer and inner scanning orders.

Figure 5 exemplifies the tokenization algorithm of an image from the MNIST dataset (30). The image was padded with zeros to be of dimensions M×M=32×32𝑀𝑀3232M\times M=32\times 32italic_M × italic_M = 32 × 32, with m=5𝑚5m=5italic_m = 5. The bottom row of the figure shows the tokens and their locations on the wavelet image for the first three bit planes. To make the process clearer, we explicitly write the resulted sequence of tokens for the first bit plane shown in Figure 5(d).

{{\displaystyle\{{ ‘Insignificant’, ‘Insignificant’, ‘NowSignificantPos’, ‘Insignificant’,
‘Insignificant’, ‘Insignificant’, ‘NowSignificantNeg’, ‘Insignificant’,
‘Group2x2’, ‘Group2x2’,
‘Insignificant’, ‘Insignificant’, ‘Insignificant’, ‘NowSignificantNeg’,
‘Group2x2’,‘Group2x2’, ‘Group2x2’,
‘Group4x4’,,‘Group4x4’}\displaystyle\text{`Group4x4'},\dots,\text{`Group4x4'}\}‘Group4x4’ , … , ‘Group4x4’ }

The token sequences of the second and third bit-planes follow the same scanning pattern. Eventually, the three sequences are concatenated in the natural order to form the final sequence which describes the three bit planes wavelet image appearing in Figure 5(c).

There is a very important hyper-parameter which is the choice of the smallest threshold at the final bit-plane. Through this hyper-parameter, the wavelet representation provides us with a very robust and stable trade-off of fine detail generation and length of token sequences. Choosing a final threshold provides very consistent control over visual quality relating to: “Visually Lossless”, “High”, “Medium”, “Low”, etc. This is en par with the quality settings in digital cameras, which in turn lead to a selection of the corresponding quantization tables of the JPEG algorithm generating the compressed images.

Refer to caption
(a) 32×32323232\times 3232 × 32 padded MNIST image.
Refer to caption
(b) Wavelet Transform.
Refer to caption
(c) Three bit planes of coefficients.
Refer to caption
(d) First bit plane.
Refer to caption
(e) Second bit plane.
Refer to caption
(f) Third bit plane.
Figure 5: Depiction of the tokenization process. On the top left and middle, a 32×32323232\times 3232 × 32 padded MNIST image and its wavelet transform. On the top right, the wavelet approximation generated by the first three bit-planes. The bottom row illustrates the tokens and their locations on the 32×32323232\times 3232 × 32 grid, where, ‘NowSignificantNeg’ and ‘NowSignificantPos’ tokens are annotated with orange “--” and blue “+++” and signs respectively. The tokens ‘NextAcurracy0’ and ‘NextAccuracy1’ are marked with green down and red up triangles. The purple dots represent ‘Insignificant’ coefficients and the brown and pink squares represent the ‘Group2x2’ and ‘Group4x4’ zero block tokens.

3.2.2 Decoding the token sequence into an approximate wavelet representation

The tokenization process described in the previous subsection can be easily inverted back to an approximate wavelet representation. Moreover, any initial sub-sequence can be inverted to provide a possibly coarser approximation. We initialize a matrix of size M×M𝑀𝑀M\times Mitalic_M × italic_M of the approximated wavelet coefficients {α~I}subscript~𝛼𝐼\{\tilde{\alpha}_{I}\}{ over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT } with zeros and begin the scanning process with the first bit-plane. Based on (3) or (4), we know how to initialize the first bit-plane with the initial threshold T=2m2𝑇superscript2𝑚2T=2^{m-2}italic_T = 2 start_POSTSUPERSCRIPT italic_m - 2 end_POSTSUPERSCRIPT or T=2m~1𝑇superscript2~𝑚1T=2^{\tilde{m}-1}italic_T = 2 start_POSTSUPERSCRIPT over~ start_ARG italic_m end_ARG - 1 end_POSTSUPERSCRIPT. We then process the token sequence and update the approximated coefficients using the corresponding ‘significant’ and ‘bit accuracy’ tokens. If for any given reason, the sequence of tokens terminates, we have the best possible approximated coefficients {α~I}subscript~𝛼𝐼\{\tilde{\alpha}_{I}\}{ over~ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT } from which we can obtain an approximated image by applying the inverse DWT. Our decoding process relies on the assumption that the token sequence is valid. For example, a ‘Group4x4’ token cannot appear while the decoder scan position is at a location of indices not divisible by 4. It is obvious how to achieve this in the context of image coding. However, during an image generation process, this needs to be enforced using the conditional next-token inference described in Subsection 4.6.

4 The Generative Wavelet Transformer

Assume that for a certain dataset of images, we have established the translation of the visual information of each image to a sequence of tokens encapsulating the visual information from coarse to fine details as explained in Subsection 3.2. We assume that within the sequences, distinct patterns and relations exist between the tokens. For example, the wavelet coefficients {f,ψ~j,ke}𝑓subscriptsuperscript~𝜓𝑒𝑗𝑘\{\langle f,\tilde{\psi}^{e}_{j,k}\rangle\}{ ⟨ italic_f , over~ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ⟩ } of wavelets {ψj,ke}subscriptsuperscript𝜓𝑒𝑗𝑘\{\psi^{e}_{j,k}\}{ italic_ψ start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT } whose support intersects with a certain portion of an edge of the image, will be significant and aligned across scales in a tree-like structure as per Figures 2 and 3(b). At the same time, coefficients of wavelets whose support intersects with a smooth area of the image will be insignificant and they appear in local groups. This leads to the intuition that the powerful transformers created over the last few years (7) are able to learn the patterns of the ‘wavelet language’ and to generate them from some random seeds during inference.

Interestingly, as we experienced through empirical experimentation, one can actually take off-the-shelf pre-trained transformers such as T5 (32) or DistilBert (33) and fine-tune them to our purpose of learning the wavelet language with no additional modifications and with some reasonable generative results. That is, even though the pre-trained models were trained on a set of more than 20,000 tokens and on text datasets, one is able to fine-tune them on wavelet-based token sequences of images, containing only 7 tokens. However, some classical components of transformers are not aligned with the ‘wavelet language’ and should be replaced. At the same time, there are some very useful components and techniques that come with language models that can be also leveraged successfully in the context of image generation.

In this section, we describe how we modified the architecture of the DistilGPT2 model (33) (we used the code from HuggingFace (34) as a starting point) to optimize it to align with the wavelet-based image generation method. This obviously requires training the modified model from scratch.

4.1 Token vector representation

Typically, in the standard scenarios of spoken languages, transformers apply a ‘pre-processing’ learnable transform to tokens to convert them to vector representations. The idea is that similar words should be converted to vectors with some proximity, which intuitively serve as better input for the transformer’s neural network. However, as explained in Subsection 3.2.1, our wavelet dictionary includes only seven tokens that have very distinctive and different roles. Therefore, the simple transformation of the tokens to the one-shot encoding of the standard basis of dimension 7 is probably a better, if not optimal choice. Thus, the initial vector representation of a token is: ‘Group4x4’(1,0,0,0,0,0,0)‘Group4x4’1000000\text{`Group4x4'}\rightarrow(1,0,0,0,0,0,0)‘Group4x4’ → ( 1 , 0 , 0 , 0 , 0 , 0 , 0 ), ‘Group2x2’(0,1,0,0,0,0,0)‘Group2x2’0100000\text{`Group2x2'}\rightarrow(0,1,0,0,0,0,0)‘Group2x2’ → ( 0 , 1 , 0 , 0 , 0 , 0 , 0 ), etc. This means that in the transformers we modified, we removed the ‘token \rightarrow vector’ learnable transformation.

4.2 Initial bit-plane threshold

Recall that we have two options: to use a uniform initial bit-plane threshold for all images in the dataset derived from (3), or to use an adaptive initial threshold for each image of the training set using (4). In the latter case, we need to inform the transformer, per image, which initial threshold the token sequence is associated with. We do this as follows: assume a given dataset has l𝑙litalic_l possible values for m~~𝑚\tilde{m}over~ start_ARG italic_m end_ARG in (4) (e.g., l=4𝑙4l=4italic_l = 4 for the MNIST dataset, see Figure 6). Then, we concatenate a one-shot encoding of dimension l𝑙litalic_l of the initial threshold parameter of the given image to each vector representation of each token.

For image generation, one may sample randomly from the distribution of l𝑙litalic_l possible initial thresholds. In the case that the image generation is conditioned on a certain class (see Subsection 4.4), one can sample from the conditional distribution of the possible thresholds of the specific class.

Refer to caption
Figure 6: Distribution of log2subscript2\log_{2}roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of the initial thresholds for the 70,000 MNIST images with the Haar wavelet transform.

4.3 Positional encoding

In classic transformer architectures (7), one adds the positional encoding vp(t)subscript𝑣𝑝𝑡v_{p}(t)italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t ) of the position t𝑡titalic_t to the token’s vector representation ve(x(t))subscript𝑣𝑒𝑥𝑡v_{e}(x(t))italic_v start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_x ( italic_t ) ). Learnt positional embedding applied a learned transform tvp(t)𝑡subscript𝑣𝑝𝑡t\rightarrow v_{p}(t)italic_t → italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t ). Some transformers use hard-coded map** of the position. Assuming the vector embedding dimension is desubscript𝑑𝑒d_{e}italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and the maximum length of a sequence is lmaxsubscript𝑙l_{\max}italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, then

vp(t)(2i1)=sin(t/lmax2i/de),vp(t)(2i)=cos(t/lmax2i/de),1ide/2.formulae-sequencesubscript𝑣𝑝𝑡2𝑖1𝑡superscriptsubscript𝑙2𝑖subscript𝑑𝑒formulae-sequencesubscript𝑣𝑝𝑡2𝑖𝑡superscriptsubscript𝑙2𝑖subscript𝑑𝑒1𝑖subscript𝑑𝑒2v_{p}(t)(2i-1)=\sin(t/l_{\max}^{2i/d_{e}}),\quad v_{p}(t)(2i)=\cos(t/l_{\max}^% {2i/d_{e}}),\quad 1\leq i\leq d_{e}/2.italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t ) ( 2 italic_i - 1 ) = roman_sin ( italic_t / italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_i / italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t ) ( 2 italic_i ) = roman_cos ( italic_t / italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_i / italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , 1 ≤ italic_i ≤ italic_d start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT / 2 .

In our scenario of the wavelet language, the position of a token in a sequence is (bp,I)𝑏𝑝𝐼(bp,I)( italic_b italic_p , italic_I ), where bp𝑏𝑝bpitalic_b italic_p is the enumeration of the bit-plane and I=(i1,i2)𝐼subscript𝑖1subscript𝑖2I=(i_{1},i_{2})italic_I = ( italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is the index of the current coefficient αIsubscript𝛼𝐼\alpha_{I}italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT in the scan order. Therefore, we concatenate to the vector representation of a token from Subsection 4.1, a vector component of dimension 3 with the location of the token (bp,i1,i2)𝑏𝑝subscript𝑖1subscript𝑖2(bp,i_{1},i_{2})( italic_b italic_p , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

4.4 Generative guidance

It is obviously critical for any image generation method to allow guidance of the generative process by placing a condition on the class type of the generated image or a text prompt that describes it. Some image generation models apply a joint embedding space for text and images for this purpose. One such method is to used a pretrained model such as CLIP (35) that maps text and images to a joint embedding space. The CLIP contains an image encoder f𝑓fitalic_f and a caption encoder g𝑔gitalic_g, that during training over pairs of images with captions {(x,c)}𝑥𝑐\{(x,c)\}{ ( italic_x , italic_c ) }, optimizes a contrastive cross-entropy loss that encourages high dot-products f(x),g(c)𝑓𝑥𝑔𝑐\langle f(x),g(c)\rangle⟨ italic_f ( italic_x ) , italic_g ( italic_c ) ⟩ in the joint embedding space. Thus, any image generation method, can use the vector embedding of the given text prompt c𝑐citalic_c to guide the generative process by conditioning the image embedding f(x)𝑓𝑥f(x)italic_f ( italic_x ) to be highly correlated with the embedding of the textual prompt.

In our case, since we converted the problem of image generation to a ‘wavelet-language’ generation, we can apply ‘text’-type prompting methods. Having access to a joint embedding text-image space allows us to train using the vector representation of the image training set. Then, at image generation, we use the vector representation of the given text prompt to guide the generative process. There are very simple ways of using these vector representations. We choose to concatenate them to the vector representation of each token and its position (as explained above). For example, as shown in Section 5, for the image datasets MNIST or FashionMNIST with 10 classes, it is easy to concatenate a vector of length 10 representing the class of the image. In the case where we wish to guide the generative process using a textual prompt, we may concatenate the CLIP vector embedding (35) of the textual prompt to each token vector representation. As we discuss in Subsection 6.2, we hope this approach to guiding the generative process can be generalized to composition of blobs (36), where a given guiding vector of a blob is used only at positions of the scan where the support of the corresponding wavelet intersects the blob.

4.5 Initialization of the generative process

Since the guidance of the generative process (Subsection 4.4) is applied through the concatenation of vector representations to each token vector representation, in some cases, the initialization becomes a minor issue. For example, when training on MNIST and generating digits, one can get away with a simple random choice from the subset: ‘Insignificant’, ‘NowSignificantNeg’ or ‘NowSignifiantPos’ for the first token and from there the transformer will generate a valid token sequence which is converted to an adequate image of a digit from the pre-selected class.

A more robust method is as follows. Suppose we wish to generate a handwritten digit from a certain digit class. Let {fs}sSsubscriptsubscript𝑓𝑠𝑠𝑆\{f_{s}\}_{s\in S}{ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s ∈ italic_S end_POSTSUBSCRIPT be the subset of MNIST images from that specific digit class and let

{fs,φ~m,k},sS,k=(k1,k2),1k1,k22,formulae-sequencesubscript𝑓𝑠subscript~𝜑𝑚𝑘𝑠𝑆formulae-sequence𝑘subscript𝑘1subscript𝑘2formulae-sequence1subscript𝑘1subscript𝑘22\{\langle f_{s},\tilde{\varphi}_{m,k}\rangle\},\quad s\in S,\quad k=(k_{1},k_{% 2}),\quad 1\leq k_{1},k_{2}\leq 2,{ ⟨ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , over~ start_ARG italic_φ end_ARG start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT ⟩ } , italic_s ∈ italic_S , italic_k = ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , 1 ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 2 , (5)

be the subset of low-resolution coefficients of these images defined by (2). Let N(v,Σ)𝑁𝑣ΣN(v,\Sigma)italic_N ( italic_v , roman_Σ ) be the fourth-dimensional normal distribution, approximated by the subset (5). We then sample from N(v,Σ)𝑁𝑣ΣN(v,\Sigma)italic_N ( italic_v , roman_Σ ), a random group of four low-resolution coefficients. Now, the token representation of these coefficients can serve as a basis for a robust initialization of the generative process of the required digit. In the case where the guidance is provided by a vector representation of some text-prompt, one can create the normal distribution using a subset of K𝐾Kitalic_K-nearest neighbors in the image vector representation space.

Once some random seeding allows us to initialize the token sequence, we may introduce as much diversity as required using the methods of Subsection 4.7 so that even using the same seed may generate various images corresponding to the given guidance.

4.6 Conditional next token inference

In Greedy generative mode, the next selected token x(t)𝑥𝑡x(t)italic_x ( italic_t ), 1x(t)71𝑥𝑡71\leq x(t)\leq 71 ≤ italic_x ( italic_t ) ≤ 7, at location t𝑡titalic_t, is the token for which the transformer assigns the highest probability from (p1(t),,p7(t))subscript𝑝1𝑡subscript𝑝7𝑡(p_{1}(t),...,p_{7}(t))( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) , … , italic_p start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT ( italic_t ) ). As described in Subsection 4.7 below, there are various alternative methods to control the output of the transformer. However, since each generative token inference step is a statistical event, it may occur that the next predicted token is not valid at the current position of the wavelet bit-plane scan. To overcome this, we ensure any selected token satisfies the conditions below relating to the context and the current position in the scan. For example, when using the Greedy method, one simply picks from the subset of tokens satisfying the conditions below, the one with the highest probability.

  1. (i)

    ‘Group4x4’ - The scan is at an index (4l1,4l2)4subscript𝑙14subscript𝑙2(4l_{1},4l_{2})( 4 italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 4 italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and the group has not yet dissolved.

  2. (ii)

    ‘Group2x2’ - The scan is at an index (2l1,2l2)2subscript𝑙12subscript𝑙2(2l_{1},2l_{2})( 2 italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 2 italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and the group has not yet dissolved.

  3. (iii)

    ’NowSignificantNeg’, ’NowSignificantPos’ - At the current location I𝐼Iitalic_I, the coefficient αIsubscript𝛼𝐼\alpha_{I}italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is still insignificant, possibly as part of a group of insignificant coefficients.

  4. (iv)

    ‘Insignificant’ - At the current location I𝐼Iitalic_I, the coefficient αIsubscript𝛼𝐼\alpha_{I}italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is still insignificant, possibly as part of a group of insignificant coefficients.

  5. v)

    ‘NextAccuracy0’, ‘NextAccuracy1’ - At the current location I𝐼Iitalic_I, the coefficient αIsubscript𝛼𝐼\alpha_{I}italic_α start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT has already been reported to be significant.

4.7 Controlling the degree of generative diversity during inference

Since we are applying a language transformer model we may use various simple stochastic mechanisms to control the generative process during inference and allow a diversity of possible images to be generated from a single prompt. Some of the available stochastic methods are: Beam search with multinomial sampling, Top-k𝑘kitalic_k and Top-p𝑝pitalic_p. In our experiments, we tested the latter two:

  1. (i)

    Top-k𝑘kitalic_k sampling - The Top-k𝑘kitalic_k inference method (37) filters the k𝑘kitalic_k most likely next words first and then samples from the probability mass that is redistributed among only those k𝑘kitalic_k next words. GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation. In Figure 8 below, we see a diversity of sandals generated by guiding the model with the vector representation of the corresponding FashionMNIST ‘sandal’ class and using the Top-2222 method. We see that using k=2𝑘2k=2italic_k = 2 is sufficient to move the generative process from a deterministic process to a sufficiently diverse stochastic process, yet with output that fits the class description.

  2. (ii)

    Top-p𝑝pitalic_p sampling- In Top-p𝑝pitalic_p sampling or nucleus sampling, the selection pool for the next token is determined by the cumulative probability of the most probable tokens. Setting a threshold p𝑝pitalic_p, the model includes just enough of the most probable tokens so that their combined probability reaches or exceeds this threshold. Again, the distribution mass is redistributed among these tokens and then the next token is sampled using this distribution. In Figure 7 we see different examples of the digits ‘3’ and ‘8’ generated using the Top-0.60.60.60.6 method.

5 Experimental results

We conducted experiments on the MNIST and FashionMNIST datasets. Here are some details:

  • The images in both datasets were padded with zeros to M×M=32×32𝑀𝑀3232M\times M=32\times 32italic_M × italic_M = 32 × 32 and normalized to have values within [0,1]01[0,1][ 0 , 1 ].

  • We used the Haar wavelet basis for the MNIST images and the bior4.4 wavelet basis for the FashionMNIST.

  • The images were tokenized with a final threshold of T=23𝑇superscript23T=2^{-3}italic_T = 2 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for MNIST and T=24𝑇superscript24T=2^{-4}italic_T = 2 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for FashionMNIST.

  • The maximal token sequence lengths were 1742 for MNIST and 3098 for FashionMNIST.

  • We trained two separate distillgpt2 models from scratch on the two datasets. As for the training configurations, both training sessions had batch size 4, learning rate 0.0004, and weight decay 0.01.

  • Models were trained on an NVIDIA A100 GPU with 80GB; MNIST occupied around 22GB while FashionMNIST occupied 61GB. Both models were trained for a few days.

Results with different controlling methods appear below in Figures 7 and 8.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Digits generated with Top-p=0.6𝑝0.6p=0.6italic_p = 0.6 along with a depiction of the generated wavelet coefficients.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Sandals generated with Top-k=2𝑘2k=2italic_k = 2 along with a depiction of the generated wavelet coefficients.

More generated images for different classes of MNIST and FashionMNIST appear in the following figures.

Refer to caption
Refer to caption
Refer to caption
Figure 9: More MNIST results.
Refer to caption
Refer to caption
Refer to caption
Figure 10: More FashionMNIST results.

6 Discussion and future work

In this paper, we introduced a novel method for image generation that is based on elements of wavelet image coding and NLP transformers. Unfortunately, our research group does not have access to sufficient computational resources at the moment, so this work serves as a first modest proof of concept. Indeed, the wavelet representation is a powerful tool in image processing that can serve as a basis for many image generation functionalities. Here, we list some directions that we will consider for future work.

6.1 Generation of color images at high resolution and with fine details

In our experiments, we only generated small grayscale images. We provide here some details on how the method can be generalized:

  1. (i)

    Color images - For color images (or even spectral images), we may adopt a well-known paradigm from image compression. For improved performance, one may transform input images in the RGB𝑅𝐺𝐵RGBitalic_R italic_G italic_B color space to the YCbCr𝑌𝐶𝑏𝐶𝑟YCbCritalic_Y italic_C italic_b italic_C italic_r color space. The Y𝑌Yitalic_Y component is the luminance component, essentially the image’s grayscale part. The other two components, Cb𝐶𝑏Cbitalic_C italic_b and Cr𝐶𝑟Critalic_C italic_r, capture the color information of the image. Typically, the luminance component carries most of the visual information, and thus also, its encoding is usually the significant part of an encoded image. In image coding, one usually encodes separately each of the three channels. Our method can then be generalized to color images by applying the DWT and the tokenization process separately to each color channel.

  2. (ii)

    Image resolution - Support of any image resolution simply translates to more wavelet coefficients and longer token sequences. Obviously, this requires larger transformer models that can support longer contexts and more computational resources. The method should also generalize well, just as wavelet image coding is being applied to Gigabyte images.

  3. (iii)

    Generating fine details - Using our wavelet model, finer details are captured at higher bit-planes. The choice of the final threshold of the final bit-plane provides excellent and very consistent control over the amount of detail one wishes to generate (see Subsection 3.2.1). This quantization technique is at the heart of the JPEG algorithms and translates to very specific modes in digital cameras that can be set to: ”Visually Lossless”, “High”, “Medium”, etc. This exact form of control also applies to wavelets but, unfortunately, is not the default mode of operation in JPEG2000. Obviously, to generate finer details, one needs to train the transformer on longer token sequences, again requiring more computational resources.

6.2 Support for generation of compositions of blobs

In many cases, one wishes to apply fine-grained control of compositional text-to-image generation, where certain locations in the image, marked perhaps with bounding boxes or ellipses, receive different textual descriptions (36). One possible method to accomplish this using the wavelet generative approach is to apply the transformer in evaluation mode and apply the vector representation of the blob’s textual prompt as described in Subsection 4.4 whenever the bit-plane scan is at indices of wavelet coefficients whose support intersects the blob.

6.3 Multi-modal generation

The ability to represent an image’s visual information as a sequence of tokens presents an attractive possibility of merging the wavelet-based tokens with other language tokens to create a unified multi-modal transformer.

Funding

N. Sharon is partially supported by the NSF-BSF award 2019752. W. Mattar is partially supported by The Nehemia Levtzion Scholarship for Outstanding Doctoral Students from the Periphery (2023). N. Sharon and W. Mattar are partially supported by the DFG award 514588180.

References

  • Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Phuong and Hutter (2022) Mary Phuong and Marcus Hutter. Formal algorithms for transformers, 2022. URL https://arxiv.longhoe.net/abs/2207.09238.
  • Tay et al. (2022) Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
  • Esser et al. (2021) Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  • Ramesh et al. (2021) Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
  • Shapiro (1993) Jerome Shapiro. Embedded image coding using zerotrees of wavelet coefficients. IEEE Transactions in signal processing, 41(12):3445–3462, 1993.
  • Said and Pearlman (1996) Amir Said and William Pearlman. A new, fast, and efficient image codec based on set partitioning in hierarchical trees. IEEE Transactions on Circuits and Systems for Video Technology, 6(3):243–250, 1996.
  • Taubman and Marcellin (2002) David Taubman and Michael Marcellin. JPEG2000: Image Compression Fundamentals, Standards and Practice, 2nd edition. Springer, 2002.
  • Daubechies (1992) Ingrid Daubechies. Ten Lectures on Wavelets. SIAM, 1992.
  • DeVore (1998) Ronald DeVore. Nonlinear approximation. Acta Numerica, 7:51–150, 1998.
  • Mallat (2009) Stephan Mallat. A Wavelet tour of signal processing, the sparse way. Academic Press, 2009.
  • Wallace (1992) G.K. Wallace. The jpeg still picture compression standard. IEEE Transactions on Consumer Electronics, 38(1):xviii–xxxiv, 1992. doi: 10.1109/30.125072.
  • Mihcak et al. (1999) M Kivanc Mihcak, Igor Kozintsev, and Kannan Ramchandran. Spatially adaptive statistical modeling of wavelet image coefficients and its application to denoising. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), volume 6, pages 3253–3256. IEEE, 1999.
  • Kivanc Mihcak et al. (1999) M. Kivanc Mihcak, I. Kozintsev, K. Ramchandran, and P. Moulin. Low-complexity image denoising based on statistical modeling of wavelet coefficients. IEEE Signal Processing Letters, 6(12):300–303, 1999. doi: 10.1109/97.803428.
  • Buccigrossi and Simoncelli (1999) Robert W Buccigrossi and Eero P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE transactions on Image processing, 8(12):1688–1701, 1999.
  • Nichol et al. (2022) Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 16784–16804. PMLR, July 2022. URL https://proceedings.mlr.press/v162/nichol22a.html.
  • Wang et al. (2023) Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023.
  • Lee et al. (2022) Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022.
  • Yu et al. (2024) Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Laing-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. arXiv preprint arXiv:2406.07550, 2024.
  • Tian et al. (2024) Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024.
  • Zhu et al. (2023) Qing Zhu, Xiumei Li, Junmei Sun, and Huang Bai. Wdig: a wavelet domain image generation framework based on frequency domain optimization. EURASIP Journal on Advances in Signal Processing, 2023(1):66, 2023.
  • Yu et al. (2021) Yingchen Yu, Fangneng Zhan, Shijian Lu, Jianxiong Pan, Feiying Ma, Xuansong Xie, and Chunyan Miao. Wavefill: A wavelet-based generation network for image inpainting. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14114–14123, 2021.
  • Phung et al. (2023) Hao Phung, Quan Dao, and Anh Tran. Wavelet diffusion models are fast and scalable image generators. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10199–10208, 2023. doi: 10.1109/CVPR52729.2023.00983.
  • Guth et al. (2022) Florentin Guth, Simon Coste, Valentin De Bortoli, and Stephane Mallat. Wavelet score-based generative modeling. Advances in Neural Information Processing Systems, 35:478–491, 2022.
  • Deng (2012) Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012.
  • Cerná and Finěk (2011) Dana Cerná and Václav Finěk. Discrete CDF 9 / 7 wavelet transform for finite-length signals, 2011. URL https://api.semanticscholar.org/CorpusID:208013335.
  • Chung et al. (2024) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, ** Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024. URL http://jmlr.org/papers/v25/23-0870.html.
  • Sanh (2019) V Sanh. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In Proceedings of Thirty-third Conference on Neural Information Processing Systems (NIPS2019), 2019.
  • (34) Huggingface Repository. HuggingFace DistilGPT2. https://huggingface.co/distilbert/distilgpt2, 2019. Accessed: 2024-06-27.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Nie et al. (2024) Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, and Arash Vahdat. Compositional text-to-image generation with dense blob representations, 2024.
  • Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082.