Wavelets Are All You Need for Autoregressive Image Generation
Abstract. In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this ‘wavelet language’. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.
1 Introduction
The generation of high-resolution visual information is certainly one of the most remarkable achievements of modern-age artificial intelligence. One of the prominent methods is diffusion-based models (1, 2, 3, 4, 5). In essence, diffusion models attempt to learn inversions of ill-posed operators, such as additive Gaussian noise, blurring, etc., so an image may be generated from random noisy or blurry seeds. One then enforces various conditions on images created as the time steps of the inversion process so that the final generated image may correspond to a given text prompt.
Another line of research is designing autoregressive models, in an attempt to leverage on the powerful Large Language Models (LLMs) (6, 7, 8). These models (9, 10) provide different methods to convert the image pixel representation to a series of visual tokens and then apply generative language techniques.
In this paper, we refine this line of research and provide a mathematically robust approach to the autoregressive image generation process. To this end, we reach out to a classic technique in image processing, specifically, wavelet image coding (11, 12, 13). Wavelets (14, 15, 16) are one of the main tools of modern approximation theory for nonlinear, adaptive approximation. The various wavelet transforms provide the means to transform an image into a representation that captures the essence of the visual information in a sparse way. Typically, the significant wavelet coefficients are a small fraction of the coefficients and represent important edge and texture information, while the insignificant coefficients with small absolute values are associated with smooth regions of the image. The goal of wavelet image compression is then to efficiently store the information of only the significant coefficients. In fact, the underlying method of the popular JPEG image compression algorithm (17), invented in the 80s, contains many elements of wavelet coding, where a local Discrete Cosine Transform, a precursor of wavelets, is used. However, in this paper, we leverage the progressive wavelet compression technique, a more advanced form of image compression. It creates a bit-stream where every bit corresponds to the next most important piece of visual information. Since we are generating images rather than decoding them from a compressed file, there is no need to create actual binary bit-streams, and using a ‘wavelet language’ of a limited number of tokens is sufficient.
Thus, our new approach to autoregressive image generation is based on two main ingredients. The first is progressive wavelet image coding, which allows to tokenize the visual information of an image from coarse to fine details. This is achieved using only 7 tokens, by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second ingredient is a variant of an NLP decoder-only transformer (6, 7, 8) whose architecture was re-designed and optimized for token sequences in this ‘wavelet language’. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions (18, 19, 20). During inference, this allows the generation of visually meaningful images from an initial random seed generated from sampled from the distribution of the scaling function coefficients at the lowest resolution.
Using the wavelet autoregressive approach, where the ‘wavelet language’ contains only 7 tokens, provides many attractive features. The length of the token sequences during training or inference can be flexible, where longer sequences imply more detailed or higher-resolution images. Guiding the generative process using a class affiliation or text prompting is easily achieved by concatenating the corresponding vector representations to the tokens’ vector representation of dimension 7. Stochastic control using simple transformer inference techniques, allows to create from one textual prompt a diversity of corresponding images. Furthermore, since each token is associated with the local support in the image domain of the corresponding wavelet, one can switch the guidance during the generative process to allow different prompting for different regions.
Our paper is organized as follows. We begin with a review of related work in Section 2. In Section 3, we review wavelet image coding and explain how one may extract from the classical theory the ability to tokenize the visual information of images. In Section 4, we focus on components of language transforms that we redesign to serve our special wavelet language. We further provide several methods that can be used to direct the generation process under certain conditions: class label and/or textual prompt. In Section 5, we provide experimental results. Finally, in Section 6, we discuss possible future applications of our method, such as multi-modality generation and compositions of blobs.
2 Related work
In this section, we first review the current state of the art in image generation. We then review some methods that apply wavelets as a frequency decomposition backbone for various aspects of style transfer, acceleration, and optimization of existing image generation methods, etc.
Currently, many commercial solutions apply diffusion-based models (1, 2, 3, 4, 5, 21). In essence, diffusion models learn inversions of ill-posed operators, such as additive Gaussian noise, blurring, etc., so images may be generated from random noisy or blurry seeds. One then enforces various conditions on images created through the time steps of the inversion process so that the final generated image may correspond to a given text prompt.
The methods of VQGAN (9) and DALL-E (10) along with (22, 23) utilize a visual tokenizer to discretize images into grids of 2D tokens, which are then flattened to a 1D sequence for autoregressive learning, mirroring the process of sequential language modeling. For example, in (10) a discrete variational autoencoder is trained to compress each RGB image into a grid of image tokens, where each such token can assume 8192 possible values. This creates a relatively short context sequence of tokens, but with a vocabulary of 8192 word tokens. The TiTok method, recently introduced in (24), shows how to compose Vision Transformers with the Vector-Quantization method to arrive at an autoregressive method that may use only 32 tokens. In comparison, our method uses only 7 tokens for any image resolution and any level of fine detail generation.
In contrast, to the raster-scan “next-token prediction”, the method of (25) provides autoregressive learning on images as “next-scale prediction” or “next-resolution prediction”
Some methods, such as (26, 27), use wavelets as means for frequency decomposition representations for image inpainting, style transfer, and generative adversarial network methods. Some works propose to use wavelets as part of diffusion methods (28, 29) to speed up the diffusion approach by applying the denoising process in the wavelet regime.
3 Elements of Wavelet Image Coding
In this section, we review some elements of wavelet image coding (11, 12, 13) that we use for our generative method. Essentially, we are interested in the process that takes an image in its raw pixel form as input and generates a sequence of tokens that capture its visual details. The structure of the sequence from coarse to fine details is achieved by ordering the information starting with the most significant bits of the most significant wavelet coefficients. En par with wavelet coding, we also have a goal to create token sequences that are as short as possible. This creates shorter contexts for the transformer decoder and improves its performance. However, in our generative application, we are not concerned with efficient encoding of the token sequence to a compressed stream of bits. This is a subtopic of information theory and is typically implemented using arithmetic encoders.
3.1 Wavelet Transforms
A univariate wavelet system (14, 16) is a family of real functions in , built by dilating and translating a unique mother wavelet function
where the mother wavelet typically has compact support (or fast decay) and has vanishing moments
(1) |
Wavelet systems can be constructed to serve a basis of . To facilitate applications, one then also constructs a dual of , where , so that for each ,
For special choices of , the set forms an orthonormal basis for and then, .
Usually, one starts the construction of a wavelet system from a Multi-Resolution Analysis (MRA) generated by a scaling function that satisfies a two-scale equation
One then sets
which implies (under certain mild conditions)
Again, to facilitate applications, one may also construct a dual of , where , so that for each ,
Equipped with the MRA, one then proceeds to construct the wavelet such that , with . A classic example for an orthonormal MRA and wavelet system where and , are the Haar scaling function and Haar wavelet
The bivariate Haar system (see below) is a good choice when working with piecewise constant images, such as the MNIST handwritten digits (30). For some of our experiments, we use a famous wavelet system from the Cohen–Daubechies–Feauveau (CDF) family of wavelets (14), which is sometimes termed bior4.4 ( in (1)) or [9,7] in the signal processing community (the supports of the scaling functions and wavelets, as well as the lengths of the associated filters, are 9 and 7). The generating functions of the bior4.4 are depicted in Figure 1.
The wavelet model can be easily generalized to any dimension, via a tensor product of the wavelet and the scaling functions. Assume that the univariate dual scaling functions and dual wavelets , are given. Then, a wavelet bivariate basis is constructed using three types of basic wavelets
The bivariate wavelet transform of , in terms of the bivariate wavelet tensor basis
is then
The bivariate wavelet decomposition can thus be interpreted as a signal decomposition in a set of three spatially oriented frequency subbands: detects horizontal edges; ( detects vertical edges and ( detects diagonal edges.
Under the assumption that and are compactly supported (or have fast decay), a wavelet coefficient at a scale represents the information about the function in the spatial region in the neighborhood of . At the next finer scale , the information about this region is represented by four wavelet coefficients, which are described as the children of . This leads to a natural tree structure organized in a quad tree structure of each of the three subband types as shown in Figure 2. As decreases, the child coefficients add finer and finer details into the spatial regions occupied by their ancestors.
In image processing, one uses the Discrete Wavelet Transform (DWT). It works by initially assuming that the image pixels are good approximants of the projections on the shifts of the dual scaling function with the underlying function
With these coefficients as input, one uses the DWT to compute coefficients down to some predefined low-resolution . For simplicity, we may assume that and that we use the DWT to compute
(2) |
Wavelet representations are considered very efficient for image compression (11, 12, 13). The edge information typically constitutes a small portion of a typical image, while the dual wavelet coefficients have a large absolute value only if edges intersect the support of the corresponding dual wavelets. Consequently, the image can be approximated well using a few significant wavelet coefficients. A clear statistical structure also follows: large/small values of wavelet coefficients tend to propagate through the scales of the quadtrees depicted in Figure 2. As an example, a sparse wavelet representation of a fishing boat image and a compressed version of it are shown in Figure 3, where the compression algorithm JPEG2000 is based on the sparse representation. The Figure clearly depicts that the significant wavelet coefficients (coefficients with relatively large absolute values) are located on strong edges of the image.
3.2 Embedded Wavelet Tokenization
The sparse wavelet representation (2) of an image provides the perfect infrastructure for the generation of embedded coding representations (11, 12, 13). Embedded coding is similar in spirit to binary finite precision representations of real numbers. For each digit added to the right, more precision is added. Yet, the “encoding” can cease at any time and provide the “best” representation of the real number achievable within the framework of the binary digit representation. Similarly, the embedded coder can cease at any time and provide the “best” representation of an image achievable within its framework. Embedded coding streams can be generated from wavelet representations by ordering the information on the wavelet representation starting with the most significant bits of the most significant coefficients. that is, the coefficients with the largest absolute value.
First, for simplicity of notation, using (2), denote for ,
We also map the coefficients for , , based on their location , , in the coefficient matrix
We note in passing that one may assume that the low resolution scaling function coefficients from (2) are known during training and are randomly sampled from some distribution during image generation and therefore need not be part of the tokenization.
3.2.1 Encoding a wavelet representation into a token sequence
We now show how to process the numeric representation of the coefficients, from most significant to least significant and ‘encode’ it into a relatively compact series of tokens. The representation using the series of tokens should be invertible. That is, one can ‘decode’ the sequence back to the wavelet representation.
To this end, assuming the image pixels are normalized to the range , one can show that for an image of dyadic dimension
(3) |
Assuming for simplicity that all images of a given dataset have the same dyadic dimensions , then this bound holds for all of their wavelet representations. Our first option is to initialize a threshold and begin scanning the wavelet coefficients of the image, in a predetermined order (see below) for significance, with the goal of reporting only those coefficients for which the following holds
Our second option, is to compute separately for each image in the dataset
(4) |
and then initialize for the specific image . In this scenario, we store and use the parameter for each image in the training set along with its sequence of tokens.
We also maintain a matrix of approximated wavelet coefficients which we initialize with zeros. Once we complete the processing at a given bit plane, we update and repeat the process. At each bit-plane we report the significant coefficients that were just uncovered in this bit-plane using a token ‘NowSignificantNeg’ if the coefficient is negative or a token ‘NowSignificantPos’ if it is positive. At the time of uncovering, we modify the approximation of the coefficient to have the absolute value , with the reported sign. Next, we add a token to represent the coefficient’s next significant bit, ‘NextAccuracy0’ if the coefficient satisfies or the token ‘NextAccuracy1’ if . The approximation is updated accordingly by subtracting or adding (depending on the sign of the coefficient and the accuracy bit type).
Let us demonstrate with an example. Assume T=16 and . Therefore, the coefficient is first uncovered in the current bit plane. When we arrive at the index , we report a ‘NowSignificantNeg’ token for this coefficient, providing it with a temporary approximation , which lies in the middle of the segment . Next, since in fact , we report a token ‘NextAccuracy0’ to represent the coefficient’s next significant bit, providing an updated approximation , which lies in the middle of the segment , leading to a better approximation of the ground truth value.
In case a coefficient has been uncovered in any of the previous bit-planes and is already known to be significant, we only add one of the tokens ‘NextAccuracy0’ if or ‘NextAccuracy1’ if . We then update the approximation by subtracting or adding (depending on the sign of the coefficient and the accuracy bit type).
Assuming the bit-plane scanning order of the coefficients is fixed, one then only needs to add the token ‘Insignificant’ to provide a valid invertible tokenization process. One simply scans the coefficients in the fixed order and uses their true known value to test and apply one of three possibilities:
-
(i)
: The coefficient has already been reported as significant in a previous bit-plane. Therefore one reports the token ‘NextAccuracy0’ or ‘NextAccuracy1’ depending on the test .
-
(ii)
: First report the token ‘NowSignificantNeg’ or ‘NowSignificantPos’ depending on the sign and then report the token ‘NextAccuracy0’ or ‘NextAccuracy1’.
-
(iii)
: report ‘Insignificant’.
The process described above, although completely sufficient for invertible tokenization, potentially creates long sequences. Specifically, it does not take into consideration the local correlations among ‘neighboring’ insignificant wavelet coefficients. Due to the sparsity property of the wavelet transform, during the scanning process, many of the ‘Insignificant’ coefficients form local groups. Moreover, there are correlations between local groups of insignificant coefficients of the same subband type across resolutions in the manner of the quad-tree structure of Figure 2. Image compression algorithms such as the EZW (11) or SPIHT (12), are based on statistical zero tree models that try to capture these correlations across resolutions. As we shall later see, for image generation, we actually rely on the powerful capabilities of the transformer models to learn correlation patterns of the ‘wavelet language’ of the given dataset. However, we do ‘ease the burden’ off the transformers significantly by utilizing the structure of the groups of insignificant coefficients to reduce the size of the token sequences, thereby creating shorter contexts.
To this end, we add two additional tokens for groups of insignificant coefficients: ‘Group4x4’ and ‘Group2x2’ and modify the scanning process to visit the coefficients based on groups of . The first token is used in locations where the scan is at an index , for some integers . If at the current bit plane, all the 16 coefficients with indices , , are still insignificant, we issue the token ‘Group4x4’ and the tokenization process continues to the next group of coefficients. However, if any of the coefficients of the group becomes significant in the current bit-plane, the group breaks down to 4 groups of . If a group of is still composed of insignificant coefficients at the current bit-plane, we add a token ‘Group2x2’. If a group of breaks down, then each coefficient from the group is reported individually as being ‘Insignificant’ or one of ‘NowSignificantNeg’, ‘NowSignificantPos’. The scanning process keeps track of which groups broke up, so that only necessary and informative tokens are generated. We summarize the seven tokens and their roles below
-
(i)
‘Group4x4’ – At the index , the group of 16 coefficients , , , are still insignificant, .
-
(ii)
‘Group2x2’ – At the index , the group of 4 coefficients , , , are still insignificant, .
-
(iii)
’NowSignificantNeg’, ’NowSignificantPos’ – At the current location , the coefficient satisfies . If the coefficient was part of a group of insignificant coefficients at the previous bit-plane, the group is now automatically dissolved.
-
(iv)
‘Insignificant’ – At the current location , the coefficient is still insignificant and satisfies . If the coefficient was part of a group of insignificant coefficients at the previous bit-plane, the group is now automatically dissolved.
-
v)
‘NextAccuracy0’, ‘NextAccuracy1’ – At the current location , the coefficient has already been reported to be significant since it satisfies . Here, we improve the accuracy of its approximation using one of these tokens, depending on the test .
The bit-plane scan is carried out in two nested loops; the outer loop proceeds from low resolution to high resolution, each time traversing the three types of wavelet subbands. The inner loop traverses the blocks. Figure 4 illustrates the outer and inner scanning patterns.
Figure 5 exemplifies the tokenization algorithm of an image from the MNIST dataset (30). The image was padded with zeros to be of dimensions , with . The bottom row of the figure shows the tokens and their locations on the wavelet image for the first three bit planes. To make the process clearer, we explicitly write the resulted sequence of tokens for the first bit plane shown in Figure 5(d).
‘Insignificant’, ‘Insignificant’, ‘NowSignificantPos’, ‘Insignificant’, | |||
‘Insignificant’, ‘Insignificant’, ‘NowSignificantNeg’, ‘Insignificant’, | |||
‘Group2x2’, ‘Group2x2’, | |||
‘Insignificant’, ‘Insignificant’, ‘Insignificant’, ‘NowSignificantNeg’, | |||
‘Group2x2’,‘Group2x2’, ‘Group2x2’, | |||
The token sequences of the second and third bit-planes follow the same scanning pattern. Eventually, the three sequences are concatenated in the natural order to form the final sequence which describes the three bit planes wavelet image appearing in Figure 5(c).
There is a very important hyper-parameter which is the choice of the smallest threshold at the final bit-plane. Through this hyper-parameter, the wavelet representation provides us with a very robust and stable trade-off of fine detail generation and length of token sequences. Choosing a final threshold provides very consistent control over visual quality relating to: “Visually Lossless”, “High”, “Medium”, “Low”, etc. This is en par with the quality settings in digital cameras, which in turn lead to a selection of the corresponding quantization tables of the JPEG algorithm generating the compressed images.
3.2.2 Decoding the token sequence into an approximate wavelet representation
The tokenization process described in the previous subsection can be easily inverted back to an approximate wavelet representation. Moreover, any initial sub-sequence can be inverted to provide a possibly coarser approximation. We initialize a matrix of size of the approximated wavelet coefficients with zeros and begin the scanning process with the first bit-plane. Based on (3) or (4), we know how to initialize the first bit-plane with the initial threshold or . We then process the token sequence and update the approximated coefficients using the corresponding ‘significant’ and ‘bit accuracy’ tokens. If for any given reason, the sequence of tokens terminates, we have the best possible approximated coefficients from which we can obtain an approximated image by applying the inverse DWT. Our decoding process relies on the assumption that the token sequence is valid. For example, a ‘Group4x4’ token cannot appear while the decoder scan position is at a location of indices not divisible by 4. It is obvious how to achieve this in the context of image coding. However, during an image generation process, this needs to be enforced using the conditional next-token inference described in Subsection 4.6.
4 The Generative Wavelet Transformer
Assume that for a certain dataset of images, we have established the translation of the visual information of each image to a sequence of tokens encapsulating the visual information from coarse to fine details as explained in Subsection 3.2. We assume that within the sequences, distinct patterns and relations exist between the tokens. For example, the wavelet coefficients of wavelets whose support intersects with a certain portion of an edge of the image, will be significant and aligned across scales in a tree-like structure as per Figures 2 and 3(b). At the same time, coefficients of wavelets whose support intersects with a smooth area of the image will be insignificant and they appear in local groups. This leads to the intuition that the powerful transformers created over the last few years (7) are able to learn the patterns of the ‘wavelet language’ and to generate them from some random seeds during inference.
Interestingly, as we experienced through empirical experimentation, one can actually take off-the-shelf pre-trained transformers such as T5 (32) or DistilBert (33) and fine-tune them to our purpose of learning the wavelet language with no additional modifications and with some reasonable generative results. That is, even though the pre-trained models were trained on a set of more than 20,000 tokens and on text datasets, one is able to fine-tune them on wavelet-based token sequences of images, containing only 7 tokens. However, some classical components of transformers are not aligned with the ‘wavelet language’ and should be replaced. At the same time, there are some very useful components and techniques that come with language models that can be also leveraged successfully in the context of image generation.
In this section, we describe how we modified the architecture of the DistilGPT2 model (33) (we used the code from HuggingFace (34) as a starting point) to optimize it to align with the wavelet-based image generation method. This obviously requires training the modified model from scratch.
4.1 Token vector representation
Typically, in the standard scenarios of spoken languages, transformers apply a ‘pre-processing’ learnable transform to tokens to convert them to vector representations. The idea is that similar words should be converted to vectors with some proximity, which intuitively serve as better input for the transformer’s neural network. However, as explained in Subsection 3.2.1, our wavelet dictionary includes only seven tokens that have very distinctive and different roles. Therefore, the simple transformation of the tokens to the one-shot encoding of the standard basis of dimension 7 is probably a better, if not optimal choice. Thus, the initial vector representation of a token is: , , etc. This means that in the transformers we modified, we removed the ‘token vector’ learnable transformation.
4.2 Initial bit-plane threshold
Recall that we have two options: to use a uniform initial bit-plane threshold for all images in the dataset derived from (3), or to use an adaptive initial threshold for each image of the training set using (4). In the latter case, we need to inform the transformer, per image, which initial threshold the token sequence is associated with. We do this as follows: assume a given dataset has possible values for in (4) (e.g., for the MNIST dataset, see Figure 6). Then, we concatenate a one-shot encoding of dimension of the initial threshold parameter of the given image to each vector representation of each token.
For image generation, one may sample randomly from the distribution of possible initial thresholds. In the case that the image generation is conditioned on a certain class (see Subsection 4.4), one can sample from the conditional distribution of the possible thresholds of the specific class.
4.3 Positional encoding
In classic transformer architectures (7), one adds the positional encoding of the position to the token’s vector representation . Learnt positional embedding applied a learned transform . Some transformers use hard-coded map** of the position. Assuming the vector embedding dimension is and the maximum length of a sequence is , then
In our scenario of the wavelet language, the position of a token in a sequence is , where is the enumeration of the bit-plane and is the index of the current coefficient in the scan order. Therefore, we concatenate to the vector representation of a token from Subsection 4.1, a vector component of dimension 3 with the location of the token .
4.4 Generative guidance
It is obviously critical for any image generation method to allow guidance of the generative process by placing a condition on the class type of the generated image or a text prompt that describes it. Some image generation models apply a joint embedding space for text and images for this purpose. One such method is to used a pretrained model such as CLIP (35) that maps text and images to a joint embedding space. The CLIP contains an image encoder and a caption encoder , that during training over pairs of images with captions , optimizes a contrastive cross-entropy loss that encourages high dot-products in the joint embedding space. Thus, any image generation method, can use the vector embedding of the given text prompt to guide the generative process by conditioning the image embedding to be highly correlated with the embedding of the textual prompt.
In our case, since we converted the problem of image generation to a ‘wavelet-language’ generation, we can apply ‘text’-type prompting methods. Having access to a joint embedding text-image space allows us to train using the vector representation of the image training set. Then, at image generation, we use the vector representation of the given text prompt to guide the generative process. There are very simple ways of using these vector representations. We choose to concatenate them to the vector representation of each token and its position (as explained above). For example, as shown in Section 5, for the image datasets MNIST or FashionMNIST with 10 classes, it is easy to concatenate a vector of length 10 representing the class of the image. In the case where we wish to guide the generative process using a textual prompt, we may concatenate the CLIP vector embedding (35) of the textual prompt to each token vector representation. As we discuss in Subsection 6.2, we hope this approach to guiding the generative process can be generalized to composition of blobs (36), where a given guiding vector of a blob is used only at positions of the scan where the support of the corresponding wavelet intersects the blob.
4.5 Initialization of the generative process
Since the guidance of the generative process (Subsection 4.4) is applied through the concatenation of vector representations to each token vector representation, in some cases, the initialization becomes a minor issue. For example, when training on MNIST and generating digits, one can get away with a simple random choice from the subset: ‘Insignificant’, ‘NowSignificantNeg’ or ‘NowSignifiantPos’ for the first token and from there the transformer will generate a valid token sequence which is converted to an adequate image of a digit from the pre-selected class.
A more robust method is as follows. Suppose we wish to generate a handwritten digit from a certain digit class. Let be the subset of MNIST images from that specific digit class and let
(5) |
be the subset of low-resolution coefficients of these images defined by (2). Let be the fourth-dimensional normal distribution, approximated by the subset (5). We then sample from , a random group of four low-resolution coefficients. Now, the token representation of these coefficients can serve as a basis for a robust initialization of the generative process of the required digit. In the case where the guidance is provided by a vector representation of some text-prompt, one can create the normal distribution using a subset of -nearest neighbors in the image vector representation space.
Once some random seeding allows us to initialize the token sequence, we may introduce as much diversity as required using the methods of Subsection 4.7 so that even using the same seed may generate various images corresponding to the given guidance.
4.6 Conditional next token inference
In Greedy generative mode, the next selected token , , at location , is the token for which the transformer assigns the highest probability from . As described in Subsection 4.7 below, there are various alternative methods to control the output of the transformer. However, since each generative token inference step is a statistical event, it may occur that the next predicted token is not valid at the current position of the wavelet bit-plane scan. To overcome this, we ensure any selected token satisfies the conditions below relating to the context and the current position in the scan. For example, when using the Greedy method, one simply picks from the subset of tokens satisfying the conditions below, the one with the highest probability.
-
(i)
‘Group4x4’ - The scan is at an index and the group has not yet dissolved.
-
(ii)
‘Group2x2’ - The scan is at an index and the group has not yet dissolved.
-
(iii)
’NowSignificantNeg’, ’NowSignificantPos’ - At the current location , the coefficient is still insignificant, possibly as part of a group of insignificant coefficients.
-
(iv)
‘Insignificant’ - At the current location , the coefficient is still insignificant, possibly as part of a group of insignificant coefficients.
-
v)
‘NextAccuracy0’, ‘NextAccuracy1’ - At the current location , the coefficient has already been reported to be significant.
4.7 Controlling the degree of generative diversity during inference
Since we are applying a language transformer model we may use various simple stochastic mechanisms to control the generative process during inference and allow a diversity of possible images to be generated from a single prompt. Some of the available stochastic methods are: Beam search with multinomial sampling, Top- and Top-. In our experiments, we tested the latter two:
-
(i)
Top- sampling - The Top- inference method (37) filters the most likely next words first and then samples from the probability mass that is redistributed among only those next words. GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation. In Figure 8 below, we see a diversity of sandals generated by guiding the model with the vector representation of the corresponding FashionMNIST ‘sandal’ class and using the Top- method. We see that using is sufficient to move the generative process from a deterministic process to a sufficiently diverse stochastic process, yet with output that fits the class description.
-
(ii)
Top- sampling- In Top- sampling or nucleus sampling, the selection pool for the next token is determined by the cumulative probability of the most probable tokens. Setting a threshold , the model includes just enough of the most probable tokens so that their combined probability reaches or exceeds this threshold. Again, the distribution mass is redistributed among these tokens and then the next token is sampled using this distribution. In Figure 7 we see different examples of the digits ‘3’ and ‘8’ generated using the Top- method.
5 Experimental results
We conducted experiments on the MNIST and FashionMNIST datasets. Here are some details:
-
•
The images in both datasets were padded with zeros to and normalized to have values within .
-
•
We used the Haar wavelet basis for the MNIST images and the bior4.4 wavelet basis for the FashionMNIST.
-
•
The images were tokenized with a final threshold of for MNIST and for FashionMNIST.
-
•
The maximal token sequence lengths were 1742 for MNIST and 3098 for FashionMNIST.
-
•
We trained two separate distillgpt2 models from scratch on the two datasets. As for the training configurations, both training sessions had batch size 4, learning rate 0.0004, and weight decay 0.01.
-
•
Models were trained on an NVIDIA A100 GPU with 80GB; MNIST occupied around 22GB while FashionMNIST occupied 61GB. Both models were trained for a few days.
More generated images for different classes of MNIST and FashionMNIST appear in the following figures.
6 Discussion and future work
In this paper, we introduced a novel method for image generation that is based on elements of wavelet image coding and NLP transformers. Unfortunately, our research group does not have access to sufficient computational resources at the moment, so this work serves as a first modest proof of concept. Indeed, the wavelet representation is a powerful tool in image processing that can serve as a basis for many image generation functionalities. Here, we list some directions that we will consider for future work.
6.1 Generation of color images at high resolution and with fine details
In our experiments, we only generated small grayscale images. We provide here some details on how the method can be generalized:
-
(i)
Color images - For color images (or even spectral images), we may adopt a well-known paradigm from image compression. For improved performance, one may transform input images in the color space to the color space. The component is the luminance component, essentially the image’s grayscale part. The other two components, and , capture the color information of the image. Typically, the luminance component carries most of the visual information, and thus also, its encoding is usually the significant part of an encoded image. In image coding, one usually encodes separately each of the three channels. Our method can then be generalized to color images by applying the DWT and the tokenization process separately to each color channel.
-
(ii)
Image resolution - Support of any image resolution simply translates to more wavelet coefficients and longer token sequences. Obviously, this requires larger transformer models that can support longer contexts and more computational resources. The method should also generalize well, just as wavelet image coding is being applied to Gigabyte images.
-
(iii)
Generating fine details - Using our wavelet model, finer details are captured at higher bit-planes. The choice of the final threshold of the final bit-plane provides excellent and very consistent control over the amount of detail one wishes to generate (see Subsection 3.2.1). This quantization technique is at the heart of the JPEG algorithms and translates to very specific modes in digital cameras that can be set to: ”Visually Lossless”, “High”, “Medium”, etc. This exact form of control also applies to wavelets but, unfortunately, is not the default mode of operation in JPEG2000. Obviously, to generate finer details, one needs to train the transformer on longer token sequences, again requiring more computational resources.
6.2 Support for generation of compositions of blobs
In many cases, one wishes to apply fine-grained control of compositional text-to-image generation, where certain locations in the image, marked perhaps with bounding boxes or ellipses, receive different textual descriptions (36). One possible method to accomplish this using the wavelet generative approach is to apply the transformer in evaluation mode and apply the vector representation of the blob’s textual prompt as described in Subsection 4.4 whenever the bit-plane scan is at indices of wavelet coefficients whose support intersects the blob.
6.3 Multi-modal generation
The ability to represent an image’s visual information as a sequence of tokens presents an attractive possibility of merging the wavelet-based tokens with other language tokens to create a unified multi-modal transformer.
Funding
N. Sharon is partially supported by the NSF-BSF award 2019752. W. Mattar is partially supported by The Nehemia Levtzion Scholarship for Outstanding Doctoral Students from the Periphery (2023). N. Sharon and W. Mattar are partially supported by the DFG award 514588180.
References
- Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
- Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf.
- Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Phuong and Hutter (2022) Mary Phuong and Marcus Hutter. Formal algorithms for transformers, 2022. URL https://arxiv.longhoe.net/abs/2207.09238.
- Tay et al. (2022) Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
- Esser et al. (2021) Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
- Ramesh et al. (2021) Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
- Shapiro (1993) Jerome Shapiro. Embedded image coding using zerotrees of wavelet coefficients. IEEE Transactions in signal processing, 41(12):3445–3462, 1993.
- Said and Pearlman (1996) Amir Said and William Pearlman. A new, fast, and efficient image codec based on set partitioning in hierarchical trees. IEEE Transactions on Circuits and Systems for Video Technology, 6(3):243–250, 1996.
- Taubman and Marcellin (2002) David Taubman and Michael Marcellin. JPEG2000: Image Compression Fundamentals, Standards and Practice, 2nd edition. Springer, 2002.
- Daubechies (1992) Ingrid Daubechies. Ten Lectures on Wavelets. SIAM, 1992.
- DeVore (1998) Ronald DeVore. Nonlinear approximation. Acta Numerica, 7:51–150, 1998.
- Mallat (2009) Stephan Mallat. A Wavelet tour of signal processing, the sparse way. Academic Press, 2009.
- Wallace (1992) G.K. Wallace. The jpeg still picture compression standard. IEEE Transactions on Consumer Electronics, 38(1):xviii–xxxiv, 1992. doi: 10.1109/30.125072.
- Mihcak et al. (1999) M Kivanc Mihcak, Igor Kozintsev, and Kannan Ramchandran. Spatially adaptive statistical modeling of wavelet image coefficients and its application to denoising. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), volume 6, pages 3253–3256. IEEE, 1999.
- Kivanc Mihcak et al. (1999) M. Kivanc Mihcak, I. Kozintsev, K. Ramchandran, and P. Moulin. Low-complexity image denoising based on statistical modeling of wavelet coefficients. IEEE Signal Processing Letters, 6(12):300–303, 1999. doi: 10.1109/97.803428.
- Buccigrossi and Simoncelli (1999) Robert W Buccigrossi and Eero P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE transactions on Image processing, 8(12):1688–1701, 1999.
- Nichol et al. (2022) Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 16784–16804. PMLR, July 2022. URL https://proceedings.mlr.press/v162/nichol22a.html.
- Wang et al. (2023) Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023.
- Lee et al. (2022) Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022.
- Yu et al. (2024) Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Laing-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. arXiv preprint arXiv:2406.07550, 2024.
- Tian et al. (2024) Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024.
- Zhu et al. (2023) Qing Zhu, Xiumei Li, Junmei Sun, and Huang Bai. Wdig: a wavelet domain image generation framework based on frequency domain optimization. EURASIP Journal on Advances in Signal Processing, 2023(1):66, 2023.
- Yu et al. (2021) Yingchen Yu, Fangneng Zhan, Shijian Lu, Jianxiong Pan, Feiying Ma, Xuansong Xie, and Chunyan Miao. Wavefill: A wavelet-based generation network for image inpainting. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14114–14123, 2021.
- Phung et al. (2023) Hao Phung, Quan Dao, and Anh Tran. Wavelet diffusion models are fast and scalable image generators. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10199–10208, 2023. doi: 10.1109/CVPR52729.2023.00983.
- Guth et al. (2022) Florentin Guth, Simon Coste, Valentin De Bortoli, and Stephane Mallat. Wavelet score-based generative modeling. Advances in Neural Information Processing Systems, 35:478–491, 2022.
- Deng (2012) Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012.
- Cerná and Finěk (2011) Dana Cerná and Václav Finěk. Discrete CDF 9 / 7 wavelet transform for finite-length signals, 2011. URL https://api.semanticscholar.org/CorpusID:208013335.
- Chung et al. (2024) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, ** Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024. URL http://jmlr.org/papers/v25/23-0870.html.
- Sanh (2019) V Sanh. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. In Proceedings of Thirty-third Conference on Neural Information Processing Systems (NIPS2019), 2019.
- (34) Huggingface Repository. HuggingFace DistilGPT2. https://huggingface.co/distilbert/distilgpt2, 2019. Accessed: 2024-06-27.
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Nie et al. (2024) Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, and Arash Vahdat. Compositional text-to-image generation with dense blob representations, 2024.
- Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082.