Toward a Diffusion-Based Generalist for Dense Vision Tasks

Yue Fan1,2, Yongqin Xian2, Xiaohua Zhai3, Alexander Kolesnikov3,
Muhammad Ferjad Naeem4, Bernt Schiele1, Federico Tombari2,5
1Max Planck Institute for Informatics, Saarland Informatics,
2Google, 3Google DeepMind, 4ETH Zurich, 5TU Munich
Intern at Google during the project.
Abstract

Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists.

1 Introduction

The field of artificial intelligence has made significant progress in building generalized model frameworks. In particular, autoregressive transformers [27] have become a prominent unified approach in Natural Language Processing (NLP), effectively addressing a wide range of tasks with a singular model architecture [25, 10, 19, 21]. However, in computer vision (CV), building a unified framework remains challenging due to the inherent diversity of the tasks and output formats. Consequently, state-of-the-art computer vision models still have many complex task-specific designs [3, 8, 9, 15, 31], making it difficult for feature sharing across tasks and, thus, limiting knowledge transfer.

The stark contrast between NLP and CV has given rise to a growing interest in develo** unified approaches for vision tasks [18, 6, 7, 29, 30, 35]. Recently, [29, 30] have shown image itself can be used as a robust interface for unifying different vision tasks and demonstrated good performance. In this paper, we propose a multi-task diffusion generalist for dense vision tasks by reformulating the dense prediction tasks as conditional image generation, and re-purpose pre-trained latent diffusion models for it. Fig. LABEL:fig:teaser visualizes the output of our model on semantic segmentation, panoptic segmentation, depth estimation, and image restoration. Based on text prompts, our model can perform different tasks with one set of parameters. However, directly finetuning the pre-trained latent diffusion models (e.g. Stable Diffusion [22]) leads to quantization errors for segmentation tasks (see Table 2). To this end, we propose to do pixel-space diffusion which effectively improves the generation quality and does not suffer from quantization errors. Moreover, our exploration into training diffusion models as vision generalists reveals a list of interesting findings as follows:

  • Diffusion-based generalists show superior performance over the non-diffusion-based generalists on tasks involving semantics or global understanding of the scene.

  • We find conditioning on the image feature extracted from powerful pre-trained image encoders results in better performance than directly conditioning on the raw image.

  • Pixel diffusion is better than latent diffusion as it does not have the quantization issue while upsampling.

  • We observe that text-to-image generation pre-training stabilizes the training and leads to better performance.

In experiments, we demonstrate the model’s versatility across six different dense prediction tasks on depth estimation, semantic segmentation, panoptic segmentation, image denoising, image draining, and light enhancement. Our method achieves competitive performance to the current state-of-the-art in many settings.

2 Related Work

Unified framework & Unified model: Efforts have been made to unify various vision tasks with a single model, resulting in several vision generalists [18, 6, 7, 29, 30, 14]. Inspired by the success of sequence-to-sequence modeling in Natural Language Processing (NLP), Pix2Seq [6, 7] leverages a plain autoregressive transformer and tackles many vision tasks with next-token prediction. For example, bounding boxes in object detection are cast as sequences of discrete tokens, and masks in semantic segmentation are encoded with coordinates of object polygons [4]. The idea was further developed in Unified-IO [18], where dense prediction such as segmentation, depth map, and image restoration are also unified as tokens by using the corresponding image features from a vector quantization variational auto-encoder (VQ-VAE) [26]. On the output side, the predicted image tokens are then decoded into masks and depth maps as the final prediction. Similarly, OFA [28] unified a diverse set of cross-modal and unimodal tasks in a simple sequence-to-sequence learning framework and achieved competitive performance pretrained with only 20M publicly available image-text pairs. Painter [29] and SegGPT [30], on the other hand, reformulate different vision tasks as an image inpainting problem, and perform in-context learning following [2]. Unlike the previous work, our method unifies different vision tasks under a conditional image generation framework and introduces a diffusion-based vision generalist for it.

Unified framework & Task-specific model: Besides the aforementioned literature, there is another line of related works that pursue unified architecture but task-specific models. UViM [14] addressed the high-dimensionality output space of vision tasks via learned guiding code, where a short sequence modeled by an additional language model to encode task-specific information guides the prediction of the base model. Separate models are trained for different tasks as the guiding code is task-specific. XDecoder [36] unified pixel-level image segmentation, image-level retrieval, and vision-language tasks with a generic decoding procedure, which predicts pixel-level masks and token-level semantics, and different combinations of the two outputs are used for different tasks. Despite their good performance, the task/modality-specific customization poses difficulty for knowledge sharing among different tasks and is also not friendly for supporting unseen tasks.

3 Toward a Diffusion-Based Generalist

Refer to caption
Figure 1: The training pipeline of the diffusion-based vision generalist consists of two parts: Left: Redefining the output space of different vision tasks as RGB images so that they can be unified under a conditional image generation framework. Right: We finetune a pre-trained diffusion model on the reformatted data from the first step. Diffusion is performed in the pixel space to mitigate the quantization error of the latent diffusion (see Table 2). The image and text conditionings are fed into the model via the corresponding encoders, where only the image encoder is tuned during the training.

3.1 Unification with Conditional Image Generation

As the output of most vision tasks can be always visualized as images, we redefine the output space of different vision tasks as RGB images and unify them as conditional image generation to tackle the inherent difference of output formats of different vision tasks. Given a input image x𝑥xitalic_x and the corresponding ground-truth y𝑦yitalic_y, we first transform y𝑦yitalic_y into RGB images and then pair it with a task descriptor in text. By doing so, training sets of different tasks are combined into a holistic training set. And training the model jointly on it enables the knowledge transfer between tasks. At test time, given a new image, the model can perform different tasks following the text instructions (examples in Fig. LABEL:fig:teaser).

In this paper, we consider four types of dense prediction tasks: depth estimation, semantic segmentation, panoptic segmentation, and image restoration.

Depth estimation outputs real number depth value for each pixel on x𝑥xitalic_x. Given the minimum and the maximum values, we map them into [0,255]0255[0,255][ 0 , 255 ] linearly and discretize them into integers, which is then repeated and stacked along the channel to form the ground-truth RGB label.

Semantic segmentation predicts a class label for each pixel. We use a pre-defined injective class-to-color map** to transform the segmentation mask into RGB images. Given a task with C𝐶Citalic_C categories, we define C𝐶Citalic_C colors which are evenly distributed in the 3-dimensional RGB space. Specifically, following [29], the class index is represented by a 3-digit number with b-base system, where b=C13𝑏superscript𝐶13b=\lceil C^{\frac{1}{3}}\rceilitalic_b = ⌈ italic_C start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT ⌉. Thus, the margin between two colors is defined as int(256b)int256𝑏\text{int}(\frac{256}{b})int ( divide start_ARG 256 end_ARG start_ARG italic_b end_ARG ). The color for the i𝑖iitalic_i-th class is then [int(ib2)×mint𝑖superscript𝑏2𝑚\text{int}(\frac{i}{b^{2}})\times mint ( divide start_ARG italic_i end_ARG start_ARG italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) × italic_m, int(ib)%b×mintpercent𝑖𝑏𝑏𝑚\text{int}(\frac{i}{b})\%b\times mint ( divide start_ARG italic_i end_ARG start_ARG italic_b end_ARG ) % italic_b × italic_m, l%b×mpercent𝑙𝑏𝑚l\%b\times mitalic_l % italic_b × italic_m]. At test time, we find the nearest neighbor of the predicted color in the predefined class-to-color map** and predict the corresponding category.

Panoptic segmentation is solved as a combination of semantic and instance segmentation. Semantic segmentation labels are constructed as stated above. For instance segmentation, we set N𝑁Nitalic_N as the maximum number of instances a single training image can contain. Then, we define N𝑁Nitalic_N colors which are evenly distributed in the 3-dimensional RGB space as in semantic segmentation. Finally, we assign colors to objects based on their spatial location to form the RGB ground-truth label. For example, the instance whose center is at the upper leftmost corner obtains the first color and the lower rightmost gets the last color. At test time, the model makes predictions twice with different text instructions and merge the results for panoptic segmentation.

Image restoration aims to predict the clean image from corrupted images. Thus, the output space is inherently RGB image and does not need further transformation to fit in the framework.

3.2 A Diffusion Multi-Task Generalist Framework

By reformating the output space of different vision tasks into images, it is natural to solve them together under a conditional image generation framework. To this end, we leverage the powerful diffusion models pre-trained for image generation and re-purpose them in our use case.

Fig. 1 shows the overall pipeline of the method, which is a conditional image generation framework with pixel-space diffusion. Given M𝑀Mitalic_M tasks with datasets {Ii,Yi}i=1MsuperscriptsubscriptsuperscriptI𝑖superscriptY𝑖𝑖1𝑀\{\textbf{I}^{i},\textbf{Y}^{i}\}_{i=1}^{M}{ I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , Y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, where IisuperscriptI𝑖\textbf{I}^{i}I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are the input images of task i and YisuperscriptY𝑖\textbf{Y}^{i}Y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are the corresponding ground-truth labels. We first transform the output into RGB image format XisuperscriptX𝑖\textbf{X}^{i}X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and augment each task with a text instruction Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. At each training step, we randomly sample a subset of tasks and then sample data from each task. For each input data {Ii,Xi,Ti}superscript𝐼𝑖superscript𝑋𝑖superscript𝑇𝑖\{I^{i},X^{i},T^{i}\}{ italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }, we first compute the multi-scale feature map of the original image Iisuperscript𝐼𝑖I^{i}italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT from the image encoder. Then, it is concatenated with the noised target image Xtisuperscriptsubscript𝑋𝑡𝑖X_{t}^{i}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT before being fed into the UNet for the reconstruction loss. Note that the image feature can have a different spatial resolution than the target image Xtisuperscriptsubscript𝑋𝑡𝑖X_{t}^{i}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, in which case the concatenation will be performed on the interpolated image feature. In experiments, we find both the image feature resolution and the target resolution are important for the final performance but target resolution matters more. The text conditioning Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is fed into the UNet via cross-attention [22]. The whole pipeline is trained in an end-to-end manner except for the text encoder, which is frozen throughout the training. Compared to the standard diffusion model for conditional image generation, there are three main differences:

Target Depth Estimation Semantic Seg. Panoptic Seg. Denoising Deraining Light Enhance.
image RMSE \downarrow mIoU \uparrow PQ \uparrow SSIM \uparrow SSIM \uparrow SSIM \uparrow
resolution NYUv2 ADE-20K COCO SIDD 5 datasets LoL
Generalist framework, task-specific models
UViM [14] 512×512512512512\times 512512 × 512 0.467 - 45.8% - - -
Generalist models
Unified-IO [18] 256×256256256256\times 256256 × 256 0.385 25.7% - - - -
InstructCV [11] 256×256256256256\times 256256 × 256 0.297 47.2% - - - -
Painter [29] 448×448448448448\times 448448 × 448 0.288 49.9% 43.4% 0.954 0.868 0.872
Painter [29] 128×128128128128\times 128128 × 128 0.435† 28.4%† 22.6%† 0.922† 0.626† 0.773†
Ours 128×128128128128\times 128128 × 128 0.448 48.7% 40.3% 0.954 0.815 0.758
Table 1: Our method achieves competitive performance in most of the tasks while trained at a much smaller target resolution of 128×128128128128\times 128128 × 128. When compared at the same resolution, our method shows superior performance over the previous best method (Painter [29]), especially on semantic segmentation and panoptic segmentation. The best number is in bold and the second best number is underscored. †indicates numbers from our reproduction.
[Uncaptioned image]
Semantic Seg. Panoptic Seg.
mIoU \uparrow PQ \uparrow
ADE-20K COCO
Latent Diffusion 17.1% 11.7%
Pixel Diffusion 48.0% 35.5%
Table 2: Upper: Semantic segmentation output of the latent diffusion model. The perceptually same colored regions have different pixel values and, therefore, are mapped to different class labels, leading to bad final performance. While the red box contains only one ground-truth class sky in generated RGB image, the final class prediction has four classes after the quantization. Lower: Latent diffusion suffers from the quantization issue while pixel diffusion achieves good performance.
  • We propose to directly perform diffusion in the pixel space. As shown in Table 2, when map** from the latent space to the pixel space, visually uniform regions actually have pixels of many different RGB values. This variance can lead to inaccurate class map**s, and consequently, suboptimal performance for semantic and panoptic segmentation.

  • The image conditioning is provided via a feature extractor (we use ConvNeXt [17]) and is concatenated to the target image X0subscript𝑋0X_{0}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Compared to the widely adopted method of directly concatenating the raw image as the condition, this brings significant performance improvement, especially for semantic and panoptic segmentation (see Table 3 for ablation).

  • We remove the self-attention layers in the outermost layers of UNet. This is because the pixel space diffusion at large target image resolutions induces considerable memory costs. Removing them alleviates the issue without compromising the performance.

4 Experimental Results

Here, we first explain experimental settings in Section 4.1. Then, we highlight important design choices in diffusion-based multi-task generalists in Section 4.2 before comparing our method with previous approaches in Section 4.3.

4.1 Datasets and Implementation Details

Datasets: We evaluate our method on six different dense prediction tasks with various output formats. For depth estimation, we use NYUv2 [24] and report the Root Mean Square Error (RMSE). For semantic segmentation, we evaluate on ADE20K [34] and adopt the widely used metric of mean IoU (mIoU). For panoptic segmentation, we use MS-COCO [16] and report panoptic quality as the measure. During inference, the model is forwarded twice for each validation image with different instructions to obtain the results of semantic and instance segmentation respectively. The outputs are then merged together into the panoptic segmentation. Image restoration tasks are evaluated on several popular benchmarks, including SIDD [1] for image denoising, LoL [32] for low-light image enhancement, and 5 merged datasets [33] for deraining.

Implementation details. As mentioned above, we take the Stable Diffusion v1.4 [22] checkpoint and finetune it jointly on six tasks. The image feature extractor is an ImageNet-21K [23] pre-trained ConvNeXt-Large [17]. The text encoder is Open-CLIP [20], which is used in Stable Diffusion [22]. We adopt uniform sampling for each tasks except panoptic segmentation, whose weight is twice as much as the other tasks (as it is a combination of semantic and instance segmentation). Following [5], we also adjust the input scaling factor by a constant factor b𝑏bitalic_b in the forward noising processing of diffusion. We use AdamW optimizer [13] with constant learning rate of 0.0001, linearly warmed up in the first 20,000 iterations. The target image resolution is 128×128128128128\times 128128 × 128 while the conditioning image resolution is 512×512512512512\times 512512 × 512. We train our model for 180,000 steps in total with a batch size of 1024.

Depth Estimation Semantic Seg. Panoptic Seg. Denoising Deraining Light Enhance.
RMSE \downarrow mIoU \uparrow PQ \uparrow SSIM \uparrow SSIM \uparrow SSIM \uparrow
NYUv2 ADE-20K COCO SIDD 5 datasets LoL
Ours 0.511 48.0% 35.5% 0.949 0.772 0.704
Non-diffusion 0.443 42.4% 19.8% 0.951 0.773 0.703
Train from scratch 0.528 46.6% 33.6% 0.948 0.764 0.704
Direct concat. 0.476 37.6% 27.1% 0.941 0.772 0.687
Table 3: We analyze the important design choices of our method and aim to provide a recipe for training diffusion-based generalists: 1. diffusion models greatly outperform non-diffusion models on panoptic segmentation; 2. text-to-image generation pre-training leads to an overall better performance; 3. conditioning on image features extracted from an encoder gives significant improvement over the raw image.
Refer to caption
Figure 2: Qualitative results on images from the validation sets of ADE20K, MS-COCO, NYU-V2, SIDD, Deraining, and LOL. Following a raster scan order, the text prompts are ”Performance semantic segmentation”, ”Performance instance segmentation”, ”Performance depth estimation”, ”Performance image restoration denoising”, ”Performance image restoration deraining”, and ”Performance image restoration light enhancement”, respectively. The images are not cherry-picked.

4.2 Recipes for Diffusion-Based Generalists

In this section, we analyze the design choices of our method and show their importance through ablation experiments. Specifically, we show the importance of diffusion by training the same model as in Fig. 1 to directly generate target images without using diffusion (non-diffusion). We study the significance of image generation pre-training and image encoder by training models without them (train from scratch and direct concat.). If not specified, we train all models at a target resolution of 64×64646464\times 6464 × 64 for 50,000 steps.

We attribute the success of our method to four aspects. (1) While having similar results on image restoration tasks, diffusion-based generalist achieves better performance than non-diffusion models on segmentation tasks which requires a global understanding of the scene and the semantics. For example, the diffusion model reaches 35.5% PQ for panoptic segmentation while the non-diffusion model has only 19.8% (Table 3 ours v.s. non-diffusion). (2) Image generation pre-training on large scale dataset transfers useful knowledge to the many downstream tasks. The model finetuned from Stable Diffusion v1.4 [22] achieves better results than the one trained from scratch across the tasks (Table 3 ours vs train from scratch). (3) The image conditioning can take advantage of powerful pre-trained image encoders by conditioning on the image features rather than the raw image, which is in contrast to the standard practice for image generation tasks. On semantic segmentation and panoptic segmentation, extracting features gives 10.4% and 8.4% performance improvement, respectively (Table 3 ours v.s. direct concat.). (4) Pixel diffusion is better than latent diffusion as it does not suffer from the quantization issue while upsampling (see Table 2 for an example).

4.3 Comparisons with Prior Art

We compare our model with recent vision generalists in Table 1. With a much smaller target image resolution at 128×128128128128\times 128128 × 128, our method achieves competitive performance across the tasks. In particular, when compared with the previous best model Painter [29] at the same target resolution, our method has a significant margin over them, which highlights the potential of our method at a higher resolution.

4.4 Qualitative Results

In this section, we visualize the output of our method on six different tasks in Fig. 2. We use DDIM at inference time with 50 steps. Each figure shows the output of the denoising process at the 0-th, 25-th, and 50-th steps.

4.5 Ablation Study

In this section, we analyze the effect of other important hyper-parameters of our method, such as batch size, target image resolution, and noise-signal ratio. Similar to Section 4.2, we train all models at a target resolution of 64×64646464\times 6464 × 64 for 50,000 steps by default.

Effect of batch size. Here, we discuss the effect of different batch sizes for our method. As shown in Table 4, the performance of most of the tasks improves with the increase of the batch size. In particular, panoptic segmentation greatly benefits from the large batch size.

Depth Sem. Seg. Pan. Seg. Denoise Detrain Enhance.
RMSE \downarrow mIoU \uparrow PQ \uparrow SSIM \uparrow SSIM \uparrow SSIM \uparrow
NYUv2 ADE-20K COCO SIDD 5 datasets LoL
128 0.548 35.5% 26.2% 0.941 0.754 0.701
256 0.495 44.3% 30.0% 0.945 0.766 0.703
512 0.491 47.1% 33.5% 0.948 0.770 0.702
1024 0.511 48.0% 35.5% 0.949 0.772 0.704
Table 4: Large batch size improves the performance for all the tasks except depth estimation.

Effect of target resolution. Table 5 studies the effect of different target image resolutions. Since our method performs diffusion in the pixel space, increasing the target image resolution is important for good performance. Despite the increased memory cost, our method achieves its best performance at the resolution of 128×128128128128\times 128128 × 128 and can be further improved with even larger target images.

Depth Sem. Seg. Pan. Seg. Denoise Detrain Enhance.
RMSE \downarrow mIoU \uparrow PQ \uparrow SSIM \uparrow SSIM \uparrow SSIM \uparrow
NYUv2 ADE-20K COCO SIDD 5 datasets LoL
32x32 0.514 44.4% 32.1% 0.940 0.743 0.653
64x64 0.511 48.0% 35.5% 0.949 0.772 0.704
128x128 0.467 49.2% 36.7% 0.953 0.810 0.762
Table 5: Effect of output resolution. Increasing the target image resolution significantly improves the performance across tasks.

Importance of noise-signal ratio. In DDPM [12], the forward diffusion process is defined as xt=γtx0+1γtϵsubscript𝑥𝑡subscript𝛾𝑡subscript𝑥01subscript𝛾𝑡italic-ϵx_{t}=\sqrt{\gamma_{t}}x_{0}+\sqrt{1-\gamma_{t}}\epsilonitalic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ, where x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the input image, ϵitalic-ϵ\epsilonitalic_ϵ is a Gaussian noise, and t𝑡titalic_t is the number of diffusion step. As shown in [5], the denoising task at the same noise level (i.e. the same t) becomes simpler with the increase in the image size. In order to compensate for this, [5] proposed to scale the input with a constant b𝑏bitalic_b to explicitly control the noise-signal ratio, which results in the forward diffusion process as xt=γtbx0+1γtϵsubscript𝑥𝑡subscript𝛾𝑡𝑏subscript𝑥01subscript𝛾𝑡italic-ϵx_{t}=\sqrt{\gamma_{t}}bx_{0}+\sqrt{1-\gamma_{t}}\epsilonitalic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_b italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ. As we reduce b𝑏bitalic_b, it increases the noise levels. Table 6 shows the effect of the noise-signal ratio b𝑏bitalic_b where b=0.5𝑏0.5b=0.5italic_b = 0.5 gives the best performance.

Depth Sem. Seg. Pan. Seg. Denoise Detrain Enhance.
RMSE \downarrow mIoU \uparrow PQ \uparrow SSIM \uparrow SSIM \uparrow SSIM \uparrow
NYUv2 ADE-20K COCO SIDD 5 datasets LoL
0.1 0.497 46.9% 33.1% 0.948 0.770 0.702
0.3 0.511 48.0% 35.5% 0.949 0.772 0.704
0.5 0.514 49.3% 35.9% 0.949 0.774 0.708
0.7 0.533 48.2% 34.4% 0.949 0.773 0.707
1.0 0.572 40.3% 31.1% 0.948 0.770 0.706
Table 6: Importance of noise-signal ratio b𝑏bitalic_b in the forward diffusion process xt=γtbx0+1γtϵsubscript𝑥𝑡subscript𝛾𝑡𝑏subscript𝑥01subscript𝛾𝑡italic-ϵx_{t}=\sqrt{\gamma_{t}}bx_{0}+\sqrt{1-\gamma_{t}}\epsilonitalic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_b italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ.

5 Conclusion and Limitations

In this work, we explore a diffusion-based vision generalist, where different dense prediction tasks are unified as conditional image generation and we re-purpose pre-trained diffusion models for it. Furthermore, we analyze different design choices of diffusion-based generalists and provide a recipe for training such a model. In experiments, we demonstrate the model’s versatility across six different dense prediction tasks and achieve competitive performance to the current state-of-the-art. This work, however, is also subject to limitations. For example, full fine-tuning of the pre-trained diffusion model at a larger target image resolution is memory intensive due to the pixel space diffusion. Thus, exploring parameter-efficient tuning for such a model would be an interesting future direction.

References

  • Abdelhamed et al. [2018] Abdelrahman Abdelhamed, Stephen Lin, and Michael S Brown. A high-quality denoising dataset for smartphone cameras. In CVPR, 2018.
  • Bar et al. [2022] Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, and Alexei Efros. Visual prompting via image inpainting. NeurIPS, 2022.
  • Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
  • Castrejon et al. [2017] Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, and Sanja Fidler. Annotating object instances with a polygon-rnn. In CVPR, 2017.
  • Chen [2023] Ting Chen. On the importance of noise scheduling for diffusion models. arXiv preprint, 2023.
  • Chen et al. [2021] Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. arXiv preprint, 2021.
  • Chen et al. [2022] Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J Fleet, and Geoffrey E Hinton. A unified sequence interface for vision tasks. NeurIPS, 2022.
  • Cheng et al. [2021] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
  • Cheng et al. [2022] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, 2018.
  • Gan et al. [2023] Yulu Gan, Sungwoo Park, Alexander Schubert, Anthony Philippakis, and Ahmed M Alaa. Instructcv: Instruction-tuned text-to-image diffusion models as vision generalists. arXiv preprint, 2023.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint, 2014.
  • Kolesnikov et al. [2022] Alexander Kolesnikov, André Susano Pinto, Lucas Beyer, Xiaohua Zhai, Jeremiah Harmsen, and Neil Houlsby. Uvim: A unified modeling approach for vision with learned guiding codes. NeurIPS, 2022.
  • Li et al. [2022] Zhenyu Li, Xuyang Wang, Xianming Liu, and Junjun Jiang. Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv preprint, 2022.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • Liu et al. [2022] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, 2022.
  • Lu et al. [2022] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2022.
  • Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICLR, 2021.
  • Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • Silberman et al. [2012] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  • Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint, 2023.
  • Van Den Oord et al. [2017] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. NeurIPS, 2017.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.
  • Wang et al. [2022a] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, **gren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022a.
  • Wang et al. [2023a] Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023a.
  • Wang et al. [2023b] Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. Seggpt: Segmenting everything in context. arXiv preprint, 2023b.
  • Wang et al. [2022b] Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, and Houqiang Li. Uformer: A general u-shaped transformer for image restoration. In CVPR, 2022b.
  • Wei et al. [2018] Chen Wei, Wen**g Wang, Wenhan Yang, and Jiaying Liu. Deep retinex decomposition for low-light enhancement. arXiv preprint, 2018.
  • Zamir et al. [2022] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Learning enriched features for fast image restoration and enhancement. PAMI, 2022.
  • Zhou et al. [2017] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, 2017.
  • Zhu et al. [2022] Xizhou Zhu, **guo Zhu, Hao Li, Xiaoshi Wu, Hongsheng Li, Xiaohua Wang, and Jifeng Dai. Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In CVPR, 2022.
  • Zou et al. [2023] Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and language. In CVPR, 2023.