License: CC BY-NC-SA 4.0
arXiv:2312.01985v1 [cs.CV] 04 Dec 2023

UniGS: Unified Representation for Image Generation and Segmentation

Lu Qi11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,     Lehan Yang2*2{}^{2*}start_FLOATSUPERSCRIPT 2 * end_FLOATSUPERSCRIPT,    Weidong Guo33{}^{3\dagger}start_FLOATSUPERSCRIPT 3 † end_FLOATSUPERSCRIPT,    Yu Xu33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT,   
Bo Du44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT,    Varun Jampani55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT,     Ming-Hsuan Yang1,616{}^{1,6}start_FLOATSUPERSCRIPT 1 , 6 end_FLOATSUPERSCRIPT,    
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTThe University of California, Merced   22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTThe University of Sydney  
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTQQ Brower Lab, Tencent,   44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTWuhan Univeristy  
55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTStability AI   66{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPTGoogle Research
Equal contribution. \dagger corresponding author.
Abstract

This paper introduces a novel unified representation of diffusion models for image generation and segmentation. Specifically, we use a colormap to represent entity-level masks, addressing the challenge of varying entity numbers while aligning the representation closely with the image RGB domain. Two novel modules, including the location-aware color palette and progressive dichotomy module, are proposed to support our mask representation. On the one hand, a location-aware palette guarantees the colors’ consistency to entities’ locations. On the other hand, the progressive dichotomy module can efficiently decode the synthesized colormap to high-quality entity-level masks in a depth-first binary search without knowing the cluster numbers. To tackle the issue of lacking large-scale segmentation training data, we employ an inpainting pipeline and then improve the flexibility of diffusion models across various tasks, including inpainting, image synthesis, referring segmentation, and entity segmentation. Comprehensive experiments validate the efficiency of our approach, demonstrating comparable segmentation mask quality to state-of-the-art and adaptability to multiple tasks. The code will be released at https://github.com/qqlu/Entity.

1 Introduction

Deep learning has propelled the performance of several tasks to new heights, marking substantial progress within the computer vision community. Image generation [17, 27, 74, 29] and segmentation [6, 75, 18, 32, 43, 45], as two typical dense prediction tasks within this field, are widely used in plethora of applications such as autonomous driving [40], video surveillance [49], medical imaging [52], robotics [12], photography [56], and intelligent creation [61, 62].

Refer to caption
Figure 1: Visualization results of a single UniGS model on image generation and segmentation. We present four tasks: multi-class multi-region inpainting, image synthesis, referring segmentation, and entity segmentation. We note that the generation of colormaps shares a similar pipeline with images without needing any explicit segmentation loss.

The innovative usage of latent codes [46] in diffusion models has recently demonstrated remarkable capabilities in producing high-quality images, opening a new era of AI-generated content (AIGC). Nevertheless, using a similar design for segmentation remains relatively unexplored in diffusion-based works, despite evidence from specific studies [5, 3, 33, 54, 36] that highlight the potential of attention blocks to group pixels. Realizing such a capability with a unified representation for image and entity-level segmentation masks could potentially refine image generation, achieving greater coherence between the synthesized entities and their masks. Moreover, this unified representation offers significant potential for performing various dense prediction tasks, including both generation and segmentation in a single representation, as shown in Figure 1.

The intuitive solution is to represent segmentation masks simply as a colormap like Painter [59] and InstructDiffusion [16]. However, the implementation is far from straightforward for three main reasons. First, the colormap design should be consistent with latent space not explored in Painter [59]. Second, it should be able to differentiate entities in the same category. This challenge is not addressed in instructDiffusion [16], which can only detect one entity. Finally, the mask quality is not guaranteed and usually has many noises without regular segmentation loss functions like cross-entropy or dice loss. Even though the colormap design effectively achieves a unified representation, the large-scale dataset requirements for training diffusion models are at odds with the sparse segmentation annotations at hand, resulting in a critical bottleneck in our exploration.

To tackle these challenges, our first step is to validate that variational autoencoder (VAE) [23] used in stable diffusion [46] can effectively encode and decode colormaps in the same way as images. Based on colormap representation and latent diffusion model, we introduce the UniGS framework to simultaneously generate images and multi-class entity-level segmentation masks. The UniGS has a UNet architecture augmented with dual branches: one for image and another for mask generation. In the mask branch, we propose two modules, including a location-aware palette and a progressive dichotomy module. The former assigns each entity area to some fixed colors by the entities’ center-of-mass location, enabling UniGS to discriminate entities within the same category. The latter efficiently decodes generated noisy colormap into explicit masks without knowing the entity numbers.

Then, we train our diffusion model under the inpainting protocol, addressing the scarcity of large-scale mask annotations. In this way, the diffusion model is primed to hone in on specific regions rather than the entire image. This flexibility facilitates using multiple segmentation datasets for training our diffusion model. Combining unified image and mask representation with an inpainting pipeline further integrates various tasks within a single representation with minor modifications. Figure 1 shows the effectiveness of the UniGS on four tasks, including multi-class multi-region inpainting, image synthesis, referring segmentation, and entity segmentation.

The main contributions of this work are as follows:

  • We are the first to propose a unified diffusion model (UniGS) for image generation and segmentation within a unified representation by treating the entity-level segmentation masks the same as images.

  • Two novel modules, including a location-aware palette and progressive dichotomy module, can make efficient transformations between the entity-level segmentation masks and colormap representations.

  • The inpainting-based protocol addresses the scarcity of large-scale segmentation data and affords the versatility to employ a unified representation across multiple tasks.

  • The extensive experiments show our UniGS framework’s effectiveness on image generation and segmentation. In particular, UniGS can obtain segmentation performance comparable to state-of-the-art methods without any standard segmentation loss design. Our work can inspire foundation models with a unified representation for two mainstream dense prediction tasks.

2 Related Work

Diffusion Model for Generation. The diffusion models were initially introduced in the context of generation tasks [14] and have undergone significant evolution through latent design [46]. Diffusion models have been applied to a wide variety of generation [46, 39, 50, 15], image super-resolution [1, 13, 63], image inpainting [34, 68, 53, 71], image editing [22, 73, 64], image-to-image translation [57, 28, 11, 72], among others. We note that all current methods utilize the latent code to generate high-resolution images and have been extended to 3D [70, 25, 37] or video generation [21, 65, 35, 4]. Instead of those methods focusing on content generation, we endow the diffusion model with perception and segmentation ability by using similar representations for the images.

Refer to caption
Figure 2: Overview of the UniGS framework within the inpainting pipeline. Similar to stable diffusion, our UniGS denoise the feature in the latent space by an encoder and decoder. We note that the predictions of UNet z^isubscriptnormal-^𝑧𝑖\hat{z}_{i}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and z^msubscriptnormal-^𝑧𝑚\hat{z}_{m}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are unified representations that can be decoded into images and colormaps by a similar latent decoder.

Diffusion Model for Segmentation. Several studies have delved into pixel-level segmentation [6, 41, 75, 42, 18, 44, 48, 32, 43, 45] using diffusion models through three distinct pipelines. The first two pipelines emphasize leveraging pre-trained stable diffusion [46] to simultaneously generate segmentation masks and images. Specifically, the first pipeline, as discussed in studies like [5, 3, 33, 54, 36], employs both self- and cross-attention maps in stable diffusion for shape grou**. However, these approaches demonstrate limited capabilities in the instance or entity-level discrimination [45, 43]. Conversely, the second pipeline [66, 67, 30, 69] primarily focuses on integrating a segmentation branch to produce precise mask generation by bringing substantially computational costs. Instead, the third pipeline [8, 7] is conditioned upon the input image by diffusing the image features to masks or bounding boxes. Furthermore, a prevalent issue with these methods is the inconsistency between image generation and segmentation mask generation processes. In contrast, we develop a unified representation for both tasks by converting the segmentation mask into a colormap.

Unified Representation. Some foundation models [59, 60, 16] explored unified representation for both generation and perception tasks. Our work is mostly similar to the Painter [59] and InstructDiffusion [16] but with various designs. Rather than reproducing the original color through MAE’s [19] regression as in Painter [59], our approach involves gradually diffusing the latent code of the colormap by several time steps. Compared to the InstructDiffusion [16], our framework offers greater flexibility in decoupling the image and colormap using distinct latent codes. As a result, there’s no necessity to employ a lightweight segmentation branch for mask generation in our approach.

3 Method

Based on the latent diffusion model [46], the proposed UniGS framework aims to progressively and simultaneously denoise images and segmentation masks given a text prompt. In Figure 2, we show the overview of the UniGS model within the inpainting pipeline. Such a pipeline can address the challenge of insufficient segmentation datasets and unifying multiple tasks in a single representation.

Specifically, the input of our UNet has four parts, including the latent encode of the noised image, colormap, context, and a resized coarse mask. They are denoted by ztisuperscriptsubscript𝑧𝑡𝑖z_{t}^{i}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, ztssuperscriptsubscript𝑧𝑡𝑠z_{t}^{s}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, ztcsuperscriptsubscript𝑧𝑡𝑐z_{t}^{c}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and mcsuperscript𝑚𝑐m^{c}italic_m start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, respectively. Based on the text prompt, we use an UNet to denoise the ztisuperscriptsubscript𝑧𝑡𝑖z_{t}^{i}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, ztssuperscriptsubscript𝑧𝑡𝑠z_{t}^{s}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT to z^isubscript^𝑧𝑖\hat{z}_{i}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and z^msubscript^𝑧𝑚\hat{z}_{m}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. During inference, ztisuperscriptsubscript𝑧𝑡𝑖z_{t}^{i}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and ztssuperscriptsubscript𝑧𝑡𝑠z_{t}^{s}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT would be the pure Gaussian noise. Compared to stable diffusion [46], there is no obvious structure difference except for the input and output channel numbers.

Notation Definition Notation Definition
ΥΥ\Upsilonroman_Υ AutoEncoder (VAE) ΩΩ\Omegaroman_Ω Coarse Mask Generator
ΨΨ\Psiroman_Ψ Colormap Encoder ΦΦ\Phiroman_Φ Colormap Decoder
M𝑀Mitalic_M Entity-level Masks Mcsubscript𝑀𝑐M_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT Colormap
I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT Original Image mcsubscript𝑚𝑐m_{c}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT Coarse Mask
Table 1: Illustration of some notations in the Method section.

In the following, we begin with an overview of latent diffusion techniques for high-resolution image synthesis. Then, we introduce our novel mask representation to represent entity masks. Lastly, we propose our whole inpainting pipeline and its extension to multiple tasks. It is noted that Table 1 lists essential notions in this section.

3.1 Review of Latent Diffusion

Diffusion models [20] is a class of likelihood-based models that define a Markov chain of forward and backward processes, gradually adding and removing noise to sample data. The forward process is defined as

q(zt|z0)=𝒩(zt|α¯tz0,(1α¯t)z0),𝑞conditionalsubscript𝑧𝑡subscript𝑧0𝒩conditionalsubscript𝑧𝑡subscript¯𝛼𝑡subscript𝑧01subscript¯𝛼𝑡subscript𝑧0q(z_{t}|z_{0})=\mathcal{N}(z_{t}|\sqrt{\bar{\alpha}_{t}}z_{0},(1-\bar{\alpha}_% {t})z_{0}),italic_q ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (1)

which transforms data sample z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to a latent noisy sample ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for t{0,1,,T}𝑡01𝑇t\in\{0,1,...,T\}italic_t ∈ { 0 , 1 , … , italic_T } by adding gausian noise ϵitalic-ϵ\epsilonitalic_ϵ to z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. α¯ts=0tαs=s=0t(1βs)subscript¯𝛼𝑡superscriptsubscriptproduct𝑠0𝑡subscript𝛼𝑠superscriptsubscriptproduct𝑠0𝑡1subscript𝛽𝑠\bar{\alpha}_{t}\coloneqq\prod_{s=0}^{t}\alpha_{s}=\prod_{s=0}^{t}(1-\beta_{s})over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) where βssubscript𝛽𝑠\beta_{s}italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the noise variance schedule [20]. During training, a neural network (usually an UNet) fθ(𝒛t,t)subscript𝑓𝜃subscript𝒛𝑡𝑡f_{\theta}(\bm{z}_{t},t)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is trained to predict ϵitalic-ϵ\epsilonitalic_ϵ to recover z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by minimizing the training objective with 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss [20]:

train=12fθ(zt,t)ϵ2.subscripttrain12superscriptnormsubscript𝑓𝜃subscript𝑧𝑡𝑡italic-ϵ2\mathcal{L}_{\text{train}}=\frac{1}{2}||f_{\theta}(z_{t},t)-\epsilon||^{2}.caligraphic_L start_POSTSUBSCRIPT train end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG | | italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - italic_ϵ | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (2)

where θ𝜃\thetaitalic_θ is the parameters of the neural network. At inference stage, data sample z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is reconstructed from zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT with the model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and an updating rule [20, 51] in an iterative way, i.e., zTzTΔz0subscript𝑧𝑇subscript𝑧𝑇Δsubscript𝑧0z_{T}\rightarrow z_{T-\Delta}\rightarrow...\rightarrow z_{0}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT → italic_z start_POSTSUBSCRIPT italic_T - roman_Δ end_POSTSUBSCRIPT → … → italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. For a clear illustration, we omit the updating rule and regard the output of fθ(zt,t)subscript𝑓𝜃subscript𝑧𝑡𝑡f_{\theta}(z_{t},t)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) as z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In the context of generating high-resolution images I0h×w×3subscript𝐼0superscript𝑤3I_{0}\in\mathcal{R}^{h\times w\times 3}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT given a text prompt, diffusion models would incur substantial computational costs if using large image size like h=w=512𝑤512h=w=512italic_h = italic_w = 512. The hhitalic_h and w𝑤witalic_w represent the image height and width. To tackle this issue, latent diffusion models (LDM) uses latent code of the images as z0ih4×w4×4superscriptsubscript𝑧0𝑖superscript4𝑤44z_{0}^{i}\in\mathcal{R}^{\frac{h}{4}\times\frac{w}{4}\times 4}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT divide start_ARG italic_h end_ARG start_ARG 4 end_ARG × divide start_ARG italic_w end_ARG start_ARG 4 end_ARG × 4 end_POSTSUPERSCRIPT [58]:

z0i=Υ(I)andI^0=Υ1(z^0i)formulae-sequencesuperscriptsubscript𝑧0𝑖Υ𝐼andsubscript^𝐼0superscriptΥ1superscriptsubscript^𝑧0𝑖z_{0}^{i}=\Upsilon(I)\quad\text{and}\quad\hat{I}_{0}=\Upsilon^{-1}(\hat{z}_{0}% ^{i})italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_Υ ( italic_I ) and over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Υ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (3)

where ΥΥ\Upsilonroman_Υ and Υ1superscriptΥ1\Upsilon^{-1}roman_Υ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT represent the encoding and decoding process of AutoEncoder (VAE) to I𝐼Iitalic_I. *^^\hat{*}over^ start_ARG * end_ARG indicates the prediction results. As such, latent diffusion reduces computational demands and maintains good generation ability. We base our UniGS model on Stable Diffusion [46], a popular LDM variant.

3.2 Colormap-based Entity Mask Representation

We use a colormap representing segmentation masks that can align mask representation with image format while supporting variability in entity numbers. However, designing a colormap encoder and decoder is non-trivial due to the requirements of discriminating each entity within the same categories. Moreover, this representation would lack the standard segmentation loss in latent space like binary cross-entropy and dice loss. Using the denoise loss in Eq 2 for colormap would lead to several extreme cases in Figure 3. Thus, we describe our location-aware palette and progressive dichotomy modules in colormap encoding and decoding to solve the above-mentioned problems.

Colormap Encoding. The colormap encoder ΨΨ\Psiroman_Ψ converts several entity-level binary segmentation masks M{0,1}n×h×w𝑀superscript01𝑛𝑤M\in\{0,1\}^{n\times h\times w}italic_M ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n × italic_h × italic_w end_POSTSUPERSCRIPT to an colormap Mc[0,255]h×w×3subscript𝑀𝑐superscript0255𝑤3M_{c}\in[0,255]^{h\times w\times 3}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ [ 0 , 255 ] start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT as

Mc=Ψ(M)subscript𝑀𝑐Ψ𝑀M_{c}=\Psi(M)italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = roman_Ψ ( italic_M ) (4)

n𝑛nitalic_n denotes the number of sampled entities. The Mcsubscript𝑀𝑐M_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is initialized by zero value and then assigned some color for each entity area by our location-aware palette. Specifically, we partition an image into b×b𝑏𝑏b\times bitalic_b × italic_b grids where each grid has a unique color. Each entity area is associated with these fixed colors if their gravity centers are at the grids. Each RGB channel has five candidate color values {0,64,128,192,255}064128192255\{0,64,128,192,255\}{ 0 , 64 , 128 , 192 , 255 } in our location-aware palette. Thus, the overall color number is 124=531124superscript531124=5^{3}-1124 = 5 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT - 1 with color (0,0,0)000(0,0,0)( 0 , 0 , 0 ) indicating the background. The grid numbers b2=|b×b|superscript𝑏2𝑏𝑏b^{2}=|b\times b|italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = | italic_b × italic_b | should be less than 124.

The location-aware palette design proves simple but efficient in covering nearly all labeled entities (97.4% coverage ratio across the COCO, ADE20K, OpenImages, and EntitySeg datasets). That’s because UNet has a position encoding design that can help predict corresponding colors. In contrast, random color assignments often struggle to distinguish between entities of the same category due to providing too large a color space.

Refer to caption
Figure 3: Illustration of several difficult cases in the decoded color map. We conclude those cases into three kinds of problems, including boundary, similar color, and background.

Colormap Decoding. While the generated colormap effectively differentiates between entities visually, converting it into the perfect entity-level masks presents several challenges. A primary issue is the need for more awareness of entity numbers. Therefore, heuristic k-means clustering is impractical. To tackle this issue, we propose a progressive dichotomy module ΦΦ\Phiroman_Φ to group areas of identical color by pixel-level features p𝑝pitalic_p without prior knowledge of cluster numbers.

M^=Φ(Mc^)=Φ(Υ1(z^0s))^𝑀Φ^subscript𝑀𝑐ΦsuperscriptΥ1superscriptsubscript^𝑧0𝑠\hat{M}=\Phi(\hat{M_{c}})=\Phi(\Upsilon^{-1}(\hat{z}_{0}^{s}))over^ start_ARG italic_M end_ARG = roman_Φ ( over^ start_ARG italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ) = roman_Φ ( roman_Υ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ) (5)

where Mc^^subscript𝑀𝑐\hat{M_{c}}over^ start_ARG italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG is predicted colormap decoded by VAE, and M^{0,1}n×H×W^𝑀superscript01𝑛𝐻𝑊\hat{M}\in\{0,1\}^{n\times H\times W}over^ start_ARG italic_M end_ARG ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n × italic_H × italic_W end_POSTSUPERSCRIPT has n𝑛nitalic_n binary masks.

Specifically, the progressive dichotomy module (PDM) is a depth-first cascaded clustering method where we further split the jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT entity mask m^v1jsuperscriptsubscript^𝑚𝑣1𝑗\hat{m}_{v-1}^{j}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT generated at (v1)thsuperscript𝑣1th{(v-1)}^{\text{th}}( italic_v - 1 ) start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT iteration into the two sub-masks m^v2jsuperscriptsubscript^𝑚𝑣2𝑗\hat{m}_{v}^{2j}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_j end_POSTSUPERSCRIPT and m^v2j+1superscriptsubscript^𝑚𝑣2𝑗1\hat{m}_{v}^{2j+1}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_j + 1 end_POSTSUPERSCRIPT at vthsuperscript𝑣th{v}^{\text{th}}italic_v start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT iteration,

{m^v2j,m^v2j+1}=𝒦(m^v1j)superscriptsubscript^𝑚𝑣2𝑗superscriptsubscript^𝑚𝑣2𝑗1𝒦superscriptsubscript^𝑚𝑣1𝑗\{\hat{m}_{v}^{2j},\hat{m}_{v}^{2j+1}\}=\mathcal{BK}(\hat{m}_{v-1}^{j}){ over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_j end_POSTSUPERSCRIPT , over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_j + 1 end_POSTSUPERSCRIPT } = caligraphic_B caligraphic_K ( over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (6)

The 𝒦𝒦\mathcal{BK}caligraphic_B caligraphic_K denotes two-cluster k-means and each m^{0,1}H×W^𝑚superscript01𝐻𝑊\hat{m}\in\{0,1\}^{H\times W}over^ start_ARG italic_m end_ARG ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT . Further splitting m^v1jsuperscriptsubscript^𝑚𝑣1𝑗\hat{m}_{v-1}^{j}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT will stop until the average L2 distance of mask pixels to their mean less than δ𝛿\deltaitalic_δ:

om^v1j(pocm^v1j)2|m^v1j|<δsubscript𝑜superscriptsubscript^𝑚𝑣1𝑗superscriptsubscript𝑝𝑜subscript𝑐superscriptsubscript^𝑚𝑣1𝑗2superscriptsubscript^𝑚𝑣1𝑗𝛿\frac{\sum_{o\in\hat{m}_{v-1}^{j}}(p_{o}-c_{\hat{m}_{v-1}^{j}})^{2}}{|\hat{m}_% {v-1}^{j}|}<\deltadivide start_ARG ∑ start_POSTSUBSCRIPT italic_o ∈ over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | end_ARG < italic_δ (7)

The cm^v1j=om^v1jpo|m^v1j|subscript𝑐superscriptsubscript^𝑚𝑣1𝑗subscript𝑜superscriptsubscript^𝑚𝑣1𝑗subscript𝑝𝑜superscriptsubscript^𝑚𝑣1𝑗c_{\hat{m}_{v-1}^{j}}=\frac{\sum_{o\in\hat{m}_{v-1}^{j}}p_{o}}{|\hat{m}_{v-1}^% {j}|}italic_c start_POSTSUBSCRIPT over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_o ∈ over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG | over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | end_ARG with |m^v1j|superscriptsubscript^𝑚𝑣1𝑗|\hat{m}_{v-1}^{j}|| over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | denoting the pixel numbers of mask m^v1jsuperscriptsubscript^𝑚𝑣1𝑗\hat{m}_{v-1}^{j}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_v - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT.

The pixel feature posubscript𝑝𝑜p_{o}italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is designed in light of three critical observations in Figure 3. o[0,,h×w)𝑜0𝑤o\in[0,...,h\times w)italic_o ∈ [ 0 , … , italic_h × italic_w ). At first, it is not trivial to discern whether a gradient color signifies one or multiple entities. Second, the foreground colors would be degraded by the background. Thirdly, some black holes are hard to predict as true or false positives. Thus, we design po1×6subscript𝑝𝑜superscript16p_{o}\in\mathcal{R}^{1\times 6}italic_p start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT 1 × 6 end_POSTSUPERSCRIPT with both RGB and LAB image space. Including LAB image space is pivotal due to their perceptual uniformity property, which ensures that minor variations in LAB values translate to approximately uniform alterations in color as perceived by the human eye, thereby providing enhanced contrast.

Task Condition Output
coarse mask (mcsuperscript𝑚𝑐m^{c}italic_m start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT) control factor (zicsuperscriptsubscript𝑧𝑖𝑐z_{i}^{c}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT) text prompt template image (z^0isubscriptsuperscript^𝑧𝑖0\hat{z}^{i}_{0}over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) mask (z^0ssubscriptsuperscript^𝑧𝑠0\hat{z}^{s}_{0}over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT)
Inpainting Ω(M)Ω𝑀\Omega(M)roman_Ω ( italic_M ) Υ(I0(1mc))Υdirect-productsubscript𝐼01superscript𝑚𝑐\Upsilon(I_{0}\odot(1-m^{c}))roman_Υ ( italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊙ ( 1 - italic_m start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ) ‘inpainting: generate dog.’
Image Synthesis 𝒥𝒥\mathcal{J}caligraphic_J Υ(Mc)Υsubscript𝑀𝑐\Upsilon(M_{c})roman_Υ ( italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ‘synthesis: generate dog, ground, and sky.’
Referring Segmentation Ω(M)Ω𝑀\Omega(M)roman_Ω ( italic_M ) Υ(I0)Υsubscript𝐼0\Upsilon(I_{0})roman_Υ ( italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ‘referring: find dog.’
Entity Segmentation 𝒥𝒥\mathcal{J}caligraphic_J Υ(I0)Υsubscript𝐼0\Upsilon(I_{0})roman_Υ ( italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ‘panoptic: all entities.’
Table 2: Illustration of the condition signal’s design on training task in our framework. The ✓and ✗indicate whether we expect the two output tensors to be the same as our condition zicsuperscriptsubscript𝑧𝑖𝑐z_{i}^{c}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

3.3 Inpainting Pipeline

We adopt an inpainting pipeline for training and inference to reconcile the generative model’s requirements for large-scale segmentation datasets. For example, the Open-Images dataset [2] with mask annotations encompasses approximately 1.8 million images but only contains about three entity-level labels per image. Directly training the latent diffusion model results in too many ambiguities due to unlabeled areas. Instead, our inpainting pipeline enables the generative model to concentrate on the valid areas regardless of the partial segmentation labels.

In the training period, the UNet input is zuH4×W4×13superscript𝑧𝑢superscript𝐻4𝑊413z^{u}\in\mathcal{R}^{\frac{H}{4}\times\frac{W}{4}\times 13}italic_z start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG × 13 end_POSTSUPERSCRIPT that concatenated by

ztu=CONCAT(zti,zts,mc,ztc)superscriptsubscript𝑧𝑡𝑢CONCATsuperscriptsubscript𝑧𝑡𝑖superscriptsubscript𝑧𝑡𝑠superscript𝑚𝑐superscriptsubscript𝑧𝑡𝑐z_{t}^{u}=\text{CONCAT}(z_{t}^{i},z_{t}^{s},m^{c},z_{t}^{c})italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = CONCAT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_m start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) (8)

ztisuperscriptsubscript𝑧𝑡𝑖z_{t}^{i}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and ztssuperscriptsubscript𝑧𝑡𝑠z_{t}^{s}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT are the latent code of the noised image Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and colormap Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in time step t𝑡titalic_t where both ztisuperscriptsubscript𝑧𝑡𝑖z_{t}^{i}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and ztssuperscriptsubscript𝑧𝑡𝑠z_{t}^{s}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT are h4×w4×4superscript4𝑤44\mathcal{R}^{\frac{h}{4}\times\frac{w}{4}\times 4}caligraphic_R start_POSTSUPERSCRIPT divide start_ARG italic_h end_ARG start_ARG 4 end_ARG × divide start_ARG italic_w end_ARG start_ARG 4 end_ARG × 4 end_POSTSUPERSCRIPT. mc{0,1}h4×w4×1superscript𝑚𝑐superscript014𝑤41m^{c}\in\{0,1\}^{\frac{h}{4}\times\frac{w}{4}\times 1}italic_m start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT divide start_ARG italic_h end_ARG start_ARG 4 end_ARG × divide start_ARG italic_w end_ARG start_ARG 4 end_ARG × 1 end_POSTSUPERSCRIPT is a coarse mask where 1111 indicates a rectangular or an irregular area that needs our UniGS framework to fill entities and their masks,

mc=Ω(M)subscript𝑚𝑐Ω𝑀m_{c}=\Omega(M)italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = roman_Ω ( italic_M ) (9)

More details regarding ΩΩ\Omegaroman_Ω are available in the supplementary material. Next, zicsuperscriptsubscript𝑧𝑖𝑐z_{i}^{c}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is the latent code of the masked image by mcsubscript𝑚𝑐m_{c}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT,

ztc=Υ(I0(1mc))superscriptsubscript𝑧𝑡𝑐Υdirect-productsubscript𝐼01superscript𝑚𝑐z_{t}^{c}=\Upsilon(I_{0}\odot(1-m^{c}))italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = roman_Υ ( italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊙ ( 1 - italic_m start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ) (10)

The UNet output is

z^0=fθ(ztu,t)H4×W4×8subscript^𝑧0subscript𝑓𝜃superscriptsubscript𝑧𝑡𝑢𝑡superscript𝐻4𝑊48\hat{z}_{0}=f_{\theta}(z_{t}^{u},t)\in\mathcal{R}^{\frac{H}{4}\times\frac{W}{4% }\times 8}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_t ) ∈ caligraphic_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG × 8 end_POSTSUPERSCRIPT (11)

where the first and last four channels of z^0subscript^𝑧0\hat{z}_{0}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be latently decoded to the final image and colormaps.

3.4 One-to-Many Tasks

The inpainting pipeline with colormap representation allows for integrating various tasks within a single model. We use the UniGS model for four vision tasks: inpainting, image synthesis, referring segmentation, and entity segmentation. The configuration of each task is presented in Table 2.

Multi-class Multi-region Inpainting. Our baseline task that has been detailed in Section 3.3.

Image Synthesis: zicsuperscriptsubscript𝑧𝑖𝑐z_{i}^{c}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is latent code of colormap Mcsubscript𝑀𝑐M_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT containing sampled entities. Meanwhile, the coarse mask mcsubscript𝑚𝑐m_{c}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is 𝒥𝒥\mathcal{J}caligraphic_J matrix of all ones to cover the entire image area. For the output, we maintain z^0s=Υ(Mc)subscriptsuperscript^𝑧𝑠0Υsubscript𝑀𝑐\hat{z}^{s}_{0}=\Upsilon(M_{c})over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Υ ( italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) and predict the entities’ appearance z^0isubscriptsuperscript^𝑧𝑖0\hat{z}^{i}_{0}over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Referring Segmentation. This task aims at segmenting some classes based on instructions. Thus, we preserve image information by zic=Υ(I0)superscriptsubscript𝑧𝑖𝑐Υsubscript𝐼0z_{i}^{c}=\Upsilon(I_{0})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = roman_Υ ( italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Considering the requirement of negative samples to ensure alignment between the entity and text prompt, we define λ𝜆\lambdaitalic_λ as the possibility of each sampled entity belonging to a negative in training. For negative samples, the category names in text prompts are replaced with others that do not appear in the coarse mask.

Entity Segmentation. All the entities should be predicted in z^0ssubscriptsuperscript^𝑧𝑠0\hat{z}^{s}_{0}over^ start_ARG italic_z end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with the coarse mask area 𝒥𝒥\mathcal{J}caligraphic_J.

Refer to caption
Figure 4: Qualitative comparison of inpainting results between Stable Diffusion and our UniGS. For the coarse masks, we keep the consistency of the ones used in our training phase to eliminate the pattern gap. Furthermore, we showcase multi-class, multi-region inpainting to the multiple entities within the same category, moving beyond the conventional approach of incorporating a single entity.

4 Experiments

In this section, we first explore the performance of our proposed UniGS in four individual tasks, including multi-class multi-region inpainting, image synthesis, referring segmentation, and entity segmentation. Some key module designs or hyper-parameters on the mask quality are ablated in referring segmentation. Similar to other works [9, 47, 45] to evaluate the image and mask quality, we use the intersection over union (IoU) and recall for mask evaluation and the Fréchet inception distance (FID) and CLIP score (CS) for image generation.

4.1 Experiment Setting

For each single-task model, we exclusively utilize the COCO dataset [31], the Open Images [26], and EntitySeg datasets [45] as our training data. Considering the COCO panoptic data having about 10%percent\%% ignored area, we only use the EntitySeg for entity segmentation task in case of performance degradation.

In our training process, we randomly sample up to four objects per sampled area for tasks such as inpainting, image synthesis, and referring segmentation. On the other hand, entity segmentation should include all the entities that can cover the whole sampled area. During the inference period, we sample 1000 images in COCO validation data as our test set, where each image has a coarse mask and various control factors for different tasks, as shown in Table 2.

We initialize our model with stable diffusion v1.5 inpainting and weight newly added channels as zero. The image size and latent factor reduction ratio are set to 512×\times×512.

4.2 Multi-class Multi-region Inpainting

Method FID (\downarrow) CLIP Score (\uparrow) FID (\downarrow) CLIP Score (\uparrow)
single object multiple objects
SD1.5IsuperscriptsubscriptabsentI1.5{}_{\text{I}}^{\text{1.5}}start_FLOATSUBSCRIPT I end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT 4.95 88.86 7.82 83.80
UniGS1.41.4{}^{\text{1.4}}start_FLOATSUPERSCRIPT 1.4 end_FLOATSUPERSCRIPT 4.39 88.92 6.19 84.43
UniGS1.5IsuperscriptsubscriptabsentI1.5{}_{\text{I}}^{\text{1.5}}start_FLOATSUBSCRIPT I end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT 3.78 90.22 5.89 85.87
Table 3: Quantitative results on image inpainting task. The SD1.5IsuperscriptsubscriptabsentI1.5{}_{\text{I}}^{\text{1.5}}start_FLOATSUBSCRIPT I end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT means the stable diffusion inpainting model with version 1.5. The UniGS1.41.4{}^{\text{1.4}}start_FLOATSUPERSCRIPT 1.4 end_FLOATSUPERSCRIPT is the UniGS that initialized from the stable diffusion model with version 1.4, the UniGS1.5IsuperscriptsubscriptabsentI1.5{}_{\text{I}}^{\text{1.5}}start_FLOATSUBSCRIPT I end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT is the UniGS that initialized from stable diffusion inpainting model with version 1.5.

We evaluate the inpainting model by inserting one or multiple objects into the coarse mask area generated from the entity masks. The model’s output in this task includes the generated image and colormap. As in Table 3, our method outperformed the original model regarding both FID and CLIP scores. This improvement was observed even when our model was initialized using the stable diffusion 1.4 pre-trained model, highlighting the effectiveness of our approach in enhancing image generation through the integration of object mask guidance. That’s because our unified representation effectively constrains the model to maintain consistency between the visual appearance of the objects and their corresponding masks. As a result, the object masks impart a robust shape priority, guiding and refining the image generation process to ensure alignment and coherence in the final output.

Figure 4 shows the visual comparison between stable diffusion and our UniGS with the coarse masks generated from our code used in the training period. In other words, our testing keeps a similar pattern of inpainting area. It is evident from these results that the objects generated by the model are in strong harmony with the high-quality masks, showcasing the model’s effectiveness in seamlessly integrating the objects into the overall image composition. Furthermore, this impressive coherence between generated objects and their masks is attributable to our model’s unified representation of images and segmentation masks.

Refer to caption
Figure 5: Qualitative comparison among Stable Diffusion, ControlNet, T2I-Adapter, and our UniGS on the image synthesis task. Compared to those methods, UniGS maintains more coherence between the generated entities and their corresponding segmentation masks.

Figure 6 presents the visualization results with real-world irregular coarse masks generated from brush strokes. Our UniGS model still has considerable ability in object insertion without the domain gap problem. Also, some exciting things can be observed. For example, our model can generate realistic shadows of the inserted objects in the last two columns, which are rarely observed in other generation models. This capability emerges despite the absence of explicit shadow supervision during the training phase. The reason might be that the target of mask generation forces our UniGS model to learn the grou** behavior of pixels with similar textures. Thus, it is easier for our UniGS to recognize and replicate shadow patterns from their context that has shadows. This attribute enhances the overall realism and visual coherence of the inserted objects.

4.3 Image Synthesis

In this task, we expect to take a colormap as input along with a text-based image synthesis prompt and output a synthesized image. Except for the conventional metrics of Fréchet inception distance (FID) and CLIP score, we incorporated mean Intersection over Union (mIoU) to evaluate the alignment of the generated image with the specified mask shape. For this external evaluation, we utilized the Mask2Former model equipped with a large swan backbone to perform segmentation on the images generated by our model. The mIoU is then calculated by comparing these segmentation masks against the original colormap. In Figure 5, we present the visual consistency between the synthesized objects and the provided color masks among four methods: stable diffusion, ControlNet, T2I Adaptor, and UniGS. We modify the pipeline of the compared approaches for a fair comparison. For stable diffusion designed not for image synthesis, we only use text prompts for conditions. For ControlNet and T2I-Adaptor, we follow the default settings and input the segmentation map and text prompt to get the synthesis image. In this figure, we highlight the model’s ability to align the generated objects with the specified mask constraints closely. Moreover, those visualization results reflect the successful and seamless integration of these objects within their backgrounds, further highlighting the benefits of our unified representation in synthesizing contextually coherent and visually harmonious images. As shown in Table 4, our method is more favorable in all critical metrics, including the FID, CLIP score, and mIoU. This comparison underscores that the unified representation for both image and segmentation mask can help the image synthesis network have a higher quality perceptual judgment and more precise mask-to-object alignment.

Method FID (\downarrow) CS (\uparrow) mIoU (\uparrow) FID (\downarrow) CS (\uparrow) mIoU (\uparrow)
single object multiple object
SD [70] 36.502 54.708 0.196 34.770 56.511 0.191
CN [72] 35.111 55.230 0.277 30.108 58.709 0.326
T2I [38] 34.434 59.024 0.306 24.898 62.910 0.379
UniGS 15.272 65.015 0.781 14.271 69.504 0.777
Table 4: Quantitative results on image synthesis task. ‘SD’, ‘CN’, and T2I indicate the stable diffusion, CrontrolNet, and T2I-Adaptor. The ‘CS’ is the CLIP Score.

4.4 Referring Segmentation

We evaluate the quality of generated masks by mIoU and recall metrics for the referring segmentation task. In Table 5 on the comparison to the state-of-the-art segmentation method Mask2Former [10], our generative method has considerable segmentation quality. These results are worth noticing as we do not use any explicit segmentation loss, thereby demonstrating the potential of the UniGS model.

Method Backbone mIoU(\uparrow) Recall (\uparrow)
Mask2Former [10] Swin-Large 0.815 0.887
UniGS (Ours) - 0.808 0.872
Table 5: Quantitative results on referring segmentation task. We choose Mask2Former with a Swin-Large backbone as our baseline for comparison with SOTA segmentation methods.
Method Backbone mIoU(\uparrow) Recall (\uparrow) APe𝑒{}^{e}start_FLOATSUPERSCRIPT italic_e end_FLOATSUPERSCRIPT (\uparrow)
ConInst-Entity [55, 43] Swin-Large 0.621 0.685 0.397
SAM [24] VIT-Huge 0.653 0.714 0.432
CropFormer [45] Swin-Large 0.664 0.727 0.449
UniGS (Ours) - 0.631 0.692 0.407
Table 6: Quantitative results on entity segmentation task. The APe𝑒{}^{e}start_FLOATSUPERSCRIPT italic_e end_FLOATSUPERSCRIPT is AP with a non-overlapped constraint used in entity segmentation.

4.5 Entity Segmentation

The entity segmentation aims at splitting an input image into several semantically coherent regions.

Refer to caption
Figure 6: Visualization results of our UniGS in the real world. We generate the coarse masks used in inpainting by brush strokes from Gradio. We identified some interesting observations, particularly the appearance of shadows from the third to sixth columns.

Thus, the generated colormap should cover the whole image. After latent decoding the output from UNet, we use the progressive dichotomy module to transform the colormap into explicit segmentation masks. In Table 6, we show that there is still a significant performance gap between the UniGS and state-of-the-art entity segmentation model. However, we mention that the entity performance of the UniGS model is acceptable and better than kernel-based methods like CondInst.

Method mIoU(\uparrow) Recall (\uparrow)
Random Color Assignment 0.493 0.563
Location-aware Palette (Ours) 0.808 0.872
Table 7: Ablation study on various color assignments in mask encoder. The ‘Random Color Assignment’ indicates assigning each entity with a random color.

4.6 Ablation Study

In the following, we ablate different color assignment criteria and progressive dichotomy modules with various hyper-parameters. All the ablation studies are conducted in referring segmentation tasks to measure the mask quality.

Location-aware Palette. To evaluate the effectiveness of our color map** over the random color assignment for the object, we have individually trained the referring segmentation models for both color map** methods, as shown in Table 7. Our color map** method lets the model easily learn the pattern of the object color mask.

Progressive Dichotomy Module. Compared to the fixed cluster numbers in k-means, our proposed progressive dichotomy module has the advantage of adaptive cluster numbers. We verify our method in Table 8 by comparing K-Means and ours. Our progressive dichotomy module has no noticeable performance degradation compared to K-Means, even with knowing the ground truth numbers, manifesting the effectiveness and robustness of our progressive dichotomy module.

Furthermore, the distance threshold δ𝛿\deltaitalic_δ and pixel feature used in the progressive dichotomy module are ablated in Table 9. In Table 9(a), we can see that the distance threshold designed in the progressive dichotomy module is robust to the segmentation performance ranging from 0 to 20. In Table 9(b), using the RGB and LAB space pixel feature to decode the generated colormap can obtain the best mask quality because the LAB space can offer more contrast information for those two similar colors in RGB space.

Method Cluster Numbers mIoU(\uparrow) Recall (\uparrow)
Native K-Means Fixed (3) 0.520 0.641
Adaptive (GT) 0.810 0.874
PDM Adaptive 0.808 0.872
Table 8: Comparison between native K-Means and our progressive dichotomy module. ‘Fixed (3)’ indicates that we assign native K-Means with three cluster numbers. ‘Adaptive (GT)’ is to assign the cluster number by the ground truth number.
δ𝛿\deltaitalic_δ mIoU(\uparrow) Recall (\uparrow) 1 0.804 0.868 10 0.808 0.872 20 0.791 0.857 50 0.705 0.789 (a) pixel feature mIoU(\uparrow) Recall (\uparrow) RGB 0.796 0.860 LAB 0.787 0.856 RGB + LAB 0.808 0.872 (b)
Table 9: Ablation study on progressive dichotomy module. We ablate the distance threshold δ𝛿\deltaitalic_δ (a) and pixel feature (b) in the colormap decoding process.

5 Conclusion

This paper introduces a novel, effective unified representation in image generation and segmentation tasks. The key to our approach is regarding entity-level segmentation masks as a colormap generation problem. To distinguish entities within the same category, we employ a location-aware palette where each entity is distinctly colored based on its center-of-mass location. Furthermore, our progressive dichotomy module can efficiently transform a generated, albeit noisy, colormap into high-quality segmentation masks. Our extensive experiments on four diverse tasks demonstrate the robustness and versatility of our unified representation in image generation and segmentation. In the future, we will explore the multi-task training of our unified representation in a single model. We hope our work can foster the development of a foundation model with a unified representation for various tasks.

References

  • Batzolis et al. [2021] Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Schönlieb, and Christian Etmann. Conditional image generation with score-based diffusion models. arXiv preprint arXiv:2111.13606, 2021.
  • Benenson et al. [2019] Rodrigo Benenson, Stefan Popov, and Vittorio Ferrari. Large-scale interactive object segmentation with human annotators. In CVPR, 2019.
  • Burgert et al. [2022] Ryan Burgert, Kanchana Ranasinghe, Xiang Li, and Michael S Ryoo. Peekaboo: Text to image diffusion models are zero-shot segmentors. arXiv preprint arXiv:2211.13224, 2022.
  • Ceylan et al. [2023] Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. In ICCV, 2023.
  • Chefer et al. [2023] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826, 2023.
  • Chen et al. [2017] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017.
  • Chen et al. [2023a] Shoufa Chen, Peize Sun, Yibing Song, and ** Luo. Diffusiondet: Diffusion model for object detection. In ICCV, 2023a.
  • Chen et al. [2023b] Ting Chen, Lala Li, Saurabh Saxena, Geoffrey Hinton, and David J Fleet. A generalist framework for panoptic segmentation of images and videos. In ICCV, 2023b.
  • Chen et al. [2023c] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023c.
  • Cheng et al. [2022] Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022.
  • Cheng et al. [2023] Bin Cheng, Zuhao Liu, Yunbo Peng, and Yue Lin. General image-to-image translation with one-shot image guidance. In ICCV, 2023.
  • Chu et al. [2021] Ruihang Chu, Yukang Chen, Tao Kong, Lu Qi, and Lei Li. Icm-3d: Instantiated category modeling for 3d instance segmentation. RAL, 2021.
  • Chung et al. [2022] Hyung** Chung, Byeongsu Sim, and Jong Chul Ye. Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction. In CVPR, 2022.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurlPS, 2021.
  • Dockhorn et al. [2022] Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score-based generative modeling with critically-damped langevin diffusion. In ICLR, 2022.
  • Geng et al. [2023] Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Han Hu, Dong Chen, et al. Instructdiffusion: A generalist modeling interface for vision tasks. arXiv preprint arXiv:2309.03895, 2023.
  • Gregor et al. [2015] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. In ICML, 2015.
  • He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.
  • He et al. [2022] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurlPS, 2020.
  • Ho et al. [2022] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  • Kawar et al. [2023] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
  • Kingma et al. [2019] Diederik P Kingma, Max Welling, et al. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. In ICCV, 2023.
  • Koo et al. [2023] Juil Koo, Seungwoo Yoo, Minh Hieu Nguyen, and Minhyuk Sung. Salad: Part-level latent diffusion for 3d shape generation and manipulation. In ICCV, 2023.
  • Kuznetsova et al. [2020] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  • Li et al. [2019] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. Controllable text-to-image generation. NeurlPS, 2019.
  • Li et al. [2023a] Bo Li, Kaitao Xue, Bin Liu, and Yu-Kun Lai. Bbdm: Image-to-image translation with brownian bridge diffusion models. In CVPR, 2023a.
  • Li et al. [2022] Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, and Jiaya Jia. Mat: Mask-aware transformer for large hole image inpainting. In CVPR, 2022.
  • Li et al. [2023b] Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Guiding text-to-image diffusion model towards grounded generation. In ICCV, 2023b.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  • Liu et al. [2018] Shu Liu, Lu Qi, Haifang Qin, Jian** Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In CVPR, 2018.
  • Liu et al. [2023] Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, and Donglin Wang. Vgdiffzero: Text-to-image diffusion models can be zero-shot visual grounders. arXiv preprint arXiv:2309.01141, 2023.
  • Lugmayr et al. [2022] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
  • Luo et al. [2023] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, **gren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023.
  • Ma et al. [2023] Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, **xiang Liu, Yu Wang, Ya Zhang, and Yanfeng Wang. Diffusionseg: Adapting diffusion towards unsupervised object discovery. arXiv preprint arXiv:2303.09813, 2023.
  • Metzer et al. [2023] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR, 2023.
  • Mou et al. [2023] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  • Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2021.
  • Qi et al. [2019] Lu Qi, Li Jiang, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Amodal instance segmentation with kins dataset. In CVPR, 2019.
  • Qi et al. [2021a] Lu Qi, Jason Kuen, Jiuxiang Gu, Zhe Lin, Yi Wang, Yukang Chen, Yanwei Li, and Jiaya Jia. Multi-scale aligned distillation for low-resolution detection. In CVPR, 2021a.
  • Qi et al. [2021b] Lu Qi, Yi Wang, Yukang Chen, Ying-Cong Chen, Xiangyu Zhang, Jian Sun, and Jiaya Jia. Pointins: Point-based instance segmentation. TPAMI, 2021b.
  • Qi et al. [2022] Lu Qi, Jason Kuen, Yi Wang, Jiuxiang Gu, Hengshuang Zhao, Philip Torr, Zhe Lin, and Jiaya Jia. Open world entity segmentation. TAPMI, 2022.
  • Qi et al. [2023a] Lu Qi, Jason Kuen, Weidong Guo, Jiuxiang Gu, Zhe Lin, Bo Du, Yu Xu, and Ming-Hsuan Yang. Aims: All-inclusive multi-level segmentation for anything. In NeurlPS, 2023a.
  • Qi et al. [2023b] Lu Qi, Jason Kuen, Tiancheng Shen, Jiuxiang Gu, Wenbo Li, Weidong Guo, Jiaya Jia, Zhe Lin, and Ming-Hsuan Yang. High quality entity segmentation. In ICCV, 2023b.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  • Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  • Shen et al. [2022] Tiancheng Shen, Yuechen Zhang, Lu Qi, Jason Kuen, Xingyu Xie, Jianlong Wu, Zhe Lin, and Jiaya Jia. High quality segmentation for ultra high-resolution images. In CVPR, 2022.
  • Shu [2014] Guang Shu. Human detection, tracking and segmentation in surveillance video. 2014.
  • Sinha et al. [2021] Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2c: Diffusion-decoding models for few-shot conditional generation. In NeurlPS, 2021.
  • Song et al. [2021] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021.
  • Suetens [2017] Paul Suetens. Fundamentals of medical imaging. Cambridge university press, 2017.
  • Svitov et al. [2023] David Svitov, Dmitrii Gudkov, Renat Bashirov, and Victor Lempitsky. Dinar: Diffusion inpainting of neural textures for one-shot human avatars. In ICCV, 2023.
  • Tian et al. [2023] Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, and Mar Gonzalez-Franco. Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion. arXiv preprint arXiv:2308.12469, 2023.
  • Tian et al. [2020] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convolutions for instance segmentation. In ECCV, 2020.
  • Tsai et al. [2023] Yu-Ju Tsai, Yu-Lun Liu, Lu Qi, Kelvin CK Chan, and Ming-Hsuan Yang. Dual associated encoder for face restoration. arXiv preprint arXiv:2308.07314, 2023.
  • Tumanyan et al. [2023] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
  • Van Den Oord et al. [2017] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurlPS, 2017.
  • Wang et al. [2023a] Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023a.
  • Wang et al. [2023b] Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. Seggpt: Segmenting everything in context. In ICCV, 2023b.
  • Wang et al. [2021] Yi Wang, Lu Qi, Ying-Cong Chen, Xiangyu Zhang, and Jiaya Jia. Image synthesis via semantic composition. In ICCV, 2021.
  • Wang et al. [2022] Yi Wang, Menghan Xia, Lu Qi, **g Shao, and Yu Qiao. Palgan: Image colorization with palette generative adversarial networks. In ECCV, 2022.
  • Wu et al. [2023a] Chanyue Wu, Dong Wang, Yunpeng Bai, Hanyu Mao, Ying Li, and Qiang Shen. Hsr-diff: hyperspectral image super-resolution via conditional diffusion models. In ICCV, 2023a.
  • Wu and De la Torre [2023] Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In ICCV, 2023.
  • Wu et al. [2023b] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023b.
  • Wu et al. [2023c] Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, and Chunhua Shen. Datasetdm: Synthesizing data with perception annotations using diffusion models. In NeurlPS, 2023c.
  • Wu et al. [2023d] Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, and Chunhua Shen. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. In ICCV, 2023d.
  • Xie et al. [2023] Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun Zhang. Smartbrush: Text and shape guided object inpainting with diffusion model. In CVPR, 2023.
  • Xu et al. [2023a] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023a.
  • Xu et al. [2023b] Minkai Xu, Alexander S Powers, Ron O Dror, Stefano Ermon, and Jure Leskovec. Geometric latent diffusion models for 3d molecule generation. In ICML, 2023b.
  • Yang et al. [2023] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xue** Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In CVPR, 2023.
  • Zhang et al. [2023a] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
  • Zhang et al. [2023b] Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris N Metaxas, and Jian Ren. Sine: Single image editing with text-to-image diffusion models. In CVPR, 2023b.
  • Zhao et al. [2019] Bo Zhao, Lili Meng, Weidong Yin, and Leonid Sigal. Image generation from layout. In CVPR, 2019.
  • Zhao et al. [2017] Hengshuang Zhao, Jian** Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
\thetitle

Supplementary Material

This supplementary material document provides more visualization results and training/inference details on Multi-class multi-region inpainting, image synthesis, referring segmentation, and entity segmentation. The supplementary material is organized as follows:

  • Training/inference details on four tasks.

  • More visualization results, including an example illustration of decoding colormap and the generation results of three tasks.

Also, please check a recorded video to obtain a brief description of our paper.

6 Training/inference Details

Multi-class multi-region inpainting

In our research, we train our model on the COCO and Open-Images dataset, initializing it with the stable diffusion inpainting model v1.5. For each image, we sample a maximum of four objects and then. Based on the ground truth mask, we have two schemes to make the coarse mask: 1. simulate a more coarse mask using the curve. 2. direct. The process follows in Algorithm 1 that is used in Paint-by-Example. This coarse mask is then cropped out from the original image. The cropped image and coarse mask are concatenated and fed into a UNet, with the expected output being an inpainted image and a separate color mask. Under this setting, the model was trained for 48 epochs, resulting in the development of our model. During inference, we sample a maximum of three objects. Similarly, we sample out coarse masks. And utilize randomly initialized noise for DDIM denoising, a total of 200 steps.

We have compared the number of parameters and the inference speed with the original stable diffusion inpainting model in Table 10. We have only added a few channels to the input and output, kee** the parameter count almost consistent with that of stable diffusion. The inference speed also remains nearly the same. Without incurring additional computational costs, we have achieved better output results.

Image Synthesis

During image synthesis, we use a colormap as our conditional input instead of an image, and the input for the coarse mask is a mask entirely filled with ones, indicating that the area of interest is the entire image. We conducted a full training over 48 epochs on the COCO and Open-Images dataset, with the maximum number of entity samples set to four. The inference process is consistent with the training but with a maximum sample number of three. The input includes a text condition, colormap, and an all-one coarse mask to produce the synthesized image.

Method Parameters Speed
Stable Diffusion 859.54 M 14.48
UniGS (Ours) 859.56 M 14.40
Table 10: Comparison of parameters and inference speed between Stable Diffusion and our UniGS. The inference speed is tested by the seconds per image. We use the DDIM sampling strategy for both methods. We do not use any accelerating techniques for a fair comparison. And we note that those techniques used in Stable Diffusion also work in our UniGS.
Algorithm 1 Pseudocode (Python-like) of the Coarse Mask Sampling Method: Curve and Bounding Box
prob=random.uniform(0, 1)
## random or bounding box mask
if prob<self.arbitrary_mask_percent:
    mask_img = Image.new(’RGB’, (W, H), (255, 255, 255))
    bbox_mask=copy.copy(bbox)
    extended_bbox_mask=copy.copy(extended_bbox)
    top_nodes = np.asfortranarray([
                    [bbox_mask[0], (bbox_mask[0]+bbox_mask[2])/2 , bbox_mask[2]],
                    [bbox_mask[1], extended_bbox_mask[1], bbox_mask[1]],
                ])
    down_nodes = np.asfortranarray([
            [bbox_mask[2],(bbox_mask[0]+bbox_mask[2])/2 , bbox_mask[0]],
            [bbox_mask[3], extended_bbox_mask[3], bbox_mask[3]],
        ])
    left_nodes = np.asfortranarray([
            [bbox_mask[0],extended_bbox_mask[0] , bbox_mask[0]],
            [bbox_mask[3], (bbox_mask[1]+bbox_mask[3])/2, bbox_mask[1]],
        ])
    right_nodes = np.asfortranarray([
            [bbox_mask[2],extended_bbox_mask[2] , bbox_mask[2]],
            [bbox_mask[1], (bbox_mask[1]+bbox_mask[3])/2, bbox_mask[3]],
        ])
    top_curve = bezier.Curve(top_nodes,degree=2)
    right_curve = bezier.Curve(right_nodes,degree=2)
    down_curve = bezier.Curve(down_nodes,degree=2)
    left_curve = bezier.Curve(left_nodes,degree=2)
    curve_list=[top_curve,right_curve,down_curve,left_curve]
    pt_list=[]
    random_width=5
    for curve in curve_list:
        x_list=[]
        y_list=[]
        for i in range(1,19):
            if (curve.evaluate(i*0.05)[0][0]) not in x_list and (curve.evaluate(i*0.05)[1][0] not in y_list):
                pt_list.append((curve.evaluate(i*0.05)[0][0]+random.randint(-random_width,random_width),curve.evaluate(i*0.05)[1][0]+random.randint(-random_width,random_width)))
                x_list.append(curve.evaluate(i*0.05)[0][0])
                y_list.append(curve.evaluate(i*0.05)[1][0])
    mask_img_draw=ImageDraw.Draw(mask_img)
    mask_img_draw.polygon(pt_list,fill=(0,0,0))
    mask_tensor=get_tensor(normalize=False, toTensor=True)(mask_img)[0].unsqueeze(0)
else:
    mask_img=np.zeros((H,W))
    mask_img[extended_bbox[1]:extended_bbox[3],extended_bbox[0]:extended_bbox[2]]=1
    mask_img=Image.fromarray(mask_img)
    mask_tensor=1-get_tensor(normalize=False, toTensor=True)(mask_img)

Referring segmentation

In the task of referring segmentation, the method of sampling coarse mask remains consistent with the inpainting approach, but the coarse mask region is not excised from the original image. Instead, the original image and the coarse mask are concatenated and then jointly input into the model. Similarly, the model was trained on the COCO and Open-Images dataset for 48 epochs to yield the final model. The original image, coarse mask, and text prompt are input during inference. Post-processing is then performed on the output colormap to obtain the final segmentation mask

Entity segmentation

For entity segmentation, unprocessed original images are input along with an all-one coarse mask, indicating that the entire image area is subject to segmentation. During training, we no longer sample entities but directly use all entities from the COCO and EntitySeg datasets, encoding them into a colormap as input for the framework. In inference, the output colormap undergoes post-processing, but the background is not removed. Instead, all clusters are retained as individual entities.

One-to-Many Joint Model

We train our single model for the four tasks within the COCO, Open-Images, and EntitySeg datasets. Unlike training a single model, we add the task embedding for the multi-task training. Furthermore, the sample ratios of four tasks, including multi-class multi-region inpainting, image synthesis, referring segmentation, and entity segmentation, are 0.3, 0.3, 0.2, and 0.2, respectively. We train the single model for multi-tasks in 96 epochs. Finally, this model performs comparably to each model of a single task, indicating the great potential of our unified representation for image generation and segmentation.

7 More Visualization Results

Progressive Dichotomy Module.

Figure 7 shows an example of our progressive dichotomy module to decode the generated colormap into several explicit entity masks. We can see that our decoding process does not require assigning cluster numbers in a depth-first search manner.

Refer to caption
Figure 7: An example illustration of our progressive dichotomy module at each clustering iteration. The red color indicates the average distance to its cluster center for all the pixels in the cluster.

Visualization Results

Figure 89 and 10 shows more visualization results of multi-class multi-region inpainting and image synthesis with our UniGS framework. And Figure 11 shows the entity segmentation results of our UniGS.

Refer to caption
Figure 8: More visualization results of our UniGS in multi-class multi-region inpainting.
Refer to caption
Figure 9: More visualization results of our UniGS in the real world. We generate the coarse masks used in inpainting by brush strokes from Gradio. We identified some interesting observations, particularly the appearance of shadows from the third to sixth columns.
Refer to caption
Figure 10: More visualization results of our UniGS in image synthesis
Refer to caption
Figure 11: More visualization results of our UniGS in Entity Segmentation.