DiffAugment: Diffusion based Long-Tailed Visual Relationship Recognition

Parul Gupta
Monash University
[email protected] Tuan Nguyen
Monash University
[email protected] Abhinav Dhall
Indian Institute of Technology Ropar
[email protected] Munawar Hayat
Monash University
[email protected] Trung Le
Monash University
[email protected] Thanh-Toan Do
Monash University
[email protected]

Abstract

The task of Visual Relationship Recognition (VRR) aims to identify relationships between two interacting objects in an image and is particularly challenging due to the widely-spread and highly imbalanced distribution of $<$ subject, relation, object $>$ triplets. To overcome the resultant performance bias in existing VRR approaches, we introduce DiffAugment – a method which first augments the tail classes in the linguistic space by making use of WordNet and then utilizes the generative prowess of Diffusion Models to expand the visual space for minority classes. We propose a novel hardness-aware component in diffusion which is based upon the hardness of each $<$ S,R,O $>$ triplet and demonstrate the effectiveness of hardness-aware diffusion in generating visual embeddings for the tail classes. We also propose a novel subject and object based seeding strategy for diffusion sampling which improves the discriminative capability of the generated visual embeddings. Extensive experimentation on the GQA-LT dataset shows favorable gains in the subject/object and relation average per-class accuracy using Diffusion augmented samples.Our code will be released.

1 Introduction

The Long-tailed Visual Relationship Recognition (LTVRR) task aims at understanding the relationships among interacting objects in an image when both the objects and the relationships follow a long-tailed distribution. Since Diffusion Models (DM) [7] have proven to be quite effective in modelling large-scale complex visual-textual space which is naturally long-tailed [16], this work aims at utilizing the generative capabilities of Diffusion Models to overcome the performance bias arising due to long-tail specifically for Visual Relationship Recognition.

The goal of the VRR task is to recognize the categories of two interacting objects and their relation, e.g. recognizing triplets like $<$ banans, hang from, ceiling $>$ . Due to the enriched scene understanding provided by VRR, it benefits various other vision tasks such as image captioning, Visual Question-Answering and 3D scene synthesis. However, due to the imbalanced class distribution in many VRR datasets, predictions of the most existing models are dominated by the head/frequent relations, lacking generalization on tail/low-shot relationships. In the LTVRR setup which was introduced first in [2], subjects, objects and relations follow a long-tailed distribution. They introduced two benchmarks – dubbed VG8K-LT and GQA-LT built upon the widely used Visual Genome [12] and GQA datasets [9] respectively. These benchmarks are particularly challenging as not only could the combination (S, R, O) be rare, but so can any of the interacting subjects/objects (S/O) and/or the relation (R). Moreover, they have extremely high imbalance ratios for both objects and relationships – for GQA-LT, the ratio of the number of samples in the most populated to the least populated object class is 300,000+ and for relation classes, the ratio is around 1.7 million. Similarly for VG8K-LT, the imbalance ratio is 14,000 for object classes and 34,000 for relation classes.

A general VRR approach consists of 3 steps – (1) Extracting general visual features of cropped image regions of the subject, object and relation using pretrained object detection models such as Faster-RCNN [24]. These features can be considered as the standard input to any VRR approach. (2) Refining the visual features by message-passing among objects/relations. (3) Predicting the categories of subject, object and corresponding relation using the refined embeddings. The standard strategies to tackle long-tailed imbalance are based upon data re-sampling [28] or weight adjustment based loss functions such as weighted cross-entropy, focal loss [15], equalization loss [31] or decoupling [10] the representation learning from classifier learning. Another direction for alleviating imbalance involves augmentation of the tail classes, wherein features learnt from the abundantly populated head classes are transferred to the under represented tail classes [16]. Building upon these approaches, for Visual Relationship Recognition in the long-tail setting, [2] introduces a Visio-Lingual Hubless loss and RelMix augmentation. Both RelMix and VilHub loss can be integrated with the class-balancing loss functions and decoupling to produce good results.

Complementary to these strategies, we propose DiffAugment: an approach which utilizes Diffusion Models [7] to augment the tail classes and overcome the imbalance. Unlike RelMix augmentation which operates on the refined visual embeddings, DiffAugment works at the VRR input level. To do so, firstly, we augment the triplets containing objects/relations from tail classes in the training data. For this, we take the triplets containing tail objects/relations from training data and replace their subjects/objects with similar object classes from the dataset. We make use of Wordnet [18] based LCH synset similarity [1] to get the similar object classes. Parallely, we also train a Diffusion Model that generates the visual features of step 1 (general VRR input), conditioned upon the CLIP [23] textual embedding of the triplets. Next, we sample the visual features corresponding to the augmented triplets from this trained DM, which in turn, can be used to fine-tune any existing LTVRR model. In order to improve the quality of the generated visual features, we further propose two enhancements in the Diffusion Model– (1) Motivated from [27], rather than starting from a purely random Gaussian distribution (seed) while sampling, we extract the visual features of the augmented triplet’s subject and object from the training data to obtain a better seed. (2) We model the hardness of a triplet and use it as an extra condition while training the Diffusion Model. This Hardness can be interpreted as a constraint on the region in which the generated visual embedding of a triplet can occur, thereby improving the discriminative capability of the generated visual features. Finetuning using our generated visual features shows consistent improvement in the per-class accuracy for different LTVRR approaches on the GQA-LT dataset.
Thus our contributions in this work can be summarized as:

•

To the best of our knowledge, this work is the first attempt to employ Diffusion Models in the domain of Visual Relationship Recognition.
•

Our approach successfully models and generates the visual embeddings of captions (triplets) involving tail classes and uses them to overcome the bias in the training of existing Visual Relationship Recognition approaches. Hence, it acts as a data augmentation strategy which can be used on top of any existing Visual Relationship Recognition algorithm to improve the classification performance on the tail classes.
•

We introduce two novel components - a Subject/Object based seeding strategy and hardness-aware Diffusion wherein we define the hardness of each triplet and add it as a condition to the Diffusion Model - both of these components improve the discriminative capability of the DM generated samples.

2 Related Work

Long-tailed Visual Relationship Recognition There are several works pertaining to Visual Relationship Recognition in long-tailed data distribution. The first work in this direction, [2] introduced a novel augmentation strategy called RelMix and a visio-linguistic hubless (VilHub) loss to adapt a base approach for Visual Relationship Recognition (LSVRU [37]) to the long-tail scenario. RelMix is inspired from Manifold Mixup [34] and combines the visual embeddings of different triplets belonging to tail/medium classes along with their expected predictions to augment the training data. VilHub loss encourages the average probability of each class being predicted in a batch to be close to uniform, i.e. equally preferred across head and tail. While our approach is essentially an augmentation strategy too, it employs a Diffusion Model to generate the foundation Faster-RCNN [24] based visual features rather than the approach-specific refined features and can be used on top of any model trained using RelMix and VilHub. RelMix is an end-to-end augmentation strategy and needs a VRR model to be trained from scratch whereas our diffusion based augmentation needs only fine-tuning of already trained VRR models.
Another state-of-the-art work for LTVRR titled RelTransformer [5] uses two Transformer [33] based encoders - one to represent the global scene context and another for the relation. The global scene context encoder guides the relation encoder through meshed attention. Further, it also uses a novel memory attention module to allow the relation representation information to be shared across the entire dataset to alleviate the frequency bias. ViTSCG [35] uses Vision Transformer [6] and Masking with overlapped area (MOA) module [19] to first extract the subject and object features. Now, to obtain the relation features, it uses spatially conditioned graph [36] which is a graph neural network designed to jointly reason about the appearance and spatial information of an image. Finally, the subject, object and relation features are refined using the RelTransformer module [5] as explained earlier and passed on to the MLP classifiers for prediction. Our DiffAugment strategy can be utilised for both these approaches by augmenting the features that are input to the RelTransformer module.
Diffusion Models (DM) [7] are a recently introduced class of generative models which have shown promising results in numerous generative applications including image super-resolution [26], image editing [4, 17], speech synthesis [11, 13, 8], voice conversion [21] and text-to-speech [20]. The basic DM architecture has been improved in various ways recently with attempts to improve the diversity and quality of the generated images while reducing the training and sampling time [38, 39, 14, 32]. Some recent works such as Class-Balancing Diffusion Model (CBDM) [22] also venture towards maintaining the sample quality and diversity when training Diffusion Models on imbalanced datasets. In this paper, we make use of the simplest version of Diffusion Models called Latent Diffusion Models with Cross-Attention based conditioning [25]. Improvements in LTVRR by using advanced Diffusion Models has been left to be explored in the future.

3 Method

Refer to caption — Figure 1: For all stages: X) Given a triplet $<$ S,R,O $>$ , its linguistic embedding is obtained using the CLIP [23] text encoder. Y) For any bounding box region from an image, its visual embedding is obtained from VGG-16 [29] architecture based Faster-RCNN [24] backbone. Z) A K-means cluster model is trained on the CLIP linguistic embeddings of all the original triplets. Using this model, the Hardness based conditional of any triplet is the normalised vector of distances between K-means cluster centres and the triplet’s CLIP language embedding. DiffAugment Pipeline: A) In stage 1, we augment the triplets containing s/r/o from tail classes using Wordnet [18] based LCH Synset similarity [1]. B) In stage 2, the Diffusion Model is trained in the visual embedding space, conditioned upon the linguistic embedding and optionally concatenated with the hardness conditional. C) The trained Diffusion Model is used to sample visual embeddings corresponding to the augmented triplets of stage 1, optionally using a subject and object based starting seed. D) The augmented visual features can be used to fine-tune any pre-trained Visual Relationship Detection model, such as LSVRU [37] or RelTransformer [5].

3.1 Problem Definition

In the Visual Relationship Recognition task, each image $I$ is assumed as a scene graph $G=(N,E)$ where each node $n_{i}\in N$ represents an object in the image and each edge $e_{i}\in E$ represents the spatial or semantic relation between two interacting objects. The visual relationship between a subject $s$ and an object $o$ is denoted by $r$ . The goal is to predict the labels $y_{r}$ , $y_{s}$ and $y_{o}$ (corresponding to the relation, subject and object respectively), given the Image $I$ and the bounding boxes $b_{s}$ and $b_{o}$ corresponding to the subject and object respectively.

y_{s},y_{r},y_{o}=f(b_{s},b_{o},b_{r},I)

The bounding box corresponding to the relation $b_{r}$ is obtained by the minimum enclosing region of $b_{s}$ and $b_{o}$ . $I$ is the raw RGB pixels of the image and $f$ denotes the inference function.

3.2 Diffusion-based Augmentation – DiffAugment

Our base strategy has four different stages which are described here. Figure 1(A, B, C, D) show the overall pipeline.

1. Augmenting the Triplets.

(Shown in Figure 1(A)) For each original triplet involving a tail relation, we replace the subject/object with similar classes from the training data. Likewise, for each original triplet involving a tail subject/object, we replace the corresponding object/subject respectively with similar classes from the training data. In order to obtain the similar classes, we use Wordnet [18] based Leacock Chordorow (LCH) [1] Synset similarity metric which ensures that the augmented triplets are plausible examples. e.g. In the Figure 1(A), the triplet $<$ spoons, to the right of, cake $>$ involves the tail class spoons as the subject. Therefore we replace the object, i.e. cake with other object classes from the dataset that are similar to cake, such as waffles, cookies, biscuits and brownies.

2. Diffusion Model Training.

(Shown in Figure 1(B)) We choose the VGG-16 [29] backbone based Faster-RCNN [24] visual features (also used in LSVRU [37], RelTransformer [5]) as the space for Diffusion (also shown in Figure 1(Y)). First, we extract the Faster-RCNN features of the bounding boxes corresponding to the relations (i.e. $b_{r}$ ) for all the triplets in the training data. Then, we use these visual features as the input and target for the forward and backward processes of Diffusion. The conditioning of the Diffusion Model is done upon the CLIP textual encoding of the triplets (denoted by (X) in Figure 1). To condition the Diffusion Model on the textual embeddings, cross-attention mechanism [25] is applied. As an enhancement, we also use a hardness based conditional for the Diffusion Model, which is explained in Section 3.4.

3. Diffusion Model Sampling.

(Shown in Figure 1(C)) The trained Diffusion Model is used to sample the Faster-RCNN [24] based visual features for the triplets augmented in the first stage. For example, in Figure 1(C), the CLIP embedding of the augmented triplet $<$ spoons, to the right of, cookies $>$ is used as a conditional input to sample the Faster-RCNN visual embedding corresponding to it. Instead of using a random Gaussian seed for sampling, we can also use a seed obtained using the augmented triplet’s subject and object, as explained in Section 3.3.

4. VRD Approach Fine-tuning.

(Shown in Figure 1(D)) The generated visual features can be used to fine-tune any existing, pre-trained Visual Relationship detection model since most of the existing VRR approaches use Faster-RCNN [24] based features as an input. We perform fine-tuning on two such VRD methods - LSVRU and RelTransformer and observe an improvement in the average per class accuracy for both.

3.3 Enhancement-1: Subject-Object based seed

Many image-editing based applications of Diffusion Models make use of the noisy versions of original images when trying to generate the modified image [3]. Inspired from this idea, while generating the visual features of any augmented triplet, we propose to make use of the visual features of its subject and object. To do so, for each augmented triplet, we choose random bounding boxes corresponding to its subject and object from the training data and take an average of their Faster-RCNN [24] features. Then we add Gaussian noise to it in the same manner as the forward diffusion process and use the result as the starting seed while sampling from the trained Diffusion Model. This is shown in Figure 1(C), where bounding boxes corresponding to the subject spoons and object cookies are taken randomly from the training dataset; following which their Faster-RCNN features are combined to produce the Subject-Object based seed. As observed in Section 4, samples generated in this manner have a better discriminative quality.

3.4 Enhancement-2: Hardness aware Diffusion

As the vocabulary of the data on which the CLIP [23] textual encoder is trained is much larger than the limited number of classes in the GQA-LT dataset (1703 objects and 310 relations), the linguistic embeddings used to condition the Diffusion Model are not as discriminative as desirable. Therefore, we propose to generate an additional, hardness based conditional for the Diffusion Model. For this, we take the CLIP [23] linguistic embeddings of all the triplets in the training dataset and perform K-means clustering. To obtain the hardness of any triplet, we calculate the distance of its CLIP textual [23] embedding from all the cluster centres of the K-Means model and normalize the resulting vector of distances. This process is displayed in Figure 1(Z). Just like the memory augmentation vectors in RelTransformer [5] encode the global relation information for the whole dataset, similarly, the cluster centres of the CLIP textual embeddings encode information about the entire dataset, i.e. they are a compressed representation of the entire dataset. Intuitively, classes that are similar to each other are expected to have similar visual features. Therefore, the cluster centres in the CLIP textual space can be assumed to correspond to the cluster centres in the Faster-RCNN visual space. So we expect that the generated visual embedding of any triplet should be closer to the visual embeddings of those triplets whose linguistic embeddings are similar to that of the current triplet. To encourage that, we use the hardness vector as a condition in the Diffusion Model, in addition to the CLIP textual embedding. Specifically, our hardness vector encodes the distances of a triplet with respect to the textual cluster centres. Hence, at sampling time, (Figure 1(C)), this vector explicitly encourages the Diffusion Model to generate its visual embedding at similar relative distance from the corresponding visual cluster centres. Ablations in Section 4 demonstrate the effectiveness of introducing this hardness based conditional to the Diffusion Model.

4 Experiments

4.1 Experimental settings

In this work, we perform experiments on the GQA-LT dataset built on top of Visual Genome by [2]. It has $72,580$ training images, $2,573$ validation images and $7,722$ test images; with $1,703$ objects and $310$ relations. The most frequent object and relation has $374,282$ and $1,692,068$ examples, and the least frequent has $1$ and $2$ respectively. Following the strategy used by [2], the dataset is split into three parts– Many, Medium, Few and the selection ratio of each split is based on the frequency of each class: Many (top $5\%$ - 86 classes for S/O and 16 classes for R), Medium(middle $15\%$ - 255 classes for S/O and 46 classes for R), Few (remaining $80\%$ - 1362 classes for S/O and 248 classes for R).
Baseline Models We evaluate the effect of our diffusion based augmentation on two state-of-the-art Visual Relationship Recognition approaches – LSVRU [37] and RelTransformer [5]. For both the baselines, we also consider models trained using class-balancing loss functions such as weighted cross-entropy loss (WCE) for fine-tuning. For both the methods, since trained model checkpoints are not publicly available, we train the models from scratch using their official implementations¹¹1https://github.com/Vision-CAIR/RelTransformer, https://github.com/Vision-CAIR/LTVRR/tree/ltvrd-challenge-2023 in order to get baseline results²²2We are unable to reproduce the performance reported for RelTransformer with WCE loss using the official implementation. Hence we report the fine-tuning results based upon the performance that we get in Table 1. and further fine-tune them on our generated samples.
Evaluation metrics The main metric used is the average per-class accuracy, which is the accuracy of each class calculated separately, then averaged.
Implementation details For augmenting the triplets in stage 1 (Section 3.2), we consider the triplets containing either few or medium classes and augment with synsets having LCH similarity greater than or equal to 2.26 (empirical value). The CLIP text encoder produces a 768-D linguistic embedding while the Faster-RCNN output is a 4096-D visual embedding. In order to calculate the hardness of each triplet, K-means clustering is performed with 1200 cluster centres, thus giving a 1200-D hardness vector for each triplet after L1 normalization. A total of 48K augmented triplets from few classes and 48K augmented triplets from medium classes are used for the experiments. While the training of the baselines required 8 Nvidia V100 GPUs with a batch size of 8 and is done for 12 epochs, finetuning can be performed on a single GPU. We use a batch size of 256 for the augmented visual embeddings’ data loading and fine-tune for 10 epochs on the augmented data.

Learning Methods Subject/Object Relation Combined many medium few all many medium few all all 86 255 1362 1703 16 46 248 310 2013 Architecture: LSVRU [37] CE 68.24 36.34 6.74 14.28 60.24 13.70 6.34 10.22 12.25 CE + DiffAugment 62.23 41.77 10.14 17.51 35.21 23.94 8.04 11.80 14.66 CE + VilHub + RelMix 68.80 44.20 10.26 18.30 63.86 12.00 6.83 10.55 14.42 CE + VilHub + RelMix + DiffAugment 62.66 44.08 12.79 19.99 48.94 19.20 7.89 11.69 15.84 WCE 54.83 43.32 12.25 19.05 52.75 35.17 13.03 18.37 18.71 WCE + DiffAugment 39.83 33.90 17.33 20.95 37.37 37.72 18.13 22.03 21.49 Architecture: RelTransformer [5] CE 72.45 50.06 11.69 20.50 62.48 16.83 7.45 11.69 16.10 CE + DiffAugment 70.55 50.47 13.23 21.70 47.89 30.13 7.09 12.62 17.16 WCE 53.67 47.35 19.47 25.37 54.86 38.96 15.10 20.70 23.04 WCE + DiffAugment 39.16 38.04 22.68 25.81 45.19 40.19 18.01 22.71 24.26

Table 1: Average per-class accuracy on GQA-LT dataset. The baseline results (non-gray) are as per our reproduction using the official code released by the authors [5, 2]. The DiffAugment results include both the enhancements, i.e. using Subject-Object based seed and Hardness aware diffusion. The overall best results are in bold, while category-wise best results are underlined. Combined accuracy refers to the average of all subjects/objects accuracy and all relations accuracy. Diffusion based augmentation improves the average per class accuracy for both LSVRU and RelTransformer; with or without class-balancing weighted cross entropy loss, VilHub loss and RelMix augmentation.

Learning Method	Seed	Subject/Object				Relation				Combined
		many	medium	few	all	many	medium	few	all	all
		86	255	1362	1703	16	46	248	310	2013
LSVRU [37]
CE + DiffAugment	Random	62.12	42.42	10.18	17.63	34.67	23.73	6.04	10.14	13.88
CE + DiffAugment	S-O	62.23	41.77	10.14	17.51	35.21	23.94	8.04	11.80	14.66
RelTransformer [5]
WCE + DiffAugment	Random	39.49	38.02	22.92	26.02	46.79	39.59	17.27	22.10	24.06
WCE + DiffAugment	S-O	39.16	38.04	22.68	25.81	45.19	40.19	18.01	22.71	24.26

Table 2: Effect of using Subject-Object based (S-O) seed rather than Random Gaussian seed for diffusion sampling. Combined accuracy refers to the average of all subjects/objects accuracy and all relations accuracy. For all the experiments, Hardness aware diffusion has been used. The overall better results are in bold. Category-wise better results are underlined. Sampling using subject-object based seed gives better combined performance.

Learning Method	Hardness	Subject/Object				Relation				Combined
	Aware	many	medium	few	all	many	medium	few	all	all
	Diffusion	86	255	1362	1703	16	46	248	310	2013
LSVRU [37]
CE + DiffAugment	✗	63.18	41.96	9.58	17.14	36.05	23.52	6.13	10.26	13.70
CE + DiffAugment	✓	62.12	42.42	10.18	17.63	34.67	23.73	6.04	10.14	13.88
RelTransformer [5]
CE + DiffAugment	✗	70.14	51.28	14.27	22.64	51.56	25.91	6.94	12.06	17.35
CE + DiffAugment	✓	70.21	51.20	14.23	22.59	53.15	25.56	7.52	12.56	17.58

Table 3: Effect of using Hardness aware diffusion. Combined accuracy refers to the average of all subjects/objects accuracy and all relations accuracy. For all the experiments, Random Gaussian seed has been used while sampling. The overall best results are in bold. Category-wise best results are underlined. For both the VRR approaches, Hardness aware diffusion gives better combined performance.

Learning Method Fine-tuning Subject/Object Relation Combined strategy many medium few all many medium few all all 86 255 1362 1703 16 46 248 310 2013 LSVRU [37] WCE + DiffAugment Random 39.83 33.90 17.33 20.95 37.37 37.72 18.13 22.03 21.49 Easy then Hard 37.34 31.43 17.74 20.78 38.39 37.56 18.64 22.47 21.62 RelTransformer [5] CE + DiffAugment Random 70.55 50.47 13.23 21.70 47.89 30.13 7.09 12.62 17.16 Easy then Hard 70.05 51.10 14.00 22.39 55.38 24.65 6.97 12.09 17.24

Table 4: Effect of using Curriculum based fine-tuning i.e. initially using easy samples and later followed by hard samples. Combined accuracy refers to the average of all subjects/objects accuracy and all relations accuracy. For all the experiments, Subject-Object based seed has been used while sampling. The overall best results are in bold. Category-wise best results are underlined. For both the settings, curriculum based fine-tuning gives better combined performance.

LSVRU[37] + CE loss : 12.25%
Sbj/Obj	Hardness	Curriculum	Combined
based	Aware	based	accuracy
seed	Diffusion	Fine-tuning
✗	✗	✗	13.7
✗	✓	✗	13.88
✗	✓	✓	13.99
✓	✗	✗	13.62
✓	✓	✗	14.66
✓	✓	✓	14.2

Table 5: Effect of the DiffAugment enhancements in different combinations on LSVRU [37] trained with Cross Entropy loss on the GQA-LT dataset. The best enhancement combination is highlighted in gray. Combined accuracy refers to the average of all subjects/objects accuracy and all relations accuracy.

LSVRU[37] + CE loss : 12.25%
#samples	Subject/Object	Relation	Combined
	all	all	all
	1703	310	2013
20K	17.16	10.72	13.94
60K	17.38	11.02	14.20
96K	17.51	11.80	14.66

Table 6: Effect of using different number of augmented samples to fine-tune LSVRU [37] trained with Cross-Entropy Loss on GQA-LT dataset. The best performance is highlighted in gray.

4.2 Quantitative Results

Table 1 presents the results of our approach, including both the enhancements on the GQA-LT dataset. For LSVRU as well as RelTransformer, irrespective of the loss function used, fine-tuning using Diffusion generated samples improves the overall average per-class accuracy for both subjects/objects and relations. Further, Diffusion-augmented samples improve the performance of models trained using RelMix augmentation as well. As the table shows, DiffAugment based fine-tuning of an LSVRU model that has been trained with Cross Entropy and VilHub losses and using RelMix Augmentation, the overall combined performance increases by $1.42\%$ . Even though the many category performance drops, it is compensated by the considerable rise in the joint performance of the medium and few categories. The combined subject/object and relation performance increases by $\approx 2.5\%$ for LSVRU [37] and $\approx 1.1\%$ for RelTransformer [5].

4.3 Ablation Study

To validate the advantage of each of the enhancements, we perform extensive ablations and describe their takeaways as follows–
Effect of using Subject-Object based seed Table 2 shows the result of using subject-object based seed rather than random Gaussian seed while sampling from the Diffusion Model. Even though the overall subject/object performance reduces slightly, there is a consistent improvement in the overall relation performance for both the architectures (LSVRU [37] and RelTransformer [5]) irrespective of the loss function used (CE/WCE). The improvement in the relations accuracy overcomes the slight reduction in subject-object accuracy as displayed by the better combined performance. For LSVRU, Subject-Object based seed improves the combined accuracy by $0.78\%$ and for RelTransformer, it increases the accuracy by $0.2\%$ .
Effect of Hardness aware diffusion Table 3 reports the change in accuracy across different categories on adding hardness as an extra condition to the Diffusion Model. Even though the category-wise performance for both relations and subjects/objects does not improve consistently, the combined average per-class accuracy across relations and objects improves for both LSVRU and Reltransformer.
Effect of Curriculum based fine-tuning Since we use hardness as a condition while generating the visual embeddings and inspired from the principles of curriculum learning [30], we try to observe if ordering the visual embeddings according to their hardness and using the easier samples first for fine-tuning the VRR model has any impact on the performance. So, we divide the triplets between easy and hard by calculating the entropy of the hardness vectors and choosing the median as the threshold. As observed in Table 4, there is no consistent rise/decline in category-wise accuracy for subjects/objects or relations even though the overall performance combined across all relations and objects improves. However, as seen in Table 5, for LSVRU with CE loss, the optimal performance is achieved without curriculum based fine-tuning.

Combined Effect of all the enhancements In order to understand the outcome of using all these enhancements together, we take the LSVRU [37] model trained using Cross-entropy loss as baseline and apply the Subject-Object seed based sampling, Hardness conditioned Diffusion and Curriculum based fine-tuning in all possible combinations. The overall relation + subject/object average per-class accuracy for each configuration is reported in Table 5. As the table shows, adding each of the enhancements generally improves the overall accuracy, however, the optimal strategy may involve using only some of the enhancements (like not using curriculum based fine-tuning in this case).
Effect of number of augmented samples used in fine-tuning To observe how the number of DiffAugment generated visual embeddings affects the average per-class accuracy after fine-tuning, we create subsets of $20,000$ and $60,000$ samples from the entire set of $96,000$ embeddings (sampled using Subject-Object based seed and with Hardness conditional DM). Then we fine-tune the LSVRU model using Cross-Entropy loss on each of the subsets. The average per-class accuracy for relations, subjects/objects and both combined is displayed in Table 6. From the table, it is evident that more number of augmented samples result in better classification performance upon fine-tuning. However, the Diffusion sampling time that is required also increases for more samples.

4.4 Qualitative Results

We examine how the predicted S/R/O changes after fine-tuning an LSVRU + CE [37] model using DiffAugment and show six such results in Figure 2. It can be observed that the fine-tuned model is able to predict more informative relation labels such as perched on instead of above, kicking rather than playing with, chained to instead of leaning on, skiing on rather than playing in, feeding instead of looking at and flying rather than holding. All of these informative relations are more towards the tail of dataset distribution as compared to the initial predictions. DiffAugment can also make the subject/object more specific, as in log changed to branch, ball changed to soccer ball and child changed to girl.

5 Conclusion

We present DiffAugment - a pioneer in utilizing generative Diffusion Models to overcome the class imbalance of Long-tailed Visual Relationship Recognition. It is a streamlined approach to augment the tail classes by first generating viable tail class triplets with the help of WordNet [18]; and then generating visual embeddings corresponding to those triplets through Diffusion. We further define the hardness of each triplet; utilize it as a diffusion conditional, and also introduce a new subject-object based seed for diffusion sampling – all of which improve the discriminative performance of the VRR approaches upon fine-tuning as shown in the experiments. The offline design makes this method suitable to be used on top of any existing VRR model.

References

10. [1998] Combining Local Context and WordNet Similarity for Word Sense Identification. In WordNet: An Electronic Lexical Database. The MIT Press, 1998.
Abdelkarim et al. [2021] Sherif Abdelkarim, Aniket Agarwal, Panos Achlioptas, Jun Chen, Jiaji Huang, Boyang Li, Kenneth Church, and Mohamed Elhoseiny. Exploring long tail visual relationship recognition with large vocabulary, 2021.
Avrahami et al. [2023] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. ACM Transactions on Graphics, 42(4):1–11, 2023.
Brooks et al. [2022] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
Chen et al. [2022] Jun Chen, Aniket Agarwal, Sherif Abdelkarim, Deyao Zhu, and Mohamed Elhoseiny. Reltransformer: A transformer-based long-tail visual relationship recognition, 2022.
Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020.
Huang et al. [2022] Rongjie Huang, Max WY Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. Fastdiff: A fast conditional diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934, 2022.
Hudson and Manning [2019] Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019.
Kang et al. [2020] Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decoupling representation and classifier for long-tailed recognition, 2020.
Kong et al. [2020] Zhifeng Kong, Wei **, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
Krishna et al. [2016] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
Lee et al. [2021] Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, and Tie-Yan Liu. Priorgrad: Improving conditional denoising diffusion models with data-driven adaptive prior. arXiv preprint arXiv:2106.06406, 2021.
Li et al. [2023] Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models, 2023.
Lin et al. [2018] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection, 2018.
Liu et al. [2019] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X. Yu. Large-scale long-tailed recognition in an open world, 2019.
Meng et al. [2022] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
Miller [1995] George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, 1995.
Park et al. [2023] Jeeseung Park, **-Woo Park, and Jong-Seok Lee. Viplo: Vision transformer based pose-conditioned self-loop graph for human-object interaction detection, 2023.
Popov et al. [2021a] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608. PMLR, 2021a.
Popov et al. [2021b] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov, and Jiansheng Wei. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. arXiv preprint arXiv:2109.13821, 2021b.
Qin et al. [2023] Yiming Qin, Huangjie Zheng, Jiangchao Yao, Mingyuan Zhou, and Ya Zhang. Class-balancing diffusion models, 2023.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
Ren et al. [2016] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks, 2016.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022.
Saharia et al. [2021] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement, 2021.
Samuel et al. [2023] Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, and Gal Chechik. Generating images of rare concepts using pre-trained diffusion models, 2023.
Shen et al. [2016] Li Shen, Zhouchen Lin, and Qingming Huang. Relay backpropagation for effective learning of deep convolutional neural networks, 2016.
Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.
Soviany et al. [2022] Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey, 2022.
Tan et al. [2020] **gru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, and Junjie Yan. Equalization loss for long-tailed object recognition, 2020.
Tang et al. [2023] Zhicong Tang, Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. Improved vector quantized diffusion models, 2023.
Vaswani et al. [2023] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.
Verma et al. [2019] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states, 2019.
Wang et al. [2023] Chenyu Wang, Shuo Wang, and Shenghua Gao. Vision transformer-based spatially conditioned graphs for long tail visual relationship recognition cvpr 2023 ltvrr challenge. arXiv preprint arXiv, 2023.
Zhang et al. [2021] Frederic Z. Zhang, Dylan Campbell, and Stephen Gould. Spatially conditioned graphs for detecting human-object interactions, 2021.
Zhang et al. [2019] Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed Elgammal, and Mohamed Elhoseiny. Large-scale visual relationship understanding, 2019.
Zhang and Chen [2023] Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. In The Eleventh International Conference on Learning Representations, 2023.
Zheng et al. [2023] Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, and Anima Anandkumar. Fast sampling of diffusion models via operator learning, 2023.