(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: University of Stuttgart, Germany
11email: {adnen.abdessaied, lei.shi, andreas.bulling}@vis.uni-stuttgart.de
[Uncaptioned image] https://perceptualui.org/publications/abdessaied24_eccv

Multi-Modal Video Dialog State Tracking in the Wild

Adnen Abdessaied \orcidlink0000-0002-9489-6340    Lei Shi\orcidlink0000-0003-1628-1559    Andreas Bulling \orcidlink0000-0001-6317-7303
Abstract

We present 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT[Uncaptioned image] – a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in the wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities which further refines its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  achieves new state-of-the-art results on five challenging benchmarks.

Keywords:
Video Dialog Vision & Language Multi-Modal Learning
Refer to caption
Figure 1: 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPTRefer to caption achieves SOTA results on a broad range of video-language tasks.

1 Introduction

Multi-moda tasks at the intersection of computer vision and natural language processing were introduced to develop intelligent agents capable of assisting humans in understanding a visual premise through language. Among these tasks, video dialog is considered to be one of the most challenging. In contrast to visual [8] and video [68] question answering, which only require reasoning about a single question, video dialog models have to reason over the entire dialog history in addition to the current question. Furthermore, in contrast to visual dialog [15], video dialog involves reasoning over a video instead of a static image. Thus, a crucial part of a video dialog model is Dialog State Tracking (DST) which was originally introduced to track and update users’ goals in the form of dialog states [42, 64]. Nowadays, it is broadly used when a model keeps track of what it believes to be relevant for answering the question at hand. Until now, research on DST has been predominately uni-modal in the form of slot-filling tasks [51, 70, 39] where the slots and slot values are constrained by a knowledge domain (e.g. hotel domain) and database schema (e.g. tabular data). However, the current landscape of the field necessitates extending to a multi-modal framework. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) Some works track the constituents of only one modality to help the model focuses on the most salient ones within a multi-model context (e.g. video dialog [50], visual dialog [52], image retrieval [20], recommender systems [66]) rendering their state tracking approach uni-modal. More recently, Le et al. [34] proposed VDTN which extended the slot-filling paradigm to predict the visual attributes of CATER objects [19] from a pool of pre-defined textual values resulting in the same aforementioned limitation. (2) Other works [31, 49] moved closer to performing multi-modal state tracking but were limited to synthetic datasets that do not reflect the complexity of real-world scenarios.

To this end, we present 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  as an attempt to address the aforementioned shortcomings. Specifically, 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  uses a backbone VLM and attention-based modality-specific tracking blocks to identify the most relevant constituents of each modality. Then, it uses a multi-modal GNN-based approach to learn the missing underlying structure between the mix of modalities in the form of latent graphs. Finally, it uses the fine-grained GNN features to enhance the hidden states of the backbone VLM in order to answer the question at hand more efficiently. To summarize, the contributions of our work are three-fold: (1) We propose 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT– a novel video dialog model that unlike previous works performs multi-modal state tracking on each input modality separately. Our model is generic by nature and could be easily adapted to deal with a wide range of tasks and datasets. (2) We equip our model with a novel divide-and-conquer GNN-based mechanism that dynamically learns the missing underlying structure of the mix of all modalities. First, it selects the most important constituents of each modality and learns their respective local structures in the from of latent graphs. Then, it parses all individual graphs and features into a global modality-agnostic graph to further refine its structure and node features that we use to enhance the hidden states of the backbone VLM. (3) As seen in Figure 1, 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  sets new state-of-the-art results across a broad range of video-language tasks.

2 Related Work

2.0.1 Video Dialog.

Video dialog emerged as a natural extension to visual question answering [8], video question answering [69], and visual dialog [15]. Almari et al. [4] proposed AVSD – one of the first video dialog datasets based on the Charades videos [59] which became the default dataset for the task. Later works [35, 45] leveraged the advantages of pre-trained large language models [43, 56] and fine-tuned them on the downstream video dialog task achieving new state-of-the-art results. Others used GNNs to perform reasoning on the dialog history [32] or on the visual scene [27] in an attempt to improve performance. Pham et al. [55] proposed an object-centric model that tracks object-associated dialog states upon receiving new questions. Inspired by the success of neural module networks [6, 7], Le et al. [33] introduced VGNMN to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. More recently, Yoon et al. [73] introduced a text hallucination mitigation framework based on a hallucination regularization loss. Despite the high multi-modality of the task in general and the AVSD dataset in particular, the aforementioned works missed out on the idea of performing explicit multi-modal dialog state tracking. Instead, they focused on general vanilla attention methods that particularly tracked only one modality (mostly the visual input) at the expense of the others. 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  comes towards closing this research gap by being able to perform multi-modal state tracking on each input modality separately.

2.0.2 Dialog State Tracking.

Traditional state tracking approaches consisted in predicting slot values (e.g. meals offered by a restaurant) from a pre-defined set at each dialog turn conditioned on some context. As a result, these approaches remained predominately uni-modal even though they were applied within a multi-modal context (e.g. video dialog [50], visual dialog [52], image retrieval [20], recommender systems [66]). However, the current landscape of dialog research necessitates the transition to multi-modal dialog state tracking to cope with the complexity of recent datasets. Some works [1] have already been proposed to address this problem. For example, SIMMC [49, 31] was introduced to develop agents capable of hel** a human in a shop** scenario and therefore need to track the multi-modal state of the dialog to fulfill its task efficiently. More recently, Le et al. [34] suggested to perform video dialog state tracking by extending the slot-filling task to predict predefined attributes of CATER [19] objects limiting their approach to only the DVD dataset [38].

That said, these works focused only on synthetic and automatically generated datasets. To the best of our knowledge, 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  is the first model to perform genuine multi-modal state tracking in the wild for video dialog by being able to deal with complex real-world scenarios.

2.0.3 Graph Structure Learning.

Early works on graph structure learning leveraged bilevel programming [14] to simultaneously learn GNN parameters and topology [17]. Yu et al. [75] proposed applying the linear structure equation model in conjunction with a variational autoencoder [57] in order to learn directed acyclic graphs. Subsequently, Elinas et al. [16] suggested using a stochastic variational inference model to jointly estimate the graph posterior and the GNN parameters. Chen et al. [11] proposed to iteratively refine the graph topology in an end-to-end manner using graph similarity metric learning. Wu et al. [65] suggested an all-pair message passing method to efficiently propagate signals between arbitrary nodes for classification.

Our method differs from the aforementioned works in three distinct aspects: (1) We propose a novel multi-modal graph structure learning method that relies on a two-stage divide-and-conquer procedure that first predicts local modality-specific latent graphs before tackling the global graph consisting of the mix of all available modalities. (2) We use our graph learning approach to enhance the hidden states of a backbone VLM. (3) Instead of dealing with uni-modal graph-based tasks (node, edge, or graph classification), we investigate the effect of our method on the multi-modal, non-graph related downstream task of video dialog.

Refer to caption
Figure 2: 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPTRefer to caption takes a video Refer to caption, a dialog history Refer to caption, and a question Refer to caption as input and autoregressively generates an answer Refer to caption as output. It uses a BART backbone adapted to deal with multi-modal input features and enhanced via our graph-based mixing approach.

3 Method

3.1 Problem Formulation

Given a question QtsubscriptQt\texttt{Q}_{\texttt{t}}Q start_POSTSUBSCRIPT t end_POSTSUBSCRIPT grounded on a video V at t-th dialog turn, a dialog history Ht={C,(Q1,A1),,(Qt-1,At-1)}subscriptHtCsubscript(Q1subscriptA1)subscript(Qt-1subscriptAt-1)\texttt{H}_{\texttt{t}}=\{\texttt{C},\texttt{(Q}_{\texttt{1}},\texttt{A}_{% \texttt{1}}\texttt{)},...,\texttt{(Q}_{\texttt{t-1}},\texttt{A}_{\texttt{t-1}}% \texttt{)}\}H start_POSTSUBSCRIPT t end_POSTSUBSCRIPT = { C , (Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , (Q start_POSTSUBSCRIPT t-1 end_POSTSUBSCRIPT , A start_POSTSUBSCRIPT t-1 end_POSTSUBSCRIPT ) } composed of previous question-answer pairs and a video caption C, a video dialog model is tasked of autoregressively generating a free-form answer AtsubscriptAt\texttt{A}_{\texttt{t}}A start_POSTSUBSCRIPT t end_POSTSUBSCRIPT to the question at hand, i.e. each answer token atisuperscriptsubscriptati\texttt{a}_{\texttt{t}}^{\texttt{i}}a start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT i end_POSTSUPERSCRIPT satisfies

ati=argmaxa𝒱[P(a|V,Qt,Ht,At<i)],superscriptsubscriptatisubscriptargmaxa𝒱𝑃conditionalaVsubscriptQtsubscriptHtsuperscriptsubscriptAtabsenti\texttt{a}_{\texttt{t}}^{\texttt{i}}=\displaystyle\operatorname*{arg\,max}_{% \texttt{a}\in\mathcal{V}}\left[P\left(\texttt{a}|\texttt{V},\texttt{Q}_{% \texttt{t}},\texttt{H}_{\texttt{t}},\texttt{A}_{\texttt{t}}^{<\texttt{i}}% \right)\right],a start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT i end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT a ∈ caligraphic_V end_POSTSUBSCRIPT [ italic_P ( a | V , Q start_POSTSUBSCRIPT t end_POSTSUBSCRIPT , H start_POSTSUBSCRIPT t end_POSTSUBSCRIPT , A start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT < i end_POSTSUPERSCRIPT ) ] , (1)

where At<isuperscriptsubscriptAtabsenti\texttt{A}_{\texttt{t}}^{<\texttt{i}}A start_POSTSUBSCRIPT t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT < i end_POSTSUPERSCRIPT and 𝒱𝒱\mathcal{V}caligraphic_V denote the previously predicted answer tokens and the vocabulary, respectively.

3.2 Input Representation Learning

As can be seen from Figure 2, 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  is based on BART [43] and adapted to handle data from multiple input modalities.

3.2.1 Visual Representations.

As it is standard for this task, the visual representations are extracted for a given video using I3D-rgb and I3D-flow models [10] pre-trained on YouTube videos and the Kinetics dataset [26]. Formally, a video V is first split into lvsubscript𝑙vl_{\textrm{v}}italic_l start_POSTSUBSCRIPT v end_POSTSUBSCRIPT segments using a sliding window of n𝑛nitalic_n frames. Then, each segment S={f1,f2,,fn}𝑆subscript𝑓1subscript𝑓2subscript𝑓𝑛S=\{f_{1},f_{2},...,f_{n}\}italic_S = { italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, where fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents one video frame, are fed to the pre-trained I3D models to extract the dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT-dimensional video features Vrgb,Vflowlv×dvsubscript𝑉rgbsubscript𝑉flowsuperscriptsubscript𝑙vsubscript𝑑vV_{\textrm{rgb}},V_{\textrm{flow}}\in\mathbb{R}^{l_{\textrm{v}}\times d_{% \mathrm{v}}}italic_V start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT flow end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Finally, we extracted object features Vsamlv×dssubscript𝑉samsuperscriptsubscript𝑙vsubscript𝑑sV_{\textrm{sam}}\in\mathbb{R}^{l_{\textrm{v}}\times d_{\mathrm{s}}}italic_V start_POSTSUBSCRIPT sam end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from the middle frame of the video using SAM [30]. We mapped these features to match the hidden dimension d𝑑ditalic_d of BART using linear projections with weights matrices Wrgb,Wflow,Wsamsubscript𝑊rgbsubscript𝑊flowsubscript𝑊samW_{\textrm{rgb}},W_{\textrm{flow}},W_{\textrm{sam}}italic_W start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT flow end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT sam end_POSTSUBSCRIPT.

3.2.2 Audio Representations.

Similar the previous works [73, 32, 45], we used audio features extracted from a pre-trained VGGish model [60]. Since video and audio are synchronous, the same splits were used the generate the dasubscript𝑑𝑎d_{a}italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT-dimensional audio features Avggishlv×dasubscript𝐴vggishsuperscriptsubscript𝑙vsubscript𝑑𝑎A_{\textrm{vggish}}\in\mathbb{R}^{l_{\textrm{v}}\times d_{a}}italic_A start_POSTSUBSCRIPT vggish end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. As for the video feature, we mapped the audio features to the BART embeddings space using a linear projection with a weight matrix Wad×dasubscript𝑊𝑎superscript𝑑subscript𝑑𝑎W_{a}\in\mathbb{R}^{d\times d_{a}}italic_W start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. We refer to [22] for further details about feature extraction.

3.2.3 Textual Representations.

We used the dialog history composed of the video caption and the previous question-answer pairs as well as the current question as additional input to the encoder. We separated each segment with the special token </s>. Subsequently, we embedded their concatenation into a dense representation T=[TH,TQ]ltxt×d𝑇subscript𝑇Hsubscript𝑇Qsuperscriptsubscript𝑙txt𝑑T=[T_{\textrm{H}},T_{\textrm{Q}}]\in\mathbb{R}^{l_{\mathrm{txt}}\times d}italic_T = [ italic_T start_POSTSUBSCRIPT H end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT using a word embedding matrix Wtxt|𝒱|×dsubscript𝑊txtsuperscript𝒱𝑑W_{\mathrm{txt}}\in\mathbb{R}^{|\mathcal{V}|\times d}italic_W start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_V | × italic_d end_POSTSUPERSCRIPT, where ltxtsubscript𝑙txtl_{\textrm{txt}}italic_l start_POSTSUBSCRIPT txt end_POSTSUBSCRIPT, 𝒱𝒱\mathcal{V}caligraphic_V, THsubscript𝑇HT_{\textrm{H}}italic_T start_POSTSUBSCRIPT H end_POSTSUBSCRIPT, and TQsubscript𝑇QT_{\textrm{Q}}italic_T start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT are the length of the textual input, the vocabulary, the dense representation of the history and question, respectively. Finally, we input a shifted ground-truth to the decoder and embedded it using the same word matrix.

3.2.4 State Tokens.

We inserted special state tokens <si>subscript<si>\texttt{<s}_{\texttt{i}}\texttt{>}<s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT > at the beginning of each modality (Vrgb,Vflow,Vsam,Avggish,TH,TQsubscript𝑉rgbsubscript𝑉flowsubscript𝑉samsubscript𝐴vggishsubscript𝑇Hsubscript𝑇QV_{\textrm{rgb}},V_{\textrm{flow}},V_{\textrm{sam}},A_{\textrm{vggish}},T_{% \textrm{H}},T_{\textrm{Q}}italic_V start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT flow end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT sam end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT vggish end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT H end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT) and used them to keep track of the most relevant constituents.

Refer to caption
Figure 3: In Stage I, 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  first gathers multi-modal features {Xi}subscript𝑋𝑖\{X_{i}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } from the previous BART layer and computes their respective initial local structres {A~i}subscript~𝐴𝑖\{\tilde{A}_{i}\}{ over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. Then, it simultaneously learns the local latent multi-modal graphs and refines the features using a two stream framework, i.e. {Ai,j,Ai,j′′}jsubscriptsubscriptsuperscript𝐴𝑖𝑗subscriptsuperscript𝐴′′𝑖𝑗𝑗\{A^{\prime}_{i,j},A^{\prime\prime}_{i,j}\}_{j}{ italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and {Zi,j,Zi,j′′}jsubscriptsubscriptsuperscript𝑍𝑖𝑗subscriptsuperscript𝑍′′𝑖𝑗𝑗\{Z^{\prime}_{i,j},Z^{\prime\prime}_{i,j}\}_{j}{ italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, respectively. Finally, it outputs the final multi-modal latent graph Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT used to compute the local ELBO loss ELBOlocal=1Ni=1NELBOlocal,isuperscriptsubscriptELBOlocal1𝑁superscriptsubscript𝑖1𝑁superscriptsubscriptELBOlocal𝑖\mathcal{L}_{\textrm{ELBO}}^{\textrm{local}}=\frac{1}{N}\sum_{i=1}^{N}\mathcal% {L}_{\textrm{ELBO}}^{\textrm{local},i}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT local end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT local , italic_i end_POSTSUPERSCRIPT.

3.3 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT: Multi-Modal Feature Mixing

The main idea of 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  is to keep track of the most relevant constituents at different semantic levels (e.g. across modalities and encoder layers) and use them to refine the multi-modal state of the model. Specifically, we insert a 𝕄𝕀𝕏𝔼𝕄𝕀𝕏𝔼\mathbb{MIXER}blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R layer after every ΔΔ\Deltaroman_Δ encoder layers. Our approach follows a two-stage divide and conquer scheme where we first learn the underlying local structures of the individual modalities before learning the global inter-modal structure of the mix of all available modalities. We posit that directly learning the latter might be daunting for such a high multi-modal task.

3.3.1 Multi-Modal Feature Tracking.

We take advantage of the special state tokens <si>subscript<si>\texttt{<s}_{\texttt{i}}\texttt{>}<s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT > to keep track of the most relevant modality-specific features at different embedding levels of the encoder. Specifically, for each modality, we select the K𝐾Kitalic_K tokens with the highest attention values with respect to the respective state token, i.e.

Xi=topK(αavg(h<si>,Hi))K×d,subscript𝑋𝑖subscripttop𝐾subscript𝛼avgsubscriptsubscript<si>subscript𝐻𝑖superscript𝐾𝑑X_{i}=\mathrm{top}_{K}(\alpha_{\textrm{avg}}(h_{\texttt{<s}_{\texttt{i}}% \texttt{>}},H_{i}))\in\mathbb{R}^{K\times d},italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_top start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT avg end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT <s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT > end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_d end_POSTSUPERSCRIPT , (2)

where αavg(h<si>,Hi)subscript𝛼avgsubscriptsubscript<si>subscript𝐻𝑖\alpha_{\textrm{avg}}(h_{\texttt{<s}_{\texttt{i}}\texttt{>}},H_{i})italic_α start_POSTSUBSCRIPT avg end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT <s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT > end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the attention values between the state embedding and the remaining tokens embeddings Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the ilimit-from𝑖i-italic_i -th modality averaged across heads.

3.3.2 Mixing Stage I (Divide).

We posit that the selected features {Xi}subscript𝑋𝑖\{X_{i}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } of each modality encapsulate rich information that could be leveraged to improve the learning capabilities of our model. A viable approach is to take advantage of the power of GNNs to refine these features based on their local structures as prior works highlighted the merit of integrating GNNs with transformer-based models [2, 71, 72]. However, in our case, the underlying structures that govern {Xi}subscript𝑋𝑖\{X_{i}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } are missing. To this end, we propose a novel multi-modal graph structure learning approach that simultaneously learns the graph weights and the adjacency matrix in form of latent graphs. We posit that we can split the adjacency matrix Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the ilimit-from𝑖i-italic_i -th modality into an initial (observable) part A~isubscript~𝐴𝑖\tilde{A}_{i}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a missing (sought-after) part Aisuperscriptsubscript𝐴𝑖A_{i}^{\prime}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT where A~isubscript~𝐴𝑖\tilde{A}_{i}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a binary matrix constructed using a k𝑘kitalic_kNN (k=4𝑘4k=4italic_k = 4) approach based on Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Thus,

P(Xi,Ai)𝑃subscript𝑋𝑖subscript𝐴𝑖\displaystyle P(X_{i},A_{i})italic_P ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =P(Ai|Xi)P(Xi)absent𝑃conditionalsubscript𝐴𝑖subscript𝑋𝑖𝑃subscript𝑋𝑖\displaystyle=P(A_{i}|X_{i})P(X_{i})= italic_P ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_P ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (3)
=P(Ai,A~i|Xi)P(Xi).absent𝑃superscriptsubscript𝐴𝑖conditionalsubscript~𝐴𝑖subscript𝑋𝑖𝑃subscript𝑋𝑖\displaystyle=P(A_{i}^{\prime},\tilde{A}_{i}|X_{i})P(X_{i}).= italic_P ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_P ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (4)
Refer to caption
(a) We use the predicted local latent graphs {Ai}subscript𝐴𝑖\{A_{i}\}{ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } to initialize A~=diag([A1,..,AN],0)\tilde{A}=\mathrm{diag}([A_{1},..,A_{N}],0)over~ start_ARG italic_A end_ARG = roman_diag ( [ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] , 0 ) in order to learn the final global latent graph A𝐴Aitalic_A. The updated node features Z𝑍Zitalic_Z are scattered back to their initial positions in the BART layer.
Refer to caption
(b) We update the state embeddings h<si>subscriptexpectationsubscript𝑠𝑖h_{<s_{i}>}italic_h start_POSTSUBSCRIPT < italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > end_POSTSUBSCRIPT by averaging the corresponding features from Z𝑍Zitalic_Z.
Figure 4: Overview of mixing stage II.

Although the conditional distribution P(Ai,A~i|X)𝑃superscriptsubscript𝐴𝑖conditionalsubscript~𝐴𝑖𝑋P(A_{i}^{\prime},\tilde{A}_{i}|X)italic_P ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X ) can be modeled by a parametric families of distributions pθi(Ai,A~i|X)subscriptsuperscript𝑝𝑖𝜃superscriptsubscript𝐴𝑖conditionalsubscript~𝐴𝑖𝑋p^{i}_{\theta}(A_{i}^{\prime},\tilde{A}_{i}|X)italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X ), the optimal parameter set θ¯¯𝜃\bar{\theta}over¯ start_ARG italic_θ end_ARG is not known making the computations of the marginal

pθi(A~i|Xi)=pθi(Ai,A~i|Xi)d(Ai)subscriptsuperscript𝑝𝑖𝜃conditionalsubscript~𝐴𝑖subscript𝑋𝑖subscriptsuperscript𝑝𝑖𝜃superscriptsubscript𝐴𝑖conditionalsubscript~𝐴𝑖subscript𝑋𝑖𝑑superscriptsubscript𝐴𝑖p^{i}_{\theta}(\tilde{A}_{i}|X_{i})=\displaystyle\int p^{i}_{\theta}(A_{i}^{% \prime},\tilde{A}_{i}|X_{i})d(A_{i}^{\prime})italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∫ italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_d ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (5)

and therefore the posterior of each modality

pθi(Ai|A~i,Xi)=pθi(Ai,A~i|Xi)pθi(A~i|Xi)subscriptsuperscript𝑝𝑖𝜃conditionalsuperscriptsubscript𝐴𝑖subscript~𝐴𝑖subscript𝑋𝑖subscriptsuperscript𝑝𝑖𝜃superscriptsubscript𝐴𝑖conditionalsubscript~𝐴𝑖subscript𝑋𝑖subscriptsuperscript𝑝𝑖𝜃conditionalsubscript~𝐴𝑖subscript𝑋𝑖p^{i}_{\theta}(A_{i}^{\prime}|\tilde{A}_{i},X_{i})=\displaystyle\frac{p^{i}_{% \theta}(A_{i}^{\prime},\tilde{A}_{i}|X_{i})}{p^{i}_{\theta}(\tilde{A}_{i}|X_{i% })}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (6)

intractable. To be able to infer the missing part of the local adjacency matrix, we take advantage of Variational Inference (VI) to learn an approximation qϕi(Ai|A~i,Xi)subscriptsuperscript𝑞𝑖italic-ϕconditionalsuperscriptsubscript𝐴𝑖subscript~𝐴𝑖subscript𝑋𝑖q^{i}_{\phi}(A_{i}^{\prime}|\tilde{A}_{i},X_{i})italic_q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) of the posterior. We postulate that the missing adjacency matrix of modality i𝑖iitalic_i not only depends on its own features Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT but also on the features of other modalities Xjisubscript𝑋𝑗𝑖X_{j\neq i}italic_X start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT. Therefore, we propose a multi-modal conditioning (MMC) of Equation 6 on all Xjisubscript𝑋𝑗𝑖X_{j\neq i}italic_X start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT in addition to Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We also follow the idea of [11] that better graph structures lead to better features and better features lead to better graph structures. Therefore, as shown in Figure 3, we use a two-stream approach where one stream uses enhanced features to learn the latent multi-modal graphs and the other uses the predicted graphs to infer fine-grained features to learn both qϕisuperscriptsubscript𝑞italic-ϕ𝑖q_{\phi}^{i}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and pθisuperscriptsubscript𝑝𝜃𝑖p_{\theta}^{i}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for each modality. Specifically, in the purple module of the upper stream, we estimate an edge of latent graph Ai,jsubscriptsuperscript𝐴𝑖𝑗A^{\prime}_{i,j}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT using cosine similarity as

amn=1Kk=1Kcos(wjkxm,wjkxn),subscriptsuperscript𝑎𝑚𝑛1𝐾superscriptsubscript𝑘1𝐾cosdirect-productsuperscriptsubscript𝑤𝑗𝑘subscript𝑥𝑚direct-productsuperscriptsubscript𝑤𝑗𝑘subscript𝑥𝑛a^{\prime}_{mn}=\frac{1}{K}\sum_{k=1}^{K}\mathrm{cos}(w_{j}^{k}\odot x_{m},w_{% j}^{k}\odot x_{n}),italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_cos ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (7)

where xm,xnXisubscript𝑥𝑚subscript𝑥𝑛subscript𝑋𝑖x_{m},x_{n}\in X_{i}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, {wjk}superscriptsubscript𝑤𝑗𝑘\{w_{j}^{k}\}{ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } are learnable weights for each modality, and direct-product\odot denotes element-wise multiplication. Then, in the green module, we update the multi-modal node features using an APPNP [18] module and the predicted latent graphs for modality i𝑖iitalic_i to get {Zi,j}jsubscriptsubscriptsuperscript𝑍𝑖𝑗𝑗\{Z^{\prime}_{i,j}\}_{j}{ italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. For the lower stream, we first start by updating the node features similarly to the upper stream by using the initial graphs {A~i}subscript~𝐴𝑖\{\tilde{A}_{i}\}{ over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } to get {Zi,j′′}jsubscriptsubscriptsuperscript𝑍′′𝑖𝑗𝑗\{Z^{\prime\prime}_{i,j}\}_{j}{ italic_Z start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Then, we use the enhanced node features {[Zi,j,Zi,j′′]}jsubscriptsubscriptsuperscript𝑍𝑖𝑗subscriptsuperscript𝑍′′𝑖𝑗𝑗\{[Z^{\prime}_{i,j},Z^{\prime\prime}_{i,j}]\}_{j}{ [ italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_Z start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ] } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to predict the second set of local latent graphs {Aij′′}jsubscriptsubscriptsuperscript𝐴′′𝑖𝑗𝑗\{A^{\prime\prime}_{ij}\}_{j}{ italic_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. At the end, we output the final local latent graph of modality i𝑖iitalic_i as

Ai=12A~iInitialization Bias (IB)+12j=1N1N(Ai,j+Ai,j′′)VI approximation via MMCK×K.subscript𝐴𝑖subscript12subscript~𝐴𝑖Initialization Bias (IB)subscript12superscriptsubscript𝑗1𝑁1𝑁subscriptsuperscript𝐴𝑖𝑗subscriptsuperscript𝐴′′𝑖𝑗VI approximation via MMCsuperscript𝐾𝐾A_{i}=\underbrace{\frac{1}{2}\tilde{A}_{i}}_{{\color[rgb]{% 0.68359375,0.4921875,0.08203125}\definecolor[named]{pgfstrokecolor}{rgb}{% 0.68359375,0.4921875,0.08203125}\textrm{{Initialization Bias (IB)}}}}+% \underbrace{\frac{1}{2}\sum_{j=1}^{N}\frac{1}{N}(A^{\prime}_{i,j}+A^{\prime% \prime}_{i,j})}_{\textrm{{\color[rgb]{0.671875,0.26953125,1}\definecolor[named% ]{pgfstrokecolor}{rgb}{0.671875,0.26953125,1}{VI approximation via MMC}}}}\in% \mathbb{R}^{K\times K}.italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Initialization Bias (IB) end_POSTSUBSCRIPT + under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + italic_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT VI approximation via MMC end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_K end_POSTSUPERSCRIPT . (8)

3.3.3 Mixing Stage II (Conquer).

This stage tries to infer the global latent graph structure governing the mix of all modalities {Xi}subscript𝑋𝑖\{X_{i}\}{ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. As seen in 4(a), it depends on the previously predicted local latent graphs to build the initial global graphs as

A~=diag([A1,..,AN],0)NK×NK.\tilde{A}=\mathrm{diag}([A_{1},..,A_{N}],0)\in\mathbb{R}^{NK\times NK}.over~ start_ARG italic_A end_ARG = roman_diag ( [ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] , 0 ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_K × italic_N italic_K end_POSTSUPERSCRIPT . (9)

Similar to Stage I, we use a two-stream approach to learn the global pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and qϕsubscript𝑞italic-ϕq_{\phi}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and thus the global latent graph A𝐴Aitalic_A and node features

Z=12(Z+Z′′),𝑍12superscript𝑍superscript𝑍′′Z=\frac{1}{2}(Z^{\prime}+Z^{\prime\prime}),italic_Z = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_Z start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , (10)

where Zsuperscript𝑍Z^{\prime}italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and Z′′superscript𝑍′′Z^{\prime\prime}italic_Z start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT are obtained from the upper and lower streams, respectively. Finally, we update the state tokens embeddings h<si>subscriptexpectationsubscript𝑠𝑖h_{<s_{i}>}italic_h start_POSTSUBSCRIPT < italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > end_POSTSUBSCRIPT by averaging the corresponding features from Z𝑍Zitalic_Z (see 4(b)) and integrate the latter back into the hidden state of the corresponding BART layer following

H=(1λ)(H(Z,Idx))+λH,𝐻1𝜆𝐻𝑍Idx𝜆𝐻H=(1-\lambda)(H\oslash(Z,\mathrm{Idx}))+\lambda H,italic_H = ( 1 - italic_λ ) ( italic_H ⊘ ( italic_Z , roman_Idx ) ) + italic_λ italic_H , (11)

where λ(0,1)𝜆01\lambda\in(0,1)italic_λ ∈ ( 0 , 1 ) is a hyper-parameter and \oslash, H𝐻Hitalic_H, and IdxIdx\mathrm{Idx}roman_Idx denote the scatter operation, the hidden state of the BART layer and the indices of the nodes features Z𝑍Zitalic_Z relative to H𝐻Hitalic_H, respectively.

Loss Function.

Since we rely on VI to infer the local and global latent graphs, we used two ELBO losses to optimize (1) the local multi-modal graph learners {qϕi,pθi}superscriptsubscript𝑞italic-ϕ𝑖superscriptsubscript𝑝𝜃𝑖\{q_{\phi}^{i},p_{\theta}^{i}\}{ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } and (2) the global learners qϕ,pθsubscript𝑞italic-ϕsubscript𝑝𝜃{q_{\phi},p_{\theta}}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Please refer to supplementary material for the derivation of these losses. We trained our model end-to-end using a combination of the generative loss of the video dialog task gensubscriptgen\mathcal{L}_{\mathrm{gen}}caligraphic_L start_POSTSUBSCRIPT roman_gen end_POSTSUBSCRIPT and both ELBO losses, i.e.

=α1gensubscript𝛼1subscriptgen\displaystyle\mathcal{L}=\alpha_{1}\mathcal{L}_{\mathrm{gen}}caligraphic_L = italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_gen end_POSTSUBSCRIPT α2ELBOlocalα3ELBOglobal,subscript𝛼2superscriptsubscriptELBOlocalsubscript𝛼3superscriptsubscriptELBOglobal\displaystyle-\alpha_{2}\mathcal{L}_{\textrm{ELBO}}^{\mathrm{local}}-\alpha_{3% }\mathcal{L}_{\textrm{ELBO}}^{\mathrm{global}},- italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_local end_POSTSUPERSCRIPT - italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_global end_POSTSUPERSCRIPT , (12)
ELBOlocalsuperscriptsubscriptELBOlocal\displaystyle\mathcal{L}_{\textrm{ELBO}}^{\mathrm{local}}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_local end_POSTSUPERSCRIPT =1Ni=1NELBOlocal,i,absent1𝑁superscriptsubscript𝑖1𝑁superscriptsubscriptELBOlocal𝑖\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}_{\textrm{ELBO}}^{\textrm{% local},i},= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT local , italic_i end_POSTSUPERSCRIPT , (13)

where {αk}subscript𝛼𝑘\{\alpha_{k}\}{ italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } are hyper-parameters and ELBOlocal,isuperscriptsubscriptELBOlocal𝑖\mathcal{L}_{\textrm{ELBO}}^{\textrm{local},i}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT local , italic_i end_POSTSUPERSCRIPT is the local ELBO loss for the i𝑖iitalic_i-th modality.

4 Experiments

4.1 Datasets

We mainly evaluated our model on the popular and challenging Audio-Visual Scene Aware Dialog (AVSD) dataset [4]. Each of its dialogs comes with 10101010 question-answer pairs as well as a short description/caption based on a video. Each video is collected from the Charades dataset [59] and the dialogs are generated by human annotators. We considered all three benchmarks of the dataset, i.e. AVSD-DSTC7 [74], AVSD-DSTC8 [28], and AVSD-DSTC10 [58], which were respectively released for the Dialog System Technology Challenge (DSTC). To assess the generalizability of our model, we not only experimented with the generative task of SIMMC 2.0 [31] but also with the recent and challenging open-ended video question answering NExT-QA dataset [67]. We refer to the supplementary material for more details about all five benchmarks.

4.2 Metrics

We used the established official metrics for each dataset in order to fairly compare 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  with the previous models. Specifically, for all three AVSD datasets, we used BLEU (B-n) [53], ROUGE-L (R) [46], METEOR (M) [9], and CIDEr (C) [62]. Whereas for SIMMC and NExT-QA, we used B-4 and WUPS [48] scores, respectively.

Table 1: Results on AVSD-DSTC7 and AVSD-DSTC8. Best and second best performances are in bold and underlined, respectively. =absent\spadesuit=♠ = Two-stage training.
Model Venue AVSD-DSTC7 AVSD-DSTC8
B-1 B-2 B-3 B-4 M R C B-1 B-2 B-3 B-4 M R C
Baseline [22] ICASSP’19 62.162.162.162.1 48.048.048.048.0 37.937.937.937.9 30.530.530.530.5 21.721.721.721.7 48.148.148.148.1 73.373.373.373.3 61.461.461.461.4 46.746.746.746.7 36.536.536.536.5 28.928.928.928.9 21.021.021.021.0 48.048.048.048.0 65.165.165.165.1
MTN [36] ACL’19 71.571.571.571.5 58.158.158.158.1 47.647.647.647.6 39.239.239.239.2 26.926.926.926.9 55.955.955.955.9 106.6106.6106.6106.6 -- -- -- -- -- -- --
JMAN [13] AAAI’20 66.766.766.766.7 52.152.152.152.1 41.341.341.341.3 33.433.433.433.4 23.923.923.923.9 53.353.353.353.3 94.194.194.194.1 64.564.564.564.5 50.450.450.450.4 40.240.240.240.2 32.432.432.432.4 23.223.223.223.2 52.152.152.152.1 87.587.587.587.5
VGD [35] ACL’20 74.974.974.974.9 62.062.062.062.0 52.052.052.052.0 43.643.643.643.6 28.228.228.228.2 58.258.258.258.2 119.4119.4119.4119.4 -- -- -- -- -- -- --
BiST [37] EMNLP’20 75.575.575.575.5 61.961.961.961.9 51.051.051.051.0 42.942.942.942.9 28.428.428.428.4 58.158.158.158.1 119.2119.2119.2119.2 68.468.468.468.4 54.854.854.854.8 45.745.745.745.7 37.637.637.637.6 27.327.327.327.3 56.356.356.356.3 101.7101.7101.7101.7
SCGA [27] AAAI’21 74.574.574.574.5 62.262.262.262.2 51.751.751.751.7 43.043.043.043.0 28.528.528.528.5 57.857.857.857.8 120.1120.1120.1120.1 71.171.171.171.1 59.359.359.359.3 49.749.749.749.7 41.641.641.641.6 27.627.627.627.6 56.656.656.656.6 112.3112.3112.3112.3
RLM [45] TASLP’21 76.576.576.576.5 64.364.364.364.3 54.354.354.354.3 45.945.945.945.9 29.429.429.429.4 60.660.660.660.6 130.8130.8130.8130.8 74.674.674.674.6 62.662.662.662.6 52.852.852.852.8 44.544.544.544.5 28.628.628.628.6 59.859.859.859.8 124.0124.0124.0124.0
PDC [32] ICLR’21 77.077.077.077.0 65.365.365.365.3 53.953.953.953.9 44.944.944.944.9 29.229.229.229.2 60.660.660.660.6 129.5129.5129.5129.5 74.974.974.974.9 62.962.962.962.9 52.852.852.852.8 43.943.943.943.9 28.528.528.528.5 59.259.259.259.2 120.1120.1120.1120.1
AV-TRN [58] ICASSP’22 -- -- -- 40.640.640.640.6 26.226.226.226.2 55.455.455.455.4 107.9107.9107.9107.9 -- -- -- 39.439.439.439.4 25.025.025.025.0 54.554.554.554.5 99.799.799.799.7
VGNMN [33] NAACL’22 -- -- -- 42.942.942.942.9 27.827.827.827.8 57.857.857.857.8 118.8118.8118.8118.8 -- -- -- -- -- -- --
COST [55] ECCV’22 72.372.372.372.3 58.958.958.958.9 48.348.348.348.3 40.040.040.040.0 26.626.626.626.6 56.156.156.156.1 108.5108.5108.5108.5 69.569.569.569.5 55.955.955.955.9 46.546.546.546.5 3.823.823.823.82 27.827.827.827.8 57.457.457.457.4 105.1105.1105.1105.1
MRLV [3] NeurIPS’22 -- 59.259.259.259.2 49.349.349.349.3 41.541.541.541.5 26.926.926.926.9 56.956.956.956.9 115.9115.9115.9115.9 -- -- -- -- -- -- --
THAM [73] EMNLP’22 77.877.877.877.8 65.465.465.465.4 54.954.954.954.9 46.846.846.846.8 30.830.830.830.8 61.961.961.961.9 133.5133.5133.5133.5 76.476.476.476.4 64.164.164.164.1 53.853.853.853.8 45.545.545.545.5 30.130.130.130.1 61.061.061.061.0 130.4130.4130.4130.4
DialogMCF [12] TASLP’23 77.777.777.777.7 65.365.365.365.3 54.754.754.754.7 45.745.745.745.7 30.630.630.630.6 61.361.361.361.3 135.2135.2135.2135.2 75.675.675.675.6 63.363.363.363.3 53.253.253.253.2 44.944.944.944.9 29.329.329.329.3 60.160.160.160.1 125.3125.3125.3125.3
ITR [76] PAMI’23 78.278.278.278.2 65.565.565.565.5 55.255.255.255.2 46.946.946.946.9 30.530.530.530.5 61.961.961.961.9 133.1133.1133.1133.1 76.276.276.276.2 64.164.164.164.1 54.354.354.354.3 46.046.046.046.0 29.829.829.829.8 60.760.760.760.7 128.5128.5128.5128.5
𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT [Uncaptioned image] 78.778.7\mathbf{78.7}bold_78.7 66.566.5\mathbf{66.5}bold_66.5 56.356.3\mathbf{56.3}bold_56.3 47.647.6\mathbf{47.6}bold_47.6 31.331.3\mathbf{31.3}bold_31.3 62.562.5\mathbf{62.5}bold_62.5 138.8138.8\mathbf{138.8}bold_138.8 77.577.5\mathbf{77.5}bold_77.5 66.066.0\mathbf{66.0}bold_66.0 56.156.1\mathbf{56.1}bold_56.1 47.747.7\mathbf{47.7}bold_47.7 30.630.6\mathbf{30.6}bold_30.6 62.462.4\mathbf{62.4}bold_62.4 135.4135.4\mathbf{135.4}bold_135.4
w/o Vsamsubscript𝑉samV_{\textrm{sam}}italic_V start_POSTSUBSCRIPT sam end_POSTSUBSCRIPT ECCV’24 78.678.6{78.6}78.6 66.366.3{66.3}66.3 56.056.0{56.0}56.0 47.447.4{47.4}47.4 31.231.2{31.2}31.2 62.262.2{62.2}62.2 137.3137.3{137.3}137.3 77.477.4{77.4}77.4 65.865.8{65.8}65.8 56.056.0{56.0}56.0 47.347.3{47.3}47.3 30.630.6{30.6}30.6 62.162.1{62.1}62.1 134.8134.8{134.8}134.8
   w/o Avggishsubscript𝐴vggishA_{\textrm{vggish}}italic_A start_POSTSUBSCRIPT vggish end_POSTSUBSCRIPT 78.478.4{78.4}78.4 66.066.0{66.0}66.0 55.855.8{55.8}55.8 47.147.1{47.1}47.1 31.031.0{31.0}31.0 62.062.0{62.0}62.0 136.5136.5{136.5}136.5 77.177.177.177.1 65.665.6{65.6}65.6 55.755.7{55.7}55.7 47.147.1{47.1}47.1 30.230.2{30.2}30.2 61.861.8{61.8}61.8 133.6133.6{133.6}133.6
Table 2: Results on AVSD-DSTC10.
Model Venue B-1 B-2 B-3 B-4 M R C
AV-TRN [58] ICASSP’22 -- -- -- 24.724.724.724.7 19.119.119.119.1 43.743.743.743.7 56.656.656.656.6
   + Ext. [58] ICASSP’22 -- -- -- 37.137.137.137.1 24.524.524.524.5 53.553.553.553.5 86.986.986.986.9
DSTC10 [23] AAAI’22 67.367.367.367.3 54.554.554.554.5 44.844.844.844.8 37.237.237.237.2 24.324.324.324.3 53.053.053.053.0 91.291.291.291.2
DialogMCF [12] TASLP’23 69.369.369.369.3 55.655.655.655.6 45.045.045.045.0 36.936.936.936.9 24.924.924.924.9 53.653.653.653.6 91.291.291.291.2
𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT [Uncaptioned image] 70.070.0\mathbf{70.0}bold_70.0 57.457.4\mathbf{57.4}bold_57.4 47.647.6\mathbf{47.6}bold_47.6 40.040.0\mathbf{40.0}bold_40.0 25.725.7\mathbf{25.7}bold_25.7 54.554.5\mathbf{54.5}bold_54.5 99.899.8\mathbf{99.8}bold_99.8
w/o Vsamsubscript𝑉samV_{\textrm{sam}}italic_V start_POSTSUBSCRIPT sam end_POSTSUBSCRIPT ECCV’24 69.869.8{69.8}69.8 57.457.4{57.4}57.4 47.547.5{47.5}47.5 39.839.8{39.8}39.8 25.625.6{25.6}25.6 54.354.3{54.3}54.3 97.697.6{97.6}97.6
   w/o Avggishsubscript𝐴vggishA_{\textrm{vggish}}italic_A start_POSTSUBSCRIPT vggish end_POSTSUBSCRIPT 69.769.7{69.7}69.7 57.157.1{57.1}57.1 47.247.2{47.2}47.2 39.539.5{39.5}39.5 25.125.1{25.1}25.1 54.054.0{54.0}54.0 96.996.9{96.9}96.9
Table 3: Results on SIMMC.
Model Venue B-4
MTN [36] ACL’19 21.721.721.721.7
GPT-2 [31] EMNLP’21 19.219.219.219.2
BART [41] NAACL’22 33.133.133.133.1
PaCE [44] ACL’23 34.134.134.134.1
𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT Refer to caption ECCV’24 44.744.7\mathbf{44.7}bold_44.7
Table 4: Results on open-ended NExT-QA.
Model Venue WUPSC WUPST WUPSD WUPS
HCRN [40] CVPR’20 16.0516.0516.0516.05 17.6817.6817.6817.68 49.7849.7849.7849.78 23.9223.9223.9223.92
HGA [24] AAAI’20 17.9817.9817.9817.98 17.9517.9517.9517.95 50.8450.8450.8450.84 24.0624.0624.0624.06
Flamingo [5] NeurIPS’22 -- -- -- 28.4028.4028.4028.40
KcGA [25] AAAI’23 -- -- -- 28.2028.2028.2028.20
EMU [61] arXiv’23 -- -- -- 23.4023.4023.4023.40
𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT Refer to caption ECCV’24 22.1222.12\mathbf{22.12}bold_22.12 22.2022.20\mathbf{22.20}bold_22.20 55.6455.64\mathbf{55.64}bold_55.64 29.5029.50\mathbf{29.50}bold_29.50

4.3 Main Results

4.3.1 AVSD-DSTC7.

As can be seen in Table 1, our model managed to achieve new SOTA results across all evaluation metrics thereby outperforming the latest baselines including PDC [32], DialogMCF [12], THAM [73], and ITR [76]. Specifically, 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  outperformed the latest ITR [76] model by over 1.5% (relative improvement) on B-2, B-3, B-4, and M scores. Since some of the previous models did not use SAM [30] and audio features, we trained two additional versions of our model where we first only removed SAM features before additionally removing the audio features. Both versions are denoted by “w/o Vsamsubscript𝑉samV_{\textrm{sam}}italic_V start_POSTSUBSCRIPT sam end_POSTSUBSCRIPT” and “w/o Avggishsubscript𝐴vggishA_{\textrm{vggish}}italic_A start_POSTSUBSCRIPT vggish end_POSTSUBSCRIPT”, respectively. As can be seen from Table 1, both versions still managed to outperfrom all previous model across all evaluation mertrics.

4.3.2 AVSD-DSTC8.

As depicted in Table 1, models tend to struggle more on this more recent benchmark. However, 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  scored new SOTA results with higher relative improvements compared to DSTC7 thereby lifting the B-2, B-3, B-4, and C scores by over 3% relatively to the second best models ITR [76] and THAM [73]. Similarly to AVSD-DSTC7, both our ablated versions managed to surpass these models on all evaluation metrics and marginally underperformed our full model.

C, T, and D denote causal, temporal and descriptive questions, respectively.

4.3.3 AVSD-DSTC10.

We then evaluated 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  on the latest AVSD-DSTC10 benchmark. Contrarily to the previous versions, AVSD-DSTC10 does not include human generated video descriptions during inference since these are unavailable in real-world applications. As depicted in Table 2, models struggle the most on this version of the challenge. However, not only our full 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  model but also its two ablated versions managed to outperform the latest models on all evaluation metrics.

4.3.4 SIMMC.

\clubsuit: Models trained with optimal hyperparameters from AVSD and without Vsamsubscript𝑉samV_{\textrm{sam}}italic_V start_POSTSUBSCRIPT sam end_POSTSUBSCRIPT.

To assess the generalizability of our model, we additionally tested it on the generative task of SIMMC 2.0 [49]. As can be seen from Table 4, 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  managed to outperform the latest published models such as PaCE [44] by achieving a B-4 score of 44.744.744.744.7.

4.3.5 NExT-QA.

Finally, we tested our model on the recent open-ended NExT-QA benchmark [67]. As depicted in Table 4, 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  not only outperformed HCRN [40] and HGA [24] on all WUPS scores [48] but also surpassed latest models such as Flamingo [5], KcGA [25], and EMU [61]. Specifically, it lifted the overall WUPS score by 1.1 absolute points compared to the seminal Flamingo-9B model with x18 more parameters.

4.4 Ablation Study

4.4.1 Effect of λ𝜆\lambdaitalic_λ and ΔΔ\Deltaroman_Δ.

We independently optimized these hyper-parameters based on the validation perplexity (PPL). First, we fixed Δ=4Δ4\Delta=4roman_Δ = 4 to guarantee a reasonable training time on our hardware setup and varied λ{0,0.1,0.5,0.9,1}𝜆00.10.50.91\lambda\in\{0,0.1,0.5,0.9,1\}italic_λ ∈ { 0 , 0.1 , 0.5 , 0.9 , 1 }. As seen in Table 6, the best performance was achieved when using λ=0.9𝜆0.9\lambda=0.9italic_λ = 0.9. Thereafter, we varied Δ{2,3,4,5}Δ2345\Delta\in\{2,3,4,5\}roman_Δ ∈ { 2 , 3 , 4 , 5 } while kee** λ=0.9𝜆0.9\lambda=0.9italic_λ = 0.9 and achieved the best results for Δ=4Δ4\Delta=4roman_Δ = 4 as can be seen from Table 6.

Table 5: Influence of the value of λ𝜆\lambdaitalic_λ.
λ𝜆\lambdaitalic_λ PPL AVSD-DSTC7 AVSD-DSTC8
(val) B-4 R C B-4 R C
0.00.00.00.0 Training unstable
0.10.10.10.1 11.0311.0311.0311.03 17.317.317.317.3 29.029.029.029.0 35.135.135.135.1 11.411.411.411.4 24.324.324.324.3 21.221.221.221.2
0.50.50.50.5 5.485.485.485.48 44.644.644.644.6 60.360.360.360.3 126.4126.4126.4126.4 44.744.744.744.7 59.459.459.459.4 123.8123.8123.8123.8
0.90.90.90.9 5.165.16\mathbf{5.16}bold_5.16 47.647.6\mathbf{47.6}bold_47.6 62.562.5\mathbf{62.5}bold_62.5 138.8138.8\mathbf{138.8}bold_138.8 47.747.7\mathbf{47.7}bold_47.7 62.462.4\mathbf{62.4}bold_62.4 135.4135.4\mathbf{135.4}bold_135.4
1.01.01.01.0 5.305.30{5.30}5.30 45.145.1{45.1}45.1 60.860.8{60.8}60.8 131.3131.3{131.3}131.3 42.342.3{42.3}42.3 61.161.1{61.1}61.1 126.9126.9{126.9}126.9
Table 6: Influence of the value of ΔΔ\Deltaroman_Δ.
ΔΔ\Deltaroman_Δ PPL AVSD-DSTC7 AVSD-DSTC8
(val) B-4 R C B-4 R C
2absent2\leq 2≤ 2 Training too long
3333 5.195.195.195.19 45.745.745.745.7 61.561.561.561.5 134.1134.1134.1134.1 46.746.746.746.7 61.561.561.561.5 131.8131.8131.8131.8
4444 5.165.16\mathbf{5.16}bold_5.16 47.647.6\mathbf{47.6}bold_47.6 62.562.5\mathbf{62.5}bold_62.5 138.8138.8\mathbf{138.8}bold_138.8 47.747.7\mathbf{47.7}bold_47.7 62.462.4\mathbf{62.4}bold_62.4 135.4135.4\mathbf{135.4}bold_135.4
5555 5.215.215.215.21 45.045.045.045.0 61.161.161.161.1 133.6133.6133.6133.6 44.644.644.644.6 60.560.560.560.5 129.1129.1129.1129.1

4.4.2 Latent Graph Size K𝐾Kitalic_K.

As illustrated in the first section of Table 7, we varied K𝐾Kitalic_K from 7777 to 16161616 in three-step intervals. The overall performance of 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  peaked when using K=10𝐾10K=10italic_K = 10 tokens from each modality as the graphs’ node features. Using higher values of K𝐾Kitalic_K rendered the learning of the global latent graphs with K×N𝐾𝑁K\times Nitalic_K × italic_N nodes more difficult and thus hurt the overall performance of our model. This is underlined by the behavior of the global ELBO loss ELBOglobalsuperscriptsubscriptELBOglobal\mathcal{L}_{\textrm{ELBO}}^{\mathrm{global}}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_global end_POSTSUPERSCRIPT as illustrated in Figure 5a. Using K=7𝐾7K=7italic_K = 7 hurt the performance of our model across almost all metrics. We posit that low values of K𝐾Kitalic_K are not sufficient to capture the most influential constituents of each modality. Therefore, we set K=10𝐾10K=10italic_K = 10 in the rest of the experiments.

4.4.3 Multi-Modal State Tracking GNNs.

In each row of the middle section of Table 7, we ablated one GNN-based tracking module and kept the remaining ones unchanged. Our full model outperformed all these ablated versions despite them having access to the same input features. The comparable results of all these ablated versions validate the use of a uniform graph size K𝐾Kitalic_K for all different modalities. Finally, we replaced all GNNs (local and global) with vanilla transformer layers. As can be seen from the last row of the middle section, this version was outperformed by our full model as well underlining the efficacy of our proposed multi-modal graph learning approach.

Refer to caption
Figure 5: a) Larger values of K𝐾Kitalic_K make the learning of the global latent graphs more challenging. b) The local ELBO loss ELBOlocalsuperscriptsubscriptELBOlocal\mathbf{\mathcal{L}_{\textrm{ELBO}}^{\mathrm{local}}}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_local end_POSTSUPERSCRIPT facilitates the learning of the global latent graphs. c) The global ELBO loss ELBOglobalsuperscriptsubscriptELBOglobal\mathbf{\mathcal{L}_{\textrm{ELBO}}^{\mathrm{global}}}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_global end_POSTSUPERSCRIPT facilitates the learning of the local latent graphs. All models use SAM and audio features.
Table 7: Comparison between different ablated versions of our model. All ablations use SAM and audio features. TRN means that the model replaces the global and local multi-modal GNNs with vanilla transformer layers and RAND denotes that it uses random latent graphs instead of learning them. Our full model is highlighted in blue.
𝐊𝐊\mathbf{K}bold_K GNNs ELBOlocalsuperscriptsubscriptELBOlocal\mathbf{\mathcal{L}_{\textrm{ELBO}}^{\mathrm{local}}}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_local end_POSTSUPERSCRIPT ELBOglobalsuperscriptsubscriptELBOglobal\mathbf{\mathcal{L}_{\textrm{ELBO}}^{\mathrm{global}}}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_global end_POSTSUPERSCRIPT ##\mathbf{\#}# Params. AVSD-DSTC7 AVSD-DSTC8
B-1 B-4 R C B-1 B-4 R C
7777 All 511similar-toabsent511\sim 511∼ 511M 77.877.877.877.8 47.047.047.047.0 61.861.861.861.8 136.2136.2136.2136.2 76.676.676.676.6 47.047.047.047.0 61.561.561.561.5 131.8131.8131.8131.8
10101010 All 511similar-toabsent511\sim 511∼ 511M 78.778.7\mathbf{78.7}bold_78.7 47.647.6\mathbf{47.6}bold_47.6 62.562.5\mathbf{62.5}bold_62.5 138.8138.8\mathbf{138.8}bold_138.8 77.577.5\mathbf{77.5}bold_77.5 47.747.7\mathbf{47.7}bold_47.7 62.462.4\mathbf{62.4}bold_62.4 135.4135.4\mathbf{135.4}bold_135.4
13131313 All 511similar-toabsent511\sim 511∼ 511M 77.077.077.077.0 45.445.445.445.4 60.660.660.660.6 131.9131.9131.9131.9 75.775.775.775.7 45.245.245.245.2 60.460.460.460.4 127.0127.0127.0127.0
16161616 All 511similar-toabsent511\sim 511∼ 511M 76.676.676.676.6 45.445.445.445.4 60.760.760.760.7 132.6132.6132.6132.6 75.875.875.875.8 45.945.945.945.9 60.560.560.560.5 128.4128.4128.4128.4
10101010 w/o GNNrgbrgb{}_{\textrm{rgb}}start_FLOATSUBSCRIPT rgb end_FLOATSUBSCRIPT 495similar-toabsent495\sim 495∼ 495M 78.478.478.478.4 47.247.247.247.2 62.462.462.462.4 137.2137.2137.2137.2 77.377.377.377.3 47.447.447.447.4 62.062.062.062.0 133.2133.2133.2133.2
10101010 w/o GNNflowflow{}_{\textrm{flow}}start_FLOATSUBSCRIPT flow end_FLOATSUBSCRIPT 495similar-toabsent495\sim 495∼ 495M 78.578.578.578.5 47.147.147.147.1 62.562.562.562.5 138.5138.5138.5138.5 76.976.976.976.9 47.247.247.247.2 61.961.961.961.9 134.1¯¯134.1\underline{134.1}under¯ start_ARG 134.1 end_ARG
10101010 w/o GNNsamsam{}_{\textrm{sam}}start_FLOATSUBSCRIPT sam end_FLOATSUBSCRIPT 495similar-toabsent495\sim 495∼ 495M 78.178.178.178.1 46.146.146.146.1 62.262.262.262.2 137.2137.2137.2137.2 77.577.577.577.5 46.546.546.546.5 61.761.761.761.7 132.7132.7132.7132.7
10101010 w/o GNNvggishvggish{}_{\textrm{vggish}}start_FLOATSUBSCRIPT vggish end_FLOATSUBSCRIPT 495similar-toabsent495\sim 495∼ 495M 78.078.078.078.0 45.845.845.845.8 61.461.461.461.4 134.9134.9134.9134.9 76.876.876.876.8 46.546.546.546.5 61.061.061.061.0 131.0131.0131.0131.0
10101010 w/o GNNHH{}_{\textrm{H}}start_FLOATSUBSCRIPT H end_FLOATSUBSCRIPT 495similar-toabsent495\sim 495∼ 495M 78.178.178.178.1 45.745.745.745.7 61.861.861.861.8 134.1134.1134.1134.1 77.477.477.477.4 46.746.746.746.7 62.262.262.262.2 134.0134.0134.0134.0
10101010 w/o GNNQQ{}_{\textrm{Q}}start_FLOATSUBSCRIPT Q end_FLOATSUBSCRIPT 495similar-toabsent495\sim 495∼ 495M 78.278.278.278.2 47.147.147.147.1 62.162.162.162.1 138.5138.5138.5138.5 77.077.077.077.0 47.047.047.047.0 61.861.861.861.8 133.6133.6133.6133.6
10101010 TRN 500similar-toabsent500\sim 500∼ 500M 77.877.877.877.8 46.946.946.946.9 61.861.861.861.8 136.6136.6136.6136.6 76.876.876.876.8 46.746.746.746.7 61.461.461.461.4 131.8131.8131.8131.8
-- -- 411similar-toabsent411\sim 411∼ 411M 76.676.676.676.6 45.145.145.145.1 60.860.860.860.8 131.3131.3131.3131.3 74.274.274.274.2 42.342.342.342.3 61.161.161.161.1 126.9126.9126.9126.9
-- w/ only A~isubscript~𝐴𝑖\tilde{A}_{i}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 413similar-toabsent413\sim 413∼ 413M 76.576.576.576.5 45.445.4{45.4}45.4 60.960.9{60.9}60.9 131.7131.7{131.7}131.7 75.275.2{75.2}75.2 45.545.5{45.5}45.5 60.760.7{60.7}60.7 130.3130.3{130.3}130.3
10101010 All 416similar-toabsent416\sim 416∼ 416M 75.975.975.975.9 44.544.544.544.5 59.859.859.859.8 127.8127.8127.8127.8 74.374.374.374.3 44.244.244.244.2 59.259.259.259.2 122.8122.8122.8122.8
10101010 All 506similar-toabsent506\sim 506∼ 506M 77.577.577.577.5 46.446.446.446.4 61.461.461.461.4 134.9134.9134.9134.9 76.276.276.276.2 46.646.646.646.6 60.960.960.960.9 130.6130.6130.6130.6
10101010 All RAND RAND 448similar-toabsent448\sim 448∼ 448M 73.073.073.073.0 42.142.142.142.1 57.357.357.357.3 119.2119.2119.2119.2 71.471.471.471.4 41.641.641.641.6 57.157.157.157.1 114.2114.2114.2114.2
Table 8: Comparison between different ablated versions of our model. All ablations were trained with SAM and audio features and with the optimal hyper-parameters as the full model. IB = Initialization Bias, MMC = Multi-Modal Conditioning.
𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT [Uncaptioned image] ##\mathbf{\#}# Params. AVSD-DSTC7 AVSD-DSTC8
B-1 B-4 R C B-1 B-4 R C
w/o MMC 500similar-toabsent500\sim 500∼ 500M 76.976.976.976.9 46.646.6{46.6}46.6 61.461.4{61.4}61.4 135.5135.5{135.5}135.5 75.875.8{75.8}75.8 46.146.1{46.1}46.1 60.560.5{60.5}60.5 130.9130.9{130.9}130.9
w/o IB 511similar-toabsent511\sim 511∼ 511M 77.677.677.677.6 47.047.0{47.0}47.0 61.861.8{61.8}61.8 136.2136.2{136.2}136.2 76.376.3{76.3}76.3 46.246.2{46.2}46.2 61.261.2{61.2}61.2 131.1131.1{131.1}131.1
Full 511similar-toabsent511\sim 511∼ 511M 78.778.7\mathbf{78.7}bold_78.7 47.647.6\mathbf{47.6}bold_47.6 62.562.5\mathbf{62.5}bold_62.5 138.8138.8\mathbf{138.8}bold_138.8 77.577.5\mathbf{77.5}bold_77.5 47.747.7\mathbf{47.7}bold_47.7 62.462.4\mathbf{62.4}bold_62.4 135.4135.4\mathbf{135.4}bold_135.4

4.4.4 ELBO Losses.

As can be seen in the third section of Table 7, we conducted extensive experiments with different combinations of the ELBO losses: (1) We first ablated the learning of both global and local latent graphs and therefore both ELBO losses resulting in a plain BART model [43]. (2) We then only used the initial graphs A~isubscript~𝐴𝑖\tilde{A}_{i}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the final latent graph approximations in both training stages I and II leading to improvements compared to plain BART. (3) Thereafter, we ablated the local ELBO loss and directly learned the global latent graphs. This version of our model underperformed BART which is in accordance with our hypothesis that directly learning the global latent graphs is a daunting task and might therefore lead to performance drops. As illustrated in Figure 5b, ELBOglobalsuperscriptsubscriptELBOglobal\mathbf{\mathcal{L}_{\textrm{ELBO}}^{\mathrm{global}}}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_global end_POSTSUPERSCRIPT converged faster and reached lower values when optimized jointly with ELBOlocalsuperscriptsubscriptELBOlocal\mathbf{\mathcal{L}_{\textrm{ELBO}}^{\mathrm{local}}}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_local end_POSTSUPERSCRIPT. (4) We thereafter ablated the global ELBO loss and only learned the local latent graphs which led to performance increases compared to the previous versions. This underlines that learning the local latent graphs is less sensitive to ELBOglobalsuperscriptsubscriptELBOglobal\mathbf{\mathcal{L}_{\textrm{ELBO}}^{\mathrm{global}}}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_global end_POSTSUPERSCRIPT than learning the global latent graphs is to ELBOlocalsuperscriptsubscriptELBOlocal\mathbf{\mathcal{L}_{\textrm{ELBO}}^{\mathrm{local}}}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_local end_POSTSUPERSCRIPT as can be seen in Figure 5c. (5) We finally evaluated a version with a comparable computational complexity as our full model but uses random latent graphs instead of learning them. As can be seen in Figure 5b, Figure 5c, and the last row of Table 7), both ELBO losses remained constant and the model reached the worst results among all ablated versions empirically showcasing the importance of our latent graph learning approach.

4.4.5 Latent Graph Learning.

Lastly, we considered two additional ablations of 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT. Specifically, we first ablated the multi-modal conditioning (MMC) of Equation 6 and learned the local latent graphs of modality i𝑖iitalic_i based only on its features Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This reduces Equation 8 to

Ai=12A~i+12(Ai+Ai′′).subscript𝐴𝑖12subscript~𝐴𝑖12subscriptsuperscript𝐴𝑖subscriptsuperscript𝐴′′𝑖A_{i}=\frac{1}{2}\tilde{A}_{i}+\frac{1}{2}(A^{\prime}_{i}+A^{\prime\prime}_{i}).italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (14)

Then, we trained a version without the initialization bias (IB) of Equation 8. As can be seen in Table 8, MMC is essential for high performance. Without it 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  achieved the lowest performance across all metrics. The same applies to IB since not incorporating A~isubscript~𝐴𝑖\tilde{A}_{i}over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and only using the posterior approximation impeded the performance across all evaluation metrics.

Refer to caption
Figure 6: Qualitative comparison of different model ablations. on response generation and latent global graph inference of qϕsubscript𝑞italic-ϕq_{\phi}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT obtained from the last encoder layer. The diagonal blocks (from upper left to lower right) correspond to Vrgb,Vflow,Vsam,Avggish,TH, and TQsubscript𝑉rgbsubscript𝑉flowsubscript𝑉samsubscript𝐴vggishsubscript𝑇H and subscript𝑇QV_{\textrm{rgb}},V_{\textrm{flow}},V_{\textrm{sam}},A_{\textrm{vggish}},T_{% \textrm{H}},\textrm{ and }T_{\textrm{Q}}italic_V start_POSTSUBSCRIPT rgb end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT flow end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT sam end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT vggish end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT H end_POSTSUBSCRIPT , and italic_T start_POSTSUBSCRIPT Q end_POSTSUBSCRIPT, respectively.

4.5 Qualitative Results

Finally, in Figure 6 we give a qualitative comparison of 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  with different ablated versions on response generation and global latent graph inference: Our full model managed to accurately answer the question whereas both ablated version failed to generate reliable responses. Furthermore, we can see how our full model better captured the local interactions within each modality (more structured diagonal blocks) as well as the global ones across modalities: Whereas the off-diagonal region (bordered in red) of the version “w/o ELBOlocalsuperscriptsubscriptELBOlocal\mathcal{L}_{\textrm{ELBO}}^{\mathrm{local}}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_local end_POSTSUPERSCRIPT” showed a clear divide between the modalities (dotted line), the full model mitigated this by producing more homogeneous values indicating better inter-modal interactions. We provide more examples and failure cases in the supplementary material.

5 Conclusion

We proposed 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT– a novel multi-modal state tracking model specifically geared towards video dialog. 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  first identifies the most influential constituents at different semantic levels (e.g. across modalities and encoder layers). Then, it relies on a two-stage divide and conquer approach to infer the missing underlying structure of the mix of all modalities and leverages it to augment the hidden states of the backbone VLM using GNNs. Through extensive ablations experiments and evaluations on five video-and-language benchmarks, we show the effectiveness and generalization capabilities of our approach.
Acknowledgments. A. Bulling was funded by the European Research Council (ERC; grant agreement 801708) and L. Shi was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC 2075–390740016.

Appendix 0.A ELBO Derivation & Implementation

In this section, we derive the ELBO loss and show how it can be used as an optimization term in our total loss. Without the loss of generality, we only consider the ELBO in the global setting. Given the intractable posterior pθ(A|A~,X)subscript𝑝𝜃conditionalsuperscript𝐴~𝐴𝑋p_{\theta}(A^{\prime}|\tilde{A},X)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) and the its approximation qϕ(A|A~,X)subscript𝑞italic-ϕconditionalsuperscript𝐴~𝐴𝑋q_{\phi}(A^{\prime}|\tilde{A},X)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ), it holds that

𝒟KL(qϕ(A|A~,X)||pθ(A|A~,X))=𝔼qϕ(A|A~,X)[logqϕ(A|A~,X)pθ(A|A~,X)]\displaystyle\mathcal{D}_{\mathrm{KL}}\left(q_{\phi}(A^{\prime}|\tilde{A},X)||% p_{\theta}(A^{\prime}|\tilde{A},X)\right)=\mathbb{E}_{q_{\phi}(A^{\prime}|% \tilde{A},X)}\left[\mathrm{log}\frac{q_{\phi}(A^{\prime}|\tilde{A},X)}{p_{% \theta}(A^{\prime}|\tilde{A},X)}\right]caligraphic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) ) = blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) end_ARG ] (15)
=𝔼qϕ(A|A~,X)[logqϕ(A|A~,X)pθ(A~|X)pθ(A,A~|X)]absentsubscript𝔼subscript𝑞italic-ϕconditionalsuperscript𝐴~𝐴𝑋delimited-[]logsubscript𝑞italic-ϕconditionalsuperscript𝐴~𝐴𝑋subscript𝑝𝜃conditional~𝐴𝑋subscript𝑝𝜃superscript𝐴conditional~𝐴𝑋\displaystyle=\mathbb{E}_{q_{\phi}(A^{\prime}|\tilde{A},X)}\left[\mathrm{log}% \frac{q_{\phi}(A^{\prime}|\tilde{A},X)p_{\theta}(\tilde{A}|X)}{p_{\theta}(A^{% \prime},\tilde{A}|X)}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_A end_ARG | italic_X ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG | italic_X ) end_ARG ] (16)
=𝔼qϕ(A|A~,X)[logqϕ(A|A~,X)pθ(A,A~|X)]+pθ(A~|X)absentsubscript𝔼subscript𝑞italic-ϕconditionalsuperscript𝐴~𝐴𝑋delimited-[]logsubscript𝑞italic-ϕconditionalsuperscript𝐴~𝐴𝑋subscript𝑝𝜃superscript𝐴conditional~𝐴𝑋subscript𝑝𝜃conditional~𝐴𝑋\displaystyle=\mathbb{E}_{q_{\phi}(A^{\prime}|\tilde{A},X)}\left[\mathrm{log}% \frac{q_{\phi}(A^{\prime}|\tilde{A},X)}{p_{\theta}(A^{\prime},\tilde{A}|X)}% \right]+p_{\theta}(\tilde{A}|X)= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG | italic_X ) end_ARG ] + italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_A end_ARG | italic_X ) (17)
=pθ(A~|X)Evidence𝔼qϕ(A|A~,X)[logpθ(A,A~|X)qϕ(A|A~,X)]=:ELBOglobal0absentsubscriptsubscript𝑝𝜃conditional~𝐴𝑋Evidencesubscriptsubscript𝔼subscript𝑞italic-ϕconditionalsuperscript𝐴~𝐴𝑋delimited-[]logsubscript𝑝𝜃superscript𝐴conditional~𝐴𝑋subscript𝑞italic-ϕconditionalsuperscript𝐴~𝐴𝑋:absentsuperscriptsubscriptELBOglobal0\displaystyle=\underbrace{p_{\theta}(\tilde{A}|X)}_{\textrm{Evidence}}-% \underbrace{\mathbb{E}_{q_{\phi}(A^{\prime}|\tilde{A},X)}\left[\mathrm{log}% \frac{p_{\theta}(A^{\prime},\tilde{A}|X)}{q_{\phi}(A^{\prime}|\tilde{A},X)}% \right]}_{=:\mathcal{L}_{\mathrm{ELBO}}^{\mathrm{global}}}\geq 0= under⏟ start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_A end_ARG | italic_X ) end_ARG start_POSTSUBSCRIPT Evidence end_POSTSUBSCRIPT - under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_A end_ARG | italic_X ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) end_ARG ] end_ARG start_POSTSUBSCRIPT = : caligraphic_L start_POSTSUBSCRIPT roman_ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_global end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≥ 0 (18)

Thus, as its name suggests, ELBO serves as a lower bound of the evidence. As a results, VI tries to maximize the ELBO which is equivalent to minimizing the Kullback-Leibner Divergence between qϕ(A|A~,X)subscript𝑞italic-ϕconditionalsuperscript𝐴~𝐴𝑋q_{\phi}(A^{\prime}|\tilde{A},X)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) and the intractable posterior pθ(A|A~,X)subscript𝑝𝜃conditionalsuperscript𝐴~𝐴𝑋p_{\theta}(A^{\prime}|\tilde{A},X)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | over~ start_ARG italic_A end_ARG , italic_X ) leading to better estimation of the latter. Since we used the ELBOs as terms in the total loss \mathcal{L}caligraphic_L to be minimized, we had to use the opposite value of each one of them. This explains the minus sign in Equation 12 in the main text. Since qϕsubscript𝑞italic-ϕq_{\phi}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT only output normalized scores as the prediction for each edge, we appended the zero vectors to both predictions in order to convert the raw scores to a two-value probability before applying the log-softmax function. We provide in LABEL:lst:elbo a code-snippet of our implementation of the ELBO loss.

Appendix 0.B Datasets

0.B.1 AVSD

The AVSD dataset [4] was released in the 7777th Dialogue System Technology Challenge (DSTC7) [74]. As can be seen from Table 9, it contains 7,65976597,6597 , 659, 1,78717871,7871 , 787, and 1,71017101,7101 , 710 dialogs for training, validation and testing, respectively. The data for DSTC8 [28] and DSTC10 [58] were only released with 1,71017101,7101 , 710 and 1,80418041,8041 , 804 dialogs for testing, respectively. For all testing splits, six human-generated reference answers were provided for each dialog in order to compute the generation metrics.

0.B.2 SIMMC2.0

SIMMC 2.0 [31] is a task-oriented dataset that was proposed for virtual assistance scenarios and contains 11111111k dialogs with 52,0445204452,04452 , 044 unique questions grounded in 5,44054405,4405 , 440 videos from the shop** domain. Its visual and textual data were automatically generated in constrained and pre-defined settings resulting in less complex and challenging scenes compared to AVSD. As can be seen in Figure 7, AVSD features a larger variety of objects that humans interact with daily, more complex dynamics, and more challenging illumination conditions. On the other hand, SIMMC 2.0 only comes with simple items linked to the shop** domain.

Table 9: Summary of the AVSD dataset with all test splits from DSTC7, DSTC8, and DSTC10.
Train Val Test
DSTC7 DSTC8 DSTC10
# Dialogs/Videos 7,65976597,6597 , 659 1,78717871,7871 , 787 1,71017101,7101 , 710 1,71017101,7101 , 710 1,80418041,8041 , 804
# Questions/Answers 153,180153180153,180153 , 180 35,7403574035,74035 , 740 13,4901349013,49013 , 490 18,8101881018,81018 , 810 28,4062840628,40628 , 406
# Words 1,450,75414507541,450,7541 , 450 , 754 339,006339006339,006339 , 006 110,252110252110,252110 , 252 162,226162226162,226162 , 226 272,606272606272,606272 , 606
Table 10: Summary of the open-ended NExT-QA dataset.
Train Val Test
# Videos 3,87038703,8703 , 870 570570570570 1,00010001,0001 , 000
# Questions 37,5233752337,52337 , 523 5,34353435,3435 , 343 9,17891789,1789 , 178

0.B.3 NExT-QA

NExT-QA [67] was recently introduced as a next generation video question answering benchmark that was introduced to advance video understanding from describing to explaining the temporal actions. Table 10 gives more insight about the statistics of the dataset.

Figure 7: Comparison between the visual complexity of AVSD (a) and SIMMC 2.0 (b). For ethical reasons, we blurred the faces of people appearing in the video frames.
Refer to caption

Appendix 0.C Experimental Setup

0.C.1 Hardware & Environment

We implemented our model in PyTorch [54] and trained them on a cluster consisting of 8888 Nvidia Tesla V100 (32GB) GPUs, 2222 Intel(R) Xeon(R) Platinum 8160 CPUs, and 1.51.51.51.5TB of RAM.

0.C.2 Training

We trained 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  end-to-end using AdamW [47] with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999, and ϵ=1e8italic-ϵ1𝑒8\epsilon=1e-8italic_ϵ = 1 italic_e - 8 and a linear learning rate schedule with warm-up for a maximum of 12121212 epochs. We utilized a learning rate lrBART=1e5subscriptlrBART1𝑒5\mathrm{lr}_{\mathrm{BART}}=1e-5roman_lr start_POSTSUBSCRIPT roman_BART end_POSTSUBSCRIPT = 1 italic_e - 5 for the weights of the BART model and a learning rate lrrest=1e4subscriptlrrest1𝑒4\mathrm{lr}_{\mathrm{rest}}=1e-4roman_lr start_POSTSUBSCRIPT roman_rest end_POSTSUBSCRIPT = 1 italic_e - 4 for the rest of the parameters of our model. Similarly to λ𝜆\lambdaitalic_λ and ΔΔ\Deltaroman_Δ, we validated the choice of the ELBO loss coefficients α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT based on the validation perplexity. Specifically, we performed a grid search using the value set {1,10,100,1000}1101001000\{1,10,100,1000\}{ 1 , 10 , 100 , 1000 } while kee** λ=0.9,Δ=4formulae-sequence𝜆0.9Δ4\lambda=0.9,\Delta=4italic_λ = 0.9 , roman_Δ = 4, and K=10𝐾10K=10italic_K = 10. The training of our full model takes approximately 20202020 hours to finish. Complete details about the hyperparameter values are listed in Table 12.

0.C.3 Inference

Similar to previous works, we utilized beam search with a depth of 5555 and a lengths penalty of 0.30.30.30.3 to generate the answers. Each answer is composed of at most 20202020 tokens. The inference time of our model takes about 2222s to answer one question.

Appendix 0.D Additional Ablations

GNN Types.

We experimented with different types of GNNs within our full model. As depicted in 11(a), the combination of 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  with APPNP [18] led to the best overall performance compared to other GNNs such as GAT [63], GCN [29], and SAGE [21].

Mode Size.

Moreover, we experimented with different sizes our model. As depicted in 11(b), the variant of 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  that is based on BART-base significantly under-performed the large variant across all evaluation metrics of both datasets.

Table 11: Additional ablations of 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT.
(a) Performance comparison of our best model using different GNN types.
𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT AVSD-DSTC7 AVSD-DSTC8
B-4 R C B-4 R C
w/ GAT 46.7¯¯46.7\underline{46.7}under¯ start_ARG 46.7 end_ARG 61.561.561.561.5 135.4135.4135.4135.4 46.546.546.546.5 60.960.960.960.9 129.4129.4129.4129.4
w/ GCN 46.646.646.646.6 61.9¯¯61.9\underline{61.9}under¯ start_ARG 61.9 end_ARG 136.7¯¯136.7\underline{136.7}under¯ start_ARG 136.7 end_ARG 46.7¯¯46.7\underline{46.7}under¯ start_ARG 46.7 end_ARG 61.6¯¯61.6\underline{61.6}under¯ start_ARG 61.6 end_ARG 131.6¯¯131.6\underline{131.6}under¯ start_ARG 131.6 end_ARG
w/ SAGE 46.046.046.046.0 61.261.261.261.2 133.4133.4133.4133.4 45.845.845.845.8 60.960.960.960.9 129.3129.3129.3129.3
w/ APPNP 47.647.6\mathbf{47.6}bold_47.6 62.562.5\mathbf{62.5}bold_62.5 138.8138.8\mathbf{138.8}bold_138.8 47.747.7\mathbf{47.7}bold_47.7 62.362.3\mathbf{62.3}bold_62.3 134.9134.9\mathbf{134.9}bold_134.9
(b) Performance comparison between different model sizes. “Base” and “Large” mean that 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  uses a base or a large backbone, respectively.
𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT AVSD-DSTC7 AVSD-DSTC8
B-4 R C B-4 R C
Base (Δ=2Δ2\Delta=2roman_Δ = 2) 39.839.8{39.8}39.8 60.060.060.060.0 113.9113.9113.9113.9 40.140.140.140.1 55.455.455.455.4 110.2110.2110.2110.2
Large (Δ=4Δ4\Delta=4roman_Δ = 4) 47.647.6\mathbf{47.6}bold_47.6 62.562.5\mathbf{62.5}bold_62.5 138.8138.8\mathbf{138.8}bold_138.8 46.746.7\mathbf{46.7}bold_46.7 61.661.6\mathbf{61.6}bold_61.6 131.6131.6\mathbf{131.6}bold_131.6

Appendix 0.E Qualitative Results

We provide additional extensive qualitative examples of our best model and some of its ablated versions for comparison in Figure 8. Finally, we give some failure cases in Figure 9.

1# ---------------------------------
2# Implementation of the ELBO loss
3# ---------------------------------
4import torch
5import torch.nn as nn
6import torch.nn.functional as F
7
8class ELBO(nn.Module):
9 def __init__(self):
10 super(ELBO, self).__init__()
11
12 def forward(self, Aq, Ap):
13 """
14 Args:
15 Aq: The predicted latent graph of q_phi
16 shape = (batch_size, K, K) -- local graphs
17 shape = (batch_size, NK, NK) -- global graphs
18
19 Ap: The predicted latent graph of p_theta
20 shape = (batch_size, K, K) -- local graphs
21 shape = (batch_size, NK, NK) -- global graphs
22
23 Returns:
24 The ELBO loss
25 """
26 Aq_flat = Aq.view(-1).unsqueeze(-1)
27 Ap_flat = Ap.view(-1).unsqueeze(-1)
28
29 Aq_flat = torch.cat(
30 [torch.zeros_like(Aq_flat), Aq_flat], dim=-1)
31 Ap_flat = torch.cat(
32 [torch.zeros_like(Ap_flat), Ap_flat], dim=-1)
33
34 log_Aq = F.log_softmax(QA_flattened, dim=1)
35 log_Ap = F.log_softmax(PA_flattened, dim=1)
36
37 Aq_dist = torch.exp(log_Aq)
38
39 loss_Aq = torch.mean(log_Aq * Aq_dist)
40 loss_Ap = torch.mean(log_Ap * Aq_dist)
41
42 elbo_loss = loss_Aq - loss_Ap
43
44 return elbo_loss
Listing 1: PyTorch implementation of the ELBO loss. Since qϕsubscript𝑞italic-ϕq_{\phi}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT only output normalized scores as the prediction for each edge, we append the zero vectors to both predictions in lines 29-30 to convert the raw scores to a two-value probability before applying softmax.
Table 12: Detailed hyperparameter setting of the training and inference of our best 𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPT  model.
Category Hyperparameter
Model Architecture Dimension of I3D rgb / I3D flow / SAM features dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 2048204820482048
Dimension of SAM features dssubscript𝑑𝑠d_{s}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT 512512512512
Maximum length of I3D rgb / I3D flow / SAM features dlsubscript𝑑𝑙d_{l}italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT 36363636
Dimension of audio features dasubscript𝑑𝑎d_{a}italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT 128128128128
Maximum length of audio features la=lvsubscript𝑙𝑎subscript𝑙𝑣l_{a}=l_{v}italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 36363636
Maximum total length of multi-modal input 1024102410241024
Dimension of hidden features d𝑑ditalic_d 1024102410241024/768768768768
Number of node features in local GNNs K𝐾Kitalic_K 10101010
Number of node features in local GNNs K𝐾Kitalic_K 10101010
Number for kNNs in {A~i}subscript~𝐴𝑖\{\tilde{A}_{i}\}{ over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } 4444
Number of learnable weights of Equation 7 8888
Input dimension of GNNs in 11(a) 1024102410241024
Output dimension of GNNs in 11(a) 1024102410241024
K𝐾Kitalic_K value of APPNP 2222
α𝛼\alphaitalic_α value of APPNP 0.10.10.10.1
Number of attention heads in local GATs 2222
Number of attention heads in global GATs 4444
λ𝜆\lambdaitalic_λ value 0.90.90.90.9
ΔΔ\Deltaroman_Δ value 4444
Optimization Optimizer AdamW
Learning rate of parameters in the VLM backbone lrBARTsubscriptlrBART\mathrm{lr}_{\mathrm{BART}}roman_lr start_POSTSUBSCRIPT roman_BART end_POSTSUBSCRIPT 1e-5
Learning rate of other parameters lrrestsubscriptlrrest\mathrm{lr}_{\mathrm{rest}}roman_lr start_POSTSUBSCRIPT roman_rest end_POSTSUBSCRIPT 1e-4
Values of {α1,α2,α3}subscript𝛼1subscript𝛼2subscript𝛼3\{\alpha_{1},\alpha_{2},\alpha_{3}\}{ italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } {1,100,100}1100100\{1,100,100\}{ 1 , 100 , 100 }
Learning rate schedule linear
Dropout rate 0.10.10.10.1
Value of gradient clip** 1.01.01.01.0
Effective batch size 96969696
Number of epochs 12121212
Hardware GPU model Tesla V100-32GB
Number of GPUs 8888
Distributed training PyTorch DDP
Inference Maximum number of response tokens 20
Depth of beam search 5555
Length penalty in beam search 0.30.30.30.3
Batch size 1111
Refer to caption
(a) Dialog with video-id = 3A9IC.
Refer to caption
(b) Dialog with video-id = 2EW71.
Refer to caption
(c) Dialog with video-id = K2XKT.
Refer to caption
(d) Dialog with video-id = UO7PC. Although the ablated version (𝕄𝕊𝕋𝕄𝕀𝕏𝔼𝕄𝕊subscript𝕋𝕄𝕀𝕏𝔼\mathbb{MST}_{\mathbb{MIXER}}blackboard_M blackboard_S blackboard_T start_POSTSUBSCRIPT blackboard_M blackboard_I blackboard_X blackboard_E blackboard_R end_POSTSUBSCRIPTw/o ELBOlocalsuperscriptsubscriptELBOlocal\mathbf{\mathcal{L}_{\textrm{ELBO}}^{\mathrm{local}}}caligraphic_L start_POSTSUBSCRIPT ELBO end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_local end_POSTSUPERSCRIPT) reached a BLEU-4 score of 41.1141.1141.1141.11, it incorrectly answered the question since the person did leave the filming area as can be seen from the last frames of the video.
Figure 8: Qualitative results on data samples form the test split of AVSD-DSTC7. For ethical reasons, we blurred the faces of people appearing in the video frames.
Refer to caption
(a) Dialog with video-id = 3DR7T.
Refer to caption
(b) Dialog with video-id = HKGAX.
Refer to caption
(c) Dialog with video-id = 1K4NH.
Refer to caption
(d) Dialog with video-id = 36QP8.
Figure 9: Negative qualitative results on data samples form the test split of AVSD-DSTC7. For ethical reasons, we blurred the faces of people appearing in the video frames.

References

  • [1] Abdessaied, A., Hochmeister, M., Bulling, A.: OLViT: Multi-modal state tracking via attention-based embeddings for video-grounded dialog. In: LREC-COLING (2024)
  • [2] Abdessaied, A., Shi, L., Bulling, A.: VD-GR: Boosting Visual Dialog With Cascaded Spatial-Temporal Multi-Modal Graphs. In: WACV (2024)
  • [3] Alamri, H., Bilic, A., Hu, M., Beedu, A., Essa, I.: End-to-end multimodal representation learning for video dialog. In: NeurIPS (2022)
  • [4] Alamri, H., Cartillier, V., Das, A., Wang, J., Cherian, A., Essa, I., Batra, D., Marks, T.K., Hori, C., Anderson, P., et al.: Audio visual scene-aware dialog. In: CVPR (2019)
  • [5] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
  • [6] Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Deep compositional question answering with neural module networks. In: CVPR (2016)
  • [7] Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Learning to compose neural networks for question answering. In: NAACL (2016)
  • [8] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual Question Answering. In: ICCV (2015)
  • [9] Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005)
  • [10] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)
  • [11] Chen, Y., Wu, L., Zaki, M.: Iterative deep graph learning for graph neural networks: Better and robust node embeddings. NeurIPS (2020)
  • [12] Chen, Z., Liu, H., Wang, Y.: DialogMCF: Multimodal Context Flow for Audio Visual Scene-Aware Dialog. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023)
  • [13] Chu, Y.W., Lin, K.Y., Hsu, C.C., Ku, L.W.: Multi-step joint-modality attention network for scene-aware dialogue system. In: DSTC Workshop @ AAAI (2020)
  • [14] Colson, B., Marcotte, P., Savard, G.: An overview of bilevel optimization. Annals of operations research (2007)
  • [15] Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M., Parikh, D., Batra, D.: Visual Dialog. In: CVPR (2017)
  • [16] Elinas, P., Bonilla, E.V., Tiao, L.: Variational inference for graph convolutional networks in the absence of graph data and adversarial settings. NeurIPS (2020)
  • [17] Franceschi, L., Niepert, M., Pontil, M., He, X.: Learning discrete structures for graph neural networks. In: ICML (2019)
  • [18] Gasteiger, J., Bojchevski, A., Günnemann, S.: Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In: ICLR (2019)
  • [19] Girdhar, R., Ramanan, D.: CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning. In: ICLR (2020)
  • [20] Guo, X., Wu, H., Cheng, Y., Rennie, S., Tesauro, G., Feris, R.: Dialog-based interactive image retrieval. NeurIPS 31 (2018)
  • [21] Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs (2017)
  • [22] Hori, C., Alamri, H., Wang, J., Wichern, G., Hori, T., Cherian, A., Marks, T.K., Cartillier, V., Lopes, R.G., Das, A., Essa, I., Batra, D., Parikh, D.: End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In: ICASSP (2019)
  • [23] Huang, X., Tan, H.L., Leong, M.C., Sun, Y., Li, L., Jiang, R., Kim, J.: Investigation on transformer-based multi-modal fusion for audio-visual scene-aware dialog. In: DSTC10 Workshop @ AAAI (2022)
  • [24] Jiang, P., Han, Y.: Reasoning with heterogeneous graph alignment for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)
  • [25] **, Y., Niu, G., Xiao, X., Zhang, J., Peng, X., Yu, J.: Knowledge-Constrained Answer Generation for Open-Ended Video Question Answering. In: AAAI (2023)
  • [26] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  • [27] Kim, J., Yoon, S., Kim, D., Yoo, C.D.: Structured co-reference graph attention for video-grounded dialogue. In: AAAI (2021)
  • [28] Kim, S., Galley, M., Gunasekara, C., Lee, S., Atkinson, A., Peng, B., Schulz, H., Gao, J., Li, J., Adada, M., et al.: The eighth dialog system technology challenge. arXiv preprint arXiv:1911.06394 (2019)
  • [29] Kipf, T.N., Welling, M.: Semi-Supervised Classification with Graph Convolutional Networks. In: ICLR (2017)
  • [30] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
  • [31] Kottur, S., Moon, S., Geramifard, A., Damavandi, B.: SIMMC 2.0: A task-oriented dialog dataset for immersive multimodal conversations. In: EMNLP (2021)
  • [32] Le, H., Chen, N.F., Hoi, S.: Learning reasoning paths over semantic graphs for video-grounded dialogues. In: ICLR (2021)
  • [33] Le, H., Chen, N.F., Hoi, S.C.H.: VGNMN: video-grounded neural module network to video-grounded language tasks. In: NAACL (2022)
  • [34] Le, H., Chen, N.F., Hoi, S.C.: Multimodal Dialogue State Tracking. In: NAACL (2022)
  • [35] Le, H., Hoi, S.C.: Video-Grounded Dialogues with Pretrained Generation Language Models. In: ACL (2020)
  • [36] Le, H., Sahoo, D., Chen, N., Hoi, S.: Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: ACL (2019)
  • [37] Le, H., Sahoo, D., Chen, N., Hoi, S.C.: BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues. In: EMNLP (2020)
  • [38] Le, H., Sankar, C., Moon, S., Beirami, A., Geramifard, A., Kottur, S.: DVD: A diagnostic dataset for multi-step reasoning in video grounded dialogue. In: ACL (2021)
  • [39] Le, H., Socher, R., Hoi, S.C.: Non-autoregressive dialog state tracking. In: ICLR (2020)
  • [40] Le, T.M., Le, V., Venkatesh, S., Tran, T.: Hierarchical conditional relation networks for video question answering. In: CVPR (2020)
  • [41] Lee, H., Kwon, O.J., Choi, Y., Park, M., Han, R., Kim, Y., Kim, J., Lee, Y., Shin, H., Lee, K., Kim, K.E.: Learning to embed multi-modal contexts for situated conversational agents. In: NAACL-Findings (Jul 2022)
  • [42] Lee, H., Lee, J., Kim, T.Y.: SUMBT: Slot-utterance matching for universal and scalable belief tracking. In: ACL (2019)
  • [43] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL (2020)
  • [44] Li, Y., Hui, B., Yin, Z., Yang, M., Huang, F., Li, Y.: PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts. In: ACL (2023)
  • [45] Li, Z., Li, Z., Zhang, J., Feng, Y., Zhou, J.: Bridging text and video: A universal multimodal transformer for audio-visual scene-aware dialog. Transactions on Audio, Speech, and Language Processing (2021)
  • [46] Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
  • [47] Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam. In: ICLR (2019)
  • [48] Malinowski, M., Fritz, M.: A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. In: NeurIPS (2014)
  • [49] Moon, S., Kottur, S., Crook, P.A., De, A., Poddar, S., Levin, T., Whitney, D., Difranco, D., Beirami, A., Cho, E., Subba, R., Geramifard, A.: Situated and interactive multimodal conversations. In: COLING (2020)
  • [50] Mou, X., Sigouin, B., Steenstra, I., Su, H.: Multimodal dialogue state tracking by QA approach with data augmentation. In: DSTC8 Workshop @ AAAI (2020)
  • [51] Mrkšić, N., Ó Séaghdha, D., Wen, T.H., Thomson, B., Young, S.: Neural belief tracker: Data-driven dialogue state tracking. In: ACL (2017)
  • [52] Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI (2020)
  • [53] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)
  • [54] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: PyTorch: An Imperative Style, High-Performance Deep Learning Library. In: NeurIPS (2019)
  • [55] Pham, H.A., Le, T.M., Le, V., Phuong, T.M., Tran, T.: Video Dialog as Conversation about Objects Living in Space-Time. In: ECCV (2022)
  • [56] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019)
  • [57] Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML (2014)
  • [58] Shah, A., Geng, S., Gao, P., Cherian, A., Hori, T., Marks, T.K., Le Roux, J., Hori, C.: Audio-visual scene-aware dialog and reasoning using audio-visual transformers with joint student-teacher learning. In: ICASSP (2022)
  • [59] Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: Crowdsourcing data collection for activity understanding. In: ECCV (2016)
  • [60] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
  • [61] Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Generative Pretraining in Multimodality. arXiv preprint arXiv:2307.05222 (2023)
  • [62] Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: Consensus-based Image Description Evaluation. In: CVPR (2015)
  • [63] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph Attention Networks. In: ICLR (2018)
  • [64] Wu, C.S., Madotto, A., Hosseini-Asl, E., Xiong, C., Socher, R., Fung, P.: Transferable multi-domain state generator for task-oriented dialogue systems. In: ACL (2019)
  • [65] Wu, Q., Zhao, W., Li, Z., Wipf, D., Yan, J.: Nodeformer: A scalable graph structure learning transformer for node classification. In: NeurIPS (2022)
  • [66] Wu, Y., Macdonald, C., Ounis, I.: Multi-modal dialog state tracking for interactive fashion recommendation. In: ACM RecSys (2022)
  • [67] Xiao, J., Shang, X., Yao, A., Chua, T.: Next-qa: Next phase of question-answering to explaining temporal actions. In: CVPR (2021)
  • [68] Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y.: Video question answering via gradually refined attention over appearance and motion. In: ACM MM (2017)
  • [69] Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In: CVPR (2016)
  • [70] Xu, P., Hu, Q.: An end-to-end approach for handling unknown slot values in dialogue state tracking. In: ACL (2018)
  • [71] Yang, J., Liu, Z., Xiao, S., Li, C., Lian, D., Agrawal, S., Singh, A., Sun, G., Xie, X.: GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph. In: NeurIPS (2021)
  • [72] Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., Liu, T.Y.: Do transformers really perform badly for graph representation? In: NeurIPS (2021)
  • [73] Yoon, S., Yoon, E., Yoon, H.S., Kim, J., Yoo, C.: Information-theoretic text hallucination reduction for video-grounded dialogue. In: EMNLP (2022)
  • [74] Yoshino, K., Hori, C., Perez, J., D’Haro, L.F., Polymenakos, L., Gunasekara, C., Lasecki, W.S., Kummerfeld, J.K., Galley, M., Brockett, C., et al.: Dialog system technology challenge 7. arXiv preprint arXiv:1901.03461 (2019)
  • [75] Yu, Y., Chen, J., Gao, T., Yu, M.: DAG-GNN: DAG structure learning with graph neural networks. In: ICML (2019)
  • [76] Zhang, H., Liu, M., Wang, Y., Cao, D., Guan, W., Nie, L.: Uncovering hidden connections: Iterative tracking and reasoning for video-grounded dialog. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)