ID Embedding as Subtle Features of Content and Structure for Multimodal Recommendation

Yuting Liu [email protected] Enneng Yang [email protected] Northeastern UniversityChina Yizhou Dang Northeastern UniversityChina [email protected] Guibing Guo Northeastern UniversityChina [email protected] Qiang Liu Chinese Academy of SciencesChina [email protected] Yuliang Liang Northeastern UniversityChina [email protected] Linying Jiang Northeastern UniversityChina [email protected]  and  Xingwei Wang Northeastern UniversityChina [email protected]
(2018)
Abstract.

Multimodal recommendation aims to model user and item representations comprehensively with the involvement of multimedia content for effective recommendations. Existing research has shown that it is beneficial for recommendation performance to combine (user- and item-) ID embeddings with multimodal salient features, indicating the value of IDs. However, there is a lack of a thorough analysis of the ID embeddings in terms of feature semantics in the literature. In this paper, we revisit the value of ID embeddings for multimodal recommendation and conduct a thorough study regarding its semantics, which we recognize as subtle features of content and structure. Based on our findings, we propose a novel recommendation model by incorporating ID embeddings to enhance the salient features of both content and structure. Specifically, we put forward a hierarchical attention mechanism to incorporate ID embeddings in modality fusing, coupled with contrastive learning, to enhance content representations. Meanwhile, we propose a lightweight graph convolution network for each modality to amalgamate neighborhood and ID embeddings for improving structural representations. Finally, the content and structure representations are combined to form the ultimate item embedding for recommendation. Extensive experiments on three real-world datasets (Baby, Sports, and Clothing) demonstrate the superiority of our method over state-of-the-art multimodal recommendation methods and the effectiveness of fine-grained ID embeddings. Our code is available at https://anonymous.4open.science/r/IDSF-code/.

Multimodal Recommendation, Subtle Features, Salient Features, Content and Structure
copyright: acmcopyrightjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NYprice: 15.00isbn: 978-1-4503-XXXX-X/18/06ccs: Information systems Recommender systems

1. Introduction

Research on multimodal recommendation has flourished, primarily thanks to its capability to improve item representations through integrating various sources of multimedia information (Zhou et al., 2020; He et al., 2018; Cao et al., 2019; Liu et al., 2022), with text and image being the two most widely adopted modalities. A prevalent viewpoint in existing research is that the key to enhance item representations is to extract salient semantic features from multiple modalities for effective item representations. In contrast, ID embedding has not attracted sufficient attention and has not been thoroughly explored in multimodal recommendation despite its well-demonstrated value in both traditional (Koren et al., 2009; Guo et al., 2015) and multimodal (He and McAuley, 2016; wei Wei et al., 2019; Tao et al., 2022) recommendation. The specific information captured by ID embeddings remains unclear, resulting in suboptimal strategies for its utilization and leaving ample room for further improvements in multimodal recommendation.

In this paper, we revisit the efficiency of ID embeddings and conduct case studies to explore the intricate details within them, which we refer to as subtle features. We argue that ID embeddings should be regarded as a mixture of subtle features of both content and structure, as shown in Section 2. Based on the analysis, we classify the use of ID embeddings in the recommendation into two lines of research. Firstly, ID embeddings are regarded as vectors of content features, containing some semantic attributes of an item. They are learned from the historical interactions between users and items via approaches such as matrix factorization (Koren et al., 2009; Guo et al., 2019), and can be concatenated with multimodal features to enrich item representations in multimodal recommendation (He and McAuley, 2016; Chen et al., 2017; Liu et al., 2017). Secondly, ID embeddings are explained as vectors of structural features, capturing the features of all the neighboring items. ID embeddings are trained from the user-item interaction bipartite graph, where each item is linked with a set of users (i.e., neighbors), through Graph Convolutional Networks (GCN) (Gori and Pucci, 2007; Bruna et al., 2014; Defferrard et al., 2016) approaches. They can enhance item representations by being aggregated with each modal feature obtained from modal-specific interaction graphs in the multimodal recommendation. (wei Wei et al., 2019, 2020; Tao et al., 2022; Wu et al., 2019; Mao et al., 2021; Liu et al., 2021; Wu et al., 2021).

Refer to caption
Figure 1. An item consists of textual and visual modalities and ID information. Features in terms of content and structure can be enhanced in each modality separately. The item representation can be obtained by fusing content and structural features. Note that traditional ID embeddings are treated as content or structural features, while our approach takes ID embeddings as a mixture of subtle features of both content and structure.

However, these approaches mainly focus on the salient features within multimodal sources and treat ID as a whole, combining it with them in a rudimentary manner. Consequently, most research concentrates on extracting and integrating multimodal information without the detailed exploration of content and structural features within ID. In our viewpoint, ID embeddings reflect the subtle features of items, which usage should be carefully considered from the standpoint of content and structure as an independent information source to enhance item representations.

On this basis, we propose a novel multimodal recommender called IDSF that explains ID embeddings as Subtle Features of both content and structure. It provides additional signals to enhance the semantics of extracted salient features under each modality regarding content and structure, leading to an improved item representation. Specifically, IDSF consists of two main modules to learn item content and structural features from all modalities. Figure 1 illustrates an intuitive example derived from the Clothing 111https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html dataset, where an item consists of content and structural features learned from three sources of information (i.e. textual, visual, and item ID). We hierarchically integrate all the information to obtain a better comprehension of the item’s semantics. For content features, we first enhance the salient features from textual and visual content by fusing them with the subtle features from modal-specific ID embeddings. Then, the enhanced modalities are fused through an attention mechanism and contrastive learning. Similarly, we retrieve the structural features by merging salient neighborhood and subtle features from a modality-specific interaction graph. Notes that we introduce modal-specific ID embeddings to help avoid mutual interference among different modalities which is proven to be necessary in Section 2. Finally, the item representation is obtained by combining the content and structural features.

It is worth mentioning that our method enjoys an additional benefit in alleviating the modal missing problem in the multimodal recommendation. That is, some modalities may be unavailable in real applications, leading to a decline in the performance of the existing multimodal recommenders (Cai et al., 2018; Tsai et al., 2019; Li et al., 2014; Cui et al., 2020; Wang et al., 2018; Zhou et al., 2019). Our work can alleviate this problem by making use of subtle features of content and structure derived from ID embeddings and thus maintaining relatively high performance.

Contributions. To sum up, the main contributions of our work are presented as follows:

  • We highlight the significance of ID embeddings and undertake a thorough analysis to explore the optimal approach to leveraging ID embeddings in multimodal recommendations. To the best of our knowledge, we are making the initial endeavor to thoroughly explore the detailed information and employ fine-grained utilization with ID embeddings.

  • We propose a novel framework that considers both item ID and modality in terms of content and structure to acquire comprehensive item representations. For content, we design a hierarchical attention mechanism and employ contrastive learning to enhance salient modal features with subtle content features within ID embeddings. For structure, we develop a modal-specific lightweight graph convolution network to involve subtle structural features in salient features. Eventually, we combine the content and structural features to obtain final item representations.

  • We perform extensive experiments on three real-world datasets (Baby, Sports, and Clothing) to demonstrate that our method outperforms the state-of-the-art recommendation methods. In addition, we find that our approach can effectively mitigate the modal missing problem.

Refer to caption
Figure 2. Distributions of pre-trained ID embeddings. The same color represents items that interacted with the same user.
Refer to caption
Figure 3. Semantic similarity of ID embeddings. Each cell in the heat map represents the normalized similarity of two ID embeddings, with horizontal and vertical coordinates denoting their mapped ID. Darker hues indicate a higher similarity.

2. ID Embedding Contains More

Yuan et al. (Yuan et al., 2023) have identified that while modal-based recommendation has achieved comparable performance to ID-based recommendation due to the advances in the multimodal domain, ID dominates recommendation with typical architecture. However, the underlying reasons for the remarkable effectiveness of the ID are still not fully understood. To delve deeper into this matter, we employed visualizations to analyze the ID embeddings and modal features, leading us to draw preliminary conclusions: the subtle features conveyed by ID embeddings, which consist of content and structural features, are anticipated to be modal-specific. Meanwhile, they can enhance the corresponding original modal features.

We first map the pre-trained item ID embeddings222We provide more details in Appendix A. to 2-dim normalized vectors by using UMAP (McInnes et al., 2018) and plot their distributions (shown in Figure 2). Note that points of the same color correspond to items that interacted with the same user. In Figure 2, the points of items interacted with different users form distinct clusters that are clearly visible and distinguished by the colors. This observation indicates that item ID embeddings of the same user have a closer distribution which is aligned with the edge in the user-item interaction graph. This suggests that item ID embeddings capture structural features within the user-item interaction graph. Simultaneously, we generate a heat map to visualize the semantic similarity among these item ID embeddings (shown in Figure 3). Each cell in the heat map represents the normalized similarity of two items, where the horizontal and vertical coordinates indicate their respective re-indexes. Note that we set adjacent coordinates for items that interact with the same user (e.g., indexes 0-4 correspond to the items that interacted with the same user). To improve clarity, we assign all values in the similarity matrix, except for the top 10 values in each row, to 0. In Figure 3, items that interacted with the same user exhibit a higher similarity and items interacted with different users also display certain semantic similarities due to the inherent content features of items themselves. According to the observation, we posit that ID embeddings capture content features of items. Therefore, to enhance the utilization of ID embeddings in the multimodal recommendation, we suggest adopting a comprehensive approach that integrates content and structure concurrently.

Moreover, we visualize the item-item similarity matrix of pre-extracted multimodal features in the same way. We conduct analyses on both text and image features to ensure the generalizability of the results. As depicted in Figure 4, patterns of similarity distribution among different modalities are partial inconsistency, which indicates there are subtle differences in the semantics between different modalities. Therefore, it will lead to a performance drop by directly aligning or fusing different modalities when recommending. To mitigate this issue, we propose to enhance multimodal features with modal-specific ID embeddings (e.g. tid𝑡𝑖𝑑tiditalic_t italic_i italic_d and vid𝑣𝑖𝑑viditalic_v italic_i italic_d), thereby increasing their adaptability for recommendation. To confirm effectiveness, we plot the distributions of visual and textual features before and after enhancement (shown in Figure 5), illustrating the effect of the enhancement on multimodal features in the context of recommendation systems. Our findings demonstrate that the inclusion of modal-specific ID embeddings effectively enhances these multi-modal features, leading to a significant improvement in the discriminability of their distributions.

Consequently, to obtain optimal item representations, modal-specific ID embeddings are suggested to enhance multimodal features in terms of content and structure, respectively.

Refer to caption
Figure 4. Semantic similarity of text and image features extracted by universal encoders. Each cell in the heat map represents the normalized similarity of two items, with horizontal and vertical coordinates denoting their mapped ID respectively. Darker hues indicate a higher similarity.
Refer to caption
Figure 5. Distributions of image and text features before and after the enhancement. Items interacted with the same user are represented by the same color.

3. Our IDSF Model

Based on the findings in Section 2, we speculate that utilizing ID embeddings in a fine-grained manner can optimize item representations for better performance. In this section, we introduce our method that treats ID embeddings as subtle features of both content and structure, to enhance and fuse multimodal salient features.

3.1. Preliminary

We represent the interaction data as a bipartite user-item graph 𝒢={(u,i)|u𝒰,i}𝒢conditional-set𝑢𝑖formulae-sequence𝑢𝒰𝑖\mathcal{G}=\{(u,i)|u\in\mathcal{U},i\in\mathcal{I}\}caligraphic_G = { ( italic_u , italic_i ) | italic_u ∈ caligraphic_U , italic_i ∈ caligraphic_I }, where 𝒰𝒰\mathcal{U}caligraphic_U, and \mathcal{I}caligraphic_I denote the set of users and items, respectively. An edge yui=1subscript𝑦𝑢𝑖1y_{ui}=1italic_y start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT = 1 indicates a positive interaction between user u𝑢uitalic_u and item i𝑖iitalic_i; otherwise yui=0subscript𝑦𝑢𝑖0y_{ui}=0italic_y start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT = 0. We denote the original features that have not been convoluted as eim,(0)superscriptsubscripte𝑖𝑚0\textbf{e}_{i}^{m,(0)}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , ( 0 ) end_POSTSUPERSCRIPT and eum,(0)superscriptsubscripte𝑢𝑚0\textbf{e}_{u}^{m,(0)}e start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , ( 0 ) end_POSTSUPERSCRIPT for a specific kind of information m𝑚m\in\mathcal{M}italic_m ∈ caligraphic_M respectively, where \mathcal{M}caligraphic_M is the set of all kinds of information. Without sacrificing generality, we will refer to an item’s textual t𝑡titalic_t and visual v𝑣vitalic_v modalities as its salient feature and its associated item ID (tid,vid𝑡𝑖𝑑𝑣𝑖𝑑tid,viditalic_t italic_i italic_d , italic_v italic_i italic_d) as its subtle feature in this work. Future studies will also take into account the multimodal information provided by users. Specifically, we can obtain the definition as ={t,v,tid,vid}𝑡𝑣𝑡𝑖𝑑𝑣𝑖𝑑\mathcal{M}=\{t,v,tid,vid\}caligraphic_M = { italic_t , italic_v , italic_t italic_i italic_d , italic_v italic_i italic_d }. In addition, the representations of higher-order items and users can be given as the k𝑘kitalic_k-layer graph convolution, denoted by eim,(k)superscriptsubscripte𝑖𝑚𝑘\textbf{e}_{i}^{m,(k)}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , ( italic_k ) end_POSTSUPERSCRIPT and eum,(k)superscriptsubscripte𝑢𝑚𝑘\textbf{e}_{u}^{m,(k)}e start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , ( italic_k ) end_POSTSUPERSCRIPT, respectively. We divided the modal-specific bipartite graphs 𝒢tsubscript𝒢𝑡\mathcal{G}_{t}caligraphic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒢vsubscript𝒢𝑣\mathcal{G}_{v}caligraphic_G start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT from 𝒢𝒢\mathcal{G}caligraphic_G in order to appropriately describe structural information on each modality individually.

3.2. Model Overview

Refer to caption
Figure 6. An illustration of our proposed IDSF framework. IDSF consists of three key modules: content representation module for items (left part), structural representation module for items (middle part), and users (right part). Specifically, the item content module adopts a hierarchical attention mechanism to enhance multiple salient features (e.g., textual and visual in this figure) with subtle features in ID embeddings of items. The item- and user-structural modules design a lightweight modality-specific propagation mechanism to capture higher-order preference representations of items and users, respectively. The recommendation prediction generates a set of recommended items for a specific user.

As illustrated in Figure 6, our IDSF model consists of three key components: item content module, item structural module, and user structural module. Specifically, for the item content module, we aim to enhance the representation of each modality and their fusion by designing a hierarchical attention mechanism. That is, we improve the salient features (such as the textual and visual content in Figure 6) with the associated subtle features in item ID embeddings in each modality and fuse multiple modalities to generate the content representation eicsuperscriptsubscripte𝑖𝑐\textbf{e}_{i}^{c}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT of item i𝑖iitalic_i. Then, in the item structural module, we adopt a similar idea to extract an item’s high-order structural representation eissuperscriptsubscripte𝑖𝑠\textbf{e}_{i}^{s}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT from various modalities of item i𝑖iitalic_i’s neighbors. We implement it by a multi-layer graph convolution in a lightweight way (textual modality on the left and visual modality on the right). The same process also holds for user structural module which obtains the user u𝑢uitalic_u’s structural representation eussuperscriptsubscripte𝑢𝑠\textbf{e}_{u}^{s}e start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. Finally, we can obtain item i𝑖iitalic_i’s overall embedding by fusing the content representation eicsuperscriptsubscript𝑒𝑖𝑐e_{i}^{c}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT with the structural representation eissuperscriptsubscripte𝑖𝑠\textbf{e}_{i}^{s}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, and take user u𝑢uitalic_u’s structural representation eussuperscriptsubscripte𝑢𝑠\textbf{e}_{u}^{s}e start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT as his/her final embedding. Prediction y^uisubscript^𝑦𝑢𝑖\hat{y}_{ui}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT can be computed by the inner product of user u𝑢uitalic_u’s and item i𝑖iitalic_i’s embeddings.

3.3. Item Content Module

Different from the existing works that mainly focus on salient features from multiple modalities, we further take into account subtle features from item ID embeddings. Specifically, we design a hierarchical attention mechanism in Section 3.3.1 to enhance salient content features (i.e., textual and visual) with their corresponding subtle features captured by modal-specific ID embeddings. Then, we introduce a contrastive loss to encourage the consistency among multi-modal information for better fusion in Section 3.3.2.

3.3.1. Modality Enhancement and Fusion

For each modality, we propose a hierarchical attention method to combine salient and subtle information. As shown in the left part of Figure 6, we first enhance the textual salient features eitsubscriptsuperscripte𝑡𝑖\textbf{e}^{t}_{i}e start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with textual subtle features eitidsuperscriptsubscripte𝑖𝑡𝑖𝑑\textbf{e}_{i}^{tid}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_i italic_d end_POSTSUPERSCRIPT in textual space through Text Attention to get eitsuperscriptsubscripte𝑖superscript𝑡\textbf{e}_{i}^{t^{\prime}}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Similarly, we obtain visual features eivsuperscriptsubscripte𝑖superscript𝑣\textbf{e}_{i}^{v^{\prime}}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT by Visual Attention, which integrates the visual salient features eivsubscriptsuperscripte𝑣𝑖\textbf{e}^{v}_{i}e start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the visual subtle features eividsuperscriptsubscripte𝑖𝑣𝑖𝑑\textbf{e}_{i}^{vid}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_i italic_d end_POSTSUPERSCRIPT in the visual space. Then, we further fuse eitsuperscriptsubscripte𝑖superscript𝑡\textbf{e}_{i}^{t^{\prime}}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and eivsuperscriptsubscripte𝑖superscript𝑣\textbf{e}_{i}^{v^{\prime}}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT through VT Attention to get the comprehensive item content representation eicsuperscriptsubscripte𝑖𝑐\textbf{e}_{i}^{c}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

All three attention mechanisms (Text Attention, Visual Attention, and VT Attention) follow the same procedure. We take the textual content and item ID in the textual space as an example. The modality fusion operation based on the attention mechanism is defined as follows:

(1) eit=m{t,tid}αimeim,subscriptsuperscriptesuperscript𝑡𝑖subscript𝑚𝑡𝑡𝑖𝑑subscriptsuperscript𝛼𝑚𝑖superscriptsubscripte𝑖𝑚\textbf{e}^{t^{\prime}}_{i}=\sum_{m\in\{t,\;tid\}}\alpha^{m}_{i}\textbf{e}_{i}% ^{m},e start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m ∈ { italic_t , italic_t italic_i italic_d } end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ,

where eitsuperscriptsubscripte𝑖𝑡\textbf{e}_{i}^{t}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represents the textual features extracted by sentence Bert (Reimers and Gurevych, 2019), eitidsuperscriptsubscripte𝑖𝑡𝑖𝑑\textbf{e}_{i}^{tid}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_i italic_d end_POSTSUPERSCRIPT represents the learnable item ID embeddings in textual space, and αimsubscriptsuperscript𝛼𝑚𝑖\alpha^{m}_{i}italic_α start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates the significance of each element eimsuperscriptsubscripte𝑖𝑚\textbf{e}_{i}^{m}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, calculated as follows:

(2) αim=softmax([qtanh(𝐖eim+b)],m{t,tid}),superscriptsubscript𝛼𝑖𝑚softmaxdelimited-[]superscriptqtop𝐖subscriptsuperscripte𝑚𝑖b𝑚𝑡𝑡𝑖𝑑\alpha_{i}^{m}=\text{softmax}([\textbf{q}^{\top}\tanh(\mathbf{W}\textbf{e}^{m}% _{i}+\textbf{b})],\;m\in\{t,tid\}),italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = softmax ( [ q start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_tanh ( bold_W e start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + b ) ] , italic_m ∈ { italic_t , italic_t italic_i italic_d } ) ,

where qdqsuperscript𝑑\textbf{q}\in\mathbb{R}^{d}q ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes attention vector and Wd×dWsuperscript𝑑𝑑\textbf{W}\in\mathbb{R}^{d\times d}W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, bd𝑏superscript𝑑b\in\mathbb{R}^{d}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denote the trainable weight matrix and bias vector, respectively. We can compute eivsuperscriptsubscripte𝑖superscript𝑣\textbf{e}_{i}^{v^{\prime}}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and eicsuperscriptsubscripte𝑖𝑐\textbf{e}_{i}^{c}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT in the same way. In this way, we are able to enhance the salient (textual and visual) features with the subtle features of the item.

3.3.2. Contrastive Learning Constraints

In this paper, we construct a self-supervised contrastive learning task by maximizing the agreement between the item representations before and after fusion. Specifically, we separately maximize the degree of consistency between the representation of different information (e.g., eitsuperscriptsubscripte𝑖𝑡\textbf{e}_{i}^{t}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and eitidsuperscriptsubscripte𝑖𝑡𝑖𝑑\textbf{e}_{i}^{tid}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_i italic_d end_POSTSUPERSCRIPT) and the fused representation (e.g., eitsuperscriptsubscripte𝑖superscript𝑡\textbf{e}_{i}^{t^{\prime}}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT) of the subtle and salient features. Similar comparisons are made between eivsuperscriptsubscripte𝑖𝑣\textbf{e}_{i}^{v}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT, eividsuperscriptsubscripte𝑖𝑣𝑖𝑑\textbf{e}_{i}^{vid}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_i italic_d end_POSTSUPERSCRIPT and eivsuperscriptsubscripte𝑖superscript𝑣\textbf{e}_{i}^{v^{\prime}}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, and also between eitsuperscriptsubscripte𝑖superscript𝑡\textbf{e}_{i}^{t^{\prime}}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, eivsuperscriptsubscripte𝑖superscript𝑣\textbf{e}_{i}^{v^{\prime}}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and eicsuperscriptsubscripte𝑖𝑐\textbf{e}_{i}^{c}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

Here, we take the text content and item ID in the textual space as an example to formulate the contrastive learning loss as follows:

(3)

¯C(eit,eitid,eit)=14||im{t,tid}I(eim,eit)+I(eit,eim),subscript¯𝐶superscriptsubscripte𝑖𝑡superscriptsubscripte𝑖𝑡𝑖𝑑superscriptsubscripte𝑖superscript𝑡14subscript𝑖subscript𝑚𝑡𝑡𝑖𝑑𝐼superscriptsubscripte𝑖𝑚superscriptsubscripte𝑖superscript𝑡𝐼superscriptsubscripte𝑖superscript𝑡superscriptsubscripte𝑖𝑚\bar{\mathcal{L}}_{C}(\textbf{e}_{i}^{t},\textbf{e}_{i}^{tid},\textbf{e}_{i}^{% t^{\prime}})\!=\!\!\!\\ -\frac{1}{4|\mathcal{I}|}\sum\limits_{i\in\mathcal{I}}\sum\limits_{m\in\{t,\;% tid\}}I(\textbf{e}_{i}^{m},\textbf{e}_{i}^{t^{\prime}})+I(\textbf{e}_{i}^{t^{% \prime}},\textbf{e}_{i}^{m}),over¯ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_i italic_d end_POSTSUPERSCRIPT , e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG 4 | caligraphic_I | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m ∈ { italic_t , italic_t italic_i italic_d } end_POSTSUBSCRIPT italic_I ( e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_I ( e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) ,

where |||\mathcal{I}|| caligraphic_I | represents the number of items, and the mutual information I()𝐼I(\cdot)italic_I ( ⋅ ) between two representations can be mathematically expressed as:

(4) I(eim,eit)=logIiiIii+Iij,m{t,tid},formulae-sequence𝐼superscriptsubscripte𝑖𝑚superscriptsubscripte𝑖superscript𝑡subscript𝐼𝑖𝑖subscript𝐼𝑖𝑖subscript𝐼𝑖𝑗𝑚𝑡𝑡𝑖𝑑\begin{split}I&(\textbf{e}_{i}^{m},\textbf{e}_{i}^{t^{\prime}})=\log\frac{I_{% ii}}{I_{ii}+I_{ij}},\;\;\;m\in\{t,\;tid\},\\ \end{split}start_ROW start_CELL italic_I end_CELL start_CELL ( e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) = roman_log divide start_ARG italic_I start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_I start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT + italic_I start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG , italic_m ∈ { italic_t , italic_t italic_i italic_d } , end_CELL end_ROW

where

(5) Iii=exp(f(eim,eit))/τ,Iij=ji(exp(f(eim,ejt))/τ+exp(f(eim,ejm))/τ),formulae-sequencesubscript𝐼𝑖𝑖𝑓superscriptsubscripte𝑖𝑚superscriptsubscripte𝑖superscript𝑡𝜏subscript𝐼𝑖𝑗subscript𝑗𝑖𝑓superscriptsubscripte𝑖𝑚superscriptsubscripte𝑗superscript𝑡𝜏𝑓superscriptsubscripte𝑖𝑚superscriptsubscripte𝑗𝑚𝜏\begin{split}&I_{ii}=\exp(f(\textbf{e}_{i}^{m},\textbf{e}_{i}^{t^{\prime}}))/% \tau,\\ &I_{ij}\!=\!\sum_{j\neq i}\left(\exp(f(\textbf{e}_{i}^{m},\textbf{e}_{j}^{t^{% \prime}}))/\tau+\exp(f(\textbf{e}_{i}^{m},\textbf{e}_{j}^{m}))/\tau\right),% \end{split}start_ROW start_CELL end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = roman_exp ( italic_f ( e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) / italic_τ , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT ( roman_exp ( italic_f ( e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) / italic_τ + roman_exp ( italic_f ( e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) ) / italic_τ ) , end_CELL end_ROW

where i,j𝑖𝑗i,j\in\mathcal{I}italic_i , italic_j ∈ caligraphic_I, τ𝜏\tau\in\mathbb{R}italic_τ ∈ blackboard_R is a temperature hyper-parameter, f(,)𝑓f(\cdot,\cdot)italic_f ( ⋅ , ⋅ ) is the similarity function implemented by consine similarity.

The overall contrastive learning loss is given by:

(6) C=¯C(et,etid,et)+¯C(ev,evid,ev)+¯C(et,ev,ec)subscript𝐶subscript¯𝐶superscripte𝑡superscripte𝑡𝑖𝑑superscriptesuperscript𝑡subscript¯𝐶superscripte𝑣superscripte𝑣𝑖𝑑superscriptesuperscript𝑣subscript¯𝐶superscriptesuperscript𝑡superscriptesuperscript𝑣superscripte𝑐\small\mathcal{L}_{C}=\bar{\mathcal{L}}_{C}(\textbf{e}^{t},\textbf{e}^{tid},% \textbf{e}^{t^{\prime}})+\bar{\mathcal{L}}_{C}(\textbf{e}^{v},\textbf{e}^{vid}% ,\textbf{e}^{v^{\prime}})+\bar{\mathcal{L}}_{C}(\textbf{e}^{t^{\prime}},% \textbf{e}^{v^{\prime}},\textbf{e}^{c})caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = over¯ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( e start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , e start_POSTSUPERSCRIPT italic_t italic_i italic_d end_POSTSUPERSCRIPT , e start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) + over¯ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( e start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , e start_POSTSUPERSCRIPT italic_v italic_i italic_d end_POSTSUPERSCRIPT , e start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) + over¯ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( e start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , e start_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , e start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT )

In this way, we will get a more complete representation of a given item by aligning the representations of the various modalities with the contrastive learning loss.

3.4. Item- and User- Structural Modules

It’s well known that the user-item interaction bipartite graph 𝒢𝒢\mathcal{G}caligraphic_G created by recommender systems contains rich structural information. The interaction graph’s higher-order connectedness depicts how preferences propagate across users and items. Graph neural network techniques have been widely proven to be effective in capturing higher-order graph structural information. In this section, we will elaborate on a new method to aggregate the higher-order neighbor information by a lightweight graph convolution, and then we fuse the higher-order information of multiple modalities.

Inspired by LightGCN (He et al., 2020) and MMGCN (wei Wei et al., 2019), for a user (an item) node in a modality-specific interaction graph 𝒢msubscript𝒢𝑚\mathcal{G}_{m}caligraphic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, we implement an aggregator via a weighted sum aggregator to gather information from its neighbors, and an enhancement through reserving the userID euidsuperscriptsubscripte𝑢𝑖𝑑\textbf{e}_{u}^{id}e start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_d end_POSTSUPERSCRIPT or modality-specific itemID eiidsuperscriptsubscripte𝑖𝑖𝑑\textbf{e}_{i}^{id}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_d end_POSTSUPERSCRIPT embedding of the target node in a certain ratio γ𝛾\gammaitalic_γ. The lightweight message propagation mechanism of aggregated k𝑘kitalic_k-hop neighbors is expressed as follows:

(7) eim,(k)=u𝒩i1|𝒩i||𝒩u|eum,(k1)+γei(m)id,superscriptsubscripte𝑖𝑚𝑘subscript𝑢subscript𝒩𝑖1subscript𝒩𝑖subscript𝒩𝑢superscriptsubscripte𝑢𝑚𝑘1𝛾subscriptsuperscripte𝑚𝑖𝑑𝑖\textbf{e}_{i}^{m,(k)}=\sum_{u\in\mathcal{N}_{i}}\frac{1}{\sqrt{|\mathcal{N}_{% i}|}\sqrt{|\mathcal{N}_{u}|}}\textbf{e}_{u}^{m,(k-1)}+\gamma\textbf{e}^{(m)id}% _{i},e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , ( italic_k ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG square-root start_ARG | caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | end_ARG end_ARG e start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , ( italic_k - 1 ) end_POSTSUPERSCRIPT + italic_γ e start_POSTSUPERSCRIPT ( italic_m ) italic_i italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,
(8) eum,(k)=i𝒩u1|𝒩u||𝒩i|eim,(k1)+γeuid,superscriptsubscripte𝑢𝑚𝑘subscript𝑖subscript𝒩𝑢1subscript𝒩𝑢subscript𝒩𝑖superscriptsubscripte𝑖𝑚𝑘1𝛾subscriptsuperscripte𝑖𝑑𝑢\textbf{e}_{u}^{m,(k)}=\sum_{i\in\mathcal{N}_{u}}\frac{1}{\sqrt{|\mathcal{N}_{% u}|}\sqrt{|\mathcal{N}_{i}|}}\textbf{e}_{i}^{m,(k-1)}+\gamma\textbf{e}^{id}_{u},e start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , ( italic_k ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG | caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | end_ARG square-root start_ARG | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG end_ARG e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , ( italic_k - 1 ) end_POSTSUPERSCRIPT + italic_γ e start_POSTSUPERSCRIPT italic_i italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ,

where m={t,v}𝑚𝑡𝑣m\in\mathcal{M}=\{t,v\}italic_m ∈ caligraphic_M = { italic_t , italic_v }, eim,(0)=eimsuperscriptsubscripte𝑖𝑚0superscriptsubscripte𝑖𝑚\textbf{e}_{i}^{m,(0)}=\textbf{e}_{i}^{m}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , ( 0 ) end_POSTSUPERSCRIPT = e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, ei(m)idsuperscriptsubscripte𝑖𝑚𝑖𝑑\textbf{e}_{i}^{(m)id}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) italic_i italic_d end_POSTSUPERSCRIPT is the itemID that reuses the corresponding textual or visual space of the multimodal fusion module, i.e., ei(m)id{eivid,eitid}superscriptsubscripte𝑖𝑚𝑖𝑑superscriptsubscripte𝑖𝑣𝑖𝑑superscriptsubscripte𝑖𝑡𝑖𝑑\textbf{e}_{i}^{(m)id}\in\{\textbf{e}_{i}^{vid},\textbf{e}_{i}^{tid}\}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) italic_i italic_d end_POSTSUPERSCRIPT ∈ { e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v italic_i italic_d end_POSTSUPERSCRIPT , e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_i italic_d end_POSTSUPERSCRIPT }, euidsuperscriptsubscripte𝑢𝑖𝑑\textbf{e}_{u}^{id}e start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_d end_POSTSUPERSCRIPT represents the userID, which is a learnable variable initialized at random. 𝒩usubscript𝒩𝑢\mathcal{N}_{u}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the set of neighbors for user u𝑢uitalic_u and item i𝑖iitalic_i respectively, γ𝛾\gammaitalic_γ is a hyper-parameter to control the amount of subtle information from the target node. The symmetric normalization term 1|𝒩i||𝒩u|1subscript𝒩𝑖subscript𝒩𝑢\frac{1}{\sqrt{|\mathcal{N}_{i}|}\sqrt{|\mathcal{N}_{u}|}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG square-root start_ARG | caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | end_ARG end_ARG follows the design of standard GCN to avoid the scale of embeddings increasing with graph convolution operations.

After computing the high-order preference embedding at top-K𝐾Kitalic_K layers via Eq. 7 and Eq. 8, we stack the preference embedding at each layer and take unweighted arithmetic mean to obtain the modal-specific representations of a user (an item). Finally, we derive the structural representations for user u𝑢uitalic_u and item i𝑖iitalic_i by an attentional combination of multimodal representations as Eq. 9 and Eq. 10 respectively, defined by:

(9) eis=m{t,v}αim(1K+1k=0Keim,(k)),superscriptsubscripte𝑖𝑠subscript𝑚𝑡𝑣superscriptsubscript𝛼𝑖𝑚1𝐾1superscriptsubscript𝑘0𝐾superscriptsubscripte𝑖𝑚𝑘\textbf{e}_{i}^{s}=\sum_{m\in\{t,v\}}\alpha_{i}^{m}\cdot(\frac{1}{K+1}\sum_{k=% 0}^{K}\textbf{e}_{i}^{m,(k)}),e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_m ∈ { italic_t , italic_v } end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⋅ ( divide start_ARG 1 end_ARG start_ARG italic_K + 1 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , ( italic_k ) end_POSTSUPERSCRIPT ) ,
(10) eus=m{t,v}αim(1K+1k=0Keum,(k)),superscriptsubscripte𝑢𝑠subscript𝑚𝑡𝑣superscriptsubscript𝛼𝑖𝑚1𝐾1superscriptsubscript𝑘0𝐾superscriptsubscripte𝑢𝑚𝑘\textbf{e}_{u}^{s}=\sum_{m\in\{t,v\}}\alpha_{i}^{m}\cdot(\frac{1}{K+1}\sum_{k=% 0}^{K}\textbf{e}_{u}^{m,(k)}),e start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_m ∈ { italic_t , italic_v } end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⋅ ( divide start_ARG 1 end_ARG start_ARG italic_K + 1 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT e start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , ( italic_k ) end_POSTSUPERSCRIPT ) ,

where αimsuperscriptsubscript𝛼𝑖𝑚\alpha_{i}^{m}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT represents the importance of the modality m𝑚mitalic_m, and the calculation rules are shown in Eq. 2.

3.5. Recommendation Prediction and Optimization

The three above-mentioned key modules allow us to extract the content representation (eicsuperscriptsubscripte𝑖𝑐\textbf{e}_{i}^{c}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT) for item i𝑖iitalic_i as well as the high-order structure representation (eissuperscriptsubscripte𝑖𝑠\textbf{e}_{i}^{s}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and eussuperscriptsubscripte𝑢𝑠\textbf{e}_{u}^{s}e start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT) for item i𝑖iitalic_i and user u𝑢uitalic_u, respectively. Therefore, we can predict user u𝑢uitalic_u’s preference for item i𝑖iitalic_i as follows:

(11) y^ui=eus(eic+eis).subscript^𝑦𝑢𝑖superscriptsuperscriptsubscripte𝑢𝑠topsuperscriptsubscripte𝑖𝑐superscriptsubscripte𝑖𝑠\hat{y}_{ui}={\textbf{e}_{u}^{s}}^{\top}(\textbf{e}_{i}^{c}+\textbf{e}_{i}^{s}).over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT = e start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) .

where eic+eissuperscriptsubscripte𝑖𝑐superscriptsubscripte𝑖𝑠\textbf{e}_{i}^{c}+\textbf{e}_{i}^{s}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT indicates the final representation of item i𝑖iitalic_i by combining its content and structure333We do not introduce MLPs to map them since the structure and content features reside in the same latent space..

To train our model, we use Bayesian Personalized Ranking (BPR) (Rendle et al., 2009) loss. BPR is a well-known pairwise loss that encourages the prediction of an observed user-item pair (u,i𝑢𝑖u,iitalic_u , italic_i) to be higher than its unobserved counterparts (u,j𝑢𝑗u,jitalic_u , italic_j):

(12) BPR=u𝒰i𝒩uj𝒩ulnσ(y^uiy^uj),subscript𝐵𝑃𝑅subscript𝑢𝒰subscript𝑖subscript𝒩𝑢subscript𝑗subscript𝒩𝑢𝜎subscript^𝑦𝑢𝑖subscript^𝑦𝑢𝑗\mathcal{L}_{BPR}=-\sum_{u\in\mathcal{U}}\sum_{i\in\mathcal{N}_{u}}\sum_{j% \notin\mathcal{N}_{u}}\ln\sigma(\hat{y}_{ui}-\hat{y}_{uj}),caligraphic_L start_POSTSUBSCRIPT italic_B italic_P italic_R end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_U end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∉ caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ln italic_σ ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_u italic_j end_POSTSUBSCRIPT ) ,

where σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is the sigmoid function, 𝒩usubscript𝒩𝑢\mathcal{N}_{u}caligraphic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT represents the set of all items that the user u𝑢uitalic_u has interacted with.

We simultaneously optimize the multimodal contrastive loss and the BPR loss, that is, the overall objective function can be written as follows:

(13) =BPR+βC+λΘ22,subscript𝐵𝑃𝑅𝛽subscript𝐶𝜆superscriptsubscriptnormΘ22\mathcal{L}=\mathcal{L}_{BPR}+\beta\mathcal{L}_{C}+\lambda||\Theta||_{2}^{2},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_B italic_P italic_R end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT + italic_λ | | roman_Θ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where β𝛽\betaitalic_β, λ𝜆\lambdaitalic_λ and ΘΘ\Thetaroman_Θ represent the strength of contrastive loss, the strengths of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization, and the learnable parameters of the model, respectively.

Compared with other GCN-based multimodal recommendation algorithms, our method does not increase the time complexity of the algorithm but achieves better results which will be later verified by extensive experiments.

4. Experiments and Analysis

4.1. Experimental Settings

Datasets. Three widely used Amazon444http://jmcauley.ucsd.edu/data/amazon/links.html datasets are used in our experiments (McAuley et al., 2015), including (a) Baby, (b) Sports and Outdoors, (c) Clothing, Shoes and Jewelry. We are referred to as Baby, Sports, and Clothing in brief, respectively. Table 1 summarizes the statistics of three datasets. These datasets contain both visual and textual modalities of items other than user-item interactions. We extract visual features by CNN, and textual features by sentence-transformers (Reimers and Gurevych, 2019) obtained by concatenating the title, description, category and brand of each item. All datasets are split into training, validation, and testing subsets with a ratio of 8:1:1. Three common metrics: Precision@K𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝐾Precision@Kitalic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n @ italic_K, Recall@K𝑅𝑒𝑐𝑎𝑙𝑙@𝐾Recall@Kitalic_R italic_e italic_c italic_a italic_l italic_l @ italic_K and NDCG@K𝑁𝐷𝐶𝐺@𝐾NDCG@Kitalic_N italic_D italic_C italic_G @ italic_K are used for evaluation and K𝐾Kitalic_K is set to 10,20102010,2010 , 20 by default.

Table 1. Statistics of the datasets.
Dataset Modalities # Items # Users # Interactions
Baby visual, textual 7,050 19,445 139,110
Sports visual, textual 18,357 35,598 256,308
Clothing visual, textual 23,033 39,387 237,488
Table 2. Performance of all comparison methods. Each row’s second-best score is underlined and the top score is highlighted in bold. The final column indicates the percentage of performance improvement relative to the second best one.
Dataset Metric(×\times×100%percent\%%) CF Methods Multi-modal Methods Improve
BPR NGCF LightGCN VBPR MMGCN GRCN SLMREC MICRO BM3 IDSF
Baby Recall@10 2.935 3.567 4.509 3.103 3.861 4.443 5.233 5.807 4.481 6.143 5.8%
Precision@10 0.308 0.378 0.478 0.332 0.408 0.473 0.585 0.608 0.470 0.645 6.0%
NDCG@10 1.588 1.964 2.498 1.749 1.996 2.392 2.885 3.218 2.343 3.431 6.6%
Recall@20 4.566 5.751 7.248 4.863 6.240 7.895 7.775 8.884 6.688 9.505 6.9%
Precision@20 0.243 0.307 0.383 0.262 0.332 0.422 0.445 0.466 0.354 0.499 7.1%
NDCG@20 2.033 2.552 3.212 2.330 2.605 3.552 3.541 4.024 2.931 4.311 7.1%
Sports Recall@10 2.977 4.925 5.786 3.560 3.352 5.879 6.559 6.671 5.947 7.001 5.0%
Precision@10 0.318 0.521 0.611 0.382 0.355 0.627 0.693 0.701 0.659 0.734 4.7%
NDCG@10 1.712 2.693 3.376 2.086 1.763 3.310 3.726 3.728 3.158 4.006 7.5%
Recall@20 4.386 7.471 8.502 5.211 5.466 8.752 9.834 9.951 9.153 10.41 4.6%
Precision@20 0.236 0.397 0.451 0.281 0.291 0.467 0.518 0.525 0.517 0.548 4.4%
NDCG@20 2.095 3.371 4.102 2.533 2.284 4.078 4.588 4.599 3.978 4.914 6.8%
Clothing Recall@10 1.322 2.693 3.684 2.199 1.996 4.006 4.964 5.047 4.144 5.381 6.6%
Precision@10 0.135 0.273 0.373 0.223 0.202 0.407 0.502 0.436 0.656 0.545 6.5%
NDCG@10 0.715 1.441 2.032 1.236 1.014 2.164 2.711 2.226 3.152 2.966 6.5%
Recall@20 1.938 4.095 5.407 3.175 3.231 6.193 7.471 7.674 6.388 7.915 3.1%
Precision@20 0.099 0.208 0.274 0.161 0.164 0.315 0.378 0.388 0.333 0.401 3.2%
NDCG@20 0.875 1.801 2.471 1.456 1.325 2.721 3.347 3.451 2.834 3.611 4.6%

Baselines. We compare IDSF with nine competing methods, which can be divided into two categories: CF methods and multimodal methods. The first category contains three classic recommendation methods that merely generate recommendations based on user-item interactions with no consideration of modality information, including BPR (Rendle et al., 2009), NGCF (Wang et al., 2019), and LightGCN (He et al., 2020). The second category of our comparison methods that take into account additional modality information for item representations before recommendation, including VBPR (He and McAuley, 2016), MMGCN (wei Wei et al., 2019), GRCN (wei Wei et al., 2020), SLMREC (Tao et al., 2022), MICRO555MICRO is an extender version of LATTICE(Zhang et al., 2021), so we omitted the performance of LATTICE. (Zhang et al., 2023) and BM3 (Zhou et al., 2023). Details about baselines are provided in Appendix B.

Hyperparameters. For a fair comparison, we set the embedding dimension to 128, batch size to 1024, and initialized all model parameters with Xavier initializer (Glorot and Bengio, 2010) which is optimized by Adam optimizer (Kingma and Ba, 2015). Other hyper-parameters are determined via grid search on the validation set. Finally, the learning rate is set to 0.00050.00050.00050.0005, the coefficient of 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is set to 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, the temperature parameter τ𝜏\tauitalic_τ is set to 0.5, the coefficient β𝛽\betaitalic_β used to control the effect of contrastive multimodal fusion task is set to 0.3 for dataset Baby, 1.0 for Sports and Clothing, the coefficient γ𝛾\gammaitalic_γ used to control the amount of information from cross-modal representations of the target node is set to 0.3 for dataset Baby, 0.3 for Sports and 1.0 for Clothing. We set the number of GCN layers to K=2𝐾2K=2italic_K = 2 the same as LightGCN (He et al., 2020). Besides, we adopt an early-stop strategy if Recall@20 on the validation set no longer increases after 5555 epochs to avoid overfitting and follow the original settings of comparison methods to achieve the best performance. Due to space limitations, we include the analysis of hyperparameters in the Appendix C.

Table 3. Performance of different variants of our IDSF.
Dataset Baby Sports Clothing
Metric(×\times×100%percent\%%) R@10 P@10 N@10 R@20 P@20 N@20 R@10 P@10 N@10 R@20 P@20 N@20 R@10 P@10 N@10 R@20 P@20 N@20
w/o content 5.211 0.551 2.865 8.258 0.437 3.671 5.539 0.584 3.071 8.383 0.444 3.831 3.963 0.401 2.181 5.899 0.298 2.673
content w/o contrast 5.551 0.582 3.089 8.654 0.455 3.886 5.754 0.608 3.181 8.763 0.464 3.981 4.175 0.422 2.285 6.229 0.315 2.811
content w/o ID 5.923 0.625 3.236 9.276 0.488 4.114 6.857 0.721 3.897 10.21 0.538 4.783 4.927 0.499 2.668 7.298 0.369 3.271
structure w/o ID 5.629 0.591 3.152 8.611 0.452 3.931 6.498 0.682 3.573 9.653 0.507 4.403 5.151 0.521 2.781 7.698 0.391 3.433
IDSF 6.143 0.645 3.431 9.505 0.499 4.311 7.001 0.734 4.006 10.41 0.548 4.914 5.381 0.545 2.966 7.915 0.401 3.611
Table 4. Performance comparison over enhanced modalities and original modalities. Where, original visual, original textual and original fused modalities represent that the content and structure features are obtained by using the original modalities without the subtle features, while enhanced visual, enhanced textual and enhanced fused modalities mean that the original features are enhanced by the subtle features, respectively.
Dataset Baby Sports Clothing
Metric(×\times×100%percent\%%) R@10 P@10 N@10 R@20 P@20 N@20 R@10 P@10 N@10 R@20 P@20 N@20 R@10 P@10 N@10 R@20 P@20 N@20
original visual 3.068 0.327 1.607 5.356 0.285 2.217 2.046 0.214 1.012 3.523 0.186 1.406 1.284 0.131 0.611 2.361 0.119 0.882
original textual 3.164 0.333 1.735 5.628 0.292 2.392 2.075 0.218 1.046 3.724 0.197 1.482 1.378 0.139 0.942 2.493 0.127 0.942
original fused 5.048 0.529 2.619 8.221 0.431 3.451 5.983 0.628 3.287 9.113 0.479 4.111 4.614 0.466 2.437 7.203 0.365 3.097
enhanced visual 5.639 0.593 3.112 8.866 0.466 3.958 6.831 0.719 3.861 10.11 0.535 4.738 4.798 0.485 2.627 7.041 0.357 3.199
enhanced textual 5.956 0.625 3.224 9.251 0.481 4.085 6.925 0.729 3.911 10.32 0.539 4.806 5.224 0.527 2.826 7.799 0.394 3.508
enhanced fused 6.143 0.645 3.431 9.505 0.499 4.311 7.001 0.734 4.006 10.41 0.548 4.914 5.381 0.545 2.966 7.915 0.401 3.611

4.2. Performance Comparison

The comparative results are summarized in Table 2, from which we can find that our proposed IDSF method outperforms both CF methods and multimodal methods. Generally, multimodal methods perform better than CF methods, implying the value of extracting salient features from multimodal information for item representation.

For the CF methods without considering multimodal information, NGCF performs better than BPR since the former method captures high-order structural features from the interactions between users and items, implying that it is significant to model structural features through graphs. In comparison, LightGCN performs the best in CF methods thanks to its adoption of an optimized lightweight aggregation method (rather than the aggregation in NGCF). This aggregation method obtains more precise structural features, demonstrating that optimized structural features can help improve recommendation performance.

Among the multimodal methods, VBPR obtains the worst performance due to the mere consideration of visual content features without structural features. Nevertheless, the fact that VBPR beats BPR to a certain extent implies the effectiveness of content features. MMGCN performs better by extracting additional structural features of textual and visual modalities. Meanwhile, we find that MMGCN outperforms NGCF under the same convolution method, leading to the conclusion that it is important to improve the structural features with multimodal information. Furthermore, GRCN refines the structural features by identifying the false-positive feedback and fixing it, whereby better performance is obtained. It is suggested to strengthen the structural features in multimodal recommendation. However, LightGCN achieves the same performance as GRCN only by optimizing the aggregation method, emphasizing the effect of the optimized aggregation in improving the structural features. MICRO gain comparable sub-optimal performance, owing to their fine-grained multimodal fusion with contrastive learning. The significance of contrastive learning in multimodal fusion is self-evident.

Finally, our proposed IDSF method beats the second-best comparison method in terms of three metrics by around 5.8-7.1% on dataset Baby, 4.4-7.5% on Sports, and 3.1-6.6% on Clothing, respectively. We attribute these important improvements mainly to the fine-grained employment of ID embeddings for multimodal recommendation, which can provide valuable subtle features (extracted from ID embeddings) to complement with salient features (from multi-modal information). The complementation can help better learn the representations of items from the perspectives of both content and structure.

4.3. Ablation Studies

To explore the effects of content features, contrastive learning and ID embeddings, we compare the results on four variants: w/o content, which discards item content features, content w/o contrast which omits contrastive loss, content w/o ID which skips fusing subtle features in content features and structure w/o ID which leaves out ID embeddings when obtaining structural features. Table 3 summarizes the performance of different variants of IDSF, from which we have the following observations.

Without the support of content features, the performance of IDSF w/o content significantly decreases, indicating that the content features are essential, which can most explicitly reveal multimodal information of items. Comparing content w/o contrast and content w/o ID, we can observe that it is necessary to draw contrastive learning into the content enhancement with subtle features. If there is no contrastive learning, ID embeddings tend to generate more noisy subtle features which cause performance degradation. Content w/o ID surpasses structure w/o ID since structural features are more sensitive to ID embeddings. Therefore, for structure features, we control the amount of ID information in the enhancement contained in convolution operation with hyperparameters.

Finally, IDSF outperforms all variants on three datasets across these three evaluation metrics, further validating the significance of the enhancement with subtle features. It shows that subtle features can complement salient features and increase the appropriateness of content and structural representation to produce more accurate recommendations.

4.4. The Contribution of Enhancement in Modality Missing

The modality missing problem remains a prominent challenge in multimodal recommendation. ID embeddings provide rich subtle information of content and structure, which can be exploited to enhance the expressiveness of salient features. In this subsection, we conduct incomplete modality experiments, aiming to explore the contribution of the proposed enhancement in the case of missing modalities. Tabel 4 reports the performance comparison over enhancing modalities with subtle features and original modalities.

According to the performance of the original and enhanced single-modality, it is observed that the proposed enhancement is quite effective in making important improvements. Meanwhile, the fusing scenario of enhanced textual and visual features outperforms that of original features, demonstrating the value of subtle features for better representation of items. In other words, our proposed enhancement can alleviate the degradation of recommendation performance when certain modalities are missing.

Moreover, textual modality is more useful than visual modality in general to model items. It can be explained by the fact that textual modality is able to provide finer-grained attributes such as categories and descriptions of items. The performances of utilizing multiple enhanced features are significantly better than those of ones with a single modality. The multimodal combination makes item representations more comprehensive and reaches better performance. The enhanced modalities follow the same trend, which implies that our proposed enhancement by subtle features extracted from ID embeddings does not interfere with the advantages of multiple modalities over a single modality.

5. Related Work

5.1. ID-based Recommendation

ID embeddings contain abundant and important internal characteristics and are essential in the existing recommendation literature(Yuan et al., 2023). For non-sequential tasks, recommendation models evolve from the early item-item collaborative filtering (Linden et al., 2003), various matrix factorization-based approaches (Koren et al., 2009; Guo et al., 2019, 2015; He et al., 2017; Guo et al., 2017), to graph-based methods (wei Wei et al., 2020; He et al., 2020; Mao et al., 2021; Wu et al., 2021; Xia et al., 2022), taking user-item pairs as input to predict matching scores between users and items. For sequential tasks, recommendation models with a sequential prediction model, such as RNN, LSTM, GRU and Transformer as the backbone, taking a user’s historical sequence of items as input to generate the next interactions (Kang and McAuley, 2018; Cui et al., 2020; Wu et al., 2019; Sun et al., 2019; Dang et al., 2023). These recommendation methods treat user/item ID as a complete item representation to conduct predictions.

5.2. ID in Multimodal Recommendation

Multimodal recommendations leverage ID embedding as a component, yet they do not assign significant importance to this particular technique. Matrix factorization-based approaches incorporate ID embeddings derived from historical interactions as content features for modeling user preferences. In this context, multimodal features are concatenated with ID embeddings as side information to enrich item and user representations. For instance, VBPR (He and McAuley, 2016) adds the score obtained by the inner product of item image features and user visual preference in prediction calculation as an extension of ID embeddings. DeepStyle (Liu et al., 2017) employs a shared user latent factor to interact with both image features and item ID embeddings. ACF (Chen et al., 2017) utilizes attention mechanisms to encode user preferences while incorporating user and item IDs as latent factors, which are subsequently aggregated.

In some GCN-based methods (wei Wei et al., 2019, 2020), ID embeddings are integrated into modality-specific structural features as a unique cross-modal global feature. These methods create modality-aware interaction graphs and perform information propagation and fusion to aggregate neighbors’ information for target nodes, and ID embeddings are considered as modality connections. SLMREC (Tao et al., 2022) realizes the underlying semantic role of ID embeddings in recommendation, and treat ID embeddings as a general modality, performing graph convolution on ”ID-modality”.

Moreover, a few works use ID embeddings similarly just as in the conventional recommendation. For example, MICRO (Zhang et al., 2023), LATTICE (Zhang et al., 2021), BM3(Zhou et al., 2023) and FREEDOM(Zhou and Shen, 2023) use ID embeddings only for the needs of collaborative filtering tasks, and take multimodal module as an auxiliary task without considering the connections between ID embeddings and multi-modalities. CMFB (Chen et al., 2021), DualGNN (Wang et al., 2023), and HUIGN (Wei et al., 2022) abandon ID and simply concatenate multimodal content to replace item IDs for matrix factorization or graph convolution.

Recently, Xiao et al. (Xiao et al., 2022) propose a generative multimodal fusion framework (GMMF) for the CTR prediction task, which improves multimodal features by generating new visual and text representations by a Difference Set network (DSN). It maps item ID into modal space and treats it as a ”special modality” to model the difference and connections between modalities as PAMD (Han et al., 2022).

Although the usage of ID varies, they all regard ID as a whole without explanation. Contrarily, we contend that a more nuanced exploration of ID utilization is warranted. In our method, we step further by discovering underlying information and fine-grained utilization of ID to model item representations more comprehensively.

6. Conclusion and Future Work

In this paper, we revisit the significance of ID embeddings and conduct analyses for ID embeddings and multimodal features in the context of multimodal recommendation. Leveraging the subtle features within ID embeddings and recognizing the distinctions between modalities, we propose a novel multimodal recommendation method called IDSF to use subtle content and structural features in ID embeddings effectively, reaching desirable performance.

For future work, we intend to further consider the subtle features in sequential recommendation and find appropriate methods to integrate them. For users, exposed content salient features are unavailable in some datasets so we are looking for this information from users’ interactive behaviors like historical reviews, etc. Moreover, in an era dominated by large-scale models, the exploration of ID embeddings in pre-training and fine-tuning is worth further investigation.

References

  • (1)
  • Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur D. Szlam, and Yann LeCun. 2014. Spectral Networks and Locally Connected Networks on Graphs. CoRR (2014).
  • Cai et al. (2018) Lei Cai, Zhengyang Wang, Hongyang Gao, Dinggang Shen, and Shuiwang Ji. 2018. Deep Adversarial Learning for Multi-Modality Missing Data Completion. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1158–1166.
  • Cao et al. (2019) Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, and Tat-Seng Chua. 2019. Unifying Knowledge Graph Learning and Recommendation: Towards a Better Understanding of User Preferences. In The World Wide Web Conference. 151–161.
  • Chen et al. (2017) **gyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 335–344.
  • Chen et al. (2021) Xi Chen, Yangsiyi Lu, Yuehai Wang, and Jianyi Yang. 2021. CMBF: Cross-Modal-Based Fusion Recommendation Algorithm. Sensors (2021), 1–14.
  • Cui et al. (2020) Qiang Cui, Shu Wu, Q. Liu, Wen Zhong, and Liang Wang. 2020. MV-RNN: A Multi-View Recurrent Neural Network for Sequential Recommendation. IEEE Transactions on Knowledge and Data Engineering (2020), 317–331.
  • Dang et al. (2023) Yizhou Dang, Enneng Yang, Guibing Guo, Linying Jiang, Xingwei Wang, Xiaoxiao Xu, Qinghui Sun, and Hong Liu. 2023. Uniform sequence better: time interval aware data augmentation for sequential recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence. 4225–4232.
  • Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural Information Processing Systems 34. 1–9.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Artificial Intelligence and Statistics. 1–8.
  • Gori and Pucci (2007) Marco Gori and Augusto Pucci. 2007. ItemRank: A Random-Walk Based Scoring Algorithm for Recommender Engines. In International Joint Conference on Artificial Intelligence. 2766–2771.
  • Guo et al. (2019) Guibing Guo, Enneng Yang, Li Shen, Xiaochun Yang, and Xiaodong He. 2019. Discrete Trust-aware Matrix Factorization for Fast Recommendation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019. 1380–1386.
  • Guo et al. (2015) Guibing Guo, Jie Zhang, and N. Yorke-Smith. 2015. TrustSVD: Collaborative Filtering with Both the Explicit and Implicit Influence of User Trust and of Item Ratings. In Association for the Advancement of Artificial Intelligence. 123–129.
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine Based Neural Network for CTR Prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 1725–1731.
  • Han et al. (2022) Tengyue Han, Pengfei Wang, Shaozhang Niu, and Chenliang Li. 2022. Modality Matches Modality: Pretraining Modality-Disentangled Item Representations for Recommendation. In Proceedings of the ACM Web Conference 2022. 2058–2066.
  • He and McAuley (2016) Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 144–150.
  • He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 639–648.
  • He et al. (2018) Xiangnan He, Zhankui He, Xiaoyu Du, and Tat-Seng Chua. 2018. Adversarial Personalized Ranking for Recommendation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 355–364.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. Proceedings of the 26th International Conference on World Wide Web (2017), 1–10.
  • Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-Attentive Sequential Recommendation. 2018 IEEE International Conference on Data Mining (ICDM) (2018), 197–206.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR (2015).
  • Koren et al. (2009) Yehuda Koren, Robert M. Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer (2009), 30–37.
  • Li et al. (2014) Rongjian Li, Wenlu Zhang, Heung-Il Suk, Li Wang, Jiang Li, Dinggang Shen, and Shuiwang Ji. 2014. Deep Learning Based Imaging Data Completion for Improved Brain Disease Diagnosis. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2014. 305–312.
  • Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com Recommendations: Item-to-Item Collaborative Filtering. IEEE Distributed Syst. Online (2003), 76–80.
  • Liu et al. (2022) Hao Liu, Yongxin Tong, **dong Han, Panpan Zhang, Xinjiang Lu, and Hui Xiong. 2022. Incorporating Multi-Source Urban Data for Personalized and Context-Aware Multi-Modal Transportation Recommendation. IEEE Transactions on Knowledge and Data Engineering (2022), 723–735.
  • Liu et al. (2017) Q. Liu, Shu Wu, and Liang Wang. 2017. DeepStyle: Learning User Preferences for Visual Recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 841–844.
  • Liu et al. (2021) Yong Liu, Susen Yang, Chenyi Lei, Guoxin Wang, Haihong Tang, Juyong Zhang, Aixin Sun, and Chunyan Miao. 2021. Pre-training Graph Transformer with Multimodal Side Information for Recommendation. In Proceedings of the 29th ACM International Conference on Multimedia. 2853–2861.
  • Mao et al. (2021) Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, and Xiuqiang He. 2021. UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 1253–1262.
  • McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-Based Recommendations on Styles and Substitutes. Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (2015), 43–52.
  • McInnes et al. (2018) Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. 2018. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. (2018), 861.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. ArXiv abs/1908.10084 (2019).
  • Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. 452––461.
  • Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. Proceedings of the 28th ACM International Conference on Information and Knowledge Management (2019), 1441–1450.
  • Tao et al. (2022) Zhulin Tao, Xiaohao Liu, Yewei Xia, Xiang Wang, Lifang Yang, Xianglin Huang, and Tat-Seng Chua. 2022. Self-supervised Learning for Multimedia Recommendation. IEEE Transactions on Multimedia (2022), 1–10.
  • Tsai et al. (2019) Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Learning Factorized Multimodal Representations. In International Conference on Learning Representations. 1–20.
  • Wang et al. (2018) Cheng Wang, Mathias Niepert, and Hui Li. 2018. LRMM: Learning to Recommend with Missing Modalities. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3360–3370.
  • Wang et al. (2023) Qifan Wang, Yin wei Wei, Jianhua Yin, Jianlong Wu, Xuemeng Song, Liqiang Nie, and Min Zhang. 2023. DualGNN: Dual Graph Neural Network for Multimedia Recommendation. IEEE Transactions on Multimedia (2023), 1074–1084.
  • Wang et al. (2019) Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural Graph Collaborative Filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 165–174.
  • Wei et al. (2022) Yinwei Wei, Xiang Wang, Xiangnan He, Liqiang Nie, Yong Rui, and Tat-Seng Chua. 2022. Hierarchical User Intent Graph Network for Multimedia Recommendation. IEEE Transactions on Multimedia (2022), 2701–2712.
  • wei Wei et al. (2020) Yin wei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, and Tat-Seng Chua. 2020. Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback. In Proceedings of the 28th ACM International Conference on Multimedia. 3541–3549.
  • wei Wei et al. (2019) Yin wei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. In Proceedings of the 27th ACM International Conference on Multimedia. 1437–1445.
  • Wu et al. (2021) Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. 2021. Self-supervised Graph Learning for Recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 726–735.
  • Wu et al. (2019) Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. 2019. Session-based Recommendation with Graph Neural Networks. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence. 346–353.
  • Xia et al. (2022) Lianghao Xia, Chao Huang, Yong Xu, Jiashu Zhao, Dawei Yin, and Jimmy Huang. 2022. Hypergraph Contrastive Collaborative Filtering. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1–10.
  • Xiao et al. (2022) Fangxiong Xiao, Lixi Deng, **g**g Chen, Houye Ji, Xiaorui Yang, Zhuoye Ding, and Bo Long. 2022. From Abstract to Details: A Generative Multimodal Fusion Framework for Recommendation. In Proceedings of the 30th ACM International Conference on Multimedia. 258–267.
  • Yuan et al. (2023) Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. 2023. Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023), 1–11.
  • Zhang et al. (2021) **ghao Zhang, Yanqiao Zhu, Qiang Liu, Shu Wu, Shuhui Wang, and Liang Wang. 2021. Mining Latent Structures for Multimedia Recommendation. In Proceedings of the 29th ACM International Conference on Multimedia. 3872–3880.
  • Zhang et al. (2023) **ghao Zhang, Yanqiao Zhu, Qiang Liu, Mengqi Zhang, Shu Wu, and Liang Wang. 2023. Latent Structures Mining with Contrastive Modality Fusion for Multimedia Recommendation. IEEE Transactions on Knowledge and Data Engineering (2023), 9154–9167.
  • Zhou et al. (2020) Kun Zhou, Haibo Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji rong Wen. 2020. S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1893–1902.
  • Zhou et al. (2019) Xiaokang Zhou, Wei Liang, Kevin I-Kai Wang, and Shohei Shimizu. 2019. Multi-Modality Behavioral Influence Analysis for Personalized Recommendations in Health Social Media Environment. IEEE Transactions on Computational Social Systems (2019), 888–897.
  • Zhou and Shen (2023) Xin Zhou and Zhiqi Shen. 2023. A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation. In Proceedings of the 31st ACM International Conference on Multimedia. 935–943.
  • Zhou et al. (2023) Xin Zhou, Hongyu Zhou, Yong Liu, Zhiwei Zeng, Chunyan Miao, Pengwei Wang, Yuan You, and Feijun Jiang. 2023. Bootstrap Latent Representations for Multi-modal Recommendation. In Proceedings of the ACM Web Conference 2023. 845–854.

Appendix A Experimental Details

In Section 2, we conducted a visualization analysis of the pre-trained ID embeddings (referred to as IDs) and multimodal features. To ensure the generalizability of our experiments, we randomly selected 10 users from the training set and created a sample set that included all the items they interacted with for analysis. After sampling, we checked for any overlap in the interaction items between users to avoid excessive complexity in the experiment.

For the scatter plots (Figure 2 and Figure 5), we used UMAP (McInnes et al., 2018) to reduce the dimensions of the IDs to 2-D and plotted each ID using the matplotlib package. Points that are closer in the plot indicate that they are also closer in the latent space, revealing structural similarity. Points of the same color (i.e., item IDs interacted with by the same user) that are closer together indicate that the IDs can reflect structural similarity.

For the heatmaps (Figure 3 and Figure 4), we directly computed the cosine similarity between each pair of item IDs, resulting in a similarity matrix of size 43×43434343\times 4343 × 43 (where 43 is the number of items), which was visualized as a heatmap. To enhance clarity, we only retained the top 10 values of each row in the similarity matrix, setting the rest to 0. The heatmaps directly display the semantic similarity between IDs, and it is evident that IDs of items that interacted with the same user exhibit higher semantic similarity, indicating that IDs can reflect content similarity.

Based on these findings, we believe that the IDs can be categorized as content features and structure features.

Appendix B Baselines

We compare IDSF with nine competing methods, which can be divided into two categories: CF methods and multimodal methods. The first category contains three classic recommendation methods that merely generate recommenation based on user-item interactions with no consideration of modality information.

  • BPR (Rendle et al., 2009) is a classic item ranking method built upon the assumption that a user prefers an interacted item to an unknown one.

  • NGCF (Wang et al., 2019) explicitly models the user-item interactions by a bipartite graph, uses graph convolution operation to learn the topology, and effectively harvests the high-order connectivity and collaborative signals for item recommendation.

  • LightGCN (He et al., 2020) abandons the use of feature transformation and nonlinear activation, and only retains the most important neighbor aggregation module in GCNs for collaborative filtering.

Multimodal models are the second category of our comparison methods that take into account additional modality information for item recommendation.

  • VBPR (He and McAuley, 2016) integrates the visual features and ID embedding of each item as its representations for the matrix factorization.

  • MMGCN (wei Wei et al., 2019) is one of the state-of-the-art multimodal recommendation models, which constructs a graph convolution network from multiple modalities to improve the structural representation.

  • GRCN (wei Wei et al., 2020) is also one of the state-of-the-art multimodal recommendation methods. It refines the user-item interaction graph by identifying the false-positive feedback and fixing it.

  • SLMREC (Tao et al., 2022) incorporates contrastive learning into modal-specific graph neural network to promote multimodal fusion.

  • MICRO (Zhang et al., 2023) mines latent item relations and conducts fine-grained multimodal fusion with contrastive learning before collaborative filtering as a auxiliary task.

  • BM3 (Zhou et al., 2023) is a self-supervised multi-modal recommendation model, which requires neither augmentations from auxiliary graphs nor negative samples.

Appendix C Hyperparameters Analysis

We do sensitivity analysis with various hyper-parameters on the graph convolution and the contrastive task in this paragraph because they are crucial components of our strategy. We examine IDSF performance in relation to various γ𝛾\gammaitalic_γ and β𝛽\betaitalic_β values. The γ𝛾\gammaitalic_γ dictates how much multimodal subtle information from the target node should be retained. The performance impact of the contrastive task magnitude β𝛽\betaitalic_β is then covered.

Refer to caption
Figure 7. Performance evaluation across various gamma(γ𝛾\gammaitalic_γ) values.
Refer to caption
Figure 8. Performance evaluation across various beta(β𝛽\betaitalic_β) values.

Figure 7 reports the results of performance comparison. γ=0𝛾0\gamma=0italic_γ = 0 means no subtle features reserved while doing graph convolution since it collects just the linked neighbors and ignores the self-connections (i.e., structure w/o ID). Moreover, we can observe the following:

  • When γ𝛾\gammaitalic_γ is adjusted to 0.3, 0.3, and 1.0, respectively, performance on Baby, Sport, and Clothing yields the best results, which validates the significance of modal-specific subtle features in structural representations. Appropriate γ𝛾\gammaitalic_γ can obtain better item representations by aggregating subtle and salient structural features which boost the recommendation performance.

  • When the performance on three datasets reaches the peak, the corresponding γ𝛾\gammaitalic_γ are not the same because the importance of salient features on revealing items attributes varies for different types of items. The improving trend declines when γ𝛾\gammaitalic_γ exceeds the best value since the excessive proportion of subtle features will cover the effect of the salient features.

  • Furthermore, given that it is sparse than other datasets that require more multimodal items to provide better recommendations, performance on Clothing improves as γ𝛾\gammaitalic_γ increases.

The effect of various coefficients β𝛽\betaitalic_β on performance is seen in Figure 8. β=0𝛽0\beta=0italic_β = 0 denotes IDSF w/o contrast, which discards the contrastive learning task. We can observe that

  • The performances on all datasets first improve as beta increases and are always better than β=0𝛽0\beta=0italic_β = 0. The primary task achieves improvements when jointly optimized with the contrastive self-supervised auxiliary task when with a small β𝛽\betaitalic_β. This implies that it is important to take multimodal fusion’s semantic consistency into account.

  • When β𝛽\betaitalic_β continues to increase, it starts to drop, indicating that β𝛽\betaitalic_β interferes with the main work of making recommendations and the gain brought by the self-supervised task could be counteracted when β𝛽\betaitalic_β is more than the weight of the BPR task.

  • Overall, there are no apparent sharp rise and falls when β0𝛽0\beta\neq 0italic_β ≠ 0, which indicates that our methods is not that sensitive to the selection of the ratio of auxiliary task.