ID Embedding as Subtle Features of Content and Structure for Multimodal Recommendation

Yuting Liu [email protected] , Enneng Yang [email protected] Northeastern UniversityChina , Yizhou Dang Northeastern UniversityChina [email protected] , Guibing Guo Northeastern UniversityChina [email protected] , Qiang Liu Chinese Academy of SciencesChina [email protected] , Yuliang Liang Northeastern UniversityChina [email protected] , Linying Jiang Northeastern UniversityChina [email protected] and Xingwei Wang Northeastern UniversityChina [email protected]

(2018)

Abstract.

Multimodal recommendation aims to model user and item representations comprehensively with the involvement of multimedia content for effective recommendations. Existing research has shown that it is beneficial for recommendation performance to combine (user- and item-) ID embeddings with multimodal salient features, indicating the value of IDs. However, there is a lack of a thorough analysis of the ID embeddings in terms of feature semantics in the literature. In this paper, we revisit the value of ID embeddings for multimodal recommendation and conduct a thorough study regarding its semantics, which we recognize as subtle features of content and structure. Based on our findings, we propose a novel recommendation model by incorporating ID embeddings to enhance the salient features of both content and structure. Specifically, we put forward a hierarchical attention mechanism to incorporate ID embeddings in modality fusing, coupled with contrastive learning, to enhance content representations. Meanwhile, we propose a lightweight graph convolution network for each modality to amalgamate neighborhood and ID embeddings for improving structural representations. Finally, the content and structure representations are combined to form the ultimate item embedding for recommendation. Extensive experiments on three real-world datasets (Baby, Sports, and Clothing) demonstrate the superiority of our method over state-of-the-art multimodal recommendation methods and the effectiveness of fine-grained ID embeddings. Our code is available at https://anonymous.4open.science/r/IDSF-code/.

Multimodal Recommendation, Subtle Features, Salient Features, Content and Structure

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Information systems Recommender systems

1. Introduction

Research on multimodal recommendation has flourished, primarily thanks to its capability to improve item representations through integrating various sources of multimedia information (Zhou et al., 2020; He et al., 2018; Cao et al., 2019; Liu et al., 2022), with text and image being the two most widely adopted modalities. A prevalent viewpoint in existing research is that the key to enhance item representations is to extract salient semantic features from multiple modalities for effective item representations. In contrast, ID embedding has not attracted sufficient attention and has not been thoroughly explored in multimodal recommendation despite its well-demonstrated value in both traditional (Koren et al., 2009; Guo et al., 2015) and multimodal (He and McAuley, 2016; wei Wei et al., 2019; Tao et al., 2022) recommendation. The specific information captured by ID embeddings remains unclear, resulting in suboptimal strategies for its utilization and leaving ample room for further improvements in multimodal recommendation.

In this paper, we revisit the efficiency of ID embeddings and conduct case studies to explore the intricate details within them, which we refer to as subtle features. We argue that ID embeddings should be regarded as a mixture of subtle features of both content and structure, as shown in Section 2. Based on the analysis, we classify the use of ID embeddings in the recommendation into two lines of research. Firstly, ID embeddings are regarded as vectors of content features, containing some semantic attributes of an item. They are learned from the historical interactions between users and items via approaches such as matrix factorization (Koren et al., 2009; Guo et al., 2019), and can be concatenated with multimodal features to enrich item representations in multimodal recommendation (He and McAuley, 2016; Chen et al., 2017; Liu et al., 2017). Secondly, ID embeddings are explained as vectors of structural features, capturing the features of all the neighboring items. ID embeddings are trained from the user-item interaction bipartite graph, where each item is linked with a set of users (i.e., neighbors), through Graph Convolutional Networks (GCN) (Gori and Pucci, 2007; Bruna et al., 2014; Defferrard et al., 2016) approaches. They can enhance item representations by being aggregated with each modal feature obtained from modal-specific interaction graphs in the multimodal recommendation. (wei Wei et al., 2019, 2020; Tao et al., 2022; Wu et al., 2019; Mao et al., 2021; Liu et al., 2021; Wu et al., 2021).

Refer to caption — Figure 1. An item consists of textual and visual modalities and ID information. Features in terms of content and structure can be enhanced in each modality separately. The item representation can be obtained by fusing content and structural features. Note that traditional ID embeddings are treated as content or structural features, while our approach takes ID embeddings as a mixture of subtle features of both content and structure.

However, these approaches mainly focus on the salient features within multimodal sources and treat ID as a whole, combining it with them in a rudimentary manner. Consequently, most research concentrates on extracting and integrating multimodal information without the detailed exploration of content and structural features within ID. In our viewpoint, ID embeddings reflect the subtle features of items, which usage should be carefully considered from the standpoint of content and structure as an independent information source to enhance item representations.

On this basis, we propose a novel multimodal recommender called IDSF that explains ID embeddings as Subtle Features of both content and structure. It provides additional signals to enhance the semantics of extracted salient features under each modality regarding content and structure, leading to an improved item representation. Specifically, IDSF consists of two main modules to learn item content and structural features from all modalities. Figure 1 illustrates an intuitive example derived from the Clothing ¹¹1https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html dataset, where an item consists of content and structural features learned from three sources of information (i.e. textual, visual, and item ID). We hierarchically integrate all the information to obtain a better comprehension of the item’s semantics. For content features, we first enhance the salient features from textual and visual content by fusing them with the subtle features from modal-specific ID embeddings. Then, the enhanced modalities are fused through an attention mechanism and contrastive learning. Similarly, we retrieve the structural features by merging salient neighborhood and subtle features from a modality-specific interaction graph. Notes that we introduce modal-specific ID embeddings to help avoid mutual interference among different modalities which is proven to be necessary in Section 2. Finally, the item representation is obtained by combining the content and structural features.

It is worth mentioning that our method enjoys an additional benefit in alleviating the modal missing problem in the multimodal recommendation. That is, some modalities may be unavailable in real applications, leading to a decline in the performance of the existing multimodal recommenders (Cai et al., 2018; Tsai et al., 2019; Li et al., 2014; Cui et al., 2020; Wang et al., 2018; Zhou et al., 2019). Our work can alleviate this problem by making use of subtle features of content and structure derived from ID embeddings and thus maintaining relatively high performance.

Contributions. To sum up, the main contributions of our work are presented as follows:

•

We highlight the significance of ID embeddings and undertake a thorough analysis to explore the optimal approach to leveraging ID embeddings in multimodal recommendations. To the best of our knowledge, we are making the initial endeavor to thoroughly explore the detailed information and employ fine-grained utilization with ID embeddings.
•

We propose a novel framework that considers both item ID and modality in terms of content and structure to acquire comprehensive item representations. For content, we design a hierarchical attention mechanism and employ contrastive learning to enhance salient modal features with subtle content features within ID embeddings. For structure, we develop a modal-specific lightweight graph convolution network to involve subtle structural features in salient features. Eventually, we combine the content and structural features to obtain final item representations.
•

We perform extensive experiments on three real-world datasets (Baby, Sports, and Clothing) to demonstrate that our method outperforms the state-of-the-art recommendation methods. In addition, we find that our approach can effectively mitigate the modal missing problem.

2. ID Embedding Contains More

Yuan et al. (Yuan et al., 2023) have identified that while modal-based recommendation has achieved comparable performance to ID-based recommendation due to the advances in the multimodal domain, ID dominates recommendation with typical architecture. However, the underlying reasons for the remarkable effectiveness of the ID are still not fully understood. To delve deeper into this matter, we employed visualizations to analyze the ID embeddings and modal features, leading us to draw preliminary conclusions: the subtle features conveyed by ID embeddings, which consist of content and structural features, are anticipated to be modal-specific. Meanwhile, they can enhance the corresponding original modal features.

We first map the pre-trained item ID embeddings²²2We provide more details in Appendix A. to 2-dim normalized vectors by using UMAP (McInnes et al., 2018) and plot their distributions (shown in Figure 2). Note that points of the same color correspond to items that interacted with the same user. In Figure 2, the points of items interacted with different users form distinct clusters that are clearly visible and distinguished by the colors. This observation indicates that item ID embeddings of the same user have a closer distribution which is aligned with the edge in the user-item interaction graph. This suggests that item ID embeddings capture structural features within the user-item interaction graph. Simultaneously, we generate a heat map to visualize the semantic similarity among these item ID embeddings (shown in Figure 3). Each cell in the heat map represents the normalized similarity of two items, where the horizontal and vertical coordinates indicate their respective re-indexes. Note that we set adjacent coordinates for items that interact with the same user (e.g., indexes 0-4 correspond to the items that interacted with the same user). To improve clarity, we assign all values in the similarity matrix, except for the top 10 values in each row, to 0. In Figure 3, items that interacted with the same user exhibit a higher similarity and items interacted with different users also display certain semantic similarities due to the inherent content features of items themselves. According to the observation, we posit that ID embeddings capture content features of items. Therefore, to enhance the utilization of ID embeddings in the multimodal recommendation, we suggest adopting a comprehensive approach that integrates content and structure concurrently.

Moreover, we visualize the item-item similarity matrix of pre-extracted multimodal features in the same way. We conduct analyses on both text and image features to ensure the generalizability of the results. As depicted in Figure 4, patterns of similarity distribution among different modalities are partial inconsistency, which indicates there are subtle differences in the semantics between different modalities. Therefore, it will lead to a performance drop by directly aligning or fusing different modalities when recommending. To mitigate this issue, we propose to enhance multimodal features with modal-specific ID embeddings (e.g. $tid$ and $vid$ ), thereby increasing their adaptability for recommendation. To confirm effectiveness, we plot the distributions of visual and textual features before and after enhancement (shown in Figure 5), illustrating the effect of the enhancement on multimodal features in the context of recommendation systems. Our findings demonstrate that the inclusion of modal-specific ID embeddings effectively enhances these multi-modal features, leading to a significant improvement in the discriminability of their distributions.

Consequently, to obtain optimal item representations, modal-specific ID embeddings are suggested to enhance multimodal features in terms of content and structure, respectively.

3. Our IDSF Model

Based on the findings in Section 2, we speculate that utilizing ID embeddings in a fine-grained manner can optimize item representations for better performance. In this section, we introduce our method that treats ID embeddings as subtle features of both content and structure, to enhance and fuse multimodal salient features.

3.1. Preliminary

We represent the interaction data as a bipartite user-item graph $\mathcal{G}=\{(u,i)|u\in\mathcal{U},i\in\mathcal{I}\}$ , where $\mathcal{U}$ , and $\mathcal{I}$ denote the set of users and items, respectively. An edge $y_{ui}=1$ indicates a positive interaction between user $u$ and item $i$ ; otherwise $y_{ui}=0$ . We denote the original features that have not been convoluted as $\textbf{e}_{i}^{m,(0)}$ and $\textbf{e}_{u}^{m,(0)}$ for a specific kind of information $m\in\mathcal{M}$ respectively, where $\mathcal{M}$ is the set of all kinds of information. Without sacrificing generality, we will refer to an item’s textual $t$ and visual $v$ modalities as its salient feature and its associated item ID ( $tid,vid$ ) as its subtle feature in this work. Future studies will also take into account the multimodal information provided by users. Specifically, we can obtain the definition as $\mathcal{M}=\{t,v,tid,vid\}$ . In addition, the representations of higher-order items and users can be given as the $k$ -layer graph convolution, denoted by $\textbf{e}_{i}^{m,(k)}$ and $\textbf{e}_{u}^{m,(k)}$ , respectively. We divided the modal-specific bipartite graphs $\mathcal{G}_{t}$ and $\mathcal{G}_{v}$ from $\mathcal{G}$ in order to appropriately describe structural information on each modality individually.

3.2. Model Overview

As illustrated in Figure 6, our IDSF model consists of three key components: item content module, item structural module, and user structural module. Specifically, for the item content module, we aim to enhance the representation of each modality and their fusion by designing a hierarchical attention mechanism. That is, we improve the salient features (such as the textual and visual content in Figure 6) with the associated subtle features in item ID embeddings in each modality and fuse multiple modalities to generate the content representation $\textbf{e}_{i}^{c}$ of item $i$ . Then, in the item structural module, we adopt a similar idea to extract an item’s high-order structural representation $\textbf{e}_{i}^{s}$ from various modalities of item $i$ ’s neighbors. We implement it by a multi-layer graph convolution in a lightweight way (textual modality on the left and visual modality on the right). The same process also holds for user structural module which obtains the user $u$ ’s structural representation $\textbf{e}_{u}^{s}$ . Finally, we can obtain item $i$ ’s overall embedding by fusing the content representation $e_{i}^{c}$ with the structural representation $\textbf{e}_{i}^{s}$ , and take user $u$ ’s structural representation $\textbf{e}_{u}^{s}$ as his/her final embedding. Prediction $\hat{y}_{ui}$ can be computed by the inner product of user $u$ ’s and item $i$ ’s embeddings.

3.3. Item Content Module

Different from the existing works that mainly focus on salient features from multiple modalities, we further take into account subtle features from item ID embeddings. Specifically, we design a hierarchical attention mechanism in Section 3.3.1 to enhance salient content features (i.e., textual and visual) with their corresponding subtle features captured by modal-specific ID embeddings. Then, we introduce a contrastive loss to encourage the consistency among multi-modal information for better fusion in Section 3.3.2.

3.3.1. Modality Enhancement and Fusion

For each modality, we propose a hierarchical attention method to combine salient and subtle information. As shown in the left part of Figure 6, we first enhance the textual salient features $\textbf{e}^{t}_{i}$ with textual subtle features $\textbf{e}_{i}^{tid}$ in textual space through Text Attention to get $\textbf{e}_{i}^{t^{\prime}}$ . Similarly, we obtain visual features $\textbf{e}_{i}^{v^{\prime}}$ by Visual Attention, which integrates the visual salient features $\textbf{e}^{v}_{i}$ with the visual subtle features $\textbf{e}_{i}^{vid}$ in the visual space. Then, we further fuse $\textbf{e}_{i}^{t^{\prime}}$ and $\textbf{e}_{i}^{v^{\prime}}$ through VT Attention to get the comprehensive item content representation $\textbf{e}_{i}^{c}$ .

All three attention mechanisms (Text Attention, Visual Attention, and VT Attention) follow the same procedure. We take the textual content and item ID in the textual space as an example. The modality fusion operation based on the attention mechanism is defined as follows:

(1)

\textbf{e}^{t^{\prime}}_{i}=\sum_{m\in\{t,\;tid\}}\alpha^{m}_{i}\textbf{e}_{i}% ^{m},

where $\textbf{e}_{i}^{t}$ represents the textual features extracted by sentence Bert (Reimers and Gurevych, 2019), $\textbf{e}_{i}^{tid}$ represents the learnable item ID embeddings in textual space, and $\alpha^{m}_{i}$ indicates the significance of each element $\textbf{e}_{i}^{m}$ , calculated as follows:

(2)

\alpha_{i}^{m}=\text{softmax}([\textbf{q}^{\top}\tanh(\mathbf{W}\textbf{e}^{m}% _{i}+\textbf{b})],\;m\in\{t,tid\}),

where $\textbf{q}\in\mathbb{R}^{d}$ denotes attention vector and $\textbf{W}\in\mathbb{R}^{d\times d}$ , $b\in\mathbb{R}^{d}$ denote the trainable weight matrix and bias vector, respectively. We can compute $\textbf{e}_{i}^{v^{\prime}}$ and $\textbf{e}_{i}^{c}$ in the same way. In this way, we are able to enhance the salient (textual and visual) features with the subtle features of the item.

3.3.2. Contrastive Learning Constraints

In this paper, we construct a self-supervised contrastive learning task by maximizing the agreement between the item representations before and after fusion. Specifically, we separately maximize the degree of consistency between the representation of different information (e.g., $\textbf{e}_{i}^{t}$ and $\textbf{e}_{i}^{tid}$ ) and the fused representation (e.g., $\textbf{e}_{i}^{t^{\prime}}$ ) of the subtle and salient features. Similar comparisons are made between $\textbf{e}_{i}^{v}$ , $\textbf{e}_{i}^{vid}$ and $\textbf{e}_{i}^{v^{\prime}}$ , and also between $\textbf{e}_{i}^{t^{\prime}}$ , $\textbf{e}_{i}^{v^{\prime}}$ and $\textbf{e}_{i}^{c}$ .

Here, we take the text content and item ID in the textual space as an example to formulate the contrastive learning loss as follows:

(3)

$\bar{\mathcal{L}}_{C}(\textbf{e}_{i}^{t},\textbf{e}_{i}^{tid},\textbf{e}_{i}^{% t^{\prime}})\!=\!\!\!\\ -\frac{1}{4|\mathcal{I}|}\sum\limits_{i\in\mathcal{I}}\sum\limits_{m\in\{t,\;% tid\}}I(\textbf{e}_{i}^{m},\textbf{e}_{i}^{t^{\prime}})+I(\textbf{e}_{i}^{t^{% \prime}},\textbf{e}_{i}^{m}),$

where $|\mathcal{I}|$ represents the number of items, and the mutual information $I(\cdot)$ between two representations can be mathematically expressed as:

(4)

\begin{split}I&(\textbf{e}_{i}^{m},\textbf{e}_{i}^{t^{\prime}})=\log\frac{I_{% ii}}{I_{ii}+I_{ij}},\;\;\;m\in\{t,\;tid\},\\ \end{split}

where

(5)

\begin{split}&I_{ii}=\exp(f(\textbf{e}_{i}^{m},\textbf{e}_{i}^{t^{\prime}}))/% \tau,\\ &I_{ij}\!=\!\sum_{j\neq i}\left(\exp(f(\textbf{e}_{i}^{m},\textbf{e}_{j}^{t^{% \prime}}))/\tau+\exp(f(\textbf{e}_{i}^{m},\textbf{e}_{j}^{m}))/\tau\right),% \end{split}

where $i,j\in\mathcal{I}$ , $\tau\in\mathbb{R}$ is a temperature hyper-parameter, $f(\cdot,\cdot)$ is the similarity function implemented by consine similarity.

The overall contrastive learning loss is given by:

(6)

\small\mathcal{L}_{C}=\bar{\mathcal{L}}_{C}(\textbf{e}^{t},\textbf{e}^{tid},% \textbf{e}^{t^{\prime}})+\bar{\mathcal{L}}_{C}(\textbf{e}^{v},\textbf{e}^{vid}% ,\textbf{e}^{v^{\prime}})+\bar{\mathcal{L}}_{C}(\textbf{e}^{t^{\prime}},% \textbf{e}^{v^{\prime}},\textbf{e}^{c})

In this way, we will get a more complete representation of a given item by aligning the representations of the various modalities with the contrastive learning loss.

3.4. Item- and User- Structural Modules

It’s well known that the user-item interaction bipartite graph $\mathcal{G}$ created by recommender systems contains rich structural information. The interaction graph’s higher-order connectedness depicts how preferences propagate across users and items. Graph neural network techniques have been widely proven to be effective in capturing higher-order graph structural information. In this section, we will elaborate on a new method to aggregate the higher-order neighbor information by a lightweight graph convolution, and then we fuse the higher-order information of multiple modalities.

Inspired by LightGCN (He et al., 2020) and MMGCN (wei Wei et al., 2019), for a user (an item) node in a modality-specific interaction graph $\mathcal{G}_{m}$ , we implement an aggregator via a weighted sum aggregator to gather information from its neighbors, and an enhancement through reserving the userID $\textbf{e}_{u}^{id}$ or modality-specific itemID $\textbf{e}_{i}^{id}$ embedding of the target node in a certain ratio $\gamma$ . The lightweight message propagation mechanism of aggregated $k$ -hop neighbors is expressed as follows:

(7)

\textbf{e}_{i}^{m,(k)}=\sum_{u\in\mathcal{N}_{i}}\frac{1}{\sqrt{|\mathcal{N}_{% i}|}\sqrt{|\mathcal{N}_{u}|}}\textbf{e}_{u}^{m,(k-1)}+\gamma\textbf{e}^{(m)id}% _{i},

(8)

\textbf{e}_{u}^{m,(k)}=\sum_{i\in\mathcal{N}_{u}}\frac{1}{\sqrt{|\mathcal{N}_{% u}|}\sqrt{|\mathcal{N}_{i}|}}\textbf{e}_{i}^{m,(k-1)}+\gamma\textbf{e}^{id}_{u},

where $m\in\mathcal{M}=\{t,v\}$ , $\textbf{e}_{i}^{m,(0)}=\textbf{e}_{i}^{m}$ , $\textbf{e}_{i}^{(m)id}$ is the itemID that reuses the corresponding textual or visual space of the multimodal fusion module, i.e., $\textbf{e}_{i}^{(m)id}\in\{\textbf{e}_{i}^{vid},\textbf{e}_{i}^{tid}\}$ , $\textbf{e}_{u}^{id}$ represents the userID, which is a learnable variable initialized at random. $\mathcal{N}_{u}$ and $\mathcal{N}_{i}$ denote the set of neighbors for user $u$ and item $i$ respectively, $\gamma$ is a hyper-parameter to control the amount of subtle information from the target node. The symmetric normalization term $\frac{1}{\sqrt{|\mathcal{N}_{i}|}\sqrt{|\mathcal{N}_{u}|}}$ follows the design of standard GCN to avoid the scale of embeddings increasing with graph convolution operations.

After computing the high-order preference embedding at top- $K$ layers via Eq. 7 and Eq. 8, we stack the preference embedding at each layer and take unweighted arithmetic mean to obtain the modal-specific representations of a user (an item). Finally, we derive the structural representations for user $u$ and item $i$ by an attentional combination of multimodal representations as Eq. 9 and Eq. 10 respectively, defined by:

(9)

\textbf{e}_{i}^{s}=\sum_{m\in\{t,v\}}\alpha_{i}^{m}\cdot(\frac{1}{K+1}\sum_{k=% 0}^{K}\textbf{e}_{i}^{m,(k)}),

(10)

\textbf{e}_{u}^{s}=\sum_{m\in\{t,v\}}\alpha_{i}^{m}\cdot(\frac{1}{K+1}\sum_{k=% 0}^{K}\textbf{e}_{u}^{m,(k)}),

where $\alpha_{i}^{m}$ represents the importance of the modality $m$ , and the calculation rules are shown in Eq. 2.

3.5. Recommendation Prediction and Optimization

The three above-mentioned key modules allow us to extract the content representation ( $\textbf{e}_{i}^{c}$ ) for item $i$ as well as the high-order structure representation ( $\textbf{e}_{i}^{s}$ and $\textbf{e}_{u}^{s}$ ) for item $i$ and user $u$ , respectively. Therefore, we can predict user $u$ ’s preference for item $i$ as follows:

(11)

\hat{y}_{ui}={\textbf{e}_{u}^{s}}^{\top}(\textbf{e}_{i}^{c}+\textbf{e}_{i}^{s}).

where $\textbf{e}_{i}^{c}+\textbf{e}_{i}^{s}$ indicates the final representation of item $i$ by combining its content and structure³³3We do not introduce MLPs to map them since the structure and content features reside in the same latent space..

To train our model, we use Bayesian Personalized Ranking (BPR) (Rendle et al., 2009) loss. BPR is a well-known pairwise loss that encourages the prediction of an observed user-item pair ( $u,i$ ) to be higher than its unobserved counterparts ( $u,j$ ):

(12)

\mathcal{L}_{BPR}=-\sum_{u\in\mathcal{U}}\sum_{i\in\mathcal{N}_{u}}\sum_{j% \notin\mathcal{N}_{u}}\ln\sigma(\hat{y}_{ui}-\hat{y}_{uj}),

where $\sigma(\cdot)$ is the sigmoid function, $\mathcal{N}_{u}$ represents the set of all items that the user $u$ has interacted with.

We simultaneously optimize the multimodal contrastive loss and the BPR loss, that is, the overall objective function can be written as follows:

(13)

\mathcal{L}=\mathcal{L}_{BPR}+\beta\mathcal{L}_{C}+\lambda||\Theta||_{2}^{2},

where $\beta$ , $\lambda$ and $\Theta$ represent the strength of contrastive loss, the strengths of $L_{2}$ regularization, and the learnable parameters of the model, respectively.

Compared with other GCN-based multimodal recommendation algorithms, our method does not increase the time complexity of the algorithm but achieves better results which will be later verified by extensive experiments.

4. Experiments and Analysis

4.1. Experimental Settings

Datasets. Three widely used Amazon⁴⁴4http://jmcauley.ucsd.edu/data/amazon/links.html datasets are used in our experiments (McAuley et al., 2015), including (a) Baby, (b) Sports and Outdoors, (c) Clothing, Shoes and Jewelry. We are referred to as Baby, Sports, and Clothing in brief, respectively. Table 1 summarizes the statistics of three datasets. These datasets contain both visual and textual modalities of items other than user-item interactions. We extract visual features by CNN, and textual features by sentence-transformers (Reimers and Gurevych, 2019) obtained by concatenating the title, description, category and brand of each item. All datasets are split into training, validation, and testing subsets with a ratio of 8:1:1. Three common metrics: $Precision@K$ , $Recall@K$ and $NDCG@K$ are used for evaluation and $K$ is set to $10,20$ by default.

Table 1. Statistics of the datasets.

Dataset	Modalities	# Items	# Users	# Interactions
Baby	visual, textual	7,050	19,445	139,110
Sports	visual, textual	18,357	35,598	256,308
Clothing	visual, textual	23,033	39,387	237,488

Table 2. Performance of all comparison methods. Each row’s second-best score is underlined and the top score is highlighted in bold. The final column indicates the percentage of performance improvement relative to the second best one.

Dataset	Metric( $\times$ 100 $\%$ )	CF Methods			Multi-modal Methods							Improve
Dataset	Metric( $\times$ 100 $\%$ )	BPR	NGCF	LightGCN	VBPR	MMGCN	GRCN	SLMREC	MICRO	BM3	IDSF	Improve
Baby	Recall@10	2.935	3.567	4.509	3.103	3.861	4.443	5.233	5.807	4.481	6.143	5.8%
	Precision@10	0.308	0.378	0.478	0.332	0.408	0.473	0.585	0.608	0.470	0.645	6.0%
	NDCG@10	1.588	1.964	2.498	1.749	1.996	2.392	2.885	3.218	2.343	3.431	6.6%
	Recall@20	4.566	5.751	7.248	4.863	6.240	7.895	7.775	8.884	6.688	9.505	6.9%
	Precision@20	0.243	0.307	0.383	0.262	0.332	0.422	0.445	0.466	0.354	0.499	7.1%
	NDCG@20	2.033	2.552	3.212	2.330	2.605	3.552	3.541	4.024	2.931	4.311	7.1%
Sports	Recall@10	2.977	4.925	5.786	3.560	3.352	5.879	6.559	6.671	5.947	7.001	5.0%
	Precision@10	0.318	0.521	0.611	0.382	0.355	0.627	0.693	0.701	0.659	0.734	4.7%
	NDCG@10	1.712	2.693	3.376	2.086	1.763	3.310	3.726	3.728	3.158	4.006	7.5%
	Recall@20	4.386	7.471	8.502	5.211	5.466	8.752	9.834	9.951	9.153	10.41	4.6%
	Precision@20	0.236	0.397	0.451	0.281	0.291	0.467	0.518	0.525	0.517	0.548	4.4%
	NDCG@20	2.095	3.371	4.102	2.533	2.284	4.078	4.588	4.599	3.978	4.914	6.8%
Clothing	Recall@10	1.322	2.693	3.684	2.199	1.996	4.006	4.964	5.047	4.144	5.381	6.6%
	Precision@10	0.135	0.273	0.373	0.223	0.202	0.407	0.502	0.436	0.656	0.545	6.5%
	NDCG@10	0.715	1.441	2.032	1.236	1.014	2.164	2.711	2.226	3.152	2.966	6.5%
	Recall@20	1.938	4.095	5.407	3.175	3.231	6.193	7.471	7.674	6.388	7.915	3.1%
	Precision@20	0.099	0.208	0.274	0.161	0.164	0.315	0.378	0.388	0.333	0.401	3.2%
	NDCG@20	0.875	1.801	2.471	1.456	1.325	2.721	3.347	3.451	2.834	3.611	4.6%

Baselines. We compare IDSF with nine competing methods, which can be divided into two categories: CF methods and multimodal methods. The first category contains three classic recommendation methods that merely generate recommendations based on user-item interactions with no consideration of modality information, including BPR (Rendle et al., 2009), NGCF (Wang et al., 2019), and LightGCN (He et al., 2020). The second category of our comparison methods that take into account additional modality information for item representations before recommendation, including VBPR (He and McAuley, 2016), MMGCN (wei Wei et al., 2019), GRCN (wei Wei et al., 2020), SLMREC (Tao et al., 2022), MICRO⁵⁵5MICRO is an extender version of LATTICE(Zhang et al., 2021), so we omitted the performance of LATTICE. (Zhang et al., 2023) and BM3 (Zhou et al., 2023). Details about baselines are provided in Appendix B.

Hyperparameters. For a fair comparison, we set the embedding dimension to 128, batch size to 1024, and initialized all model parameters with Xavier initializer (Glorot and Bengio, 2010) which is optimized by Adam optimizer (Kingma and Ba, 2015). Other hyper-parameters are determined via grid search on the validation set. Finally, the learning rate is set to $0.0005$ , the coefficient of $\ell_{2}$ is set to $10^{-4}$ , the temperature parameter $\tau$ is set to 0.5, the coefficient $\beta$ used to control the effect of contrastive multimodal fusion task is set to 0.3 for dataset Baby, 1.0 for Sports and Clothing, the coefficient $\gamma$ used to control the amount of information from cross-modal representations of the target node is set to 0.3 for dataset Baby, 0.3 for Sports and 1.0 for Clothing. We set the number of GCN layers to $K=2$ the same as LightGCN (He et al., 2020). Besides, we adopt an early-stop strategy if Recall@20 on the validation set no longer increases after $5$ epochs to avoid overfitting and follow the original settings of comparison methods to achieve the best performance. Due to space limitations, we include the analysis of hyperparameters in the Appendix C.

Table 3. Performance of different variants of our IDSF.

Dataset	Baby						Sports						Clothing
Metric( $\times$ 100 $\%$ )	R@10	P@10	N@10	R@20	P@20	N@20	R@10	P@10	N@10	R@20	P@20	N@20	R@10	P@10	N@10	R@20	P@20	N@20
w/o content	5.211	0.551	2.865	8.258	0.437	3.671	5.539	0.584	3.071	8.383	0.444	3.831	3.963	0.401	2.181	5.899	0.298	2.673
content w/o contrast	5.551	0.582	3.089	8.654	0.455	3.886	5.754	0.608	3.181	8.763	0.464	3.981	4.175	0.422	2.285	6.229	0.315	2.811
content w/o ID	5.923	0.625	3.236	9.276	0.488	4.114	6.857	0.721	3.897	10.21	0.538	4.783	4.927	0.499	2.668	7.298	0.369	3.271
structure w/o ID	5.629	0.591	3.152	8.611	0.452	3.931	6.498	0.682	3.573	9.653	0.507	4.403	5.151	0.521	2.781	7.698	0.391	3.433
IDSF	6.143	0.645	3.431	9.505	0.499	4.311	7.001	0.734	4.006	10.41	0.548	4.914	5.381	0.545	2.966	7.915	0.401	3.611

Table 4. Performance comparison over enhanced modalities and original modalities. Where, original visual, original textual and original fused modalities represent that the content and structure features are obtained by using the original modalities without the subtle features, while enhanced visual, enhanced textual and enhanced fused modalities mean that the original features are enhanced by the subtle features, respectively.

Dataset	Baby						Sports						Clothing
Metric( $\times$ 100 $\%$ )	R@10	P@10	N@10	R@20	P@20	N@20	R@10	P@10	N@10	R@20	P@20	N@20	R@10	P@10	N@10	R@20	P@20	N@20
original visual	3.068	0.327	1.607	5.356	0.285	2.217	2.046	0.214	1.012	3.523	0.186	1.406	1.284	0.131	0.611	2.361	0.119	0.882
original textual	3.164	0.333	1.735	5.628	0.292	2.392	2.075	0.218	1.046	3.724	0.197	1.482	1.378	0.139	0.942	2.493	0.127	0.942
original fused	5.048	0.529	2.619	8.221	0.431	3.451	5.983	0.628	3.287	9.113	0.479	4.111	4.614	0.466	2.437	7.203	0.365	3.097
enhanced visual	5.639	0.593	3.112	8.866	0.466	3.958	6.831	0.719	3.861	10.11	0.535	4.738	4.798	0.485	2.627	7.041	0.357	3.199
enhanced textual	5.956	0.625	3.224	9.251	0.481	4.085	6.925	0.729	3.911	10.32	0.539	4.806	5.224	0.527	2.826	7.799	0.394	3.508
enhanced fused	6.143	0.645	3.431	9.505	0.499	4.311	7.001	0.734	4.006	10.41	0.548	4.914	5.381	0.545	2.966	7.915	0.401	3.611

4.2. Performance Comparison

The comparative results are summarized in Table 2, from which we can find that our proposed IDSF method outperforms both CF methods and multimodal methods. Generally, multimodal methods perform better than CF methods, implying the value of extracting salient features from multimodal information for item representation.

For the CF methods without considering multimodal information, NGCF performs better than BPR since the former method captures high-order structural features from the interactions between users and items, implying that it is significant to model structural features through graphs. In comparison, LightGCN performs the best in CF methods thanks to its adoption of an optimized lightweight aggregation method (rather than the aggregation in NGCF). This aggregation method obtains more precise structural features, demonstrating that optimized structural features can help improve recommendation performance.

Among the multimodal methods, VBPR obtains the worst performance due to the mere consideration of visual content features without structural features. Nevertheless, the fact that VBPR beats BPR to a certain extent implies the effectiveness of content features. MMGCN performs better by extracting additional structural features of textual and visual modalities. Meanwhile, we find that MMGCN outperforms NGCF under the same convolution method, leading to the conclusion that it is important to improve the structural features with multimodal information. Furthermore, GRCN refines the structural features by identifying the false-positive feedback and fixing it, whereby better performance is obtained. It is suggested to strengthen the structural features in multimodal recommendation. However, LightGCN achieves the same performance as GRCN only by optimizing the aggregation method, emphasizing the effect of the optimized aggregation in improving the structural features. MICRO gain comparable sub-optimal performance, owing to their fine-grained multimodal fusion with contrastive learning. The significance of contrastive learning in multimodal fusion is self-evident.

Finally, our proposed IDSF method beats the second-best comparison method in terms of three metrics by around 5.8-7.1% on dataset Baby, 4.4-7.5% on Sports, and 3.1-6.6% on Clothing, respectively. We attribute these important improvements mainly to the fine-grained employment of ID embeddings for multimodal recommendation, which can provide valuable subtle features (extracted from ID embeddings) to complement with salient features (from multi-modal information). The complementation can help better learn the representations of items from the perspectives of both content and structure.

4.3. Ablation Studies

To explore the effects of content features, contrastive learning and ID embeddings, we compare the results on four variants: w/o content, which discards item content features, content w/o contrast which omits contrastive loss, content w/o ID which skips fusing subtle features in content features and structure w/o ID which leaves out ID embeddings when obtaining structural features. Table 3 summarizes the performance of different variants of IDSF, from which we have the following observations.

Without the support of content features, the performance of IDSF w/o content significantly decreases, indicating that the content features are essential, which can most explicitly reveal multimodal information of items. Comparing content w/o contrast and content w/o ID, we can observe that it is necessary to draw contrastive learning into the content enhancement with subtle features. If there is no contrastive learning, ID embeddings tend to generate more noisy subtle features which cause performance degradation. Content w/o ID surpasses structure w/o ID since structural features are more sensitive to ID embeddings. Therefore, for structure features, we control the amount of ID information in the enhancement contained in convolution operation with hyperparameters.

Finally, IDSF outperforms all variants on three datasets across these three evaluation metrics, further validating the significance of the enhancement with subtle features. It shows that subtle features can complement salient features and increase the appropriateness of content and structural representation to produce more accurate recommendations.

4.4. The Contribution of Enhancement in Modality Missing

The modality missing problem remains a prominent challenge in multimodal recommendation. ID embeddings provide rich subtle information of content and structure, which can be exploited to enhance the expressiveness of salient features. In this subsection, we conduct incomplete modality experiments, aiming to explore the contribution of the proposed enhancement in the case of missing modalities. Tabel 4 reports the performance comparison over enhancing modalities with subtle features and original modalities.

According to the performance of the original and enhanced single-modality, it is observed that the proposed enhancement is quite effective in making important improvements. Meanwhile, the fusing scenario of enhanced textual and visual features outperforms that of original features, demonstrating the value of subtle features for better representation of items. In other words, our proposed enhancement can alleviate the degradation of recommendation performance when certain modalities are missing.

Moreover, textual modality is more useful than visual modality in general to model items. It can be explained by the fact that textual modality is able to provide finer-grained attributes such as categories and descriptions of items. The performances of utilizing multiple enhanced features are significantly better than those of ones with a single modality. The multimodal combination makes item representations more comprehensive and reaches better performance. The enhanced modalities follow the same trend, which implies that our proposed enhancement by subtle features extracted from ID embeddings does not interfere with the advantages of multiple modalities over a single modality.

5. Related Work

5.1. ID-based Recommendation

ID embeddings contain abundant and important internal characteristics and are essential in the existing recommendation literature(Yuan et al., 2023). For non-sequential tasks, recommendation models evolve from the early item-item collaborative filtering (Linden et al., 2003), various matrix factorization-based approaches (Koren et al., 2009; Guo et al., 2019, 2015; He et al., 2017; Guo et al., 2017), to graph-based methods (wei Wei et al., 2020; He et al., 2020; Mao et al., 2021; Wu et al., 2021; Xia et al., 2022), taking user-item pairs as input to predict matching scores between users and items. For sequential tasks, recommendation models with a sequential prediction model, such as RNN, LSTM, GRU and Transformer as the backbone, taking a user’s historical sequence of items as input to generate the next interactions (Kang and McAuley, 2018; Cui et al., 2020; Wu et al., 2019; Sun et al., 2019; Dang et al., 2023). These recommendation methods treat user/item ID as a complete item representation to conduct predictions.

5.2. ID in Multimodal Recommendation

Multimodal recommendations leverage ID embedding as a component, yet they do not assign significant importance to this particular technique. Matrix factorization-based approaches incorporate ID embeddings derived from historical interactions as content features for modeling user preferences. In this context, multimodal features are concatenated with ID embeddings as side information to enrich item and user representations. For instance, VBPR (He and McAuley, 2016) adds the score obtained by the inner product of item image features and user visual preference in prediction calculation as an extension of ID embeddings. DeepStyle (Liu et al., 2017) employs a shared user latent factor to interact with both image features and item ID embeddings. ACF (Chen et al., 2017) utilizes attention mechanisms to encode user preferences while incorporating user and item IDs as latent factors, which are subsequently aggregated.

In some GCN-based methods (wei Wei et al., 2019, 2020), ID embeddings are integrated into modality-specific structural features as a unique cross-modal global feature. These methods create modality-aware interaction graphs and perform information propagation and fusion to aggregate neighbors’ information for target nodes, and ID embeddings are considered as modality connections. SLMREC (Tao et al., 2022) realizes the underlying semantic role of ID embeddings in recommendation, and treat ID embeddings as a general modality, performing graph convolution on ”ID-modality”.

Moreover, a few works use ID embeddings similarly just as in the conventional recommendation. For example, MICRO (Zhang et al., 2023), LATTICE (Zhang et al., 2021), BM3(Zhou et al., 2023) and FREEDOM(Zhou and Shen, 2023) use ID embeddings only for the needs of collaborative filtering tasks, and take multimodal module as an auxiliary task without considering the connections between ID embeddings and multi-modalities. CMFB (Chen et al., 2021), DualGNN (Wang et al., 2023), and HUIGN (Wei et al., 2022) abandon ID and simply concatenate multimodal content to replace item IDs for matrix factorization or graph convolution.

Recently, Xiao et al. (Xiao et al., 2022) propose a generative multimodal fusion framework (GMMF) for the CTR prediction task, which improves multimodal features by generating new visual and text representations by a Difference Set network (DSN). It maps item ID into modal space and treats it as a ”special modality” to model the difference and connections between modalities as PAMD (Han et al., 2022).

Although the usage of ID varies, they all regard ID as a whole without explanation. Contrarily, we contend that a more nuanced exploration of ID utilization is warranted. In our method, we step further by discovering underlying information and fine-grained utilization of ID to model item representations more comprehensively.

6. Conclusion and Future Work

In this paper, we revisit the significance of ID embeddings and conduct analyses for ID embeddings and multimodal features in the context of multimodal recommendation. Leveraging the subtle features within ID embeddings and recognizing the distinctions between modalities, we propose a novel multimodal recommendation method called IDSF to use subtle content and structural features in ID embeddings effectively, reaching desirable performance.

For future work, we intend to further consider the subtle features in sequential recommendation and find appropriate methods to integrate them. For users, exposed content salient features are unavailable in some datasets so we are looking for this information from users’ interactive behaviors like historical reviews, etc. Moreover, in an era dominated by large-scale models, the exploration of ID embeddings in pre-training and fine-tuning is worth further investigation.

References

(1)
Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur D. Szlam, and Yann LeCun. 2014. Spectral Networks and Locally Connected Networks on Graphs. CoRR (2014).
Cai et al. (2018) Lei Cai, Zhengyang Wang, Hongyang Gao, Dinggang Shen, and Shuiwang Ji. 2018. Deep Adversarial Learning for Multi-Modality Missing Data Completion. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1158–1166.
Cao et al. (2019) Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, and Tat-Seng Chua. 2019. Unifying Knowledge Graph Learning and Recommendation: Towards a Better Understanding of User Preferences. In The World Wide Web Conference. 151–161.
Chen et al. (2017) **gyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 335–344.
Chen et al. (2021) Xi Chen, Yangsiyi Lu, Yuehai Wang, and Jianyi Yang. 2021. CMBF: Cross-Modal-Based Fusion Recommendation Algorithm. Sensors (2021), 1–14.
Cui et al. (2020) Qiang Cui, Shu Wu, Q. Liu, Wen Zhong, and Liang Wang. 2020. MV-RNN: A Multi-View Recurrent Neural Network for Sequential Recommendation. IEEE Transactions on Knowledge and Data Engineering (2020), 317–331.
Dang et al. (2023) Yizhou Dang, Enneng Yang, Guibing Guo, Linying Jiang, Xingwei Wang, Xiaoxiao Xu, Qinghui Sun, and Hong Liu. 2023. Uniform sequence better: time interval aware data augmentation for sequential recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence. 4225–4232.
Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural Information Processing Systems 34. 1–9.
Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Artificial Intelligence and Statistics. 1–8.
Gori and Pucci (2007) Marco Gori and Augusto Pucci. 2007. ItemRank: A Random-Walk Based Scoring Algorithm for Recommender Engines. In International Joint Conference on Artificial Intelligence. 2766–2771.
Guo et al. (2019) Guibing Guo, Enneng Yang, Li Shen, Xiaochun Yang, and Xiaodong He. 2019. Discrete Trust-aware Matrix Factorization for Fast Recommendation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019. 1380–1386.
Guo et al. (2015) Guibing Guo, Jie Zhang, and N. Yorke-Smith. 2015. TrustSVD: Collaborative Filtering with Both the Explicit and Implicit Influence of User Trust and of Item Ratings. In Association for the Advancement of Artificial Intelligence. 123–129.
Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine Based Neural Network for CTR Prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 1725–1731.
Han et al. (2022) Tengyue Han, Pengfei Wang, Shaozhang Niu, and Chenliang Li. 2022. Modality Matches Modality: Pretraining Modality-Disentangled Item Representations for Recommendation. In Proceedings of the ACM Web Conference 2022. 2058–2066.
He and McAuley (2016) Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 144–150.
He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 639–648.
He et al. (2018) Xiangnan He, Zhankui He, Xiaoyu Du, and Tat-Seng Chua. 2018. Adversarial Personalized Ranking for Recommendation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 355–364.
He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. Proceedings of the 26th International Conference on World Wide Web (2017), 1–10.
Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-Attentive Sequential Recommendation. 2018 IEEE International Conference on Data Mining (ICDM) (2018), 197–206.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR (2015).
Koren et al. (2009) Yehuda Koren, Robert M. Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. Computer (2009), 30–37.
Li et al. (2014) Rongjian Li, Wenlu Zhang, Heung-Il Suk, Li Wang, Jiang Li, Dinggang Shen, and Shuiwang Ji. 2014. Deep Learning Based Imaging Data Completion for Improved Brain Disease Diagnosis. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2014. 305–312.
Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com Recommendations: Item-to-Item Collaborative Filtering. IEEE Distributed Syst. Online (2003), 76–80.
Liu et al. (2022) Hao Liu, Yongxin Tong, **dong Han, Panpan Zhang, Xinjiang Lu, and Hui Xiong. 2022. Incorporating Multi-Source Urban Data for Personalized and Context-Aware Multi-Modal Transportation Recommendation. IEEE Transactions on Knowledge and Data Engineering (2022), 723–735.
Liu et al. (2017) Q. Liu, Shu Wu, and Liang Wang. 2017. DeepStyle: Learning User Preferences for Visual Recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 841–844.
Liu et al. (2021) Yong Liu, Susen Yang, Chenyi Lei, Guoxin Wang, Haihong Tang, Juyong Zhang, Aixin Sun, and Chunyan Miao. 2021. Pre-training Graph Transformer with Multimodal Side Information for Recommendation. In Proceedings of the 29th ACM International Conference on Multimedia. 2853–2861.
Mao et al. (2021) Kelong Mao, Jieming Zhu, Xi Xiao, Biao Lu, Zhaowei Wang, and Xiuqiang He. 2021. UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 1253–1262.
McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-Based Recommendations on Styles and Substitutes. Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (2015), 43–52.
McInnes et al. (2018) Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. 2018. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. (2018), 861.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. ArXiv abs/1908.10084 (2019).
Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. 452––461.
Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. Proceedings of the 28th ACM International Conference on Information and Knowledge Management (2019), 1441–1450.
Tao et al. (2022) Zhulin Tao, Xiaohao Liu, Yewei Xia, Xiang Wang, Lifang Yang, Xianglin Huang, and Tat-Seng Chua. 2022. Self-supervised Learning for Multimedia Recommendation. IEEE Transactions on Multimedia (2022), 1–10.
Tsai et al. (2019) Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Learning Factorized Multimodal Representations. In International Conference on Learning Representations. 1–20.
Wang et al. (2018) Cheng Wang, Mathias Niepert, and Hui Li. 2018. LRMM: Learning to Recommend with Missing Modalities. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3360–3370.
Wang et al. (2023) Qifan Wang, Yin wei Wei, Jianhua Yin, Jianlong Wu, Xuemeng Song, Liqiang Nie, and Min Zhang. 2023. DualGNN: Dual Graph Neural Network for Multimedia Recommendation. IEEE Transactions on Multimedia (2023), 1074–1084.
Wang et al. (2019) Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural Graph Collaborative Filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 165–174.
Wei et al. (2022) Yinwei Wei, Xiang Wang, Xiangnan He, Liqiang Nie, Yong Rui, and Tat-Seng Chua. 2022. Hierarchical User Intent Graph Network for Multimedia Recommendation. IEEE Transactions on Multimedia (2022), 2701–2712.
wei Wei et al. (2020) Yin wei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, and Tat-Seng Chua. 2020. Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback. In Proceedings of the 28th ACM International Conference on Multimedia. 3541–3549.
wei Wei et al. (2019) Yin wei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. In Proceedings of the 27th ACM International Conference on Multimedia. 1437–1445.
Wu et al. (2021) Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. 2021. Self-supervised Graph Learning for Recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 726–735.
Wu et al. (2019) Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. 2019. Session-based Recommendation with Graph Neural Networks. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence. 346–353.
Xia et al. (2022) Lianghao Xia, Chao Huang, Yong Xu, Jiashu Zhao, Dawei Yin, and Jimmy Huang. 2022. Hypergraph Contrastive Collaborative Filtering. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1–10.
Xiao et al. (2022) Fangxiong Xiao, Lixi Deng, **g**g Chen, Houye Ji, Xiaorui Yang, Zhuoye Ding, and Bo Long. 2022. From Abstract to Details: A Generative Multimodal Fusion Framework for Recommendation. In Proceedings of the 30th ACM International Conference on Multimedia. 258–267.
Yuan et al. (2023) Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. 2023. Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023), 1–11.
Zhang et al. (2021) **ghao Zhang, Yanqiao Zhu, Qiang Liu, Shu Wu, Shuhui Wang, and Liang Wang. 2021. Mining Latent Structures for Multimedia Recommendation. In Proceedings of the 29th ACM International Conference on Multimedia. 3872–3880.
Zhang et al. (2023) **ghao Zhang, Yanqiao Zhu, Qiang Liu, Mengqi Zhang, Shu Wu, and Liang Wang. 2023. Latent Structures Mining with Contrastive Modality Fusion for Multimedia Recommendation. IEEE Transactions on Knowledge and Data Engineering (2023), 9154–9167.
Zhou et al. (2020) Kun Zhou, Haibo Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji rong Wen. 2020. S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1893–1902.
Zhou et al. (2019) Xiaokang Zhou, Wei Liang, Kevin I-Kai Wang, and Shohei Shimizu. 2019. Multi-Modality Behavioral Influence Analysis for Personalized Recommendations in Health Social Media Environment. IEEE Transactions on Computational Social Systems (2019), 888–897.
Zhou and Shen (2023) Xin Zhou and Zhiqi Shen. 2023. A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation. In Proceedings of the 31st ACM International Conference on Multimedia. 935–943.
Zhou et al. (2023) Xin Zhou, Hongyu Zhou, Yong Liu, Zhiwei Zeng, Chunyan Miao, Pengwei Wang, Yuan You, and Feijun Jiang. 2023. Bootstrap Latent Representations for Multi-modal Recommendation. In Proceedings of the ACM Web Conference 2023. 845–854.

Appendix A Experimental Details

In Section 2, we conducted a visualization analysis of the pre-trained ID embeddings (referred to as IDs) and multimodal features. To ensure the generalizability of our experiments, we randomly selected 10 users from the training set and created a sample set that included all the items they interacted with for analysis. After sampling, we checked for any overlap in the interaction items between users to avoid excessive complexity in the experiment.

For the scatter plots (Figure 2 and Figure 5), we used UMAP (McInnes et al., 2018) to reduce the dimensions of the IDs to 2-D and plotted each ID using the matplotlib package. Points that are closer in the plot indicate that they are also closer in the latent space, revealing structural similarity. Points of the same color (i.e., item IDs interacted with by the same user) that are closer together indicate that the IDs can reflect structural similarity.

For the heatmaps (Figure 3 and Figure 4), we directly computed the cosine similarity between each pair of item IDs, resulting in a similarity matrix of size $43\times 43$ (where 43 is the number of items), which was visualized as a heatmap. To enhance clarity, we only retained the top 10 values of each row in the similarity matrix, setting the rest to 0. The heatmaps directly display the semantic similarity between IDs, and it is evident that IDs of items that interacted with the same user exhibit higher semantic similarity, indicating that IDs can reflect content similarity.

Based on these findings, we believe that the IDs can be categorized as content features and structure features.

Appendix B Baselines

We compare IDSF with nine competing methods, which can be divided into two categories: CF methods and multimodal methods. The first category contains three classic recommendation methods that merely generate recommenation based on user-item interactions with no consideration of modality information.

•

BPR (Rendle et al., 2009) is a classic item ranking method built upon the assumption that a user prefers an interacted item to an unknown one.
•

NGCF (Wang et al., 2019) explicitly models the user-item interactions by a bipartite graph, uses graph convolution operation to learn the topology, and effectively harvests the high-order connectivity and collaborative signals for item recommendation.
•

LightGCN (He et al., 2020) abandons the use of feature transformation and nonlinear activation, and only retains the most important neighbor aggregation module in GCNs for collaborative filtering.

Multimodal models are the second category of our comparison methods that take into account additional modality information for item recommendation.

•

VBPR (He and McAuley, 2016) integrates the visual features and ID embedding of each item as its representations for the matrix factorization.
•

MMGCN (wei Wei et al., 2019) is one of the state-of-the-art multimodal recommendation models, which constructs a graph convolution network from multiple modalities to improve the structural representation.
•

GRCN (wei Wei et al., 2020) is also one of the state-of-the-art multimodal recommendation methods. It refines the user-item interaction graph by identifying the false-positive feedback and fixing it.
•

SLMREC (Tao et al., 2022) incorporates contrastive learning into modal-specific graph neural network to promote multimodal fusion.
•

MICRO (Zhang et al., 2023) mines latent item relations and conducts fine-grained multimodal fusion with contrastive learning before collaborative filtering as a auxiliary task.
•

BM3 (Zhou et al., 2023) is a self-supervised multi-modal recommendation model, which requires neither augmentations from auxiliary graphs nor negative samples.

Appendix C Hyperparameters Analysis

We do sensitivity analysis with various hyper-parameters on the graph convolution and the contrastive task in this paragraph because they are crucial components of our strategy. We examine IDSF performance in relation to various $\gamma$ and $\beta$ values. The $\gamma$ dictates how much multimodal subtle information from the target node should be retained. The performance impact of the contrastive task magnitude $\beta$ is then covered.

Figure 7 reports the results of performance comparison. $\gamma=0$ means no subtle features reserved while doing graph convolution since it collects just the linked neighbors and ignores the self-connections (i.e., structure w/o ID). Moreover, we can observe the following:

•

When $\gamma$ is adjusted to 0.3, 0.3, and 1.0, respectively, performance on Baby, Sport, and Clothing yields the best results, which validates the significance of modal-specific subtle features in structural representations. Appropriate $\gamma$ can obtain better item representations by aggregating subtle and salient structural features which boost the recommendation performance.
•

When the performance on three datasets reaches the peak, the corresponding $\gamma$ are not the same because the importance of salient features on revealing items attributes varies for different types of items. The improving trend declines when $\gamma$ exceeds the best value since the excessive proportion of subtle features will cover the effect of the salient features.
•

Furthermore, given that it is sparse than other datasets that require more multimodal items to provide better recommendations, performance on Clothing improves as $\gamma$ increases.

The effect of various coefficients $\beta$ on performance is seen in Figure 8. $\beta=0$ denotes IDSF w/o contrast, which discards the contrastive learning task. We can observe that

•

The performances on all datasets first improve as beta increases and are always better than $\beta=0$ . The primary task achieves improvements when jointly optimized with the contrastive self-supervised auxiliary task when with a small $\beta$ . This implies that it is important to take multimodal fusion’s semantic consistency into account.
•

When $\beta$ continues to increase, it starts to drop, indicating that $\beta$ interferes with the main work of making recommendations and the gain brought by the self-supervised task could be counteracted when $\beta$ is more than the weight of the BPR task.
•

Overall, there are no apparent sharp rise and falls when $\beta\neq 0$ , which indicates that our methods is not that sensitive to the selection of the ratio of auxiliary task.