DisCo: Towards Harmonious Disentanglement and Collaboration between Tabular and Semantic Space for Recommendation

Kounianhua Du Shanghai Jiao Tong UniversityShanghaiChina [email protected] , Jizheng Chen Shanghai Jiao Tong UniversityShanghaiChina [email protected] , Jianghao Lin Shanghai Jiao Tong UniversityShanghaiChina [email protected] , Yunjia Xi Shanghai Jiao Tong UniversityShanghaiChina [email protected] , Hangyu Wang Shanghai Jiao Tong UniversityShanghaiChina [email protected] , Xinyi Dai Huawei Noah’s Ark LabShanghaiChina [email protected] , Bo Chen Huawei Noah’s Ark LabShanghaiChina [email protected] , Ruiming Tang Huawei Noah’s Ark LabShenzhenChina [email protected] and Weinan Zhang Shanghai Jiao Tong UniversityShanghaiChina [email protected]

(2018; 20 February 2007; 12 March 2009; 5 June 2009)

Abstract.

Recommender systems play important roles in various applications such as e-commerce, social media, etc. Conventional recommendation methods usually model the collaborative signals within the tabular representation space. Despite the personalization modeling and the efficiency, the latent semantic dependencies are omitted. Methods that introduce semantics into recommendation then emerge, injecting knowledge from the semantic representation space where the general language understanding are compressed. However, existing semantic-enhanced recommendation methods focus on aligning the two spaces, during which the representations of the two spaces tend to get close while the unique patterns are discarded and not well explored. In this paper, we propose DisCo to Disentangle the unique patterns from the two representation spaces and Collaborate the two spaces for recommendation enhancement, where both the specificity and the consistency of the two spaces are captured. Concretely, we propose 1) a dual-side attentive network to capture the intra-domain patterns and the inter-domain patterns, 2) a sufficiency constraint to preserve the task-relevant information of each representation space and filter out the noise, and 3) a disentanglement constraint to avoid the model from discarding the unique information. These modules strike a balance between disentanglement and collaboration of the two representation spaces to produce informative pattern vectors, which could serve as extra features and be appended to arbitrary recommendation backbones for enhancement. Experiment results validate the superiority of our method against different models and the compatibility of DisCo over different backbones. Various ablation studies and efficiency analysis are also conducted to justify each model component.

Recommender Systems, User Modeling, Large Language Model

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: KDD; Aug 25–29, 2024; Barcelona^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Information systems Recommender systems

1. Introduction

Refer to caption — Figure 1. Illustration of the motivation.

Recommender systems have become an integral part of today’s digital ecosystem, enhancing user experiences, boosting engagement, facilitating decision-making, and fostering connections between users and relevant content or products. They are widely used in various industries, including e-commerce (Smith and Linden, 2017), entertainment (Christensen and Schiaffino, 2011), social media (Wu et al., 2023), and online streaming platforms (Zhou et al., 2018, 2019; Huang et al., 2022).

Conventional recommender systems usually only model the collaborative signals within the tabular representation space, where samples consist of multi-field categorical features. These methods focus on mining beneficial interactions among features (Juan et al., 2016a; Xiao et al., 2017; Liu et al., 2020; Song et al., 2019) and modeling user interests using historical user behaviors (Zhou et al., 2018, 2019; Feng et al., 2019; Pi et al., 2020; Qin et al., 2020) for accurate and personalized recommendation. While being good at modeling feature interactions and personalized user preferences, these methods fail to learn the latent semantic dependencies of features. For example, as shown in Figure 1, a book written by Ernest Hemingway and a book written by Raymond Carver can be close in semantic space because the two authors are known to be minimalist writing style and Carver said that he borrowed many elements from Hemingway’s style. This latent semantic dependency cannot be inferred from the tabular representation space where features are firstly encoded in a one-hot manner. Attempts to incorporate external knowledge within semantic representation space into recommendation then emerge (Hou et al., 2022; Li et al., 2023; Hou et al., 2023), where textual descriptions and their encodings are used to hold the external semantic knowledge. Despite the general language understanding within semantic representation space, the encoding of the large language model has ambiguity and fails to imply the correlations of some features well. For instance, beer and diaper are distant in semantic space but near in tabular space for recommendation where the correlation analysis is done, since users tend to buy beer and diaper together.

As discussed above, the relationship among the same set of user behaviors can be different in the two representation spaces. The unique and disentangled patterns from the two representation spaces form a complementary relationship with each other, contributing different information. Therefore, it is vital to effectively and efficiently collaborate the tabular and semantic representation spaces. The existing works either: 1) take only one space into consideration (Geng et al., 2022) or 2) only focus on aligning the two spaces (Li et al., 2023), during which the representations tend to get close (green part in Figure 1), with the disentangled part (yellow and blue parts in Figure 1) being discarded.

In this paper, we propose DisCo to Disentangle and Collaborate the tabular and semantic representation spaces for user behavior patterns modeling. Therefore, we design 1) a dual-side attentive network (DS-Attn) to capture the intra-domain and inter-domain patterns, 2) a sufficiency constraint to preserve the useful information (yellow, green, and blue parts in Figure 1) and eliminate the noisy information (grey part in Figure 1) from the two representation spaces, and 3) a disentanglement constraint to preserve the disentangling parts from the two representation spaces (yellow and blue part in Figure 1). Together, these modules strike a balance of the collaboration and disentanglement between the tabular and semantic representation space.

Concretely, DS-Attn infers the inner patterns within the tabular space and the semantic space for collaborative-based correlations and semantic-based dependencies with the intra-domain attention, and captures the inter patterns between the two spaces for aligned knowledge with the inter-domain attention where a query-key exchange is adopted to make the two spaces attend to each other. The parameters of DS-Attn and embedding networks are regularized by the sufficiency and disentanglement constraints. The sufficiency constraint maximizes the mutual information between the representations of each space with the labels, which preserves the task-relevant information offered by each space. The disentanglement constraint minimizes the mutual information between the output vectors of DS-Attn from the two spaces, which forces the two spaces to offer different information. The resulting vectors output by DS-Attn and regularized by the two constraints preserve both the consistent and the specific knowledge of the two representation spaces, which can then be fed into arbitrary recommendation backbones for prediction enhancement.

The main contributions of the paper are summarized as follows:

•

We design a novel and effective framework, DisCo, that harmoniously captures both the consistent and the specific knowledge from tabular representation space and semantic representation space with the dual-side attentive network under the regularization of the proposed sufficiency and disentanglement constraints.
•

We emphasize the importance of unique and disentangled information in both the tabular space and the semantic space, and propose the first work to disentangle the tabular and semantic representation spaces for unique domain knowledge.
•

DisCo is a model-agnostic framework compatible with different recommendation backbones, offering flexibility and generality.

The experiment results over a series of recommendation backbones justify the consistent superiority of the proposed method. Ablation studies are also conducted to validate the effectiveness of different model components.

2. Related Work

2.1. Tabular-Only Methods

Early recommendation models focus on digging into interactions among features. FM (Rendle, 2010) captures second-order feature interactions. FFM (Juan et al., 2016b) introduces field-aware interactions. Wide & Deep (Cheng et al., 2016) combines the strengths of linear models and deep neural networks. DeepFM (Guo et al., 2017) replaces the logistic regression layer in (Cheng et al., 2016) with an FM layer. xDeepFM (Lian et al., 2018) introduces the cross layer for high-order feature interactions. PNN (Qu et al., 2018) utilizes the product layer to learn the high-order product interactions. DCN (Wang et al., 2017) proposes to capture both shallow and deep feature interactions effectively. AutoInt (Song et al., 2019) utilizes the multi-head self-attention mechanism to learn high-order feature interactions. By modeling user behavior patterns, recommenders provide more personalized recommendations. DIN (Zhou et al., 2018) incorporates the attention mechanism (Vaswani et al., 2023) to capture user interests. DIEN (Zhou et al., 2019) uses GRU module to better model the evolving interests of users. DSIN (Feng et al., 2019) proposes to capture the dynamic interests of users within a session. MIMN (Pi et al., 2019) leverages a memory network architecture to capture different aspects of user interests. SIM (Pi et al., 2020) models long-term user behaviors.

2.2. Semantic-Enhanced Methods

Recently, large language models have shown great impact and shed light on various domains of recommendation systems. There are several attempts to incorporate large language models into recommender systems (Lin et al., 2023, 2024b; Xi et al., 2023; Lin et al., 2024a). PTab (Liu et al., 2022) adopts a classic BERT (Devlin et al., 2018) framework with Modality Transformation(MT), Masked-Language Finetuning(MF), and Classification Fine-tuning(CF) training stages. P5 (Geng et al., 2022), as well as its variants (Geng et al., 2023; Hua et al., 2023a, b), propose to tune T5 (Raffel et al., 2020) as a unified recommendation model for various downstream tasks ZESRec (Ding et al., 2021) proposes to obtain universal representations from item descriptions through BERT for zero-shot recommendation. UniSRec (Hou et al., 2022) learns item representations via a fixed BERT model followed by an MoE-enhanced network. CTRL (Li et al., 2023) adopts the contrastive learning methodology to align the tabular space and semantic space for recommendation enhancement. VQ-Rec (Hou et al., 2023) makes improvements on UnisRec (Hou et al., 2022), which introduces vector quantization technique.

3. Preliminaries

3.1. Problem Formulation

The click-through rate (CTR) prediction task aims at accurately predicting the probability of a user clicking an item, which is the core task for recommender systems. Therefore, we mainly focus on the CTR prediction task in this work. The conventional CTR prediction task within the tabular domain can be formulated as

(1)

p(y_{i}|\mathbf{X^{U}_{i},X^{I}_{i},X^{C}_{i}},\theta),

where $\mathbf{X^{U}_{i}}=\left[x_{i,1}^{U},\dots,x_{i,F_{U}}^{U}\right]$ is the set of user features, $\mathbf{X^{I}_{i}}=\left[x_{i,1}^{I},\dots,x_{i,F_{I}}^{I}\right]$ is the set of item features, $\mathbf{X^{C}_{i}}=\left[x_{i,1}^{C},\dots,x_{i,F_{C}}^{C}\right]$ is the set of context features for the click prediction event (e.g., device, season, etc.), $y_{i}$ is the label of the data point, and $\theta$ is the model parameter. We use $F_{U},F_{I},F_{C}$ to denote the number of user features, item features, and context features, respectively. These methods only focus on modeling feature interactions of the input based on the target sample only but fail to model the user behavior patterns.

Modeling user behavior patterns plays an important role in boosting personalized recommendation performance. This line of methods takes users’ historical behaviors as extra inputs and models the dependencies between the candidate item and historical items, which can be formulated as

(2)

p(y_{i}|\mathbf{X^{U}_{i},X^{I}_{i},X^{C}_{i}},[\langle\mathbf{X^{I}_{i_{k}}},% y_{i_{k}}\rangle]_{k=1}^{K},\theta),

where $[\langle\mathbf{X^{I}_{i_{k}}},y_{i_{k}}\rangle]_{k=1}^{K}$ represents the list of user’s historical behaviors and their corresponding labels, and $\theta$ is the model parameter.

In this paper, we aim to involve the embedded open-world knowledge of large language models to achieve harmonious disentanglement and collaboration between the tabular and semantic space for recommendation enhancement. Hence, the prediction can be formulated as

(3)

p(y_{i}|\mathbf{X^{U}_{i},X^{I}_{i}},\mathbf{X^{C}_{i}},[\langle\mathbf{X^{I}_% {i_{k}}},y_{i_{k}}\rangle]_{k=1}^{K},\Phi_{S},\theta),

where $\Phi_{S}$ represents the encoder of a large language model.

3.2. Mutual Information Minimization & Maximization

Mutual information (MI) is important but hard to compute in neural networks. For mutual information maximization, MINE (Belghazi et al., 2018) builds connections between the expectations of variables and mutual information, and proposes a lower bound of the mutual information based on the Donsker-Varadhan representation of KL divergence.

DIM (Hjelm et al., 2019) points out that we do not necessarily need to obtain the precise value of MI but only need to maximize it. They use the Jensen-Shannon Divergence to estimate the MI and therefore propose a GAN-style loss to maximize it:

(4)

L=E_{\mathbb{J}}\left[\log T_{w}(x,y)\right]+E_{\mathbb{M}}\left[\log(1-T_{w}(% x,y))\right].

For mutual information minimization, CLUB (Cheng et al., 2020) introduces an upper bound for mutual information. When the conditional distribution $p(y|x)$ is known, the upper bound can be represented as

(5)

\begin{split}I(X;Y)&\leq I_{CLUB}(X;Y)\\ &=E_{p(x,y)}\left[\log p(y|x)\right]-E_{p(x)}E_{p(y)}\left[\log p(y|x)\right].% \end{split}

When the conditional distribution is not known, one could use a variational distribution $q_{\theta}(y|x)$ to approximate it and the upper bound then becomes

(6)

I_{vCLUB}=E_{p(x,y)}\left[\log q_{\theta}(y|x)\right]-E_{p(x)}E_{p(y)}\left[% \log q_{\theta}(y|x)\right].

4. Methodology

In this section, we describe the methodology of DisCo, which is model-agnostic and compatible with different backbones.

4.1. Overview

As illustrated in Figure 2, we first prepare the tabular and semantic embeddings. In this paper, we encode the user behaviors in tabular and semantic representation space and extract patterns from the resulting tabular and semantic embeddings. To make use of the general open-world knowledge from the semantic space in an efficient manner, we pre-store the semantic embedding of each item in an indexed knowledge base, with the item ID being the indexing key. The semantic embeddings are generated from a frozen LLM, with the textual item descriptions as inputs. The tabular embeddings are obtained via tabular embedding layers with one-hot encoded features as inputs.

The proposed DisCo mainly consists of three components: the dual-side attentive network, the sufficiency constraint, and the disentanglement constraint. To collaborate the tabular and semantic representation space, we propose a Dual-Side Attentive Network to encode the patterns from user behaviors under collaborative learning based representation space, semantic dependencies-based representation space, and the interactions of the two spaces. The resulting pattern vectors serve as additional features for arbitrary recommendation models. In addition, we propose two constraints to regularize the model, which preserve the useful and unique information from the two representation spaces: 1) A sufficiency constraint to maximize the mutual information between encoded vectors and the labels, which preserves the task related information and eliminates the noise. 2) A disentanglement constraint to minimize the mutual information between vectors from different representation spaces, which forces each space to provide unique domain-specific knowledge. Together, these components strike a harmonious balance between collaboration and disentanglement.

4.2. Indexed Knowledge Base

To efficiently utilize the semantic knowledge, we first build an indexed knowledge base to extract and pre-store the semantic embedding of each item.

For each item, we obtain a semantic description for it using the field-value prompt template. As shown in Figure 3, for a movie with title Titanic, genre Romantic, and director James Cameron, we can obtain the item description ”Here is a movie, title is Titanic, genre is Romantic, and director is James Cameron.”. The item description is then fed into a large language model $\Phi_{S}(\cdot)$ to acquire the semantic embedding, which will be stored in an indexed knowledge base $KB[\cdot]$ for further usage, with the item features being the index key and the semantic embedding being the value.

4.3. Dual-Side Attentive Network

As discussed in Section 1, the relations among the same set of user behaviors can be different in the tabular and semantic domains. In order to capture the distinct and shared patterns among the behaviors in both domains, we design a dual-side attentive network (DS-Attn) module, which consists of intra-domain attention and inter-domain attention. The intra-domain attention models the distinct domain-specific patterns within the tabular domain for collaborative-based correlations and within the semantic domain for semantic-based dependencies, respectively. The inter-domain attention models the shared patterns between the two domains, where a query-key exchange between the two domains is adopted to make the two domains attend to each other.

4.3.1. Behaviors Encoding

For each target sample $\mathbf{X_{i}=[X_{i}^{U},X_{i}^{I},X_{i}^{C}]}$ , we gather $K$ recent historical behaviors $\{\langle\mathbf{X_{i_{k}}^{I}},y_{i_{k}}\rangle\}_{k=1}^{K}$ to assist the prediction. The candidate item and the historical items are transformed into representations in tabular domain and semantic domain as follows, $\forall j\in\{i,i_{1},\dots,i_{K}\}$ :

(7)		$\displaystyle\mathbf{h_{j}^{S}}$	$\displaystyle=MLP(KB[\mathbf{X_{j}^{I}}]),$
(8)		$\displaystyle\mathbf{h_{j}^{T}}$	$\displaystyle=\Phi_{T}(\mathbf{X_{j}^{I}}),$

where $\Phi_{T}$ denotes the tabular domain embedding network and $KB[\cdot]$ denotes the indexed knowledge base constructed in the previous stage. $MLP$ is used to reduce the dimension of the semantic embedding. $\mathbf{h_{j}^{S}}$ and $\mathbf{h_{j}^{T}}$ denote the embedding of item features in semantic space and tabular space, respectively. In order to decrease the entanglement of the embeddings for intra-domain and inter-domain pattern extraction, we decouple the corresponding embeddings into chunks. $\forall j\in\{i,i_{1},\dots,i_{K}\}$ :

(9)		$\displaystyle\mathbf{h_{j}^{S}}$	$\displaystyle=[\mathbf{h_{j}^{SI},h_{j}^{SC}}],$
(10)		$\displaystyle\mathbf{h_{j}^{T}}$	$\displaystyle=[\mathbf{h_{j}^{TI},h_{j}^{TC}}],$

where $\mathbf{h_{j}^{SI}}\in R^{\frac{d}{2}},\mathbf{h_{j}^{SC}}\in R^{\frac{d}{2}},% \mathbf{h_{j}^{TI}}\in R^{\frac{d}{2}},$ and $\mathbf{h_{j}^{TC}}\in R^{\frac{d}{2}}$ .

In addition, following (Qin et al., 2021; Du et al., 2022), we also encode the historical labels to model accurate click signals, $\forall k\in\{1,\dots,K\}$ :

(11)

\mathbf{l_{i_{k}}}=\Phi_{Y}(y_{i_{k}}),

where $\Phi_{Y}$ denotes the label embedding network.

4.3.2. Attention Block of DS-Attn

The dual-side attentive network consists of four attention blocks for tabular, semantic, tabular-to-semantic, and semantic-to-tabular user behavior patterns. We first define the function of the basic attention block as follows:

(12)		$\displaystyle\mathbf{H,L^{\prime}}=Attn_{\theta}(\mathbf{Q,K,V,L}),$
(13)		$\displaystyle\theta=\{\mathbf{W_{Q},W_{K},W_{V}}\},$

where $\theta$ denotes the parameters of an attention block. The detail operations for $\mathbf{H,L^{\prime}}=Attn_{\theta}(\mathbf{Q,K,V,L})$ are:

(14)	$\displaystyle\mathbf{Q^{\prime}}=\mathbf{QW_{Q}},\text{ }\mathbf{K^{\prime}}$	$\displaystyle=\mathbf{KW_{K}},\text{ }\mathbf{V^{\prime}}=\mathbf{VW_{V}},$
(15)	$\displaystyle\mathbf{A}=soft$	$\displaystyle max(\frac{\mathbf{Q^{\prime}K^{\prime T}}}{\sqrt{d}}),$
(16)	$\displaystyle\mathbf{H}=\mathbf{A}\mathbf{V^{\prime}}$	$\displaystyle,\text{ }\mathbf{L^{\prime}}=\mathbf{AL},$

where $\mathbf{Q}\in R^{1\times\frac{d}{2}},\mathbf{K}\in R^{K\times\frac{d}{2}},% \mathbf{V}\in R^{K\times\frac{d}{2}},\mathbf{L}\in R^{K\times d}$ denote the input for query, key, value, and labels, and $\mathbf{W_{Q},W_{K},W_{V}}\in R^{\frac{d}{2}\times d}$ represent the trainable weights.

4.3.3. Intra-Domain Attention

To capture the inner domain tabular and semantic behavior patterns, we apply the intra-domain attention among the candidate item and the behavior sequences within the tabular and semantic domains.

We define the query, key, and value for intra-tabular and intra-semantic behavior patterns as

(17)		$\displaystyle\mathbf{Q^{SS}_{i}}=\mathbf{h^{SI}_{i}},$	$\displaystyle\text{\quad}\mathbf{Q^{TT}_{i}}=\mathbf{h^{TI}_{i}}$
(24)		$\displaystyle\mathbf{K^{SS}_{i}=V^{SS}_{i}}=\left(\begin{array}[]{c}\mathbf{h^% {SI}_{i_{1}}}\\ \cdots\\ \mathbf{h^{SI}_{i_{K}}}\end{array}\right),$	$\displaystyle\text{\quad}\mathbf{K^{TT}_{i}=V^{TT}_{i}}=\left(\begin{array}[]{% c}\mathbf{h^{TI}_{i_{1}}}\\ \cdots\\ \mathbf{h^{TI}_{i_{K}}}\end{array}\right).$

where $\mathbf{Q^{SS}_{i}}$ , $\mathbf{K^{SS}_{i}}$ , and $\mathbf{V^{SS}_{i}}$ denote the input for intra-semantic attention and $\mathbf{Q^{TT}_{i}}$ , $\mathbf{K^{TT}_{i}}$ , and $\mathbf{V^{TT}_{i}}$ represent the input of intra-tabular attention.

Then the user behavior patterns in different spaces can be obtained by

(25)		$\displaystyle\mathbf{P^{SS}_{i}=[H^{SS}_{i},L^{SS}_{i}]}$	$\displaystyle=Attn_{\theta_{SS}}(\mathbf{Q^{SS}_{i},K^{SS}_{i},V^{SS}_{i},L_{i% }}),$
(26)		$\displaystyle\mathbf{P^{TT}_{i}=[H^{TT}_{i},L^{TT}_{i}]}$	$\displaystyle=Attn_{\theta_{TT}}(\mathbf{Q^{TT}_{i},K^{TT}_{i},V^{TT}_{i},L_{i% }}).$

4.3.4. Inter-Domain Attention

We further capture the inter-domain patterns and model the shared knowledge with an inter-domain attention module.

We define the query, key, and value for intra-tabular and intra-semantic behavior patterns encoded as

(27)		$\displaystyle\mathbf{Q^{ST}_{i}=h^{SC}_{i}},$	$\displaystyle\text{\quad}\mathbf{Q^{TS}_{i}=h^{TC}_{i}}$
(34)		$\displaystyle\mathbf{K^{ST}_{i}=V^{ST}_{i}}=\left(\begin{array}[]{c}\mathbf{h^% {TC}_{i_{1}}}\\ \cdots\\ \mathbf{h^{TC}_{i_{K}}}\end{array}\right),$	$\displaystyle\text{\quad}\mathbf{K^{TS}_{i}=V^{TS}_{i}}=\left(\begin{array}[]{% c}\mathbf{h^{SC}_{i_{1}}}\\ \cdots\\ \mathbf{h^{SC}_{i_{K}}}\end{array}\right).$

Then the user behavior patterns in different spaces can be obtained by

(35)		$\displaystyle\mathbf{P^{ST}_{i}=[H^{ST}_{i},L^{ST}_{i}]}$	$\displaystyle=Attn_{\theta_{ST}}(\mathbf{Q^{ST}_{i},K^{ST}_{i},V^{ST}_{i},L_{i% }}),$
(36)		$\displaystyle\mathbf{P^{TS}_{i}=[H^{TS}_{i},L^{TS}_{i}]}$	$\displaystyle=Attn_{\theta_{TS}}(\mathbf{Q^{TS}_{i},K^{TS}_{i},V^{TS}_{i},L_{i% }}).$

4.4. Sufficiency Constraint

To preserve the task-related information from the two representation spaces and filter out the noises, we maximize the mutual information between the encoded vectors $\mathbf{h}$ from each domain and the labels. The difficulty here lies in that the label space is discrete and only consists of two values. Therefore, we follow DIM (Hjelm et al., 2019) to use the summarized pattern vectors $\mathbf{H_{\cdot}}$ from each space to represent the label space, which extends the discrete states into high-dimensional continuous space. To maximize the mutual information, we maximize the distance between the marginal distribution and joint distributions between two variables.

Concretely, for tabular domain sufficiency preserving, we sample the positive pairs as $\mathcal{I^{T+}}=\{\langle\mathbf{h_{i}^{TI},H_{i^{+}}^{TT}}\rangle|y_{i}=y_{i% +}\}$ and sample the negative pairs as $\mathcal{I^{T-}}=\{\langle\mathbf{h_{i}^{TI},H_{i^{-}}^{TT}}\rangle|y_{i}\neq y% _{i-}\}$ . For semantic domain sufficiency preserving, we sample the positive pairs as $\mathcal{I^{S+}}=\{\langle\mathbf{h_{i}^{SI},H_{i^{+}}^{SS}}\rangle|y_{i}=y_{i% +}\}$ and sample the negative pairs as $\mathcal{I^{S-}}=\{\langle\mathbf{h_{i}^{SI},H_{i^{-}}^{SS}}\rangle|y_{i}\neq y% _{i-}\}$ . The discriminator network $\mathcal{D}_{\theta 1},\mathcal{D}_{\theta 2}$ are used to distinguish the two pairs. The optimization objective is:

(37)

\begin{split}l_{Suf}=-\sum_{\mathcal{I^{T+}},\mathcal{I^{T-}}}\left(\log% \mathcal{D}_{\theta 1}(\mathbf{h^{TI}_{i}},\mathbf{H^{TT}_{i+}})+\left(1-\log% \mathcal{D}_{\theta 1}(\mathbf{h^{TI}_{i}},\mathbf{H^{TT}_{i-}})\right)\right)% \\ -\sum_{\mathcal{I^{S+}},\mathcal{I^{S-}}}\left(\log\mathcal{D}_{\theta 2}(% \mathbf{h^{SI}_{i}},\mathbf{H^{SS}_{i+}})+\left(1-\log\mathcal{D}_{\theta 2}(% \mathbf{h^{SI}_{i}},\mathbf{H^{SS}_{i-}})\right)\right).\end{split}

4.5. Disentanglement Constraint

To preserve the unique information from the two representation spaces, a two-level disentanglement is applied. Concretely, we minimize the mutual information among the pattern vectors summarized from behavior vectors in the two domains. To achieve this, we minimize the vCLUB (Cheng et al., 2020) upper bound of the mutual information, which is defined as

(38)

I_{vCLUB}(\mathbf{X;Y})=E_{p(\mathbf{X,Y})}\left[\log q_{\theta*}(\mathbf{Y|X}% )\right]-E_{p(\mathbf{X})}E_{p(\mathbf{Y})}\left[\log q_{\theta*}(\mathbf{Y|X}% )\right],

where $q_{\theta*}(\mathbf{Y|X})$ is a variational distribution with parameter ${\theta*}$ to approximate $p(\mathbf{Y|X})$ .

At each iteration of training, the variational approximation network trained to maximize $\log q_{\theta*}(\mathbf{Y|X})$ is first optimized, and then the main model is optimized.

In this way, we train a vCLUB mutual information estimator for each pair of feature vectors from different domains and minimize the mutual information between each pair. The loss objective of the disentanglement module could then be formulated as

(39)

l_{Dis}=I_{vCLUB1}(\mathbf{H^{TT}};\mathbf{H^{SS}})+I_{vCLUB2}(\mathbf{H^{TS}}% ;\mathbf{H^{ST}}).

4.6. Prediction and Training Objective

The aggregated feature and label embeddings are then disentangled and appended to any recommendation backbone models for prediction as

(40)

\hat{y_{i}}=f(\mathbf{X_{i}},[\langle\mathbf{X^{I}_{i_{k}}},y_{i_{k}}\rangle]_% {k=1}^{K},\mathbf{P^{TT}_{i}},\mathbf{P^{SS}_{i}},\mathbf{P^{TS}_{i}},\mathbf{% P^{ST}_{i}}),

where $f(\cdot)$ denotes an arbitrary recommendation model.

The training objective consists of the prediction loss, the sufficiency loss, and the disentanglement loss, which can be formulated as:

(41)		$\displaystyle l_{pred}$	$\displaystyle=\textnormal{-}\sum_{i}(y_{i}\log\hat{y_{i}}+(1-y_{i})\log(1-\hat% {y_{i}}))~{}$
(42)		$\displaystyle\mathcal{L}$	$\displaystyle=l_{pred}+\alpha\cdot l_{Suf}+\beta\cdot l_{Dis},$

where $\alpha$ and $\beta$ are hyperparameters to scale the loss components.

5. Experiments

In this section, we empirically evaluate the proposed model on three datasets. Five research questions lead the experiment part.

RQ1

How does DisCo perform against the baselines?
RQ2

Is DisCo compatible with different backbones?
RQ3

Does each model component contribute to the performance?
RQ4

How is the efficiency of DisCo?

5.1. Setup

5.1.1. Datasets

We use three public datasets to evaluate DisCo. The statistics of the datasets are summarized in Table 1.

Table 1. Dataset statistics.

Dataset	# Users	# Items	# Samples	# Fields
ML-1M	$6,040$	$3,706$	$970,009$	8
AZ-Toys	$208,175$	$77,687$	$716,197$	5
ML-25M	$162,541$	$59,047$	$24,187,390$	4

•

ML-1M¹¹1https://grouplens.org/datasets/movielens/1m/ is a collection of movie ratings provided by users of the MovieLens website.
•

AZ-Toys²²2https://cseweb.ucsd.edu/ jmcauley/datasets.html gathers product reviews and metadata related to toys and games available in Amazon e-commerce.
•

ML-25M³³3https://files.grouplens.org/datasets/movielens/ml-25m.zip is a popular movie recommendation dataset widely used in machine learning and recommender systems.

Samples with ratings greater than 3 are treated as positive samples, with the others being negative samples. The window size of the historical behaviors is 30. Data is split according to the global timestamps. Specifically, the training data lies between $[0,T_{0})$ , the validation data lies between $[T_{0},T_{1})$ , and the test data lies between $[T_{1},+\infty)$ . The ratio of data amount for train/valid/test is $8:1:1$ .

5.1.2. Evaluation Metrics

Two widely used metrics including AUC (Area under the ROC curve) and Log Loss (binary cross-entropy loss) are applied to evaluate the performance.

5.1.3. Competing Models

We compare the proposed DisCo with the following methods: 1) conventional tabular methods including DeepFM (Guo et al., 2017), DCN (Wang et al., 2017), PNN (Qu et al., 2018), xDeepFM (Lian et al., 2018), AutoInt (Song et al., 2019), DIN (Zhou et al., 2018), DIEN (Zhou et al., 2019), and 2) semantic-enhanced methods P5 (Geng et al., 2022), UnisRec (Hou et al., 2022), CTRL (Li et al., 2023), and VQ-Rec (Hou et al., 2023).

5.1.4. Implementation Details

We utilize Vicuna-13b (Chiang et al., 2023) released by FastChat⁴⁴4https://github.com/lm-s for text encoding. For a fair comparison, we fix the embedding size and the hidden layer size to be the same for all backbone models. The embedding size for the tabular domain representation is 32. The hidden layer size used for MLP is $[128,64]$ . We use the bilinear networks to serve as the discriminator network in the sufficiency constraint and vCLUB mutual information estimator in the disentanglement constraint. The coefficients for the sufficiency constraint loss and disentanglement constraint loss are $0.02$ and $0.01$ , respectively. For each model, the learning rate is searched in the range of $\{1e-4,3e-4,5e-4,1e-3\}$ , and the weight decay is searched in the range of $\{1e-5,3e-5,5e-5,1e-4,3e-4\}$ . We use the Adam (Kingma and Ba, 2017) optimizer during training. The patience of early stop is 10. The code is available ⁵⁵5https://github.com/KounianhuaDu/DisCo ⁶⁶6https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/DisCo.

5.2. Overall Performance (RQ1)

In this- section, we compare our proposed DisCo with various baseline models. The experiment results are displayed in Table 2.

Table 2. Major results. For all the baselines, we append the user histories and their corresponding ratings/labels for fair comparison. The best result is in bold, while the second-best value is underlined. Rel.Impr denotes the relative AUC improvement of DisCo against each baseline model. The symbol * indicates statistically significant improvement with p-value

<0.001

Models		ML-1M			AZ-Toys			ML-25M
Models		AUC	Logloss	Rel. Impr.	AUC	Logloss	Rel. Impr.	AUC	Logloss	Rel.Impr.
Tabular Only (CRS)	DeepFM	0.7947	0.5470	1.11%	0.7423	0.3720	0.74%	0.8133	0.4892	1.36%
	DCN	0.7961	0.5417	0.93%	0.7424	0.3716	0.73%	0.8134	0.4875	1.35%
	PNN	0.7932	0.5451	1.30%	0.7418	0.3705	0.81%	0.8135	0.4869	1.33%
	xDeepFM	0.7938	0.5464	1.22%	0.7392	0.3748	1.16%	0.8093	0.4922	1.84%
	AutoInt	0.7945	0.5435	1.13%	0.7429	0.3705	0.66%	0.8128	0.4883	1.42%
	DIN	0.7976	0.5401	0.74%	0.7424	0.3707	0.73%	0.8174	0.4820	0.86%
	DIEN	0.7970	0.5428	0.82%	0.7446	0.3704	0.43%	0.8189	0.4841	0.68%
Semantic Enhanced	P5	0.7937	0.5478	1.23%	0.7418	0.3736	0.81%	0.8091	0.4921	1.90%
	UnisRec	0.7991	0.5410	0.55%	0.7452	0.3837	0.35%	0.8162	0.5223	1.02%
	CTRL	0.7979	0.5413	0.70%	0.7432	0.3723	0.62%	0.8189	0.4922	0.68%
	VQ-Rec	0.7972	0.5449	0.79%	0.7456	0.3826	0.30%	0.8185	0.5210	0.73%
Best CRS + DisCo		0.8035*	0.5343*	-	0.7478*	0.3704*	-	0.8245*	0.4743*	-

Table 3. Compatibility experiments. The proposed method offers additional feature fields, which can be followed by different feature interaction operations. We test its compatibility with different CTR backbones.

Backbones	ML-1M					AZ-Toys					ML-25M
	Original		+DisCo		Rel. Impr.	Original		+DisCo		Rel. Impr.	Original		+DisCo		Rel. Impr.
	AUC	LL	AUC	LL	Rel. Impr.	AUC	LL	AUC	LL	Rel. Impr.	AUC	LL	AUC	LL	Rel. Impr.
DeepFM	0.7947	0.5470	0.8029	0.5340	1.03%	0.7423	0.3720	0.7462	0.3708	0.53%	0.8133	0.4892	0.8217	0.4771	1.03%
DCN	0.7961	0.5417	0.8035	0.5343	0.93%	0.7424	0.3716	0.7470	0.3693	0.62%	0.8134	0.4875	0.8231	0.4875	1.19%
PNN	0.7932	0.5451	0.8017	0.5401	1.07%	0.7418	0.3705	0.7466	0.3681	0.65%	0.8135	0.4869	0.8245	0.4743	1.35%
xDeepFM	0.7938	0.5464	0.7999	0.5384	0.77%	0.7392	0.3748	0.7434	0.3718	0.57%	0.8093	0.4922	0.8240	0.4792	1.82%
AutoInt	0.7945	0.5435	0.8029	0.5343	1.06%	0.7429	0.3705	0.7472	0.3710	0.58%	0.8128	0.4883	0.8196	0.4863	0.83%
DIN	0.7976	0.5401	0.8016	0.5373	0.51%	0.7424	0.3707	0.7478	0.3704	0.73%	0.8174	0.4820	0.8212	0.4829	0.46%
DIEN	0.7970	0.5428	0.8025	0.5353	0.69%	0.7446	0.3704	0.7468	0.3693	0.30%	0.8189	0.4841	0.8219	0.4792	0.37%

From the results, one can draw the following conclusions. 1) Our proposed DisCo can consistently outperform all the baseline models including tabular-only methods and the semantic-enhanced methods. The improvements are statistically significant under p-value $<0.001$ . This shows the effectiveness of the proposed paradigm that disentangles and collaborates tabular and semantic domain knowledge for enhanced recommendation. 2) The methods that involve in semantic knowledge can surpass the conventional tabular-only methods, which shows the effectiveness of introducing external semantic knowledge into recommendation. 3) DisCo outperforms methods that focus on aligning the two representation spaces. For example, CTRL utilizes the contrastive learning methodology to align the two representation spaces. These methodologies tend to make representations in different representation spaces closer, where unique information is discarded during training. This validates the effectiveness of DisCo that preserves the unique information of the two different representation spaces.

5.3. Compatibility Study (RQ2)

Since the extracted patterns obtained from the dual-side attentive network can serve as extra features, they could be appended to arbitrary conventional recommendation models. In this section, we evaluate the compatibility of the proposed framework on different conventional backbones.

The feature interaction methods of the backbones include product-based, MLP-based, and attention-based operators. We test DisCo on these different operators and justify the effectiveness of the resulting feature fields. The results are displayed in Table 3. From the results, we can see that the proposed method could offer performance gains for various backbone models and operations. The improvements are statistically significant under p-value $<0.001$ , which validates the superior compatibility of DisCo.

5.4. Ablation Studies (RQ3)

5.4.1. Impact of the Dual-Side Attentive Network

In this section, we validate the effectiveness of the proposed dual-side attentive network module. Concretely, we remove the inter-domain attention within which the two representation spaces attend to each other and only retrain the intra-domain attention. This results in the common two-tower structure in recommender systems, where the semantic and tabular representations are modeled separately for semantic dependencies and collaborative signals respectively. While our dual-side attentive network module models both the intra-domain and inter-domain user behavior patterns. The results of the two-tower attention and the dual-side attentive network module are displayed in Table 4.

Table 4. Experiment on the Aggregation Module.

Datasets	Two-Tower Attention		Dual-Side Attention
Datasets	AUC	Logloss	AUC	Logloss
ML-1M	0.8015	0.5360	0.8035	0.5343
AZ-Toys	0.7464	0.3689	0.7478	0.3704
ML-25M	0.8208	0.4837	0.8245	0.4743

From the results, we could see that the proposed dual-side attentive network can outperform the two-tower aggregation, which validates the effectiveness of the proposed module that captures both the intra-domain and the inter-domain knowledge.

5.4.2. Impact of the Constraints

In this section, we study the impact of the proposed constraints.

Table 5. Impacts of the two constraints.

Datasets	ML-1M		AZ-Toys		ML-25M
Datasets	AUC	Logloss	AUC	Logloss	AUC	Logloss
w/ Both (DisCo)	0.8035	0.5343	0.7478	0.3704	0.8245	0.4743
w/o Sufficiency	0.8019	0.5354	0.7474	0.3686	0.8220	0.4780
w/o Disentanglent	0.8011	0.5377	0.7476	0.3687	0.8212	0.4833
w/o Both	0.7988	0.5391	0.7460	0.3701	0.8163	0.4928

Firstly, we conduct experiments with and without the constraints, the results of which are displayed in Table 5. From the results, we can see the following conclusions. 1) Both the sufficiency and the disentanglement constraints could offer performance gains to the model, with sufficiency helps to preserve task-relevant information from each space and disentanglement helps to enforce unique information from each space. 2) The two constraints collaborates and boosts performance with each other, which helps to capture both the consistency and specificity information of the two spaces.

In addition, we further visualize the representations of the dual-side attentive module output with and without the disentanglement constraint to dig into how the disentanglement impacts the distribution of representations. The visualization of ML-1M is displayed in Figure 4. Visualizations of more datasets could be found in the Appendix.

Concretely, we visualize the representations of the intra-domain pattern vectors $\mathbf{H^{TT},H^{SS}}$ and inter-domain pattern vectors $\mathbf{H^{TS},H^{ST}}$ with t-sne (Van der Maaten and Hinton, 2008). The left column of Figure 4 displays the representations without the disentanglement constraint. While the right column of the figure displays those with the disentanglement constraint. From the figure, we can see that under the regularization of the disentanglement constraint, the distributions of representations from different domains are separated better and there are apparently better manifolds existing in the representation spaces with the disentanglement constraint. This illustrates that the disentanglement constraint could well separate and extract different information from the two spaces, which validates the effectiveness of our design.

5.5. Efficiency Analysis (RQ4)

In this section, we discuss the efficiency of the proposed model. Firstly, the semantic embeddings for items could be pre-computed and stored in the indexed knowledge base, the construction of which can be done offline and only once. In addition, after the MLP used to reduce the dimension of the semantic embedding is trained, we could use the trained MLP to reduce the dimension and further keep a reduced-dimension version of the indexed knowledge base. Therefore, in the inference stage we do not need to deal with the high-dimensional semantic vectors.

Table 6. Training and inference time per sample (s).

Dataset	ML-1M		AZ-Toys
Dataset	Training	Inference	Training	Inference
DCN	$2.18\times 10^{-3}$	$2.86\times 10^{-5}$	$2.06\times 10^{-3}$	$2.68\times 10^{-5}$
DIN	$1.99\times 10^{-3}$	$1.34\times 10^{-5}$	$2.05\times 10^{-3}$	$2.15\times 10^{-5}$
DisCo	$5.04\times 10^{-3}$	$7.42\times 10^{-5}$	$4.29\times 10^{-3}$	$6.89\times 10^{-5}$

The training and inference time analysis is displayed in Table 6. All the experiments are done on a single V100 GPU with Intel Xeon Gold 6278C 2.60GHz CPU and run three times to get the average time. From the results, we can see that the proposed method does not cause a heavy overhead.

6. Conclusion

Recommender systems play a vital role in our daily life. Conventional recommendation methods focus on modeling feature interactions and user behaviors within the ID-based tabular representation space and fail to capture semantic dependencies among user behaviors. Existing semantic-enhanced methods focus on aligning the tabular and semantic space, while the unique and disentangled parts of the two representation spaces are not well explored. In this paper, we propose DisCo to disentangle and collaborate the tabular and semantic representation spaces to capture both the consistent and the specific knowledge from the two spaces for enhanced recommendations. Concretely, we design three modules, namely dual-side attentive network, the sufficiency constraint, and the disentanglement constraint. To efficiently utilize the semantic knowledge, a textual description for each item is firstly obtained and encoded by LLMs, the embedding of which is then stored into an indexed knowledge. The dual-side attention module models intra-domain and inter-domain patterns to offer additional knowledge for arbitrary recommendation backbones, which is constrained by the designed sufficiency and disentanglement constraints. The two constraints force the model to preserve useful information and extract unique information from the two spaces. Extensive experiments and ablation studies on three datasets and various backbone models justify the effectiveness of the proposed method.

References

(1)
Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, R. Devon Hjelm, and Aaron C. Courville. 2018. Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Vol. 80. PMLR, 530–539.
Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In DLRS@RecSys.
Cheng et al. (2020) Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. 2020. CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 1779–1788.
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
Christensen and Schiaffino (2011) Ingrid A. Christensen and Silvia Schiaffino. 2011. Entertainment recommender systems for group of users. Expert Systems with Applications 38, 11 (2011), 14127–14135. https://doi.org/10.1016/j.eswa.2011.04.221
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Ding et al. (2021) Hao Ding, Yifei Ma, Anoop Deoras, Yuyang Wang, and Hao Wang. 2021. Zero-Shot Recommender Systems. arXiv:2105.08318 [cs.LG]
Du et al. (2022) Kounianhua Du, Weinan Zhang, Ruiwen Zhou, Yangkun Wang, Xilong Zhao, Jiarui **, Quan Gan, Zheng Zhang, and David P Wipf. 2022. Learning Enhanced Representation for Tabular Data via Neighborhood Propagation. Advances in Neural Information Processing Systems 35 (2022), 16373–16384.
Feng et al. (2019) Yufei Feng, Fuyu Lv, Weichen Shen, Menghan Wang, Fei Sun, Yu Zhu, and Ke** Yang. 2019. Deep Session Interest Network for Click-Through Rate Prediction. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019. 2301–2307.
Geng et al. (2022) Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems. 299–315.
Geng et al. (2023) Shijie Geng, Juntao Tan, Shuchang Liu, Zuohui Fu, and Yongfeng Zhang. 2023. VIP5: Towards Multimodal Foundation Models for Recommendation. arXiv preprint arXiv:2305.14302 (2023).
Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. In IJCAI.
Hjelm et al. (2019) R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
Hou et al. (2023) Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao. 2023. Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders. arXiv:2210.12316 [cs.IR]
Hou et al. (2022) Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards Universal Sequence Representation Learning for Recommender Systems. arXiv:2206.05941 [cs.IR]
Hua et al. (2023a) Wenyue Hua, Yingqiang Ge, Shuyuan Xu, Jianchao Ji, and Yongfeng Zhang. 2023a. UP5: Unbiased Foundation Model for Fairness-aware Recommendation. arXiv preprint arXiv:2305.12090 (2023).
Hua et al. (2023b) Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. 2023b. How to Index Item IDs for Recommendation Foundation Models. arXiv preprint arXiv:2305.06569 (2023).
Huang et al. (2022) Yanhua Huang, Hangyu Wang, Yiyun Miao, Ruiwen Xu, Lei Zhang, and Weinan Zhang. 2022. Neural Statistics for Click-Through Rate Prediction. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1849–1853.
Juan et al. (2016a) Yu-Chin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016a. Field-aware Factorization Machines for CTR Prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September 15-19, 2016. ACM, 43–50.
Juan et al. (2016b) Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016b. Field-aware factorization machines for CTR prediction. In RecSys.
Kingma and Ba (2017) Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]
Li et al. (2023) Xiangyang Li, Bo Chen, Lu Hou, and Ruiming Tang. 2023. CTRL: Connect Tabular and Language Model for CTR Prediction. arXiv preprint arXiv:2306.02841 (2023).
Lian et al. (2018) Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1754–1763.
Lin et al. (2024a) Jianghao Lin, Bo Chen, Hangyu Wang, Yunjia Xi, Yanru Qu, Xinyi Dai, Kangning Zhang, Ruiming Tang, Yong Yu, and Weinan Zhang. 2024a. ClickPrompt: CTR Models are Strong Prompt Generators for Adapting Language Models to CTR Prediction. In Proceedings of the ACM on Web Conference 2024 (WWW ’24). 3319–3330.
Lin et al. (2023) Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, and Weinan Zhang. 2023. How Can Recommender Systems Benefit from Large Language Models: A Survey. arXiv preprint arXiv:2306.05817 (2023).
Lin et al. (2024b) Jianghao Lin, Rong Shan, Chenxu Zhu, Kounianhua Du, Bo Chen, Shigang Quan, Ruiming Tang, Yong Yu, and Weinan Zhang. 2024b. ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation. In Proceedings of the ACM on Web Conference 2024 (WWW ’24). 3497–3508.
Liu et al. (2020) Bin Liu, Chenxu Zhu, Guilin Li, Weinan Zhang, **cai Lai, Ruiming Tang, Xiuqiang He, Zhenguo Li, and Yong Yu. 2020. AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020. ACM, 2636–2645.
Liu et al. (2022) Guang Liu, Jie Yang, and Ledell Wu. 2022. PTab: Using the Pre-trained Language Model for Modeling Tabular Data. arXiv preprint arXiv:2209.08060 (2022).
Pi et al. (2019) Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019. ACM, 2671–2679.
Pi et al. (2020) Qi Pi, Guorui Zhou, Yu**g Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020. ACM, 2685–2692.
Qin et al. (2021) Jiarui Qin, Weinan Zhang, Rong Su, Zhirong Liu, Weiwen Liu, Ruiming Tang, Xiuqiang He, and Yong Yu. 2021. Retrieval & Interaction Machine for Tabular Data Prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1379–1389.
Qin et al. (2020) Jiarui Qin, Weinan Zhang, Xin Wu, Jiarui **, Yuchen Fang, and Yong Yu. 2020. User Behavior Retrieval for Click-Through Rate Prediction. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020. ACM, 2347–2356.
Qu et al. (2018) Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, Huifeng Guo, Yong Yu, and Xiuqiang He. 2018. Product-based Neural Networks for User Response Prediction over Multi-field Categorical Data. ACM Transactions on Information Systems (2018).
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
Rendle (2010) Steffen Rendle. 2010. Factorization machines. In ICDM.
Smith and Linden (2017) Brent Smith and Greg Linden. 2017. Two decades of recommender systems at Amazon.com. IEEE Internet Computing (2017). https://www.amazon.science/publications/two-decades-of-recommender-systems-at-amazon-com
Song et al. (2019) Wei** Song, Chence Shi, Zhi** Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Bei**g, China, November 3-7, 2019. ACM, 1161–1170.
Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL]
Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD’17, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 12:1–12:7.
Wu et al. (2023) Chuhan Wu, Fangzhao Wu, Yongfeng Huang, and Xing Xie. 2023. Personalized news recommendation: Methods and challenges. ACM Transactions on Information Systems 41, 1 (2023), 1–50.
Xi et al. (2023) Yunjia Xi, Weiwen Liu, Jianghao Lin, Jieming Zhu, Bo Chen, Ruiming Tang, Weinan Zhang, Rui Zhang, and Yong Yu. 2023. Towards Open-World Recommendation with Knowledge Augmentation from Large Language Models. arXiv:2306.10933 [cs.IR]
Xiao et al. (2017) Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017. 3119–3125.
Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep Interest Evolution Network for Click-Through Rate Prediction. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. 5941–5948.
Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chengru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi **, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018. 1059–1068.

Appendix A Visualizations

Visualizations of the representations for the dual-side attentive network on AZ-Toys and ML-25M are displayed in Figure 5 and Figure 6.

Table 7. Items used in the case study.

Item Title	Galaxy Quest	Babe: Pig in the City
Genre provided in the movielens dataset	Adventure	Children’s movie
Tags provided by Douban (Open world knowledge, unknown by the tabular model)	Comedy, Action, Fantasy and adventure.	Comedy, Fantasy and Adventure.

Table 8. Similarities of items in different spaces.

Cosine Similarity	In Tabular Space	In Semantic Space (BERT)	In Semantic Space (Vicuna-13b)
Galaxy Quest – Babe: Pig in the City	-0.0210	0.0599	0.9606

Appendix B Experiments on Long-Tail Data

In this section, we validate the effectiveness of DisCo on the long-tail data where features are less-hit in the tabular representation space. Concretely, we sort items based on their frequency of occurrences in the training set. The bottom 10% in terms of frequency are classified as long-tail items. Then we study the performance of the best-performed tabular-only model DIN and that with the proposed DisCo model.

Table 9. Experiment on the tail Data.

Datasets	DIN		DIN + DisCo
Datasets	AUC	Logloss	AUC	Logloss
ML-1M	0.6710	0.6564	0.6934	0.6308
AZ-Toys	0.6673	0.3982	0.7416	0.3778
ML-25M	0.7963	0.5430	0.8032	0.5380

The performance comparisons are displayed in Table 9. From the results, we can see that the proposed method gives a significant performance boost on the tail data. Since the general knowledge contained in the pretrained semantic embeddings helps to complement the less-trained features in the tabular representation space, where the semantic information is injected through the dual-side attentive network under the regularizations of the constraints.

Appendix C Case Study

In this section, We would like to provide an example to support the claim that encodings from LLMs can capture open-world knowledge as below.

Given two items in the movielens dataset: ”Galaxy Quest” and ”Babe: Pig in the City”. Their genres provided in the dataset and the tags provided by open world are listed in Table 7.

For the two items:

•

The two items do not share any common tokens.
•

The two items do not share any common features given in the dataset.
•

But they are actually close as they share a lot common features in the open world (e.g., the tags given by Douban).

We then compute the cosine similarities between the above two items in the tabular space and that in the semantic space as in Table 8. (Note that no generation is involved. We use the same genres provided by the movielens dataset to obtain the item encodings for both the tabular encodings and the semantic encodings.)

From the results, we can see that

•

Tabular representations cannot find the relevance between the two items, since no common features exist.
•

Semantic representations by small language models cannot well find the relevance between the two items, since no common tokens exist.
•

Semantic representations by LLMs can well find the relevance between the two items, even there are nearly no common tokens shared between the two items in the dataset. Since there are many anchors about the two items existing in the large open world training corpus (e.g., the common tags given by Douban, the similar descriptions of movies, etc.), as they are trained together, the representations of the two items tend to get close.

The above case study can justify that the encodings from LLMs can help to capture open-world knowledge.