DisCo: Towards Harmonious Disentanglement and Collaboration between Tabular and Semantic Space for Recommendation

Kounianhua Du Shanghai Jiao Tong UniversityShanghaiChina [email protected] Jizheng Chen Shanghai Jiao Tong UniversityShanghaiChina [email protected] Jianghao Lin Shanghai Jiao Tong UniversityShanghaiChina [email protected] Yunjia Xi Shanghai Jiao Tong UniversityShanghaiChina [email protected] Hangyu Wang Shanghai Jiao Tong UniversityShanghaiChina [email protected] Xinyi Dai Huawei Noah’s Ark LabShanghaiChina [email protected] Bo Chen Huawei Noah’s Ark LabShanghaiChina [email protected] Ruiming Tang Huawei Noah’s Ark LabShenzhenChina [email protected]  and  Weinan Zhang Shanghai Jiao Tong UniversityShanghaiChina [email protected]
(2018; 20 February 2007; 12 March 2009; 5 June 2009)
Abstract.

Recommender systems play important roles in various applications such as e-commerce, social media, etc. Conventional recommendation methods usually model the collaborative signals within the tabular representation space. Despite the personalization modeling and the efficiency, the latent semantic dependencies are omitted. Methods that introduce semantics into recommendation then emerge, injecting knowledge from the semantic representation space where the general language understanding are compressed. However, existing semantic-enhanced recommendation methods focus on aligning the two spaces, during which the representations of the two spaces tend to get close while the unique patterns are discarded and not well explored. In this paper, we propose DisCo to Disentangle the unique patterns from the two representation spaces and Collaborate the two spaces for recommendation enhancement, where both the specificity and the consistency of the two spaces are captured. Concretely, we propose 1) a dual-side attentive network to capture the intra-domain patterns and the inter-domain patterns, 2) a sufficiency constraint to preserve the task-relevant information of each representation space and filter out the noise, and 3) a disentanglement constraint to avoid the model from discarding the unique information. These modules strike a balance between disentanglement and collaboration of the two representation spaces to produce informative pattern vectors, which could serve as extra features and be appended to arbitrary recommendation backbones for enhancement. Experiment results validate the superiority of our method against different models and the compatibility of DisCo over different backbones. Various ablation studies and efficiency analysis are also conducted to justify each model component.

Recommender Systems, User Modeling, Large Language Model
copyright: acmcopyrightjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: KDD; Aug 25–29, 2024; Barcelonaprice: 15.00isbn: 978-1-4503-XXXX-X/18/06ccs: Information systems Recommender systems

1. Introduction

Refer to caption
Figure 1. Illustration of the motivation.

Recommender systems have become an integral part of today’s digital ecosystem, enhancing user experiences, boosting engagement, facilitating decision-making, and fostering connections between users and relevant content or products. They are widely used in various industries, including e-commerce (Smith and Linden, 2017), entertainment (Christensen and Schiaffino, 2011), social media (Wu et al., 2023), and online streaming platforms (Zhou et al., 2018, 2019; Huang et al., 2022).

Conventional recommender systems usually only model the collaborative signals within the tabular representation space, where samples consist of multi-field categorical features. These methods focus on mining beneficial interactions among features (Juan et al., 2016a; Xiao et al., 2017; Liu et al., 2020; Song et al., 2019) and modeling user interests using historical user behaviors (Zhou et al., 2018, 2019; Feng et al., 2019; Pi et al., 2020; Qin et al., 2020) for accurate and personalized recommendation. While being good at modeling feature interactions and personalized user preferences, these methods fail to learn the latent semantic dependencies of features. For example, as shown in Figure 1, a book written by Ernest Hemingway and a book written by Raymond Carver can be close in semantic space because the two authors are known to be minimalist writing style and Carver said that he borrowed many elements from Hemingway’s style. This latent semantic dependency cannot be inferred from the tabular representation space where features are firstly encoded in a one-hot manner. Attempts to incorporate external knowledge within semantic representation space into recommendation then emerge (Hou et al., 2022; Li et al., 2023; Hou et al., 2023), where textual descriptions and their encodings are used to hold the external semantic knowledge. Despite the general language understanding within semantic representation space, the encoding of the large language model has ambiguity and fails to imply the correlations of some features well. For instance, beer and diaper are distant in semantic space but near in tabular space for recommendation where the correlation analysis is done, since users tend to buy beer and diaper together.

As discussed above, the relationship among the same set of user behaviors can be different in the two representation spaces. The unique and disentangled patterns from the two representation spaces form a complementary relationship with each other, contributing different information. Therefore, it is vital to effectively and efficiently collaborate the tabular and semantic representation spaces. The existing works either: 1) take only one space into consideration (Geng et al., 2022) or 2) only focus on aligning the two spaces (Li et al., 2023), during which the representations tend to get close (green part in Figure 1), with the disentangled part (yellow and blue parts in Figure 1) being discarded.

In this paper, we propose DisCo to Disentangle and Collaborate the tabular and semantic representation spaces for user behavior patterns modeling. Therefore, we design 1) a dual-side attentive network (DS-Attn) to capture the intra-domain and inter-domain patterns, 2) a sufficiency constraint to preserve the useful information (yellow, green, and blue parts in Figure 1) and eliminate the noisy information (grey part in Figure 1) from the two representation spaces, and 3) a disentanglement constraint to preserve the disentangling parts from the two representation spaces (yellow and blue part in Figure 1). Together, these modules strike a balance of the collaboration and disentanglement between the tabular and semantic representation space.

Concretely, DS-Attn infers the inner patterns within the tabular space and the semantic space for collaborative-based correlations and semantic-based dependencies with the intra-domain attention, and captures the inter patterns between the two spaces for aligned knowledge with the inter-domain attention where a query-key exchange is adopted to make the two spaces attend to each other. The parameters of DS-Attn and embedding networks are regularized by the sufficiency and disentanglement constraints. The sufficiency constraint maximizes the mutual information between the representations of each space with the labels, which preserves the task-relevant information offered by each space. The disentanglement constraint minimizes the mutual information between the output vectors of DS-Attn from the two spaces, which forces the two spaces to offer different information. The resulting vectors output by DS-Attn and regularized by the two constraints preserve both the consistent and the specific knowledge of the two representation spaces, which can then be fed into arbitrary recommendation backbones for prediction enhancement.

The main contributions of the paper are summarized as follows:

  • We design a novel and effective framework, DisCo, that harmoniously captures both the consistent and the specific knowledge from tabular representation space and semantic representation space with the dual-side attentive network under the regularization of the proposed sufficiency and disentanglement constraints.

  • We emphasize the importance of unique and disentangled information in both the tabular space and the semantic space, and propose the first work to disentangle the tabular and semantic representation spaces for unique domain knowledge.

  • DisCo is a model-agnostic framework compatible with different recommendation backbones, offering flexibility and generality.

The experiment results over a series of recommendation backbones justify the consistent superiority of the proposed method. Ablation studies are also conducted to validate the effectiveness of different model components.

2. Related Work

2.1. Tabular-Only Methods

Early recommendation models focus on digging into interactions among features. FM (Rendle, 2010) captures second-order feature interactions. FFM (Juan et al., 2016b) introduces field-aware interactions. Wide & Deep (Cheng et al., 2016) combines the strengths of linear models and deep neural networks. DeepFM (Guo et al., 2017) replaces the logistic regression layer in (Cheng et al., 2016) with an FM layer. xDeepFM (Lian et al., 2018) introduces the cross layer for high-order feature interactions. PNN (Qu et al., 2018) utilizes the product layer to learn the high-order product interactions. DCN (Wang et al., 2017) proposes to capture both shallow and deep feature interactions effectively. AutoInt (Song et al., 2019) utilizes the multi-head self-attention mechanism to learn high-order feature interactions. By modeling user behavior patterns, recommenders provide more personalized recommendations. DIN (Zhou et al., 2018) incorporates the attention mechanism (Vaswani et al., 2023) to capture user interests. DIEN (Zhou et al., 2019) uses GRU module to better model the evolving interests of users. DSIN (Feng et al., 2019) proposes to capture the dynamic interests of users within a session. MIMN (Pi et al., 2019) leverages a memory network architecture to capture different aspects of user interests. SIM (Pi et al., 2020) models long-term user behaviors.

2.2. Semantic-Enhanced Methods

Recently, large language models have shown great impact and shed light on various domains of recommendation systems. There are several attempts to incorporate large language models into recommender systems (Lin et al., 2023, 2024b; Xi et al., 2023; Lin et al., 2024a). PTab (Liu et al., 2022) adopts a classic BERT (Devlin et al., 2018) framework with Modality Transformation(MT), Masked-Language Finetuning(MF), and Classification Fine-tuning(CF) training stages. P5 (Geng et al., 2022), as well as its variants (Geng et al., 2023; Hua et al., 2023a, b), propose to tune T5 (Raffel et al., 2020) as a unified recommendation model for various downstream tasks ZESRec (Ding et al., 2021) proposes to obtain universal representations from item descriptions through BERT for zero-shot recommendation. UniSRec (Hou et al., 2022) learns item representations via a fixed BERT model followed by an MoE-enhanced network. CTRL (Li et al., 2023) adopts the contrastive learning methodology to align the tabular space and semantic space for recommendation enhancement. VQ-Rec (Hou et al., 2023) makes improvements on UnisRec (Hou et al., 2022), which introduces vector quantization technique.

3. Preliminaries

3.1. Problem Formulation

The click-through rate (CTR) prediction task aims at accurately predicting the probability of a user clicking an item, which is the core task for recommender systems. Therefore, we mainly focus on the CTR prediction task in this work. The conventional CTR prediction task within the tabular domain can be formulated as

(1) p(yi|𝐗𝐢𝐔,𝐗𝐢𝐈,𝐗𝐢𝐂,θ),𝑝conditionalsubscript𝑦𝑖subscriptsuperscript𝐗𝐔𝐢subscriptsuperscript𝐗𝐈𝐢subscriptsuperscript𝐗𝐂𝐢𝜃p(y_{i}|\mathbf{X^{U}_{i},X^{I}_{i},X^{C}_{i}},\theta),italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_X start_POSTSUPERSCRIPT bold_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_X start_POSTSUPERSCRIPT bold_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_X start_POSTSUPERSCRIPT bold_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , italic_θ ) ,

where 𝐗𝐢𝐔=[xi,1U,,xi,FUU]subscriptsuperscript𝐗𝐔𝐢superscriptsubscript𝑥𝑖1𝑈superscriptsubscript𝑥𝑖subscript𝐹𝑈𝑈\mathbf{X^{U}_{i}}=\left[x_{i,1}^{U},\dots,x_{i,F_{U}}^{U}\right]bold_X start_POSTSUPERSCRIPT bold_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_F start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT ] is the set of user features, 𝐗𝐢𝐈=[xi,1I,,xi,FII]subscriptsuperscript𝐗𝐈𝐢superscriptsubscript𝑥𝑖1𝐼superscriptsubscript𝑥𝑖subscript𝐹𝐼𝐼\mathbf{X^{I}_{i}}=\left[x_{i,1}^{I},\dots,x_{i,F_{I}}^{I}\right]bold_X start_POSTSUPERSCRIPT bold_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ] is the set of item features, 𝐗𝐢𝐂=[xi,1C,,xi,FCC]subscriptsuperscript𝐗𝐂𝐢superscriptsubscript𝑥𝑖1𝐶superscriptsubscript𝑥𝑖subscript𝐹𝐶𝐶\mathbf{X^{C}_{i}}=\left[x_{i,1}^{C},\dots,x_{i,F_{C}}^{C}\right]bold_X start_POSTSUPERSCRIPT bold_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i , italic_F start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ] is the set of context features for the click prediction event (e.g., device, season, etc.), yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the label of the data point, and θ𝜃\thetaitalic_θ is the model parameter. We use FU,FI,FCsubscript𝐹𝑈subscript𝐹𝐼subscript𝐹𝐶F_{U},F_{I},F_{C}italic_F start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT to denote the number of user features, item features, and context features, respectively. These methods only focus on modeling feature interactions of the input based on the target sample only but fail to model the user behavior patterns.

Modeling user behavior patterns plays an important role in boosting personalized recommendation performance. This line of methods takes users’ historical behaviors as extra inputs and models the dependencies between the candidate item and historical items, which can be formulated as

(2) p(yi|𝐗𝐢𝐔,𝐗𝐢𝐈,𝐗𝐢𝐂,[𝐗𝐢𝐤𝐈,yik]k=1K,θ),𝑝conditionalsubscript𝑦𝑖subscriptsuperscript𝐗𝐔𝐢subscriptsuperscript𝐗𝐈𝐢subscriptsuperscript𝐗𝐂𝐢superscriptsubscriptdelimited-[]subscriptsuperscript𝐗𝐈subscript𝐢𝐤subscript𝑦subscript𝑖𝑘𝑘1𝐾𝜃p(y_{i}|\mathbf{X^{U}_{i},X^{I}_{i},X^{C}_{i}},[\langle\mathbf{X^{I}_{i_{k}}},% y_{i_{k}}\rangle]_{k=1}^{K},\theta),italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_X start_POSTSUPERSCRIPT bold_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_X start_POSTSUPERSCRIPT bold_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_X start_POSTSUPERSCRIPT bold_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , [ ⟨ bold_X start_POSTSUPERSCRIPT bold_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ] start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_θ ) ,

where [𝐗𝐢𝐤𝐈,yik]k=1Ksuperscriptsubscriptdelimited-[]subscriptsuperscript𝐗𝐈subscript𝐢𝐤subscript𝑦subscript𝑖𝑘𝑘1𝐾[\langle\mathbf{X^{I}_{i_{k}}},y_{i_{k}}\rangle]_{k=1}^{K}[ ⟨ bold_X start_POSTSUPERSCRIPT bold_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ] start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT represents the list of user’s historical behaviors and their corresponding labels, and θ𝜃\thetaitalic_θ is the model parameter.

In this paper, we aim to involve the embedded open-world knowledge of large language models to achieve harmonious disentanglement and collaboration between the tabular and semantic space for recommendation enhancement. Hence, the prediction can be formulated as

(3) p(yi|𝐗𝐢𝐔,𝐗𝐢𝐈,𝐗𝐢𝐂,[𝐗𝐢𝐤𝐈,yik]k=1K,ΦS,θ),𝑝conditionalsubscript𝑦𝑖subscriptsuperscript𝐗𝐔𝐢subscriptsuperscript𝐗𝐈𝐢subscriptsuperscript𝐗𝐂𝐢superscriptsubscriptdelimited-[]subscriptsuperscript𝐗𝐈subscript𝐢𝐤subscript𝑦subscript𝑖𝑘𝑘1𝐾subscriptΦ𝑆𝜃p(y_{i}|\mathbf{X^{U}_{i},X^{I}_{i}},\mathbf{X^{C}_{i}},[\langle\mathbf{X^{I}_% {i_{k}}},y_{i_{k}}\rangle]_{k=1}^{K},\Phi_{S},\theta),italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_X start_POSTSUPERSCRIPT bold_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_X start_POSTSUPERSCRIPT bold_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_X start_POSTSUPERSCRIPT bold_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , [ ⟨ bold_X start_POSTSUPERSCRIPT bold_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ] start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , roman_Φ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_θ ) ,

where ΦSsubscriptΦ𝑆\Phi_{S}roman_Φ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT represents the encoder of a large language model.

3.2. Mutual Information Minimization & Maximization

Mutual information (MI) is important but hard to compute in neural networks. For mutual information maximization, MINE (Belghazi et al., 2018) builds connections between the expectations of variables and mutual information, and proposes a lower bound of the mutual information based on the Donsker-Varadhan representation of KL divergence.

DIM (Hjelm et al., 2019) points out that we do not necessarily need to obtain the precise value of MI but only need to maximize it. They use the Jensen-Shannon Divergence to estimate the MI and therefore propose a GAN-style loss to maximize it:

(4) L=E𝕁[logTw(x,y)]+E𝕄[log(1Tw(x,y))].𝐿subscript𝐸𝕁delimited-[]subscript𝑇𝑤𝑥𝑦subscript𝐸𝕄delimited-[]1subscript𝑇𝑤𝑥𝑦L=E_{\mathbb{J}}\left[\log T_{w}(x,y)\right]+E_{\mathbb{M}}\left[\log(1-T_{w}(% x,y))\right].italic_L = italic_E start_POSTSUBSCRIPT blackboard_J end_POSTSUBSCRIPT [ roman_log italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_x , italic_y ) ] + italic_E start_POSTSUBSCRIPT blackboard_M end_POSTSUBSCRIPT [ roman_log ( 1 - italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_x , italic_y ) ) ] .

For mutual information minimization, CLUB (Cheng et al., 2020) introduces an upper bound for mutual information. When the conditional distribution p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ) is known, the upper bound can be represented as

(5) I(X;Y)ICLUB(X;Y)=Ep(x,y)[logp(y|x)]Ep(x)Ep(y)[logp(y|x)].𝐼𝑋𝑌subscript𝐼𝐶𝐿𝑈𝐵𝑋𝑌subscript𝐸𝑝𝑥𝑦delimited-[]𝑝conditional𝑦𝑥subscript𝐸𝑝𝑥subscript𝐸𝑝𝑦delimited-[]𝑝conditional𝑦𝑥\begin{split}I(X;Y)&\leq I_{CLUB}(X;Y)\\ &=E_{p(x,y)}\left[\log p(y|x)\right]-E_{p(x)}E_{p(y)}\left[\log p(y|x)\right].% \end{split}start_ROW start_CELL italic_I ( italic_X ; italic_Y ) end_CELL start_CELL ≤ italic_I start_POSTSUBSCRIPT italic_C italic_L italic_U italic_B end_POSTSUBSCRIPT ( italic_X ; italic_Y ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_E start_POSTSUBSCRIPT italic_p ( italic_x , italic_y ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_y | italic_x ) ] - italic_E start_POSTSUBSCRIPT italic_p ( italic_x ) end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p ( italic_y ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_y | italic_x ) ] . end_CELL end_ROW

When the conditional distribution is not known, one could use a variational distribution qθ(y|x)subscript𝑞𝜃conditional𝑦𝑥q_{\theta}(y|x)italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) to approximate it and the upper bound then becomes

(6) IvCLUB=Ep(x,y)[logqθ(y|x)]Ep(x)Ep(y)[logqθ(y|x)].subscript𝐼𝑣𝐶𝐿𝑈𝐵subscript𝐸𝑝𝑥𝑦delimited-[]subscript𝑞𝜃conditional𝑦𝑥subscript𝐸𝑝𝑥subscript𝐸𝑝𝑦delimited-[]subscript𝑞𝜃conditional𝑦𝑥I_{vCLUB}=E_{p(x,y)}\left[\log q_{\theta}(y|x)\right]-E_{p(x)}E_{p(y)}\left[% \log q_{\theta}(y|x)\right].italic_I start_POSTSUBSCRIPT italic_v italic_C italic_L italic_U italic_B end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_p ( italic_x , italic_y ) end_POSTSUBSCRIPT [ roman_log italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] - italic_E start_POSTSUBSCRIPT italic_p ( italic_x ) end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p ( italic_y ) end_POSTSUBSCRIPT [ roman_log italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] .

4. Methodology

In this section, we describe the methodology of DisCo, which is model-agnostic and compatible with different backbones.

4.1. Overview

As illustrated in Figure 2, we first prepare the tabular and semantic embeddings. In this paper, we encode the user behaviors in tabular and semantic representation space and extract patterns from the resulting tabular and semantic embeddings. To make use of the general open-world knowledge from the semantic space in an efficient manner, we pre-store the semantic embedding of each item in an indexed knowledge base, with the item ID being the indexing key. The semantic embeddings are generated from a frozen LLM, with the textual item descriptions as inputs. The tabular embeddings are obtained via tabular embedding layers with one-hot encoded features as inputs.

The proposed DisCo mainly consists of three components: the dual-side attentive network, the sufficiency constraint, and the disentanglement constraint. To collaborate the tabular and semantic representation space, we propose a Dual-Side Attentive Network to encode the patterns from user behaviors under collaborative learning based representation space, semantic dependencies-based representation space, and the interactions of the two spaces. The resulting pattern vectors serve as additional features for arbitrary recommendation models. In addition, we propose two constraints to regularize the model, which preserve the useful and unique information from the two representation spaces: 1) A sufficiency constraint to maximize the mutual information between encoded vectors and the labels, which preserves the task related information and eliminates the noise. 2) A disentanglement constraint to minimize the mutual information between vectors from different representation spaces, which forces each space to provide unique domain-specific knowledge. Together, these components strike a harmonious balance between collaboration and disentanglement.

Refer to caption
Figure 2. Overview. (a) To extract the semantic knowledge, a textual description is obtained for each item using a field-value prompt template, which is then fed to a LLM for semantic embedding and stored in an indexed knowledge base. (b) The candidate item and the historical behaviors are encoded in tabular and semantic representation spaces, which are then sent to Dual-Side Attentive Network for intra-domain and inter-domain pattern vectors. The resulting pattern vectors serve as extra features and can be appended to arbitrary recommendation model. (c) Two constraints are devised to regularize the model and preserve both the aligning part and the disentangling part of useful information from the two representation spaces. The sufficiency constraint is applied on the behavior vectors and the summarized pattern vectors to preserve the useful information. The disentanglement constraint is applied on the pattern vectors from the two different domains to force the model to capture unique information from both domains.

4.2. Indexed Knowledge Base

To efficiently utilize the semantic knowledge, we first build an indexed knowledge base to extract and pre-store the semantic embedding of each item.

Refer to caption
Figure 3. The illustration of the field-value prompt template.

For each item, we obtain a semantic description for it using the field-value prompt template. As shown in Figure 3, for a movie with title Titanic, genre Romantic, and director James Cameron, we can obtain the item description ”Here is a movie, title is Titanic, genre is Romantic, and director is James Cameron.”. The item description is then fed into a large language model ΦS()subscriptΦ𝑆\Phi_{S}(\cdot)roman_Φ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( ⋅ ) to acquire the semantic embedding, which will be stored in an indexed knowledge base KB[]𝐾𝐵delimited-[]KB[\cdot]italic_K italic_B [ ⋅ ] for further usage, with the item features being the index key and the semantic embedding being the value.

4.3. Dual-Side Attentive Network

As discussed in Section 1, the relations among the same set of user behaviors can be different in the tabular and semantic domains. In order to capture the distinct and shared patterns among the behaviors in both domains, we design a dual-side attentive network (DS-Attn) module, which consists of intra-domain attention and inter-domain attention. The intra-domain attention models the distinct domain-specific patterns within the tabular domain for collaborative-based correlations and within the semantic domain for semantic-based dependencies, respectively. The inter-domain attention models the shared patterns between the two domains, where a query-key exchange between the two domains is adopted to make the two domains attend to each other.

4.3.1. Behaviors Encoding

For each target sample 𝐗𝐢=[𝐗𝐢𝐔,𝐗𝐢𝐈,𝐗𝐢𝐂]subscript𝐗𝐢superscriptsubscript𝐗𝐢𝐔superscriptsubscript𝐗𝐢𝐈superscriptsubscript𝐗𝐢𝐂\mathbf{X_{i}=[X_{i}^{U},X_{i}^{I},X_{i}^{C}]}bold_X start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = [ bold_X start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_U end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_I end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_C end_POSTSUPERSCRIPT ], we gather K𝐾Kitalic_K recent historical behaviors {𝐗𝐢𝐤𝐈,yik}k=1Ksuperscriptsubscriptsuperscriptsubscript𝐗subscript𝐢𝐤𝐈subscript𝑦subscript𝑖𝑘𝑘1𝐾\{\langle\mathbf{X_{i_{k}}^{I}},y_{i_{k}}\rangle\}_{k=1}^{K}{ ⟨ bold_X start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_I end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT to assist the prediction. The candidate item and the historical items are transformed into representations in tabular domain and semantic domain as follows, j{i,i1,,iK}for-all𝑗𝑖subscript𝑖1subscript𝑖𝐾\forall j\in\{i,i_{1},\dots,i_{K}\}∀ italic_j ∈ { italic_i , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }:

(7) 𝐡𝐣𝐒superscriptsubscript𝐡𝐣𝐒\displaystyle\mathbf{h_{j}^{S}}bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_S end_POSTSUPERSCRIPT =MLP(KB[𝐗𝐣𝐈]),absent𝑀𝐿𝑃𝐾𝐵delimited-[]superscriptsubscript𝐗𝐣𝐈\displaystyle=MLP(KB[\mathbf{X_{j}^{I}}]),= italic_M italic_L italic_P ( italic_K italic_B [ bold_X start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_I end_POSTSUPERSCRIPT ] ) ,
(8) 𝐡𝐣𝐓superscriptsubscript𝐡𝐣𝐓\displaystyle\mathbf{h_{j}^{T}}bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT =ΦT(𝐗𝐣𝐈),absentsubscriptΦ𝑇superscriptsubscript𝐗𝐣𝐈\displaystyle=\Phi_{T}(\mathbf{X_{j}^{I}}),= roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_I end_POSTSUPERSCRIPT ) ,

where ΦTsubscriptΦ𝑇\Phi_{T}roman_Φ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT denotes the tabular domain embedding network and KB[]𝐾𝐵delimited-[]KB[\cdot]italic_K italic_B [ ⋅ ] denotes the indexed knowledge base constructed in the previous stage. MLP𝑀𝐿𝑃MLPitalic_M italic_L italic_P is used to reduce the dimension of the semantic embedding. 𝐡𝐣𝐒superscriptsubscript𝐡𝐣𝐒\mathbf{h_{j}^{S}}bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_S end_POSTSUPERSCRIPT and 𝐡𝐣𝐓superscriptsubscript𝐡𝐣𝐓\mathbf{h_{j}^{T}}bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT denote the embedding of item features in semantic space and tabular space, respectively. In order to decrease the entanglement of the embeddings for intra-domain and inter-domain pattern extraction, we decouple the corresponding embeddings into chunks. j{i,i1,,iK}for-all𝑗𝑖subscript𝑖1subscript𝑖𝐾\forall j\in\{i,i_{1},\dots,i_{K}\}∀ italic_j ∈ { italic_i , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }:

(9) 𝐡𝐣𝐒superscriptsubscript𝐡𝐣𝐒\displaystyle\mathbf{h_{j}^{S}}bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_S end_POSTSUPERSCRIPT =[𝐡𝐣𝐒𝐈,𝐡𝐣𝐒𝐂],absentsuperscriptsubscript𝐡𝐣𝐒𝐈superscriptsubscript𝐡𝐣𝐒𝐂\displaystyle=[\mathbf{h_{j}^{SI},h_{j}^{SC}}],= [ bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_SI end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_SC end_POSTSUPERSCRIPT ] ,
(10) 𝐡𝐣𝐓superscriptsubscript𝐡𝐣𝐓\displaystyle\mathbf{h_{j}^{T}}bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT =[𝐡𝐣𝐓𝐈,𝐡𝐣𝐓𝐂],absentsuperscriptsubscript𝐡𝐣𝐓𝐈superscriptsubscript𝐡𝐣𝐓𝐂\displaystyle=[\mathbf{h_{j}^{TI},h_{j}^{TC}}],= [ bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_TI end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_TC end_POSTSUPERSCRIPT ] ,

where 𝐡𝐣𝐒𝐈Rd2,𝐡𝐣𝐒𝐂Rd2,𝐡𝐣𝐓𝐈Rd2,formulae-sequencesuperscriptsubscript𝐡𝐣𝐒𝐈superscript𝑅𝑑2formulae-sequencesuperscriptsubscript𝐡𝐣𝐒𝐂superscript𝑅𝑑2superscriptsubscript𝐡𝐣𝐓𝐈superscript𝑅𝑑2\mathbf{h_{j}^{SI}}\in R^{\frac{d}{2}},\mathbf{h_{j}^{SC}}\in R^{\frac{d}{2}},% \mathbf{h_{j}^{TI}}\in R^{\frac{d}{2}},bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_SI end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_SC end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_TI end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , and 𝐡𝐣𝐓𝐂Rd2superscriptsubscript𝐡𝐣𝐓𝐂superscript𝑅𝑑2\mathbf{h_{j}^{TC}}\in R^{\frac{d}{2}}bold_h start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_TC end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT.

In addition, following (Qin et al., 2021; Du et al., 2022), we also encode the historical labels to model accurate click signals, k{1,,K}for-all𝑘1𝐾\forall k\in\{1,\dots,K\}∀ italic_k ∈ { 1 , … , italic_K }:

(11) 𝐥𝐢𝐤=ΦY(yik),subscript𝐥subscript𝐢𝐤subscriptΦ𝑌subscript𝑦subscript𝑖𝑘\mathbf{l_{i_{k}}}=\Phi_{Y}(y_{i_{k}}),bold_l start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

where ΦYsubscriptΦ𝑌\Phi_{Y}roman_Φ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT denotes the label embedding network.

4.3.2. Attention Block of DS-Attn

The dual-side attentive network consists of four attention blocks for tabular, semantic, tabular-to-semantic, and semantic-to-tabular user behavior patterns. We first define the function of the basic attention block as follows:

(12) 𝐇,𝐋=Attnθ(𝐐,𝐊,𝐕,𝐋),𝐇superscript𝐋𝐴𝑡𝑡subscript𝑛𝜃𝐐𝐊𝐕𝐋\displaystyle\mathbf{H,L^{\prime}}=Attn_{\theta}(\mathbf{Q,K,V,L}),bold_H , bold_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_A italic_t italic_t italic_n start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_Q , bold_K , bold_V , bold_L ) ,
(13) θ={𝐖𝐐,𝐖𝐊,𝐖𝐕},𝜃subscript𝐖𝐐subscript𝐖𝐊subscript𝐖𝐕\displaystyle\theta=\{\mathbf{W_{Q},W_{K},W_{V}}\},italic_θ = { bold_W start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT } ,

where θ𝜃\thetaitalic_θ denotes the parameters of an attention block. The detail operations for 𝐇,𝐋=Attnθ(𝐐,𝐊,𝐕,𝐋)𝐇superscript𝐋𝐴𝑡𝑡subscript𝑛𝜃𝐐𝐊𝐕𝐋\mathbf{H,L^{\prime}}=Attn_{\theta}(\mathbf{Q,K,V,L})bold_H , bold_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_A italic_t italic_t italic_n start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_Q , bold_K , bold_V , bold_L ) are:

(14) 𝐐=𝐐𝐖𝐐, 𝐊superscript𝐐subscript𝐐𝐖𝐐 superscript𝐊\displaystyle\mathbf{Q^{\prime}}=\mathbf{QW_{Q}},\text{ }\mathbf{K^{\prime}}bold_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_QW start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT , bold_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =𝐊𝐖𝐊, 𝐕=𝐕𝐖𝐕,formulae-sequenceabsentsubscript𝐊𝐖𝐊 superscript𝐕subscript𝐕𝐖𝐕\displaystyle=\mathbf{KW_{K}},\text{ }\mathbf{V^{\prime}}=\mathbf{VW_{V}},= bold_KW start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT , bold_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_VW start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT ,
(15) 𝐀=soft𝐀𝑠𝑜𝑓𝑡\displaystyle\mathbf{A}=softbold_A = italic_s italic_o italic_f italic_t max(𝐐𝐊𝐓d),𝑚𝑎𝑥superscript𝐐superscript𝐊𝐓𝑑\displaystyle max(\frac{\mathbf{Q^{\prime}K^{\prime T}}}{\sqrt{d}}),italic_m italic_a italic_x ( divide start_ARG bold_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_K start_POSTSUPERSCRIPT ′ bold_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) ,
(16) 𝐇=𝐀𝐕𝐇superscript𝐀𝐕\displaystyle\mathbf{H}=\mathbf{A}\mathbf{V^{\prime}}bold_H = bold_AV start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 𝐋=𝐀𝐋,\displaystyle,\text{ }\mathbf{L^{\prime}}=\mathbf{AL},, bold_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_AL ,

where 𝐐R1×d2,𝐊RK×d2,𝐕RK×d2,𝐋RK×dformulae-sequence𝐐superscript𝑅1𝑑2formulae-sequence𝐊superscript𝑅𝐾𝑑2formulae-sequence𝐕superscript𝑅𝐾𝑑2𝐋superscript𝑅𝐾𝑑\mathbf{Q}\in R^{1\times\frac{d}{2}},\mathbf{K}\in R^{K\times\frac{d}{2}},% \mathbf{V}\in R^{K\times\frac{d}{2}},\mathbf{L}\in R^{K\times d}bold_Q ∈ italic_R start_POSTSUPERSCRIPT 1 × divide start_ARG italic_d end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , bold_K ∈ italic_R start_POSTSUPERSCRIPT italic_K × divide start_ARG italic_d end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , bold_V ∈ italic_R start_POSTSUPERSCRIPT italic_K × divide start_ARG italic_d end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , bold_L ∈ italic_R start_POSTSUPERSCRIPT italic_K × italic_d end_POSTSUPERSCRIPT denote the input for query, key, value, and labels, and 𝐖𝐐,𝐖𝐊,𝐖𝐕Rd2×dsubscript𝐖𝐐subscript𝐖𝐊subscript𝐖𝐕superscript𝑅𝑑2𝑑\mathbf{W_{Q},W_{K},W_{V}}\in R^{\frac{d}{2}\times d}bold_W start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT , bold_W start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG 2 end_ARG × italic_d end_POSTSUPERSCRIPT represent the trainable weights.

4.3.3. Intra-Domain Attention

To capture the inner domain tabular and semantic behavior patterns, we apply the intra-domain attention among the candidate item and the behavior sequences within the tabular and semantic domains.

We define the query, key, and value for intra-tabular and intra-semantic behavior patterns as

(17) 𝐐𝐢𝐒𝐒=𝐡𝐢𝐒𝐈,subscriptsuperscript𝐐𝐒𝐒𝐢subscriptsuperscript𝐡𝐒𝐈𝐢\displaystyle\mathbf{Q^{SS}_{i}}=\mathbf{h^{SI}_{i}},bold_Q start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = bold_h start_POSTSUPERSCRIPT bold_SI end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ,  𝐐𝐢𝐓𝐓=𝐡𝐢𝐓𝐈 subscriptsuperscript𝐐𝐓𝐓𝐢subscriptsuperscript𝐡𝐓𝐈𝐢\displaystyle\text{\quad}\mathbf{Q^{TT}_{i}}=\mathbf{h^{TI}_{i}}bold_Q start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = bold_h start_POSTSUPERSCRIPT bold_TI end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT
(24) 𝐊𝐢𝐒𝐒=𝐕𝐢𝐒𝐒=(𝐡𝐢𝟏𝐒𝐈𝐡𝐢𝐊𝐒𝐈),subscriptsuperscript𝐊𝐒𝐒𝐢subscriptsuperscript𝐕𝐒𝐒𝐢subscriptsuperscript𝐡𝐒𝐈subscript𝐢1subscriptsuperscript𝐡𝐒𝐈subscript𝐢𝐊\displaystyle\mathbf{K^{SS}_{i}=V^{SS}_{i}}=\left(\begin{array}[]{c}\mathbf{h^% {SI}_{i_{1}}}\\ \cdots\\ \mathbf{h^{SI}_{i_{K}}}\end{array}\right),bold_K start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = bold_V start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = ( start_ARRAY start_ROW start_CELL bold_h start_POSTSUPERSCRIPT bold_SI end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL bold_h start_POSTSUPERSCRIPT bold_SI end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ) ,  𝐊𝐢𝐓𝐓=𝐕𝐢𝐓𝐓=(𝐡𝐢𝟏𝐓𝐈𝐡𝐢𝐊𝐓𝐈). subscriptsuperscript𝐊𝐓𝐓𝐢subscriptsuperscript𝐕𝐓𝐓𝐢subscriptsuperscript𝐡𝐓𝐈subscript𝐢1subscriptsuperscript𝐡𝐓𝐈subscript𝐢𝐊\displaystyle\text{\quad}\mathbf{K^{TT}_{i}=V^{TT}_{i}}=\left(\begin{array}[]{% c}\mathbf{h^{TI}_{i_{1}}}\\ \cdots\\ \mathbf{h^{TI}_{i_{K}}}\end{array}\right).bold_K start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = bold_V start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = ( start_ARRAY start_ROW start_CELL bold_h start_POSTSUPERSCRIPT bold_TI end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL bold_h start_POSTSUPERSCRIPT bold_TI end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ) .

where 𝐐𝐢𝐒𝐒subscriptsuperscript𝐐𝐒𝐒𝐢\mathbf{Q^{SS}_{i}}bold_Q start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT, 𝐊𝐢𝐒𝐒subscriptsuperscript𝐊𝐒𝐒𝐢\mathbf{K^{SS}_{i}}bold_K start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT, and 𝐕𝐢𝐒𝐒subscriptsuperscript𝐕𝐒𝐒𝐢\mathbf{V^{SS}_{i}}bold_V start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT denote the input for intra-semantic attention and 𝐐𝐢𝐓𝐓subscriptsuperscript𝐐𝐓𝐓𝐢\mathbf{Q^{TT}_{i}}bold_Q start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT, 𝐊𝐢𝐓𝐓subscriptsuperscript𝐊𝐓𝐓𝐢\mathbf{K^{TT}_{i}}bold_K start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT, and 𝐕𝐢𝐓𝐓subscriptsuperscript𝐕𝐓𝐓𝐢\mathbf{V^{TT}_{i}}bold_V start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT represent the input of intra-tabular attention.

Then the user behavior patterns in different spaces can be obtained by

(25) 𝐏𝐢𝐒𝐒=[𝐇𝐢𝐒𝐒,𝐋𝐢𝐒𝐒]subscriptsuperscript𝐏𝐒𝐒𝐢subscriptsuperscript𝐇𝐒𝐒𝐢subscriptsuperscript𝐋𝐒𝐒𝐢\displaystyle\mathbf{P^{SS}_{i}=[H^{SS}_{i},L^{SS}_{i}]}bold_P start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = [ bold_H start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_L start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ] =AttnθSS(𝐐𝐢𝐒𝐒,𝐊𝐢𝐒𝐒,𝐕𝐢𝐒𝐒,𝐋𝐢),absent𝐴𝑡𝑡subscript𝑛subscript𝜃𝑆𝑆subscriptsuperscript𝐐𝐒𝐒𝐢subscriptsuperscript𝐊𝐒𝐒𝐢subscriptsuperscript𝐕𝐒𝐒𝐢subscript𝐋𝐢\displaystyle=Attn_{\theta_{SS}}(\mathbf{Q^{SS}_{i},K^{SS}_{i},V^{SS}_{i},L_{i% }}),= italic_A italic_t italic_t italic_n start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Q start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_K start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_V start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_L start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) ,
(26) 𝐏𝐢𝐓𝐓=[𝐇𝐢𝐓𝐓,𝐋𝐢𝐓𝐓]subscriptsuperscript𝐏𝐓𝐓𝐢subscriptsuperscript𝐇𝐓𝐓𝐢subscriptsuperscript𝐋𝐓𝐓𝐢\displaystyle\mathbf{P^{TT}_{i}=[H^{TT}_{i},L^{TT}_{i}]}bold_P start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = [ bold_H start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_L start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ] =AttnθTT(𝐐𝐢𝐓𝐓,𝐊𝐢𝐓𝐓,𝐕𝐢𝐓𝐓,𝐋𝐢).absent𝐴𝑡𝑡subscript𝑛subscript𝜃𝑇𝑇subscriptsuperscript𝐐𝐓𝐓𝐢subscriptsuperscript𝐊𝐓𝐓𝐢subscriptsuperscript𝐕𝐓𝐓𝐢subscript𝐋𝐢\displaystyle=Attn_{\theta_{TT}}(\mathbf{Q^{TT}_{i},K^{TT}_{i},V^{TT}_{i},L_{i% }}).= italic_A italic_t italic_t italic_n start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_T italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Q start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_K start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_V start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_L start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) .

4.3.4. Inter-Domain Attention

We further capture the inter-domain patterns and model the shared knowledge with an inter-domain attention module.

We define the query, key, and value for intra-tabular and intra-semantic behavior patterns encoded as

(27) 𝐐𝐢𝐒𝐓=𝐡𝐢𝐒𝐂,subscriptsuperscript𝐐𝐒𝐓𝐢subscriptsuperscript𝐡𝐒𝐂𝐢\displaystyle\mathbf{Q^{ST}_{i}=h^{SC}_{i}},bold_Q start_POSTSUPERSCRIPT bold_ST end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = bold_h start_POSTSUPERSCRIPT bold_SC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ,  𝐐𝐢𝐓𝐒=𝐡𝐢𝐓𝐂 subscriptsuperscript𝐐𝐓𝐒𝐢subscriptsuperscript𝐡𝐓𝐂𝐢\displaystyle\text{\quad}\mathbf{Q^{TS}_{i}=h^{TC}_{i}}bold_Q start_POSTSUPERSCRIPT bold_TS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = bold_h start_POSTSUPERSCRIPT bold_TC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT
(34) 𝐊𝐢𝐒𝐓=𝐕𝐢𝐒𝐓=(𝐡𝐢𝟏𝐓𝐂𝐡𝐢𝐊𝐓𝐂),subscriptsuperscript𝐊𝐒𝐓𝐢subscriptsuperscript𝐕𝐒𝐓𝐢subscriptsuperscript𝐡𝐓𝐂subscript𝐢1subscriptsuperscript𝐡𝐓𝐂subscript𝐢𝐊\displaystyle\mathbf{K^{ST}_{i}=V^{ST}_{i}}=\left(\begin{array}[]{c}\mathbf{h^% {TC}_{i_{1}}}\\ \cdots\\ \mathbf{h^{TC}_{i_{K}}}\end{array}\right),bold_K start_POSTSUPERSCRIPT bold_ST end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = bold_V start_POSTSUPERSCRIPT bold_ST end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = ( start_ARRAY start_ROW start_CELL bold_h start_POSTSUPERSCRIPT bold_TC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL bold_h start_POSTSUPERSCRIPT bold_TC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ) ,  𝐊𝐢𝐓𝐒=𝐕𝐢𝐓𝐒=(𝐡𝐢𝟏𝐒𝐂𝐡𝐢𝐊𝐒𝐂). subscriptsuperscript𝐊𝐓𝐒𝐢subscriptsuperscript𝐕𝐓𝐒𝐢subscriptsuperscript𝐡𝐒𝐂subscript𝐢1subscriptsuperscript𝐡𝐒𝐂subscript𝐢𝐊\displaystyle\text{\quad}\mathbf{K^{TS}_{i}=V^{TS}_{i}}=\left(\begin{array}[]{% c}\mathbf{h^{SC}_{i_{1}}}\\ \cdots\\ \mathbf{h^{SC}_{i_{K}}}\end{array}\right).bold_K start_POSTSUPERSCRIPT bold_TS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = bold_V start_POSTSUPERSCRIPT bold_TS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = ( start_ARRAY start_ROW start_CELL bold_h start_POSTSUPERSCRIPT bold_SC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL bold_h start_POSTSUPERSCRIPT bold_SC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ) .

Then the user behavior patterns in different spaces can be obtained by

(35) 𝐏𝐢𝐒𝐓=[𝐇𝐢𝐒𝐓,𝐋𝐢𝐒𝐓]subscriptsuperscript𝐏𝐒𝐓𝐢subscriptsuperscript𝐇𝐒𝐓𝐢subscriptsuperscript𝐋𝐒𝐓𝐢\displaystyle\mathbf{P^{ST}_{i}=[H^{ST}_{i},L^{ST}_{i}]}bold_P start_POSTSUPERSCRIPT bold_ST end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = [ bold_H start_POSTSUPERSCRIPT bold_ST end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_L start_POSTSUPERSCRIPT bold_ST end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ] =AttnθST(𝐐𝐢𝐒𝐓,𝐊𝐢𝐒𝐓,𝐕𝐢𝐒𝐓,𝐋𝐢),absent𝐴𝑡𝑡subscript𝑛subscript𝜃𝑆𝑇subscriptsuperscript𝐐𝐒𝐓𝐢subscriptsuperscript𝐊𝐒𝐓𝐢subscriptsuperscript𝐕𝐒𝐓𝐢subscript𝐋𝐢\displaystyle=Attn_{\theta_{ST}}(\mathbf{Q^{ST}_{i},K^{ST}_{i},V^{ST}_{i},L_{i% }}),= italic_A italic_t italic_t italic_n start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_S italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Q start_POSTSUPERSCRIPT bold_ST end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_K start_POSTSUPERSCRIPT bold_ST end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_V start_POSTSUPERSCRIPT bold_ST end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_L start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) ,
(36) 𝐏𝐢𝐓𝐒=[𝐇𝐢𝐓𝐒,𝐋𝐢𝐓𝐒]subscriptsuperscript𝐏𝐓𝐒𝐢subscriptsuperscript𝐇𝐓𝐒𝐢subscriptsuperscript𝐋𝐓𝐒𝐢\displaystyle\mathbf{P^{TS}_{i}=[H^{TS}_{i},L^{TS}_{i}]}bold_P start_POSTSUPERSCRIPT bold_TS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = [ bold_H start_POSTSUPERSCRIPT bold_TS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_L start_POSTSUPERSCRIPT bold_TS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ] =AttnθTS(𝐐𝐢𝐓𝐒,𝐊𝐢𝐓𝐒,𝐕𝐢𝐓𝐒,𝐋𝐢).absent𝐴𝑡𝑡subscript𝑛subscript𝜃𝑇𝑆subscriptsuperscript𝐐𝐓𝐒𝐢subscriptsuperscript𝐊𝐓𝐒𝐢subscriptsuperscript𝐕𝐓𝐒𝐢subscript𝐋𝐢\displaystyle=Attn_{\theta_{TS}}(\mathbf{Q^{TS}_{i},K^{TS}_{i},V^{TS}_{i},L_{i% }}).= italic_A italic_t italic_t italic_n start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_T italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_Q start_POSTSUPERSCRIPT bold_TS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_K start_POSTSUPERSCRIPT bold_TS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_V start_POSTSUPERSCRIPT bold_TS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_L start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) .

4.4. Sufficiency Constraint

To preserve the task-related information from the two representation spaces and filter out the noises, we maximize the mutual information between the encoded vectors 𝐡𝐡\mathbf{h}bold_h from each domain and the labels. The difficulty here lies in that the label space is discrete and only consists of two values. Therefore, we follow DIM (Hjelm et al., 2019) to use the summarized pattern vectors 𝐇subscript𝐇\mathbf{H_{\cdot}}bold_H start_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT from each space to represent the label space, which extends the discrete states into high-dimensional continuous space. To maximize the mutual information, we maximize the distance between the marginal distribution and joint distributions between two variables.

Concretely, for tabular domain sufficiency preserving, we sample the positive pairs as 𝒯+={𝐡𝐢𝐓𝐈,𝐇𝐢+𝐓𝐓|yi=yi+}superscriptlimit-from𝒯conditional-setsuperscriptsubscript𝐡𝐢𝐓𝐈superscriptsubscript𝐇superscript𝐢𝐓𝐓subscript𝑦𝑖subscript𝑦limit-from𝑖\mathcal{I^{T+}}=\{\langle\mathbf{h_{i}^{TI},H_{i^{+}}^{TT}}\rangle|y_{i}=y_{i% +}\}caligraphic_I start_POSTSUPERSCRIPT caligraphic_T + end_POSTSUPERSCRIPT = { ⟨ bold_h start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_TI end_POSTSUPERSCRIPT , bold_H start_POSTSUBSCRIPT bold_i start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT ⟩ | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i + end_POSTSUBSCRIPT } and sample the negative pairs as 𝒯={𝐡𝐢𝐓𝐈,𝐇𝐢𝐓𝐓|yiyi}superscriptlimit-from𝒯conditional-setsuperscriptsubscript𝐡𝐢𝐓𝐈superscriptsubscript𝐇superscript𝐢𝐓𝐓subscript𝑦𝑖subscript𝑦limit-from𝑖\mathcal{I^{T-}}=\{\langle\mathbf{h_{i}^{TI},H_{i^{-}}^{TT}}\rangle|y_{i}\neq y% _{i-}\}caligraphic_I start_POSTSUPERSCRIPT caligraphic_T - end_POSTSUPERSCRIPT = { ⟨ bold_h start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_TI end_POSTSUPERSCRIPT , bold_H start_POSTSUBSCRIPT bold_i start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT ⟩ | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT }. For semantic domain sufficiency preserving, we sample the positive pairs as 𝒮+={𝐡𝐢𝐒𝐈,𝐇𝐢+𝐒𝐒|yi=yi+}superscriptlimit-from𝒮conditional-setsuperscriptsubscript𝐡𝐢𝐒𝐈superscriptsubscript𝐇superscript𝐢𝐒𝐒subscript𝑦𝑖subscript𝑦limit-from𝑖\mathcal{I^{S+}}=\{\langle\mathbf{h_{i}^{SI},H_{i^{+}}^{SS}}\rangle|y_{i}=y_{i% +}\}caligraphic_I start_POSTSUPERSCRIPT caligraphic_S + end_POSTSUPERSCRIPT = { ⟨ bold_h start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_SI end_POSTSUPERSCRIPT , bold_H start_POSTSUBSCRIPT bold_i start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT ⟩ | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i + end_POSTSUBSCRIPT } and sample the negative pairs as 𝒮={𝐡𝐢𝐒𝐈,𝐇𝐢𝐒𝐒|yiyi}superscriptlimit-from𝒮conditional-setsuperscriptsubscript𝐡𝐢𝐒𝐈superscriptsubscript𝐇superscript𝐢𝐒𝐒subscript𝑦𝑖subscript𝑦limit-from𝑖\mathcal{I^{S-}}=\{\langle\mathbf{h_{i}^{SI},H_{i^{-}}^{SS}}\rangle|y_{i}\neq y% _{i-}\}caligraphic_I start_POSTSUPERSCRIPT caligraphic_S - end_POSTSUPERSCRIPT = { ⟨ bold_h start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_SI end_POSTSUPERSCRIPT , bold_H start_POSTSUBSCRIPT bold_i start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT ⟩ | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT }. The discriminator network 𝒟θ1,𝒟θ2subscript𝒟𝜃1subscript𝒟𝜃2\mathcal{D}_{\theta 1},\mathcal{D}_{\theta 2}caligraphic_D start_POSTSUBSCRIPT italic_θ 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_θ 2 end_POSTSUBSCRIPT are used to distinguish the two pairs. The optimization objective is:

(37) lSuf=𝒯+,𝒯(log𝒟θ1(𝐡𝐢𝐓𝐈,𝐇𝐢+𝐓𝐓)+(1log𝒟θ1(𝐡𝐢𝐓𝐈,𝐇𝐢𝐓𝐓)))𝒮+,𝒮(log𝒟θ2(𝐡𝐢𝐒𝐈,𝐇𝐢+𝐒𝐒)+(1log𝒟θ2(𝐡𝐢𝐒𝐈,𝐇𝐢𝐒𝐒))).subscript𝑙𝑆𝑢𝑓subscriptsuperscriptlimit-from𝒯superscriptlimit-from𝒯subscript𝒟𝜃1subscriptsuperscript𝐡𝐓𝐈𝐢subscriptsuperscript𝐇𝐓𝐓limit-from𝐢1subscript𝒟𝜃1subscriptsuperscript𝐡𝐓𝐈𝐢subscriptsuperscript𝐇𝐓𝐓limit-from𝐢subscriptsuperscriptlimit-from𝒮superscriptlimit-from𝒮subscript𝒟𝜃2subscriptsuperscript𝐡𝐒𝐈𝐢subscriptsuperscript𝐇𝐒𝐒limit-from𝐢1subscript𝒟𝜃2subscriptsuperscript𝐡𝐒𝐈𝐢subscriptsuperscript𝐇𝐒𝐒limit-from𝐢\begin{split}l_{Suf}=-\sum_{\mathcal{I^{T+}},\mathcal{I^{T-}}}\left(\log% \mathcal{D}_{\theta 1}(\mathbf{h^{TI}_{i}},\mathbf{H^{TT}_{i+}})+\left(1-\log% \mathcal{D}_{\theta 1}(\mathbf{h^{TI}_{i}},\mathbf{H^{TT}_{i-}})\right)\right)% \\ -\sum_{\mathcal{I^{S+}},\mathcal{I^{S-}}}\left(\log\mathcal{D}_{\theta 2}(% \mathbf{h^{SI}_{i}},\mathbf{H^{SS}_{i+}})+\left(1-\log\mathcal{D}_{\theta 2}(% \mathbf{h^{SI}_{i}},\mathbf{H^{SS}_{i-}})\right)\right).\end{split}start_ROW start_CELL italic_l start_POSTSUBSCRIPT italic_S italic_u italic_f end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT caligraphic_T + end_POSTSUPERSCRIPT , caligraphic_I start_POSTSUPERSCRIPT caligraphic_T - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_log caligraphic_D start_POSTSUBSCRIPT italic_θ 1 end_POSTSUBSCRIPT ( bold_h start_POSTSUPERSCRIPT bold_TI end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_H start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i + end_POSTSUBSCRIPT ) + ( 1 - roman_log caligraphic_D start_POSTSUBSCRIPT italic_θ 1 end_POSTSUBSCRIPT ( bold_h start_POSTSUPERSCRIPT bold_TI end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_H start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i - end_POSTSUBSCRIPT ) ) ) end_CELL end_ROW start_ROW start_CELL - ∑ start_POSTSUBSCRIPT caligraphic_I start_POSTSUPERSCRIPT caligraphic_S + end_POSTSUPERSCRIPT , caligraphic_I start_POSTSUPERSCRIPT caligraphic_S - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_log caligraphic_D start_POSTSUBSCRIPT italic_θ 2 end_POSTSUBSCRIPT ( bold_h start_POSTSUPERSCRIPT bold_SI end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_H start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i + end_POSTSUBSCRIPT ) + ( 1 - roman_log caligraphic_D start_POSTSUBSCRIPT italic_θ 2 end_POSTSUBSCRIPT ( bold_h start_POSTSUPERSCRIPT bold_SI end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_H start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i - end_POSTSUBSCRIPT ) ) ) . end_CELL end_ROW

4.5. Disentanglement Constraint

To preserve the unique information from the two representation spaces, a two-level disentanglement is applied. Concretely, we minimize the mutual information among the pattern vectors summarized from behavior vectors in the two domains. To achieve this, we minimize the vCLUB (Cheng et al., 2020) upper bound of the mutual information, which is defined as

(38) IvCLUB(𝐗;𝐘)=Ep(𝐗,𝐘)[logqθ(𝐘|𝐗)]Ep(𝐗)Ep(𝐘)[logqθ(𝐘|𝐗)],subscript𝐼𝑣𝐶𝐿𝑈𝐵𝐗𝐘subscript𝐸𝑝𝐗𝐘delimited-[]subscript𝑞𝜃conditional𝐘𝐗subscript𝐸𝑝𝐗subscript𝐸𝑝𝐘delimited-[]subscript𝑞𝜃conditional𝐘𝐗I_{vCLUB}(\mathbf{X;Y})=E_{p(\mathbf{X,Y})}\left[\log q_{\theta*}(\mathbf{Y|X}% )\right]-E_{p(\mathbf{X})}E_{p(\mathbf{Y})}\left[\log q_{\theta*}(\mathbf{Y|X}% )\right],italic_I start_POSTSUBSCRIPT italic_v italic_C italic_L italic_U italic_B end_POSTSUBSCRIPT ( bold_X ; bold_Y ) = italic_E start_POSTSUBSCRIPT italic_p ( bold_X , bold_Y ) end_POSTSUBSCRIPT [ roman_log italic_q start_POSTSUBSCRIPT italic_θ ∗ end_POSTSUBSCRIPT ( bold_Y | bold_X ) ] - italic_E start_POSTSUBSCRIPT italic_p ( bold_X ) end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p ( bold_Y ) end_POSTSUBSCRIPT [ roman_log italic_q start_POSTSUBSCRIPT italic_θ ∗ end_POSTSUBSCRIPT ( bold_Y | bold_X ) ] ,

where qθ(𝐘|𝐗)subscript𝑞𝜃conditional𝐘𝐗q_{\theta*}(\mathbf{Y|X})italic_q start_POSTSUBSCRIPT italic_θ ∗ end_POSTSUBSCRIPT ( bold_Y | bold_X ) is a variational distribution with parameter θ{\theta*}italic_θ ∗ to approximate p(𝐘|𝐗)𝑝conditional𝐘𝐗p(\mathbf{Y|X})italic_p ( bold_Y | bold_X ).

At each iteration of training, the variational approximation network trained to maximize logqθ(𝐘|𝐗)subscript𝑞𝜃conditional𝐘𝐗\log q_{\theta*}(\mathbf{Y|X})roman_log italic_q start_POSTSUBSCRIPT italic_θ ∗ end_POSTSUBSCRIPT ( bold_Y | bold_X ) is first optimized, and then the main model is optimized.

In this way, we train a vCLUB mutual information estimator for each pair of feature vectors from different domains and minimize the mutual information between each pair. The loss objective of the disentanglement module could then be formulated as

(39) lDis=IvCLUB1(𝐇𝐓𝐓;𝐇𝐒𝐒)+IvCLUB2(𝐇𝐓𝐒;𝐇𝐒𝐓).subscript𝑙𝐷𝑖𝑠subscript𝐼𝑣𝐶𝐿𝑈𝐵1superscript𝐇𝐓𝐓superscript𝐇𝐒𝐒subscript𝐼𝑣𝐶𝐿𝑈𝐵2superscript𝐇𝐓𝐒superscript𝐇𝐒𝐓l_{Dis}=I_{vCLUB1}(\mathbf{H^{TT}};\mathbf{H^{SS}})+I_{vCLUB2}(\mathbf{H^{TS}}% ;\mathbf{H^{ST}}).italic_l start_POSTSUBSCRIPT italic_D italic_i italic_s end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_v italic_C italic_L italic_U italic_B 1 end_POSTSUBSCRIPT ( bold_H start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT ; bold_H start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT ) + italic_I start_POSTSUBSCRIPT italic_v italic_C italic_L italic_U italic_B 2 end_POSTSUBSCRIPT ( bold_H start_POSTSUPERSCRIPT bold_TS end_POSTSUPERSCRIPT ; bold_H start_POSTSUPERSCRIPT bold_ST end_POSTSUPERSCRIPT ) .

4.6. Prediction and Training Objective

The aggregated feature and label embeddings are then disentangled and appended to any recommendation backbone models for prediction as

(40) yi^=f(𝐗𝐢,[𝐗𝐢𝐤𝐈,yik]k=1K,𝐏𝐢𝐓𝐓,𝐏𝐢𝐒𝐒,𝐏𝐢𝐓𝐒,𝐏𝐢𝐒𝐓),^subscript𝑦𝑖𝑓subscript𝐗𝐢superscriptsubscriptdelimited-[]subscriptsuperscript𝐗𝐈subscript𝐢𝐤subscript𝑦subscript𝑖𝑘𝑘1𝐾subscriptsuperscript𝐏𝐓𝐓𝐢subscriptsuperscript𝐏𝐒𝐒𝐢subscriptsuperscript𝐏𝐓𝐒𝐢subscriptsuperscript𝐏𝐒𝐓𝐢\hat{y_{i}}=f(\mathbf{X_{i}},[\langle\mathbf{X^{I}_{i_{k}}},y_{i_{k}}\rangle]_% {k=1}^{K},\mathbf{P^{TT}_{i}},\mathbf{P^{SS}_{i}},\mathbf{P^{TS}_{i}},\mathbf{% P^{ST}_{i}}),over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = italic_f ( bold_X start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , [ ⟨ bold_X start_POSTSUPERSCRIPT bold_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ] start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , bold_P start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_P start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_P start_POSTSUPERSCRIPT bold_TS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_P start_POSTSUPERSCRIPT bold_ST end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) ,

where f()𝑓f(\cdot)italic_f ( ⋅ ) denotes an arbitrary recommendation model.

The training objective consists of the prediction loss, the sufficiency loss, and the disentanglement loss, which can be formulated as:

(41) lpredsubscript𝑙𝑝𝑟𝑒𝑑\displaystyle l_{pred}italic_l start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT =-i(yilogyi^+(1yi)log(1yi^))absent-subscript𝑖subscript𝑦𝑖^subscript𝑦𝑖1subscript𝑦𝑖1^subscript𝑦𝑖\displaystyle=\textnormal{-}\sum_{i}(y_{i}\log\hat{y_{i}}+(1-y_{i})\log(1-\hat% {y_{i}}))~{}= - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) )
(42) \displaystyle\mathcal{L}caligraphic_L =lpred+αlSuf+βlDis,absentsubscript𝑙𝑝𝑟𝑒𝑑𝛼subscript𝑙𝑆𝑢𝑓𝛽subscript𝑙𝐷𝑖𝑠\displaystyle=l_{pred}+\alpha\cdot l_{Suf}+\beta\cdot l_{Dis},= italic_l start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT + italic_α ⋅ italic_l start_POSTSUBSCRIPT italic_S italic_u italic_f end_POSTSUBSCRIPT + italic_β ⋅ italic_l start_POSTSUBSCRIPT italic_D italic_i italic_s end_POSTSUBSCRIPT ,

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are hyperparameters to scale the loss components.

5. Experiments

In this section, we empirically evaluate the proposed model on three datasets. Five research questions lead the experiment part.

  • RQ1

    How does DisCo perform against the baselines?

  • RQ2

    Is DisCo compatible with different backbones?

  • RQ3

    Does each model component contribute to the performance?

  • RQ4

    How is the efficiency of DisCo?

5.1. Setup

5.1.1. Datasets

We use three public datasets to evaluate DisCo. The statistics of the datasets are summarized in Table 1.

Table 1. Dataset statistics.
Dataset # Users # Items # Samples # Fields
ML-1M 6,04060406,0406 , 040 3,70637063,7063 , 706 970,009970009970,009970 , 009 8
AZ-Toys 208,175208175208,175208 , 175 77,6877768777,68777 , 687 716,197716197716,197716 , 197 5
ML-25M 162,541162541162,541162 , 541 59,0475904759,04759 , 047 24,187,3902418739024,187,39024 , 187 , 390 4
  • ML-1M111https://grouplens.org/datasets/movielens/1m/ is a collection of movie ratings provided by users of the MovieLens website.

  • AZ-Toys222https://cseweb.ucsd.edu/ jmcauley/datasets.html gathers product reviews and metadata related to toys and games available in Amazon e-commerce.

  • ML-25M333https://files.grouplens.org/datasets/movielens/ml-25m.zip is a popular movie recommendation dataset widely used in machine learning and recommender systems.

Samples with ratings greater than 3 are treated as positive samples, with the others being negative samples. The window size of the historical behaviors is 30. Data is split according to the global timestamps. Specifically, the training data lies between [0,T0)0subscript𝑇0[0,T_{0})[ 0 , italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), the validation data lies between [T0,T1)subscript𝑇0subscript𝑇1[T_{0},T_{1})[ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), and the test data lies between [T1,+)subscript𝑇1[T_{1},+\infty)[ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , + ∞ ). The ratio of data amount for train/valid/test is 8:1:1:81:18:1:18 : 1 : 1.

5.1.2. Evaluation Metrics

Two widely used metrics including AUC (Area under the ROC curve) and Log Loss (binary cross-entropy loss) are applied to evaluate the performance.

5.1.3. Competing Models

We compare the proposed DisCo with the following methods: 1) conventional tabular methods including DeepFM (Guo et al., 2017), DCN (Wang et al., 2017), PNN (Qu et al., 2018), xDeepFM (Lian et al., 2018), AutoInt (Song et al., 2019), DIN (Zhou et al., 2018), DIEN (Zhou et al., 2019), and 2) semantic-enhanced methods P5 (Geng et al., 2022), UnisRec (Hou et al., 2022), CTRL (Li et al., 2023), and VQ-Rec (Hou et al., 2023).

5.1.4. Implementation Details

We utilize Vicuna-13b (Chiang et al., 2023) released by FastChat444https://github.com/lm-s for text encoding. For a fair comparison, we fix the embedding size and the hidden layer size to be the same for all backbone models. The embedding size for the tabular domain representation is 32. The hidden layer size used for MLP is [128,64]12864[128,64][ 128 , 64 ]. We use the bilinear networks to serve as the discriminator network in the sufficiency constraint and vCLUB mutual information estimator in the disentanglement constraint. The coefficients for the sufficiency constraint loss and disentanglement constraint loss are 0.020.020.020.02 and 0.010.010.010.01, respectively. For each model, the learning rate is searched in the range of {1e4,3e4,5e4,1e3}1𝑒43𝑒45𝑒41𝑒3\{1e-4,3e-4,5e-4,1e-3\}{ 1 italic_e - 4 , 3 italic_e - 4 , 5 italic_e - 4 , 1 italic_e - 3 }, and the weight decay is searched in the range of {1e5,3e5,5e5,1e4,3e4}1𝑒53𝑒55𝑒51𝑒43𝑒4\{1e-5,3e-5,5e-5,1e-4,3e-4\}{ 1 italic_e - 5 , 3 italic_e - 5 , 5 italic_e - 5 , 1 italic_e - 4 , 3 italic_e - 4 }. We use the Adam (Kingma and Ba, 2017) optimizer during training. The patience of early stop is 10. The code is available 555https://github.com/KounianhuaDu/DisCo 666https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/DisCo.

5.2. Overall Performance (RQ1)

In this- section, we compare our proposed DisCo with various baseline models. The experiment results are displayed in Table 2.

Table 2. Major results. For all the baselines, we append the user histories and their corresponding ratings/labels for fair comparison. The best result is in bold, while the second-best value is underlined. Rel.Impr denotes the relative AUC improvement of DisCo against each baseline model. The symbol * indicates statistically significant improvement with p-value <0.001absent0.001<0.001< 0.001.
Models ML-1M AZ-Toys ML-25M
AUC Logloss Rel. Impr. AUC Logloss Rel. Impr. AUC Logloss Rel.Impr.
Tabular Only (CRS) DeepFM 0.7947 0.5470 1.11% 0.7423 0.3720 0.74% 0.8133 0.4892 1.36%
DCN 0.7961 0.5417 0.93% 0.7424 0.3716 0.73% 0.8134 0.4875 1.35%
PNN 0.7932 0.5451 1.30% 0.7418 0.3705 0.81% 0.8135 0.4869 1.33%
xDeepFM 0.7938 0.5464 1.22% 0.7392 0.3748 1.16% 0.8093 0.4922 1.84%
AutoInt 0.7945 0.5435 1.13% 0.7429 0.3705 0.66% 0.8128 0.4883 1.42%
DIN 0.7976 0.5401 0.74% 0.7424 0.3707 0.73% 0.8174 0.4820 0.86%
DIEN 0.7970 0.5428 0.82% 0.7446 0.3704 0.43% 0.8189 0.4841 0.68%
Semantic Enhanced P5 0.7937 0.5478 1.23% 0.7418 0.3736 0.81% 0.8091 0.4921 1.90%
UnisRec 0.7991 0.5410 0.55% 0.7452 0.3837 0.35% 0.8162 0.5223 1.02%
CTRL 0.7979 0.5413 0.70% 0.7432 0.3723 0.62% 0.8189 0.4922 0.68%
VQ-Rec 0.7972 0.5449 0.79% 0.7456 0.3826 0.30% 0.8185 0.5210 0.73%
Best CRS + DisCo 0.8035* 0.5343* - 0.7478* 0.3704* - 0.8245* 0.4743* -
Table 3. Compatibility experiments. The proposed method offers additional feature fields, which can be followed by different feature interaction operations. We test its compatibility with different CTR backbones.
Backbones ML-1M AZ-Toys ML-25M
Original +DisCo Rel. Impr. Original +DisCo Rel. Impr. Original +DisCo Rel. Impr.
AUC LL AUC LL AUC LL AUC LL AUC LL AUC LL
DeepFM 0.7947 0.5470 0.8029 0.5340 1.03% 0.7423 0.3720 0.7462 0.3708 0.53% 0.8133 0.4892 0.8217 0.4771 1.03%
DCN 0.7961 0.5417 0.8035 0.5343 0.93% 0.7424 0.3716 0.7470 0.3693 0.62% 0.8134 0.4875 0.8231 0.4875 1.19%
PNN 0.7932 0.5451 0.8017 0.5401 1.07% 0.7418 0.3705 0.7466 0.3681 0.65% 0.8135 0.4869 0.8245 0.4743 1.35%
xDeepFM 0.7938 0.5464 0.7999 0.5384 0.77% 0.7392 0.3748 0.7434 0.3718 0.57% 0.8093 0.4922 0.8240 0.4792 1.82%
AutoInt 0.7945 0.5435 0.8029 0.5343 1.06% 0.7429 0.3705 0.7472 0.3710 0.58% 0.8128 0.4883 0.8196 0.4863 0.83%
DIN 0.7976 0.5401 0.8016 0.5373 0.51% 0.7424 0.3707 0.7478 0.3704 0.73% 0.8174 0.4820 0.8212 0.4829 0.46%
DIEN 0.7970 0.5428 0.8025 0.5353 0.69% 0.7446 0.3704 0.7468 0.3693 0.30% 0.8189 0.4841 0.8219 0.4792 0.37%

From the results, one can draw the following conclusions. 1) Our proposed DisCo can consistently outperform all the baseline models including tabular-only methods and the semantic-enhanced methods. The improvements are statistically significant under p-value <0.001absent0.001<0.001< 0.001. This shows the effectiveness of the proposed paradigm that disentangles and collaborates tabular and semantic domain knowledge for enhanced recommendation. 2) The methods that involve in semantic knowledge can surpass the conventional tabular-only methods, which shows the effectiveness of introducing external semantic knowledge into recommendation. 3) DisCo outperforms methods that focus on aligning the two representation spaces. For example, CTRL utilizes the contrastive learning methodology to align the two representation spaces. These methodologies tend to make representations in different representation spaces closer, where unique information is discarded during training. This validates the effectiveness of DisCo that preserves the unique information of the two different representation spaces.

5.3. Compatibility Study (RQ2)

Since the extracted patterns obtained from the dual-side attentive network can serve as extra features, they could be appended to arbitrary conventional recommendation models. In this section, we evaluate the compatibility of the proposed framework on different conventional backbones.

The feature interaction methods of the backbones include product-based, MLP-based, and attention-based operators. We test DisCo on these different operators and justify the effectiveness of the resulting feature fields. The results are displayed in Table 3. From the results, we can see that the proposed method could offer performance gains for various backbone models and operations. The improvements are statistically significant under p-value <0.001absent0.001<0.001< 0.001, which validates the superior compatibility of DisCo.

5.4. Ablation Studies (RQ3)

5.4.1. Impact of the Dual-Side Attentive Network

In this section, we validate the effectiveness of the proposed dual-side attentive network module. Concretely, we remove the inter-domain attention within which the two representation spaces attend to each other and only retrain the intra-domain attention. This results in the common two-tower structure in recommender systems, where the semantic and tabular representations are modeled separately for semantic dependencies and collaborative signals respectively. While our dual-side attentive network module models both the intra-domain and inter-domain user behavior patterns. The results of the two-tower attention and the dual-side attentive network module are displayed in Table 4.

Table 4. Experiment on the Aggregation Module.
Datasets Two-Tower Attention Dual-Side Attention
AUC Logloss AUC Logloss
ML-1M 0.8015 0.5360 0.8035 0.5343
AZ-Toys 0.7464 0.3689 0.7478 0.3704
ML-25M 0.8208 0.4837 0.8245 0.4743

From the results, we could see that the proposed dual-side attentive network can outperform the two-tower aggregation, which validates the effectiveness of the proposed module that captures both the intra-domain and the inter-domain knowledge.

5.4.2. Impact of the Constraints

In this section, we study the impact of the proposed constraints.

Table 5. Impacts of the two constraints.
Datasets ML-1M AZ-Toys ML-25M
AUC Logloss AUC Logloss AUC Logloss
w/ Both (DisCo) 0.8035 0.5343 0.7478 0.3704 0.8245 0.4743
w/o Sufficiency 0.8019 0.5354 0.7474 0.3686 0.8220 0.4780
w/o Disentanglent 0.8011 0.5377 0.7476 0.3687 0.8212 0.4833
w/o Both 0.7988 0.5391 0.7460 0.3701 0.8163 0.4928

Firstly, we conduct experiments with and without the constraints, the results of which are displayed in Table 5. From the results, we can see the following conclusions. 1) Both the sufficiency and the disentanglement constraints could offer performance gains to the model, with sufficiency helps to preserve task-relevant information from each space and disentanglement helps to enforce unique information from each space. 2) The two constraints collaborates and boosts performance with each other, which helps to capture both the consistency and specificity information of the two spaces.

Refer to caption
Figure 4. T-SNE visualization of the representations for the dual-side attentive network output (ML-1M).

In addition, we further visualize the representations of the dual-side attentive module output with and without the disentanglement constraint to dig into how the disentanglement impacts the distribution of representations. The visualization of ML-1M is displayed in Figure 4. Visualizations of more datasets could be found in the Appendix.

Concretely, we visualize the representations of the intra-domain pattern vectors 𝐇𝐓𝐓,𝐇𝐒𝐒superscript𝐇𝐓𝐓superscript𝐇𝐒𝐒\mathbf{H^{TT},H^{SS}}bold_H start_POSTSUPERSCRIPT bold_TT end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT bold_SS end_POSTSUPERSCRIPT and inter-domain pattern vectors 𝐇𝐓𝐒,𝐇𝐒𝐓superscript𝐇𝐓𝐒superscript𝐇𝐒𝐓\mathbf{H^{TS},H^{ST}}bold_H start_POSTSUPERSCRIPT bold_TS end_POSTSUPERSCRIPT , bold_H start_POSTSUPERSCRIPT bold_ST end_POSTSUPERSCRIPT with t-sne (Van der Maaten and Hinton, 2008). The left column of Figure 4 displays the representations without the disentanglement constraint. While the right column of the figure displays those with the disentanglement constraint. From the figure, we can see that under the regularization of the disentanglement constraint, the distributions of representations from different domains are separated better and there are apparently better manifolds existing in the representation spaces with the disentanglement constraint. This illustrates that the disentanglement constraint could well separate and extract different information from the two spaces, which validates the effectiveness of our design.

5.5. Efficiency Analysis (RQ4)

In this section, we discuss the efficiency of the proposed model. Firstly, the semantic embeddings for items could be pre-computed and stored in the indexed knowledge base, the construction of which can be done offline and only once. In addition, after the MLP used to reduce the dimension of the semantic embedding is trained, we could use the trained MLP to reduce the dimension and further keep a reduced-dimension version of the indexed knowledge base. Therefore, in the inference stage we do not need to deal with the high-dimensional semantic vectors.

Table 6. Training and inference time per sample (s).
Dataset ML-1M AZ-Toys
Training Inference Training Inference
DCN 2.18×1032.18superscript1032.18\times 10^{-3}2.18 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.86×1052.86superscript1052.86\times 10^{-5}2.86 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 2.06×1032.06superscript1032.06\times 10^{-3}2.06 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.68×1052.68superscript1052.68\times 10^{-5}2.68 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
DIN 1.99×1031.99superscript1031.99\times 10^{-3}1.99 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.34×1051.34superscript1051.34\times 10^{-5}1.34 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 2.05×1032.05superscript1032.05\times 10^{-3}2.05 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 2.15×1052.15superscript1052.15\times 10^{-5}2.15 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
DisCo 5.04×1035.04superscript1035.04\times 10^{-3}5.04 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 7.42×1057.42superscript1057.42\times 10^{-5}7.42 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 4.29×1034.29superscript1034.29\times 10^{-3}4.29 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 6.89×1056.89superscript1056.89\times 10^{-5}6.89 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT

The training and inference time analysis is displayed in Table 6. All the experiments are done on a single V100 GPU with Intel Xeon Gold 6278C 2.60GHz CPU and run three times to get the average time. From the results, we can see that the proposed method does not cause a heavy overhead.

6. Conclusion

Recommender systems play a vital role in our daily life. Conventional recommendation methods focus on modeling feature interactions and user behaviors within the ID-based tabular representation space and fail to capture semantic dependencies among user behaviors. Existing semantic-enhanced methods focus on aligning the tabular and semantic space, while the unique and disentangled parts of the two representation spaces are not well explored. In this paper, we propose DisCo to disentangle and collaborate the tabular and semantic representation spaces to capture both the consistent and the specific knowledge from the two spaces for enhanced recommendations. Concretely, we design three modules, namely dual-side attentive network, the sufficiency constraint, and the disentanglement constraint. To efficiently utilize the semantic knowledge, a textual description for each item is firstly obtained and encoded by LLMs, the embedding of which is then stored into an indexed knowledge. The dual-side attention module models intra-domain and inter-domain patterns to offer additional knowledge for arbitrary recommendation backbones, which is constrained by the designed sufficiency and disentanglement constraints. The two constraints force the model to preserve useful information and extract unique information from the two spaces. Extensive experiments and ablation studies on three datasets and various backbone models justify the effectiveness of the proposed method.

References

  • (1)
  • Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, R. Devon Hjelm, and Aaron C. Courville. 2018. Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Vol. 80. PMLR, 530–539.
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In DLRS@RecSys.
  • Cheng et al. (2020) Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. 2020. CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 1779–1788.
  • Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
  • Christensen and Schiaffino (2011) Ingrid A. Christensen and Silvia Schiaffino. 2011. Entertainment recommender systems for group of users. Expert Systems with Applications 38, 11 (2011), 14127–14135. https://doi.org/10.1016/j.eswa.2011.04.221
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Ding et al. (2021) Hao Ding, Yifei Ma, Anoop Deoras, Yuyang Wang, and Hao Wang. 2021. Zero-Shot Recommender Systems. arXiv:2105.08318 [cs.LG]
  • Du et al. (2022) Kounianhua Du, Weinan Zhang, Ruiwen Zhou, Yangkun Wang, Xilong Zhao, Jiarui **, Quan Gan, Zheng Zhang, and David P Wipf. 2022. Learning Enhanced Representation for Tabular Data via Neighborhood Propagation. Advances in Neural Information Processing Systems 35 (2022), 16373–16384.
  • Feng et al. (2019) Yufei Feng, Fuyu Lv, Weichen Shen, Menghan Wang, Fei Sun, Yu Zhu, and Ke** Yang. 2019. Deep Session Interest Network for Click-Through Rate Prediction. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019. 2301–2307.
  • Geng et al. (2022) Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems. 299–315.
  • Geng et al. (2023) Shijie Geng, Juntao Tan, Shuchang Liu, Zuohui Fu, and Yongfeng Zhang. 2023. VIP5: Towards Multimodal Foundation Models for Recommendation. arXiv preprint arXiv:2305.14302 (2023).
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. In IJCAI.
  • Hjelm et al. (2019) R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
  • Hou et al. (2023) Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao. 2023. Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders. arXiv:2210.12316 [cs.IR]
  • Hou et al. (2022) Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards Universal Sequence Representation Learning for Recommender Systems. arXiv:2206.05941 [cs.IR]
  • Hua et al. (2023a) Wenyue Hua, Yingqiang Ge, Shuyuan Xu, Jianchao Ji, and Yongfeng Zhang. 2023a. UP5: Unbiased Foundation Model for Fairness-aware Recommendation. arXiv preprint arXiv:2305.12090 (2023).
  • Hua et al. (2023b) Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. 2023b. How to Index Item IDs for Recommendation Foundation Models. arXiv preprint arXiv:2305.06569 (2023).
  • Huang et al. (2022) Yanhua Huang, Hangyu Wang, Yiyun Miao, Ruiwen Xu, Lei Zhang, and Weinan Zhang. 2022. Neural Statistics for Click-Through Rate Prediction. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1849–1853.
  • Juan et al. (2016a) Yu-Chin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016a. Field-aware Factorization Machines for CTR Prediction. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September 15-19, 2016. ACM, 43–50.
  • Juan et al. (2016b) Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016b. Field-aware factorization machines for CTR prediction. In RecSys.
  • Kingma and Ba (2017) Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]
  • Li et al. (2023) Xiangyang Li, Bo Chen, Lu Hou, and Ruiming Tang. 2023. CTRL: Connect Tabular and Language Model for CTR Prediction. arXiv preprint arXiv:2306.02841 (2023).
  • Lian et al. (2018) Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1754–1763.
  • Lin et al. (2024a) Jianghao Lin, Bo Chen, Hangyu Wang, Yunjia Xi, Yanru Qu, Xinyi Dai, Kangning Zhang, Ruiming Tang, Yong Yu, and Weinan Zhang. 2024a. ClickPrompt: CTR Models are Strong Prompt Generators for Adapting Language Models to CTR Prediction. In Proceedings of the ACM on Web Conference 2024 (WWW ’24). 3319–3330.
  • Lin et al. (2023) Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, and Weinan Zhang. 2023. How Can Recommender Systems Benefit from Large Language Models: A Survey. arXiv preprint arXiv:2306.05817 (2023).
  • Lin et al. (2024b) Jianghao Lin, Rong Shan, Chenxu Zhu, Kounianhua Du, Bo Chen, Shigang Quan, Ruiming Tang, Yong Yu, and Weinan Zhang. 2024b. ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation. In Proceedings of the ACM on Web Conference 2024 (WWW ’24). 3497–3508.
  • Liu et al. (2020) Bin Liu, Chenxu Zhu, Guilin Li, Weinan Zhang, **cai Lai, Ruiming Tang, Xiuqiang He, Zhenguo Li, and Yong Yu. 2020. AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020. ACM, 2636–2645.
  • Liu et al. (2022) Guang Liu, Jie Yang, and Ledell Wu. 2022. PTab: Using the Pre-trained Language Model for Modeling Tabular Data. arXiv preprint arXiv:2209.08060 (2022).
  • Pi et al. (2019) Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019. ACM, 2671–2679.
  • Pi et al. (2020) Qi Pi, Guorui Zhou, Yu**g Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020. ACM, 2685–2692.
  • Qin et al. (2021) Jiarui Qin, Weinan Zhang, Rong Su, Zhirong Liu, Weiwen Liu, Ruiming Tang, Xiuqiang He, and Yong Yu. 2021. Retrieval & Interaction Machine for Tabular Data Prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1379–1389.
  • Qin et al. (2020) Jiarui Qin, Weinan Zhang, Xin Wu, Jiarui **, Yuchen Fang, and Yong Yu. 2020. User Behavior Retrieval for Click-Through Rate Prediction. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020. ACM, 2347–2356.
  • Qu et al. (2018) Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, Huifeng Guo, Yong Yu, and Xiuqiang He. 2018. Product-based Neural Networks for User Response Prediction over Multi-field Categorical Data. ACM Transactions on Information Systems (2018).
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  • Rendle (2010) Steffen Rendle. 2010. Factorization machines. In ICDM.
  • Smith and Linden (2017) Brent Smith and Greg Linden. 2017. Two decades of recommender systems at Amazon.com. IEEE Internet Computing (2017). https://www.amazon.science/publications/two-decades-of-recommender-systems-at-amazon-com
  • Song et al. (2019) Wei** Song, Chence Shi, Zhi** Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Bei**g, China, November 3-7, 2019. ACM, 1161–1170.
  • Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
  • Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL]
  • Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD’17, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 12:1–12:7.
  • Wu et al. (2023) Chuhan Wu, Fangzhao Wu, Yongfeng Huang, and Xing Xie. 2023. Personalized news recommendation: Methods and challenges. ACM Transactions on Information Systems 41, 1 (2023), 1–50.
  • Xi et al. (2023) Yunjia Xi, Weiwen Liu, Jianghao Lin, Jieming Zhu, Bo Chen, Ruiming Tang, Weinan Zhang, Rui Zhang, and Yong Yu. 2023. Towards Open-World Recommendation with Knowledge Augmentation from Large Language Models. arXiv:2306.10933 [cs.IR]
  • Xiao et al. (2017) Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017. 3119–3125.
  • Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep Interest Evolution Network for Click-Through Rate Prediction. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. 5941–5948.
  • Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chengru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi **, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018. 1059–1068.

Appendix A Visualizations

Visualizations of the representations for the dual-side attentive network on AZ-Toys and ML-25M are displayed in Figure 5 and Figure 6.

Refer to caption
Figure 5. T-SNE visualization of the representations for the dual-side attentive network output (AZ-Toys).
Refer to caption
Figure 6. T-SNE visualization of the representations for the dual-side attentive network output (ML-25M).
Table 7. Items used in the case study.
Item Title Galaxy Quest Babe: Pig in the City
Genre provided in the movielens dataset Adventure Children’s movie
Tags provided by Douban (Open world knowledge, unknown by the tabular model) Comedy, Action, Fantasy and adventure. Comedy, Fantasy and Adventure.
Table 8. Similarities of items in different spaces.
Cosine Similarity In Tabular Space In Semantic Space (BERT) In Semantic Space (Vicuna-13b)
Galaxy Quest – Babe: Pig in the City -0.0210 0.0599 0.9606

Appendix B Experiments on Long-Tail Data

In this section, we validate the effectiveness of DisCo on the long-tail data where features are less-hit in the tabular representation space. Concretely, we sort items based on their frequency of occurrences in the training set. The bottom 10% in terms of frequency are classified as long-tail items. Then we study the performance of the best-performed tabular-only model DIN and that with the proposed DisCo model.

Table 9. Experiment on the tail Data.
Datasets DIN DIN + DisCo
AUC Logloss AUC Logloss
ML-1M 0.6710 0.6564 0.6934 0.6308
AZ-Toys 0.6673 0.3982 0.7416 0.3778
ML-25M 0.7963 0.5430 0.8032 0.5380

The performance comparisons are displayed in Table 9. From the results, we can see that the proposed method gives a significant performance boost on the tail data. Since the general knowledge contained in the pretrained semantic embeddings helps to complement the less-trained features in the tabular representation space, where the semantic information is injected through the dual-side attentive network under the regularizations of the constraints.

Appendix C Case Study

In this section, We would like to provide an example to support the claim that encodings from LLMs can capture open-world knowledge as below.

Given two items in the movielens dataset: ”Galaxy Quest” and ”Babe: Pig in the City”. Their genres provided in the dataset and the tags provided by open world are listed in Table 7.

For the two items:

  • The two items do not share any common tokens.

  • The two items do not share any common features given in the dataset.

  • But they are actually close as they share a lot common features in the open world (e.g., the tags given by Douban).

We then compute the cosine similarities between the above two items in the tabular space and that in the semantic space as in Table 8. (Note that no generation is involved. We use the same genres provided by the movielens dataset to obtain the item encodings for both the tabular encodings and the semantic encodings.)

From the results, we can see that

  • Tabular representations cannot find the relevance between the two items, since no common features exist.

  • Semantic representations by small language models cannot well find the relevance between the two items, since no common tokens exist.

  • Semantic representations by LLMs can well find the relevance between the two items, even there are nearly no common tokens shared between the two items in the dataset. Since there are many anchors about the two items existing in the large open world training corpus (e.g., the common tags given by Douban, the similar descriptions of movies, etc.), as they are trained together, the representations of the two items tend to get close.

The above case study can justify that the encodings from LLMs can help to capture open-world knowledge.