Extracting Essential and Disentangled Knowledge
for Recommendation Enhancement

Kounianhua Du Shanghai Jiao Tong UniversityShanghaiChina [email protected] Jizheng Chen Shanghai Jiao Tong UniversityShanghaiChina [email protected] Jianghao Lin Shanghai Jiao Tong UniversityShanghaiChina [email protected] Menghui Zhu [email protected] Huawei Noah’s Ark LabShanghaiChina Bo Chen Huawei Noah’s Ark LabShanghaiChina [email protected] Ruiming Tang Huawei Noah’s Ark LabShenzhenChina [email protected]  and  Weinan Zhang Shanghai Jiao Tong UniversityShanghaiChina [email protected]
(2018; 20 February 2007; 12 March 2009; 5 June 2009)
Abstract.

Recommender models play a vital role in various industrial scenarios, while often faced with the catastrophic forgetting problem caused by the fast shifting data distribution, e.g., the evolving user interests, click signals fluctuation during sales promotions, etc. To alleviate this problem, a common approach is to reuse knowledge from the historical data. However, preserving the vast and fast-accumulating data is hard, which causes dramatic storage overhead. Memorizing old data through a parametric knowledge base is then proposed, which compresses the vast amount of raw data into model parameters. Despite the flexibility, how to improve the memorization and generalization capabilities of the parametric knowledge base is challenging. In this paper, we propose two constraints to extract Essential and Disentangled Knowledge from past data for rational and generalized recommendation enhancement, which improves the capabilities of the parametric knowledge base without increasing the size of it. The essential principle helps to compress the input into representative vectors that capture the task-relevant information and filter out the noisy information. The disentanglement principle reduces the redundancy of stored information and pushes the knowledge base to focus on capturing the disentangled invariant patterns. These two rules together promote rational compression of information for robust and generalized knowledge representations. For the parametric knowledge base, we train a knowledge extractor that extracts knowledge patterns of arbitrary order from past data and a knowledge encoder that memorizes the arbitrary order patterns, which serves as the retrieval key generator and memory network respectively in the following knowledge reusing phase. The whole process is regularized by the proposed two constraints. Extensive experiments on two datasets justify the effectiveness of the proposed method.

Recommender Systems, Information Compression
copyright: acmcopyrightjournalyear: 2018doi: XXXXXXX.XXXXXXXprice: 15.00isbn: 978-1-4503-XXXX-X/18/06ccs: Information systems Recommender systems

1. Introduction

Recommender systems play an important role in alleviating information overload and hel** users discover relevant content in today’s vast digital landscape. They are widely used in various industries, including e-commerce (Smith and Linden, 2017), entertainment (Christensen and Schiaffino, 2011), social media (Wu et al., 2023), and online streaming platforms (Zhou et al., 2018, 2019).

However, the data distribution shifts fast in recommender systems, e.g., the evolving user interests, the fluctuation of click signals during sale promotions, etc. This fast shifting distribution often results in the catastrophic forgetting problem that models cannot remember old knowledge well. A line of methods is to reuse knowledge from old data in a non-parametric way (Pi et al., 2020). However, preserving the vast and fast-accumulating raw data is heavy, which causes dramatic storage overhead. Compressing the knowledge from raw data into parameters for reusing is then proposed (Qin et al., 2024; Wang et al., 2023).

Refer to caption
Figure 1. (a) The illustration of the overall process. The vast knowledge of old data is compressed into the parametric knowledge base. (b) The essential principle for compression, which encourages a compressed representation that captures the fundamental knowledge of data and filters out the noises. (c) The disentangled principle for compression, which reduces the redundancy of stored patterns and decomposes the invariance for better generalization.

Despite the flexibility, how to improve the memorization and generalization capabilities of the parameters is challenging due to the noisy information and the diverse patterns existing in the vast amount of old data. For example, a user named Alice𝐴𝑙𝑖𝑐𝑒Aliceitalic_A italic_l italic_i italic_c italic_e might follow the trend to buy a product Meal_Replacement𝑀𝑒𝑎𝑙_𝑅𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡Meal\_Replacementitalic_M italic_e italic_a italic_l _ italic_R italic_e italic_p italic_l italic_a italic_c italic_e italic_m italic_e italic_n italic_t during sales promotions and give a positive feedback due to the return cash offer, which is not an actual interest of hers. AliceMeal_Replacement𝐴𝑙𝑖𝑐𝑒𝑀𝑒𝑎𝑙_𝑅𝑒𝑝𝑙𝑎𝑐𝑒𝑚𝑒𝑛𝑡Alice-Meal\_Replacementitalic_A italic_l italic_i italic_c italic_e - italic_M italic_e italic_a italic_l _ italic_R italic_e italic_p italic_l italic_a italic_c italic_e italic_m italic_e italic_n italic_t forms a spurious pattern that could bias the model from memorizing the invariant pattern. In addition, the patterns existing in the data space are diverse and complex, which are hard to be fully memorized. Some patterns may entangle with each other, leading to the redundant storage and poorer generalization. For example, FemaleActressLipstick𝐹𝑒𝑚𝑎𝑙𝑒𝐴𝑐𝑡𝑟𝑒𝑠𝑠𝐿𝑖𝑝𝑠𝑡𝑖𝑐𝑘Female-Actress-Lipstickitalic_F italic_e italic_m italic_a italic_l italic_e - italic_A italic_c italic_t italic_r italic_e italic_s italic_s - italic_L italic_i italic_p italic_s italic_t italic_i italic_c italic_k, FemaleLipstick𝐹𝑒𝑚𝑎𝑙𝑒𝐿𝑖𝑝𝑠𝑡𝑖𝑐𝑘Female-Lipstickitalic_F italic_e italic_m italic_a italic_l italic_e - italic_L italic_i italic_p italic_s italic_t italic_i italic_c italic_k, and ActressLipstick𝐴𝑐𝑡𝑟𝑒𝑠𝑠𝐿𝑖𝑝𝑠𝑡𝑖𝑐𝑘Actress-Lipstickitalic_A italic_c italic_t italic_r italic_e italic_s italic_s - italic_L italic_i italic_p italic_s italic_t italic_i italic_c italic_k entangles with each other, where Actress𝐴𝑐𝑡𝑟𝑒𝑠𝑠Actressitalic_A italic_c italic_t italic_r italic_e italic_s italic_s information contains Female𝐹𝑒𝑚𝑎𝑙𝑒Femaleitalic_F italic_e italic_m italic_a italic_l italic_e information so that FemaleActressLipstick𝐹𝑒𝑚𝑎𝑙𝑒𝐴𝑐𝑡𝑟𝑒𝑠𝑠𝐿𝑖𝑝𝑠𝑡𝑖𝑐𝑘Female-Actress-Lipstickitalic_F italic_e italic_m italic_a italic_l italic_e - italic_A italic_c italic_t italic_r italic_e italic_s italic_s - italic_L italic_i italic_p italic_s italic_t italic_i italic_c italic_k leads to redundant memorization. Decomposing FemaleActressLipstick𝐹𝑒𝑚𝑎𝑙𝑒𝐴𝑐𝑡𝑟𝑒𝑠𝑠𝐿𝑖𝑝𝑠𝑡𝑖𝑐𝑘Female-Actress-Lipstickitalic_F italic_e italic_m italic_a italic_l italic_e - italic_A italic_c italic_t italic_r italic_e italic_s italic_s - italic_L italic_i italic_p italic_s italic_t italic_i italic_c italic_k into FemaleLipstick𝐹𝑒𝑚𝑎𝑙𝑒𝐿𝑖𝑝𝑠𝑡𝑖𝑐𝑘Female-Lipstickitalic_F italic_e italic_m italic_a italic_l italic_e - italic_L italic_i italic_p italic_s italic_t italic_i italic_c italic_k and ActressLipstick𝐴𝑐𝑡𝑟𝑒𝑠𝑠𝐿𝑖𝑝𝑠𝑡𝑖𝑐𝑘Actress-Lipstickitalic_A italic_c italic_t italic_r italic_e italic_s italic_s - italic_L italic_i italic_p italic_s italic_t italic_i italic_c italic_k helps to reduce redundancy and meanwhile improve the expressiveness of the features, since the minimal information that has better invariance is kept.

To solve the challenges above, we propose the Essential and Disentangled Knowledge base, where two constraints regularize a parametric knowledge base for robust and generalized knowledge preserving without increasing the size of the knowledge base. Concretely, the Essential principle aims to capture the task-relevant information (the red part in Figure 1.(b)) and filter out the noisy information (the grey part in Figure 1.(b)) in the data, which follows the Information Bottleneck principle (Tishby et al., 2000a) that finds the optimal trade-off between compression and prediction. By striking a balance between these two aspects, the principle enables the model to preserve sufficient information of data while kee** the representation minimal and invariant (Achille and Soatto, 2018), which contributes to lower storage cost and higher generalization capability. The Disentanglement principle eliminates redundant information among different patterns (the grey part in Figure 1.(c)) and decomposes the invariance of different patterns. This reduction of redundancy helps to improve memorization within limited model parameters. And at the same time, as depicted in (Montero et al., 2020), improving disentanglement may contribute to the generalization. The two rules together contribute to compressed and rational knowledge vectors that capture the fundamental knowledge of past data and have good generalization ability for future distribution.

Concretely, our method can be split into two stages: knowledge compression and knowledge utilization. During the first stage, we compress the massive old data documents into the learned parameters. Specifically, we first extract informative patterns from the old data using a knowledge extractor. The extracted patterns are then encoded and memorized by a permutation-equivariant encoder for knowledge vectors. The whole process is under the regularization of the essential and disentangled principles. After the compression, the knowledge extractor and knowledge encoder together make up the parametric knowledge base, which serve as the retrieval key generator and memory network in the following stage. During the knowledge utilization stage, we extract knowledge from the constructed knowledge base for each data instance and utilize it as a supplement for recommendation enhancement. The extensive experiment results validate the effectiveness of the proposed method, as well as its superior model compatibility.

Our contributions can be summarized into three-folds.

  • We propose a parametric knowledge base that compresses the massive old data documents into robust and generalized parametric knowledge, which is a modal-agnostic method that could offer diverse knowledge for various base models.

  • We propose two rules to effectively regularize the parametric knowledge base, which promotes robustness and generalization by extracting essential and disentangled knowledge from the old data without increasing the size of it.

  • The knowledge extraction and utilization of EDK is instance-level, which offers fine-grained information tailed for each instance for enhancement.

Major experiments over different baseline models and ablation studies validate the effectiveness of the proposed method.

2. Related Work

2.1. Recommender Systems

This work is closely related to the research topics of recommender systems, which focus on capturing collaborative signals based on feature interaction learning and user behavior modeling.

For feature interaction learning, there are plenty of works mining the interactive patterns of categorical data. FM (Rendle, 2010) models features as low-dimensional embeddings and captures feature interactions by inner products. FFM (Juan et al., 2016) further develops field-aware interactions by multiple embeddings settings. NeuFM (He et al., 2017) proposes to use deep neural networks, improving FM by incorporating a multi-layer perceptron. Wide & Deep (Cheng et al., 2016) combines the strengths of linear models and DNNs for memorization and generalization respectively. DeepFM (Guo et al., 2017) utilizes an FM layer to replace the wide component in (Cheng et al., 2016) to model the pairwise feature interactions. xDeepFM (Lian et al., 2018) introduces the cross layer to replace the wide component and constructs limited high order feature interactions. PNN (Qu et al., 2018) proposes a product layer to better model the “AND” relations between features. DCN (Wang et al., 2017) learns both low-dimensional feature crossing and high-dimensional nonlinear features efficiently. AutoInt (Song et al., 2019) learns high-order feature interactions via the self-attention mechanism. AutoFIS (Liu et al., 2020) applies a gate for each feature interaction and selected limited order beneficial interaction by learning the gate values.

User behavior modeling plays a crucial role in recommendation systems by providing valuable insights into user preferences, interests, and behaviors. By understanding how users interact with a system and make choices, recommendation algorithms can effectively personalize and tailor recommendations to individual users. DIN (Zhou et al., 2018) models user interests with attention mechanism. It reduces the influence of historical behavior information of commodities that are not related to the currently estimated advertisement on the current click estimation judgment. DIEN (Zhou et al., 2019) further points out that user interests evolve over time and modeled the evolving interests of users with the GRU module. DSIN (Feng et al., 2019) splits different sessions and better models the evolving property of user interests. MIMN (Pi et al., 2019) further categorizes user interests into different channels and models different aspects of user interests. SIM (Pi et al., 2020) proposes to model long-term user behaviors and aggregates them using the fast SimHash algorithm.

2.2. Model Invariance & Generalization

Due to the limited capacity of model parameters, improving the quality of the memorized knowledge is important.

As stated in (Achille and Soatto, 2018), an ideal representation should be sufficient, minimal, invariant, and maximally disentangled. It also proves that a sufficient representation of the data is invariant if and only if it contains the smallest amount of information. In other words, achieving the optimal balance of sufficiency and minimalist will at the same time promotes invariance, which contributes to the generalization (Deng et al., 2022). This aligns with the information bottleneck theory (Tishby et al., 2000b) that a model should extract the most relevant information of an input sample X𝑋Xitalic_X corresponding to Y𝑌Yitalic_Y:

(1) min(I(X;S)βI(S;Y)),𝐼𝑋𝑆𝛽𝐼𝑆𝑌\min\left(I(X;S)-\beta I(S;Y)\right),roman_min ( italic_I ( italic_X ; italic_S ) - italic_β italic_I ( italic_S ; italic_Y ) ) ,

where S𝑆Sitalic_S denotes the representation of the encoded input.

As for the disentanglement requirement, it is proved in machine learning tasks that disentanglement contributes to generalization (Montero et al., 2020; Yang et al., 2023). A common thread of methods to improve the disentanglement is to minimize the mutual information of the representations.

While mutual information being hard to estimate, one can decrease the upper bound of mutual information to minimize it and increase the lower bound of mutual information to maximize it. CLUB (Cheng et al., 2020) introduces a contrastive log-ratio upper cound of mutual information, which can be represented as

(2) I(X;Y)ICLUB(X;Y)=Ep(x,y)[logp(y|x)]Ep(x)Ep(y)[logp(y|x)].𝐼𝑋𝑌subscript𝐼𝐶𝐿𝑈𝐵𝑋𝑌subscript𝐸𝑝𝑥𝑦delimited-[]𝑝conditional𝑦𝑥subscript𝐸𝑝𝑥subscript𝐸𝑝𝑦delimited-[]𝑝conditional𝑦𝑥\begin{split}I(X;Y)&\leq I_{CLUB}(X;Y)\\ &=E_{p(x,y)}\left[\log p(y|x)\right]-E_{p(x)}E_{p(y)}\left[\log p(y|x)\right].% \end{split}start_ROW start_CELL italic_I ( italic_X ; italic_Y ) end_CELL start_CELL ≤ italic_I start_POSTSUBSCRIPT italic_C italic_L italic_U italic_B end_POSTSUBSCRIPT ( italic_X ; italic_Y ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_E start_POSTSUBSCRIPT italic_p ( italic_x , italic_y ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_y | italic_x ) ] - italic_E start_POSTSUBSCRIPT italic_p ( italic_x ) end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p ( italic_y ) end_POSTSUBSCRIPT [ roman_log italic_p ( italic_y | italic_x ) ] . end_CELL end_ROW

When the conditional distribution p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ) is not known, one could use a variational distribution qθ(y|x)subscript𝑞𝜃conditional𝑦𝑥q_{\theta}(y|x)italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) to approximate it and the upper bound then becomes

(3) IvCLUB=Ep(x,y)[logqθ(y|x)]Ep(x)Ep(y)[logqθ(y|x)].subscript𝐼𝑣𝐶𝐿𝑈𝐵subscript𝐸𝑝𝑥𝑦delimited-[]subscript𝑞𝜃conditional𝑦𝑥subscript𝐸𝑝𝑥subscript𝐸𝑝𝑦delimited-[]subscript𝑞𝜃conditional𝑦𝑥I_{vCLUB}=E_{p(x,y)}\left[\log q_{\theta}(y|x)\right]-E_{p(x)}E_{p(y)}\left[% \log q_{\theta}(y|x)\right].italic_I start_POSTSUBSCRIPT italic_v italic_C italic_L italic_U italic_B end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_p ( italic_x , italic_y ) end_POSTSUBSCRIPT [ roman_log italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] - italic_E start_POSTSUBSCRIPT italic_p ( italic_x ) end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_p ( italic_y ) end_POSTSUBSCRIPT [ roman_log italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] .

MINE (Belghazi et al., 2018) proposes a lower bound of the mutual information based on the Donsker-Varadhan representation of KL divergence:

(4) I(X;Y):=DKL(𝕁||𝕄)Iw^(DV)(X;Y):=E𝕁[Tw(x;y)]logE𝕄[eTw(x,y)],\begin{split}I(X;Y)&:=D_{KL}(\mathbb{J}||\mathbb{M})\geq\hat{I_{w}}^{(DV)}(X;Y% )\\ &:=E_{\mathbb{J}}\left[T_{w}(x;y)\right]-\log E_{\mathbb{M}}\left[e^{T_{w}(x,y% )}\right],\end{split}start_ROW start_CELL italic_I ( italic_X ; italic_Y ) end_CELL start_CELL := italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( blackboard_J | | blackboard_M ) ≥ over^ start_ARG italic_I start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT ( italic_D italic_V ) end_POSTSUPERSCRIPT ( italic_X ; italic_Y ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL := italic_E start_POSTSUBSCRIPT blackboard_J end_POSTSUBSCRIPT [ italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_x ; italic_y ) ] - roman_log italic_E start_POSTSUBSCRIPT blackboard_M end_POSTSUBSCRIPT [ italic_e start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_x , italic_y ) end_POSTSUPERSCRIPT ] , end_CELL end_ROW

where 𝕁𝕁\mathbb{J}blackboard_J is the joint distribution, 𝕄𝕄\mathbb{M}blackboard_M is the product of marginal distributions, and Twsubscript𝑇𝑤T_{w}italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is a discriminator. Then DIM (Hjelm et al., 2019) points out that we do not necessarily need to obtain the precise value of MI and use the Jensen-Shannon Divergence to estimate it instead, where a GAN-style loss is proposed:

(5) L=E𝕁[logTw(x,y)]+E𝕄[log(1Tw(x,y))].𝐿subscript𝐸𝕁delimited-[]subscript𝑇𝑤𝑥𝑦subscript𝐸𝕄delimited-[]1subscript𝑇𝑤𝑥𝑦L=E_{\mathbb{J}}\left[\log T_{w}(x,y)\right]+E_{\mathbb{M}}\left[\log(1-T_{w}(% x,y))\right].italic_L = italic_E start_POSTSUBSCRIPT blackboard_J end_POSTSUBSCRIPT [ roman_log italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_x , italic_y ) ] + italic_E start_POSTSUBSCRIPT blackboard_M end_POSTSUBSCRIPT [ roman_log ( 1 - italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_x , italic_y ) ) ] .

3. Problem Formulation

Click-Through-Rate prediction predicts the signal of a user clicking a candidate item, which is an essential and important task in recommender systems. Conventional CTR prediction methods (Rendle, 2010; Juan et al., 2016; He et al., 2017; Cheng et al., 2016; Guo et al., 2017; Lian et al., 2018; Liu et al., 2020; Song et al., 2019; Qu et al., 2018; Lin et al., 2023; Wang et al., 2017) can be formulated as

(6) p(y|X,θ),𝑝conditional𝑦𝑋𝜃p(y|X,\theta),italic_p ( italic_y | italic_X , italic_θ ) ,

where X=[x1,x2,,xF]𝑋subscript𝑥1subscript𝑥2subscript𝑥𝐹X=\left[x_{1},x_{2},\dots,x_{F}\right]italic_X = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ] is the input consisting of features from F𝐹Fitalic_F feature fields and θ𝜃\thetaitalic_θ is the model parameter. These methods make prediction based solely on target feature and model parameters. Another thread of methods that models user behavior patterns (Zhou et al., 2018, 2019; Pi et al., 2019; Feng et al., 2019; Pi et al., 2020) shows important impacts on making personalized recommendation and brings great performance gains, which can be formulated as

(7) p(y|X,Dhis,θ),𝑝conditional𝑦𝑋subscript𝐷𝑖𝑠𝜃p(y|X,D_{his},\theta),italic_p ( italic_y | italic_X , italic_D start_POSTSUBSCRIPT italic_h italic_i italic_s end_POSTSUBSCRIPT , italic_θ ) ,

where Dhissubscript𝐷𝑖𝑠D_{his}italic_D start_POSTSUBSCRIPT italic_h italic_i italic_s end_POSTSUBSCRIPT represents the user behavior histories. These methods focus on modeling user historical pattern by directly inputting users’ historical behaviors.

In this paper, we aim to extract diverse patterns existing in old data and build an efficient paramatric knowledge base to supplement the target prediction. The framework can be formulated as

(8) KBθDold,𝐾subscript𝐵𝜃subscript𝐷𝑜𝑙𝑑\displaystyle KB_{\theta*}\leftarrow D_{old},italic_K italic_B start_POSTSUBSCRIPT italic_θ ∗ end_POSTSUBSCRIPT ← italic_D start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT ,
(9) p(y|X,KBθ,θ),𝑝conditional𝑦𝑋𝐾subscript𝐵𝜃𝜃\displaystyle p(y|X,KB_{\theta*},\theta),italic_p ( italic_y | italic_X , italic_K italic_B start_POSTSUBSCRIPT italic_θ ∗ end_POSTSUBSCRIPT , italic_θ ) ,
(10) p(y|X,Dhis,KBθ,θ),𝑝conditional𝑦𝑋subscript𝐷𝑖𝑠𝐾subscript𝐵𝜃𝜃\displaystyle p(y|X,D_{his},KB_{\theta*},\theta),italic_p ( italic_y | italic_X , italic_D start_POSTSUBSCRIPT italic_h italic_i italic_s end_POSTSUBSCRIPT , italic_K italic_B start_POSTSUBSCRIPT italic_θ ∗ end_POSTSUBSCRIPT , italic_θ ) ,

where Doldsubscript𝐷𝑜𝑙𝑑D_{old}italic_D start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT denotes the old data and KBθ𝐾subscript𝐵𝜃KB_{\theta*}italic_K italic_B start_POSTSUBSCRIPT italic_θ ∗ end_POSTSUBSCRIPT denotes the knowledge base we construct based on old data.

Refer to caption
Figure 2. The framework of EDK . In the knowledge compression stage, we compress the essential and disentangled knowledge within old data into the parametric knowledge base. Concretely, we extract patterns within data instances with the knowledge extractor and memorize them through the knowledge encoder that could deal with inputs of arbitrary scales. The overall knowledge compression process is regularized by two principles: essential and disentangled for better generalization and robustness. During prediction, the target could access the frozen knowledge base for instance-wise knowledge and adapt the knowledge to inject it into arbitrary recommendation backbone for enhanced prediction.

4. Methodology

Our framework can be divided into two stages: knowledge compression and knowledge utilization. In the knowledge compression stage, we compress the essential and disentangled knowledge within the old data into compact vectors stored in the parameters of the parametric knowledge base. This stage consists of two modules: a knowledge extractor that extracts informative patterns from the raw data features and a knowledge encoder that encodes and memorizes patterns of arbitrary scales. The compression process is guided by the essential and disentangled principles for robust and generalized knowledge within limited model parameters. After the compression, the knowledge extractor and knowledge encoder made up the parametric knowledge base, which serve as the retrieval key generator and the memory network in the following phase respectively. In the knowledge utilization stage, we access the knowledge base trained in the previous stage and adapt the knowledge for enhanced prediction.

4.1. Knowledge Compression

During this stage, we compress the knowledge of old data into a parametric knowledge base consisting of a knowledge extractor f()𝑓f(\cdot)italic_f ( ⋅ ) and a knowledge encoder g()𝑔g(\cdot)italic_g ( ⋅ ), guided by the essential and disentangled principles. The overall knowledge compression process can be formulated as

(11) Xf(){𝐬𝐣~}g(){𝐬𝐣}Aggregator𝐜Regularized by Essential & Disentangled,subscript𝑓𝑋~subscript𝐬𝐣𝑔subscript𝐬𝐣𝐴𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑜𝑟𝐜Regularized by Essential & Disentangled\underbrace{X\xrightarrow[]{f(\cdot)}\{\tilde{\mathbf{s_{j}}}\}\xrightarrow[]{% g(\cdot)}\{\mathbf{s_{j}}\}\xrightarrow[]{Aggregator}{\mathbf{c}}}_{\text{% Regularized by Essential \& Disentangled}},under⏟ start_ARG italic_X start_ARROW start_OVERACCENT italic_f ( ⋅ ) end_OVERACCENT → end_ARROW { over~ start_ARG bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT end_ARG } start_ARROW start_OVERACCENT italic_g ( ⋅ ) end_OVERACCENT → end_ARROW { bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT } start_ARROW start_OVERACCENT italic_A italic_g italic_g italic_r italic_e italic_g italic_a italic_t italic_o italic_r end_OVERACCENT → end_ARROW bold_c end_ARG start_POSTSUBSCRIPT Regularized by Essential & Disentangled end_POSTSUBSCRIPT ,

where X𝑋Xitalic_X denotes a sample of the old data, {𝐬𝐣~}~subscript𝐬𝐣\{\tilde{\mathbf{s_{j}}}\}{ over~ start_ARG bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT end_ARG } denotes the different patterns of the input, {𝐬𝐣}subscript𝐬𝐣\{\mathbf{s_{j}}\}{ bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT } denotes the encoded knowledge vectors, and 𝐜𝐜\mathbf{c}bold_c denotes the final abbreviated representation of the input. For simplicity, we omit the footnote of index for data sample. After training, the knowledge extractor f()𝑓f(\cdot)italic_f ( ⋅ ) and knowledge encoder g()𝑔g(\cdot)italic_g ( ⋅ ) constitute the parametric knowledge base into which the vast old data documents are compressed.

4.1.1. Essential & Disentangled Principles

The whole compression process is regularized by the essential and the disentangled principles. The essential principle encourages compressed representation that captures the sufficient and minimal information relevant to the task. This elimination of irrelevant or noisy information helps the parametric knowledge base focus only on the invariant patterns and avoids it from being biased by the spurious patterns. The disentangled principle decomposes the entanglement of the memorized knowledge patterns, which eliminates the redundancy of knowledge memorizing and promotes the disentanglement of different underlying factors, which helps the parametric knowledge base generalize its knowledge to new instances or distributions more effectively. The formulations of the two principles are given as:

  • Essential. minI(X;{𝐬𝐣})αI({𝐬𝐣};y)𝐼𝑋subscript𝐬𝐣𝛼𝐼subscript𝐬𝐣𝑦\min I(X;\{\mathbf{s_{j}}\})-\alpha I(\{\mathbf{s_{j}}\};y)roman_min italic_I ( italic_X ; { bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT } ) - italic_α italic_I ( { bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT } ; italic_y ). The generated knowledge representations should contain as much of the task-relevant information of the original input and as less of task-irrelevant information. It helps to distill complex data into a sufficient and minimal representation, enabling better generalization and robustness.

  • Disentangled. minijI(𝐬𝐢;𝐬𝐣)subscript𝑖𝑗𝐼subscript𝐬𝐢subscript𝐬𝐣\min\sum_{i\neq j}I(\mathbf{s_{i}};\mathbf{s_{j}})roman_min ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_I ( bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ; bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ). The generated knowledge representations should contain different aspects of information about the input. It helps to reduce memory redundancy and disentangle underlying factors for better invariance.

Together, the two principles contribute to sufficient, minimal, and invariant knowledge representations, which improves the expressiveness and generalization of the parameters of the knowledge base without increasing the size of it. The objective of the two principles can be formulated as

(12) minI(X;{𝐬𝐣})αI({𝐬𝐣};y)+βijI(𝐬𝐢;𝐬𝐣),𝐼𝑋subscript𝐬𝐣𝛼𝐼subscript𝐬𝐣𝑦𝛽subscript𝑖𝑗𝐼subscript𝐬𝐢subscript𝐬𝐣\min I(X;\{{\mathbf{s_{j}}}\})-\alpha I(\{\mathbf{s_{j}}\};y)+\beta\sum_{i\neq j% }I(\mathbf{s_{i}};\mathbf{s_{j}}),roman_min italic_I ( italic_X ; { bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT } ) - italic_α italic_I ( { bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT } ; italic_y ) + italic_β ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_I ( bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ; bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) ,

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are hyperparameters to scale the loss components.

Due to the complexity of computing mutual information among a set of variables, we can relax the above objective and obtain the following loss:

(13) lreg=αjI(𝐬𝐣;y)+I(X;𝐜)Essential+βijI(𝐬𝐢;𝐬𝐣)Disentangled,subscript𝑙𝑟𝑒𝑔subscript𝛼subscript𝑗𝐼subscript𝐬𝐣𝑦𝐼𝑋𝐜Essentialsubscript𝛽subscript𝑖𝑗𝐼subscript𝐬𝐢subscript𝐬𝐣Disentangledl_{reg}=\underbrace{-\alpha\sum_{j}I(\mathbf{s_{j}};y)+I(X;\mathbf{c})}_{\text% {Essential}}+\underbrace{\beta\sum_{i\neq j}I(\mathbf{s_{i}};\mathbf{s_{j}})}_% {\text{Disentangled}},italic_l start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = under⏟ start_ARG - italic_α ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_I ( bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ; italic_y ) + italic_I ( italic_X ; bold_c ) end_ARG start_POSTSUBSCRIPT Essential end_POSTSUBSCRIPT + under⏟ start_ARG italic_β ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_I ( bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ; bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT Disentangled end_POSTSUBSCRIPT ,

where 𝐜𝐜\mathbf{c}bold_c is the abbreviated representation of X𝑋Xitalic_X.

To achieve the essential objective, we need to 1) maximize the mutual information between each encoded knowledge vector and the label and at the same time 2) minimize the mutual information between the abbreviated representation and the input. These two strike a balance between compression and prediction. For maximizing the mutual information between each encoded knowledge vector and the label, the difficulty lies in that the label space is discrete and only contains two values. Therefore, we follow DIM (Hjelm et al., 2019) to maximize the distance between the joint distribution and the marginal distribution. Concretely, we regard a pattern representation and a random embedded input with the same label as the joint distribution. For the marginal distribution, we sample the pair as the pattern representation and a random embedded input with the opposite label.

(14) maxjI(𝐬𝐣;y)=maxj(logDθ(𝐬𝐣,𝐜+)+(1logDθ(𝐬𝐣,𝐜))).subscript𝑗𝐼subscript𝐬𝐣𝑦subscript𝑗𝑙𝑜𝑔subscript𝐷𝜃subscript𝐬𝐣superscript𝐜1𝑙𝑜𝑔subscript𝐷𝜃subscript𝐬𝐣superscript𝐜\begin{split}\max\sum_{j}I(\mathbf{s_{j}};y)&=\max\sum_{j}(logD_{\theta}(% \mathbf{{s_{j}}},\mathbf{c^{+}})\\ &+(1-logD_{\theta}(\mathbf{{s_{j}}},\mathbf{c^{-}}))).\end{split}start_ROW start_CELL roman_max ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_I ( bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ; italic_y ) end_CELL start_CELL = roman_max ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_l italic_o italic_g italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT , bold_c start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( 1 - italic_l italic_o italic_g italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT , bold_c start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ) ) . end_CELL end_ROW

For minimizing the mutual information between the abbreviated representation and the input, we minimize the lower bound of the mutual information proposed in (Cheng et al., 2020).

(15) minI(X;𝐜)=minIvCLUB(X;𝐜).𝐼𝑋𝐜subscript𝐼𝑣𝐶𝐿𝑈𝐵𝑋𝐜\min I(X;\mathbf{c})=\min I_{vCLUB}(X;\mathbf{c}).roman_min italic_I ( italic_X ; bold_c ) = roman_min italic_I start_POSTSUBSCRIPT italic_v italic_C italic_L italic_U italic_B end_POSTSUBSCRIPT ( italic_X ; bold_c ) .

To achieve the disentangled objective, we minimize the mutual information among encoded knowledge vectors. Since the combinations of different vectors is polynomial, it is hard to train a mutual information estimator for each pair of vectors. Instead, we use a loss that is equivalent to the contrastive loss, which enlarges the discrepancy among different patterns.

(16) minijI(𝐬𝐢;𝐬𝐣)=minilogz(𝐬𝐢)z(𝐬𝐢)ijz(𝐬𝐢)z(𝐬𝐣),subscript𝑖𝑗𝐼subscript𝐬𝐢subscript𝐬𝐣subscript𝑖𝑙𝑜𝑔𝑧subscript𝐬𝐢𝑧subscript𝐬𝐢subscript𝑖𝑗𝑧subscript𝐬𝐢𝑧subscript𝐬𝐣\min\sum_{i\neq j}I(\mathbf{s_{i}};\mathbf{s_{j}})=\min-\sum_{i}log\frac{z(% \mathbf{s_{i}})z(\mathbf{s_{i}})}{\sum_{i\neq j}z(\mathbf{s_{i}})z(\mathbf{s_{% j}})},roman_min ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_I ( bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ; bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) = roman_min - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l italic_o italic_g divide start_ARG italic_z ( bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) italic_z ( bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_z ( bold_s start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) italic_z ( bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ) end_ARG ,

where z𝑧zitalic_z is the augmentation operator defined by

(17) z=MLP(Norm(Dropout())).𝑧𝑀𝐿𝑃𝑁𝑜𝑟𝑚𝐷𝑟𝑜𝑝𝑜𝑢𝑡z=MLP(Norm(Dropout(\cdot))).italic_z = italic_M italic_L italic_P ( italic_N italic_o italic_r italic_m ( italic_D italic_r italic_o italic_p italic_o italic_u italic_t ( ⋅ ) ) ) .

4.1.2. Knowledge Extractor f()𝑓f(\cdot)italic_f ( ⋅ )

A sample of the data could be represented by X=[x1,x2,,xF]𝑋subscript𝑥1subscript𝑥2subscript𝑥𝐹X=\left[x_{1},x_{2},\dots,x_{F}\right]italic_X = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ], where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the feature value of the i𝑖iitalic_i-th field and F𝐹Fitalic_F is the number of feature fields. For each sample of the old data, we generate different masks to obtain different patterns and encode the patterns for knowledge. Concretely, for an arbitrary input X𝑋Xitalic_X, we use a global attentive readout to extract the mutual relations among features and obtain the context-aware features for mask generation.

Firstly, we embed the features of input X=[x1,x2,,xF]𝑋subscript𝑥1subscript𝑥2subscript𝑥𝐹X=\left[x_{1},x_{2},\dots,x_{F}\right]italic_X = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ] by

(18) xiX, 𝐱i=Φ1(xi),formulae-sequencefor-allsubscript𝑥𝑖𝑋 subscript𝐱𝑖subscriptΦ1subscript𝑥𝑖\displaystyle\forall x_{i}\in X,\text{\quad}\mathbf{x}_{i}=\Phi_{1}(x_{i}),∀ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,
(19) 𝐇𝟎=[𝐱1,,𝐱F],subscript𝐇0subscript𝐱1subscript𝐱𝐹\displaystyle\mathbf{H_{0}}=\left[\mathbf{x}_{1},\dots,\mathbf{x}_{F}\right],bold_H start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT = [ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ] ,

where Φ1subscriptΦ1\Phi_{1}roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denotes the embedding operation used to generate embeddings for knowledge extraction, 𝐇𝟎RF×dsubscript𝐇0superscript𝑅𝐹𝑑\mathbf{H_{0}}\in R^{F\times d}bold_H start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_F × italic_d end_POSTSUPERSCRIPT, and d𝑑ditalic_d is the embedding size.

Then we feed the embedded input into a global self-attention layer to obtain the context-aware features for patterns extraction.

(20) 𝐇GlobalAtt𝐇𝟎.𝐺𝑙𝑜𝑏𝑎𝑙𝐴𝑡𝑡𝐇subscript𝐇0\mathbf{H}\xleftarrow{GlobalAtt}\mathbf{H_{0}}.bold_H start_ARROW start_OVERACCENT italic_G italic_l italic_o italic_b italic_a italic_l italic_A italic_t italic_t end_OVERACCENT ← end_ARROW bold_H start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT .

After the global self-attention readout, we generate the mask for K𝐾Kitalic_K patterns of the input by

(21) 𝐌=𝐇𝐏,𝐌𝐇𝐏\mathbf{M}=\mathbf{H}\mathbf{P},bold_M = bold_HP ,

where 𝐌RF×K𝐌superscript𝑅𝐹𝐾\mathbf{M}\in R^{F\times K}bold_M ∈ italic_R start_POSTSUPERSCRIPT italic_F × italic_K end_POSTSUPERSCRIPT is the generated mask, 𝐏Rd×K𝐏superscript𝑅𝑑𝐾\mathbf{P}\in R^{d\times K}bold_P ∈ italic_R start_POSTSUPERSCRIPT italic_d × italic_K end_POSTSUPERSCRIPT is a learnable parameter, and K𝐾Kitalic_K denotes the number of knowledge patterns we generate.

To make each entry of the mask approximately binary and ensure the gradient consistency, we map each entry mijsubscript𝑚𝑖𝑗m_{ij}italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT of the generated mask matrix 𝐌𝐌\mathbf{M}bold_M using the hard concrete distribution (Maddison et al., 2017), which is a continuous relaxation of discrete random variables. Concretely, we obtain the masks as

(22)  uijU(0,1),similar-to subscript𝑢𝑖𝑗𝑈01\displaystyle\text{\qquad\quad}u_{ij}\sim U(0,1),italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∼ italic_U ( 0 , 1 ) ,
(23) mij=σsubscript𝑚𝑖𝑗𝜎\displaystyle m_{ij}=\sigmaitalic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_σ ((loguijlog(1uij)+logmij)/β),subscript𝑢𝑖𝑗1subscript𝑢𝑖𝑗subscript𝑚𝑖𝑗𝛽\displaystyle\left(\left(\log u_{ij}-\log(1-u_{ij})+\log m_{ij}\right)/\beta% \right),( ( roman_log italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - roman_log ( 1 - italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) + roman_log italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) / italic_β ) ,
(24) mij=Tanh(mij(δγ)+γ),subscript𝑚𝑖𝑗𝑇𝑎𝑛subscript𝑚𝑖𝑗𝛿𝛾𝛾\displaystyle m_{ij}=Tanh\left(m_{ij}(\delta-\gamma)+\gamma\right),italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_T italic_a italic_n italic_h ( italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_δ - italic_γ ) + italic_γ ) ,

where uijsubscript𝑢𝑖𝑗u_{ij}italic_u start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is sampled from a uniform distribution and σ𝜎\sigmaitalic_σ denotes the sigmoid function.

After we obtain the masks, we apply them on the original input to generate different patterns.

(25) 𝐇superscript𝐇\displaystyle\mathbf{H^{\prime}}bold_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =[Φ2(x1),,Φ2(xF)],absentsubscriptΦ2subscript𝑥1subscriptΦ2subscript𝑥𝐹\displaystyle=\left[\Phi_{2}(x_{1}),\dots,\Phi_{2}(x_{F})\right],= [ roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) ] ,
(26) 𝐬~jsubscript~𝐬𝑗\displaystyle\tilde{\mathbf{s}}_{j}over~ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =𝐇𝐌:,j, j{1,,K},formulae-sequenceabsentdirect-productsuperscript𝐇subscript𝐌:𝑗 for-all𝑗1𝐾\displaystyle=\mathbf{H^{\prime}}\odot\mathbf{M}_{:,j},\text{\quad}\forall j% \in\{1,\dots,K\},= bold_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊙ bold_M start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT , ∀ italic_j ∈ { 1 , … , italic_K } ,

where Φ2subscriptΦ2\Phi_{2}roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the embedding operation used to generate embeddings for knowledge memorization, direct-product\odot denotes the element-wise product, and 𝐬~jRF×dsubscript~𝐬𝑗superscript𝑅𝐹𝑑\tilde{\mathbf{s}}_{j}\in R^{F\times d}over~ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_F × italic_d end_POSTSUPERSCRIPT is a pattern vector.

4.1.3. Knowledge Encoder g()𝑔g(\cdot)italic_g ( ⋅ )

After we obtain the extracted patterns {𝐬𝐣~}~subscript𝐬𝐣\{\tilde{\mathbf{s_{j}}}\}{ over~ start_ARG bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT end_ARG }, we feed the patterns into the knowledge encoder to memorize them. Since the generated patterns are of arbitrary lengths, the encoder should be capable of handling inputs of different lengths. Hence, we adopt the classical self-attention architecture (Vaswani et al., 2017).

For every masked knowledge pattern, each entry of it attends to others through a self-attention layer as follows:

(27) 𝐐=𝐬~𝐖𝐐, 𝐊𝐐~𝐬subscript𝐖𝐐 𝐊\displaystyle\mathbf{Q}=\mathbf{\tilde{s}W_{Q}},\text{ }\mathbf{K}bold_Q = over~ start_ARG bold_s end_ARG bold_W start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT , bold_K =𝐬~𝐖𝐊, 𝐕=𝐬~𝐖𝐕,formulae-sequenceabsent~𝐬subscript𝐖𝐊 𝐕~𝐬subscript𝐖𝐕\displaystyle=\mathbf{\tilde{s}W_{K}},\text{ }\mathbf{V}=\mathbf{\tilde{s}W_{V% }},= over~ start_ARG bold_s end_ARG bold_W start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT , bold_V = over~ start_ARG bold_s end_ARG bold_W start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT ,
(28) 𝐀=so𝐀𝑠𝑜\displaystyle\mathbf{A}=sobold_A = italic_s italic_o ftmax(𝐐𝐊𝐓d),𝑓𝑡𝑚𝑎𝑥superscript𝐐𝐊𝐓𝑑\displaystyle ftmax\left(\frac{\mathbf{QK^{T}}}{\sqrt{d}}\right),italic_f italic_t italic_m italic_a italic_x ( divide start_ARG bold_QK start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) ,
(29) Attention𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛\displaystyle Attentionitalic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n (𝐐,𝐊,𝐕)=𝐀𝐕.𝐐𝐊𝐕𝐀𝐕\displaystyle(\mathbf{Q,K,V})=\mathbf{AV}.( bold_Q , bold_K , bold_V ) = bold_AV .

We represent the above process with ψ()𝜓\psi(\cdot)italic_ψ ( ⋅ ). Then the encoded vectors for each knowledge pattern and the final abbreviated representation 𝐜𝐜\mathbf{c}bold_c of an input could be obtained by:

(30) 𝐬𝐣subscript𝐬𝐣\displaystyle\mathbf{s_{j}}bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT =MLP(ψ(𝐬𝐣~)),absent𝑀𝐿𝑃𝜓~subscript𝐬𝐣\displaystyle=MLP(\psi(\mathbf{\tilde{s_{j}}})),= italic_M italic_L italic_P ( italic_ψ ( over~ start_ARG bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT end_ARG ) ) ,
(31) 𝐜𝐜\displaystyle\mathbf{c}bold_c =MLP(AGGj{𝐬𝐣}).absent𝑀𝐿𝑃𝐴𝐺subscript𝐺𝑗subscript𝐬𝐣\displaystyle=MLP(AGG_{j}\{\mathbf{{s_{j}}}\}).= italic_M italic_L italic_P ( italic_A italic_G italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT { bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT } ) .

4.1.4. Objective

The training objective of the knowledge compression stage can then be summarized as

(32) lcompression=lce+λ1lreg+λ2l0,subscript𝑙𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛subscript𝑙𝑐𝑒subscript𝜆1subscript𝑙𝑟𝑒𝑔subscript𝜆2subscript𝑙0l_{compression}=l_{ce}+\lambda_{1}l_{reg}+\lambda_{2}l_{0},italic_l start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p italic_r italic_e italic_s italic_s italic_i italic_o italic_n end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,

where lcesubscript𝑙𝑐𝑒l_{ce}italic_l start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT denotes the cross entropy loss of predicting labels on 𝐜𝐜\mathbf{c}bold_c, lregsubscript𝑙𝑟𝑒𝑔l_{reg}italic_l start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT is the loss corresponding to the two principles proposed in Equation (13), and l0subscript𝑙0l_{0}italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denotes the regularization loss of the generated masks.

4.2. Knowledge Utilization

After we establish the parametric knowledge base KBθ=g(f())𝐾subscript𝐵𝜃𝑔𝑓KB_{\theta*}=g(f(\cdot))italic_K italic_B start_POSTSUBSCRIPT italic_θ ∗ end_POSTSUBSCRIPT = italic_g ( italic_f ( ⋅ ) ) using the old data, we could extract the essential and disentangled knowledge vectors for target prediction enhancement. When making prediction for a target sample X𝑋Xitalic_X, we could extract the knowledge from the knowledge base and append it for prediction. The process can be formulated as:

(33) y^=Φbase(MLP(KBθ(X)),X),^𝑦subscriptΦ𝑏𝑎𝑠𝑒𝑀𝐿𝑃𝐾subscript𝐵𝜃𝑋𝑋\hat{y}=\Phi_{base}\left(MLP\left(KB_{\theta*}(X)\right),X\right),over^ start_ARG italic_y end_ARG = roman_Φ start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT ( italic_M italic_L italic_P ( italic_K italic_B start_POSTSUBSCRIPT italic_θ ∗ end_POSTSUBSCRIPT ( italic_X ) ) , italic_X ) ,

where ΦbasesubscriptΦ𝑏𝑎𝑠𝑒\Phi_{base}roman_Φ start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT denotes the backbone model and KBθ𝐾subscript𝐵𝜃KB_{\theta*}italic_K italic_B start_POSTSUBSCRIPT italic_θ ∗ end_POSTSUBSCRIPT denotes the frozen parametric knowledge base.

The final prediction is optimized by the cross-entropy loss:

(34) lpred=X,yDtrain(ylogy^+(1y)log(1y^)).subscript𝑙𝑝𝑟𝑒𝑑subscript𝑋𝑦subscript𝐷𝑡𝑟𝑎𝑖𝑛𝑦^𝑦1𝑦1^𝑦l_{pred}=\sum_{\langle X,y\rangle\in D_{train}}(y\log\hat{y}+(1-y)\log(1-\hat{% y})).italic_l start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT ⟨ italic_X , italic_y ⟩ ∈ italic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y roman_log over^ start_ARG italic_y end_ARG + ( 1 - italic_y ) roman_log ( 1 - over^ start_ARG italic_y end_ARG ) ) .

5. Experiments

In this section, we empirically evaluate the proposed model with two datasets on the Click-Through-Prediction task. We evaluate the proposed method starting from the following research questions.

  • RQ1

    Is the proposed parametric knowledge base model-agnostic? Does the proposed parametric knowledge base offer performance gain for all the base models?

  • RQ2

    Do the proposed two rules improve the quality of the parametric knowledge base?

  • RQ3

    Are the components of the parametric knowledge effective? How does the number of knowledge vectors per data instance influence the performance?

  • RQ4

    Are the objectives of the proposed constraints (i.e., essential & disentangled principles) achieved during training?

5.1. Experimental Setup

5.1.1. Datasets

We use two datasets to validate the proposed model.

Table 1. Statistics of the Used Datasets.
Dataset # Users #Items #Documents # Fields # Features
AD 1,061,768 827,009 25,029,435 12 3,029,333
Eleme 5,782,482 1,853,764 49,114,930 17 16,516,885
  • Ali Display111https://tianchi.aliyun.com/dataset/56. It is a dataset provided by Alibaba to estimate the click-through rate of Taobao display ads.

  • Eleme222https://tianchi.aliyun.com/dataset/131047. It is a dataset provided by Eleme that offered take-away service to users.

The detailed descriptions for the used datasets are summarized in Table 1. For the two datasets, the clicked samples are treated as positive samples and the exposed but not clicked samples are treated as negative samples. We split the data using the global timestamps (Qin et al., 2020, 2021). For regular baseline models, we use the logs before T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for training, 50% of the logs after T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for validation, and 50% of the logs after T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for test. For those supplemented with extra knowledge vectors, we use the logs before T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for building the knowledge base, the logs between [T0,T1)subscript𝑇0subscript𝑇1[T_{0},T_{1})[ italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) for training base models, 50% of the logs after T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for validation, and 50% of the logs after T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for test. (T0<T1subscript𝑇0subscript𝑇1T_{0}<T_{1}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). The training setup is illustrated in Figure 3.

Refer to caption
Figure 3. The dataset splitting and training setup.

5.1.2. Baselines

Since our method is model-agnostic, we use different widely-used CTR models as our base models.

We first compared a group of feature interaction models. DeepFM (Guo et al., 2017) modeled 2-way pairwise feature interactions and used a DNN for generalization. xDeepFM (Lian et al., 2018) utilized the CIN network to model finite-order feature interactions. PNN (Qu et al., 2018) explicitly modeled the ”AND” relation between features by introducing the product layer. AutoInt (Song et al., 2019) built high-order feature interactions by the multi-head attention. DCN (Wang et al., 2017) learned both low-order feature interactions and high-order nonlinear features efficiently. We also compared with models that extracted multi-samples information for recommendation enhancement. DIN (Zhou et al., 2018) modeled user interests according to the attention scores between the candidate item and the target user’s historical clicked items.

5.1.3. Metrics

The evaluation metrics include area under ROC curve (AUC) and negative log-likehood (LogLoss).

5.1.4. Hyperparameters

We use consistent embedding size for all base models for fair comparison. For AD dataset, the embedding size for base models is set to 32. The size of the knowledge vectors for AD is set to 16. For Eleme dataset, the embedding size for base models is set to 16. The size of the knowledge vectors for Eleme is set to 8. The number of patterns to extract from each data instance is set to 20. The depth and number of attention heads for the knowledge encoder is 3 and 3, respectively. The learning rate is searched in the range of [1e4,3e4,5e4,1e3]1𝑒43𝑒45𝑒41𝑒3[1e-4,3e-4,5e-4,1e-3][ 1 italic_e - 4 , 3 italic_e - 4 , 5 italic_e - 4 , 1 italic_e - 3 ], while the weight decay is searched in the range of [1e4,3e4,5e4,5e5,3e5,1e5]1𝑒43𝑒45𝑒45𝑒53𝑒51𝑒5[1e-4,3e-4,5e-4,5e-5,3e-5,1e-5][ 1 italic_e - 4 , 3 italic_e - 4 , 5 italic_e - 4 , 5 italic_e - 5 , 3 italic_e - 5 , 1 italic_e - 5 ]. Adam (Kingma and Ba, 2014) is used for training.

5.2. Main Results (RQ1)

Table 2. Experiment results on AD and Eleme. Since EDK is model-agnostic, we test its compatibility on different backbones. N/A denotes the original recommendation backbone. Rel.Impr denotes the relative AUC improvement of EDK against each backbone model. The symbol * indicates statistically significant improvement with p-value <0.001absent0.001<0.001< 0.001.
Model AD Eleme
AUC Logloss Rel. Impr. AUC Logloss Rel. Impr.
DeepFM N/A 0.6220 0.1949 2.54% 0.6221 0.0907 5.36%
w/ EDK 0.6378* 0.1935* 0.6555* 0.0901*
DCN N/A 0.6231 0.1948 2.25% 0.6386 0.0889 2.77%
w/ EDK 0.6371* 0.1936* 0.6563* 0.0881*
PNN N/A 0.6299 0.1944 1.03% 0.6324 0.0884 3.61%
w/ EDK 0.6364* 0.1935* 0.6552* 0.0881*
xDeepFM N/A 0.6291 0.1943 1.08% 0.6466 0.0883 1.48%
w/ EDK 0.6359* 0.1939* 0.6562* 0.0887*
AutoInt N/A 0.6339 0.1944 0.35% 0.6464 0.0934 1.39%
w/ EDK 0.6361* 0.1937* 0.6554* 0.0885*
DIN N/A 0.6233 0.1947 2.23% 0.6411 0.0883 2.26%
w/ EDK 0.6372* 0.1935* 0.6556* 0.0883*

The overall performance of the proposed method is displayed in Table 2. Six widely used conventional recommendation models are used as the backbone models, the feature interaction operators of which include dnn-based, product-based, and attention-based methodologies. From the results, one can see that EDK consistently offers significant performance improvements for all the selected backbone models. The improvements are statistically significant under p-value <0.001absent0.001<0.001< 0.001. This validates that compressing knowledge of old data documents into parameters could preserve useful knowledge and offer performance gains for recommender systems. In addition, EDK shows superior model compatibility, which demonstrates the effectiveness of our established parametric knowledge base under the essential and disentangled constraints.

5.3. Ablation Study (RQ2-RQ3)

In this section, we conduct various ablation studies to validate the effectiveness of model designs.

5.3.1. Impact of the Two Proposed Principles (RQ2)

To enforce the parametric knowledge base to memorize the essential and disentangle knowledge patterns within limited model parameters, we propose two constraints to regularize the knowledge compression process. In this section, we investigate the effectiveness of the two proposed constraints. We conduct experiments with backbones on which EDK achieves the best performance. Concretely, we conduct experiments with the DeepFM backbone for AD and conduct experiments on the DCN backbone for eleme. We remove the regularizations of the essential and disentangled principles to see the impacts of them. The results are displayed in Table 3.

Table 3. Impact of the Proposed Principles.
Datasets AD Eleme
AUC Logloss AUC Logloss
EDK 0.6378 0.1935 0.6563 0.0881
w/o Disentangled 0.6356 0.1939 0.6433 0.0886
w/o Essential 0.6347 0.1937 0.6520 0.0884
w/o both 0.6304 0.1940 0.6411 0.0888

From the results, we can see that the proposed essential and disentangled constraints improve the quality of the memorized knowledge of the parametric knowledge base. Both the essential principle and the disentangled principle can solely contribute to the final performance, while the two complement each other and can collaborate for more qualified knowledge.

5.3.2. Impact of the Knowledge Extractor (RQ3)

The patterns of the old data space are diverse and complex, being hard to be fully memorized due to the existence of spurious patterns and entanglement. To memorize the informative patterns within the data space, a knowledge extractor is designed to extract the essential and disentangled patterns within data points. When the number of knowledge pattern per data instance is set to 1, it reduces to memorizing a raw data point in the space. In this section, we conduct experiments on the number of knowledge patterns per data point to study the impact of the knowledge extractor. Similarly, we select DeepFM as the backbone for AD dataset and DCN as the backbone for Eleme dataset. We test on the range of [5,10,15,20,25]510152025\left[5,10,15,20,25\right][ 5 , 10 , 15 , 20 , 25 ], the results are listed in Figure 4.

Refer to caption
(a) AD.
Refer to caption
(b) Eleme.
Figure 4. Performances of EDK w.r.t different number of knowledge patterns per sample.

From the results, we can see that improving the number of knowledge vectors per data point generally gives rise to the performance, which validates the effectiveness of the knowledge extractor. In addition, further improving the number of knowledge patterns gives minor performance improvements or even impairs the final performance. This is because the increase of the knowledge patterns may result in redundancy and noise, which hurts the robustness of the model parameters.

5.4. Objectives of the Principles (RQ4)

In this section, we analyze whether the proposed two principles are achieved during the knowledge compression process.

5.4.1. Loss Curves of Different Components

First of all, we study whether the losses corresponding to the two principles are learned well during the knowledge compression process.

Refer to caption
Figure 5. The learning curves of different losses on AD dataset.

The learning curves of different losses on AD dataset is illustrated in Figure 5. The blue line displays the loss curve for the disentangled constraint, which is ijI(si;sj)subscript𝑖𝑗𝐼subscript𝑠𝑖subscript𝑠𝑗\sum_{i\neq j}I(s_{i};s_{j})∑ start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT italic_I ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). The red line and black line together make up the loss for the essential constraint, which is αjI(𝐬𝐣;y)+I(X;𝐜)𝛼subscript𝑗𝐼subscript𝐬𝐣𝑦𝐼𝑋𝐜-\alpha\sum_{j}I(\mathbf{s_{j}};y)+I(X;\mathbf{c})- italic_α ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_I ( bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ; italic_y ) + italic_I ( italic_X ; bold_c ). Concretely, the black line illustrates the curve for jI(𝐬𝐣;y)subscript𝑗𝐼subscript𝐬𝐣𝑦-\sum_{j}I(\mathbf{s_{j}};y)- ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_I ( bold_s start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT ; italic_y ) which promotes the memorization for task relevant information, and the red line illustrates the curve for I(X;𝐜)𝐼𝑋𝐜I(X;\mathbf{c})italic_I ( italic_X ; bold_c ), which compresses the representation and helps to filter the noisy information.

From the learning curves, we can see that all losses are learned well. While for the red one which corresponds to the vCLUB value, the MI upper bound estimation increases at the first epoch and decrease in the following training steps, since it takes some steps first to train the MI estimator well for correct estimation.

5.4.2. Visualizations of the Extracted Patterns

Then, we visualize the extracted patterns for more explicit analysis. Concretely, we analyze the pattern masks generated by the knowledge extractor and the corresponding representation encoded by the knowledge encoder.

Firstly, we study the masks generated by the trained knowledge extractor. Note that the entries of the masks are continuous. Here we regard an entry with value larger than 0.50.50.50.5 as an existing feature and calculate the number of the non-zero entries of the generated masks. The extracted pattern scale distribution is illustrated in Figure 6.

Refer to caption
(a) AD
Refer to caption
(b) Eleme
Figure 6. Pattern scales for AD and Eleme. (Entry with mask value>0.5absent0.5>0.5> 0.5 is regarded as 1.)

From the results, we could find that the scales of extracted patterns vary. For Eleme, the extracted patterns include distinct features, feature interactions, and denoised views of the original inputs. While for AD, the extracted patterns mainly consist of high-order feature interactions and denoised views of the inputs.

Refer to caption
(a) AD
Refer to caption
(b) Eleme
Figure 7. T-SNE visualization for the representations of the raw data features and the extracted patterns.

In addition, we further study the representations of the raw data instance features and the extracted patterns. We use the T-SNE visualization to illustrate the distribution of the encodings of raw data instances and the extracted pattern vectors as displayed in Figure 7.Each point in the left part of the figure denotes a representation of a raw data instance, and each point in the right figure represents the an encoded pattern vector. From the results, we could see that there are apparently different manifolds existing in the extracted patterns representation space for both AD and Eleme datasets, while the raw data instances just appear in different clusterings. This validates the effectiveness of the proposed method on extracting the diverse and complex patterns from the old data.

6. Conclusion

In this paper, we design a parametric knowledge base EDK that compresses the massive old data into compact knowledge stored in parameters, which is regularized by the proposed essential and disentangled principles. The two principles promote robust and generalized knowledge memorized by the parametric knowledge base without increasing the size of it. The essential principle strikes a balance between compression and prediction, which filters out the noisy and irrelevant information of the old data and preserves the task-relevant information. The disentangled principle helps to reduce the redundancy of stored information and decompose the entanglement of the knowledge representations, which gives rise to the invariance of the produced knowledge vectors. EDK is model-agnostic, flexible, and tailored for fine-grained instance-level knowledge augmentation. Experiments and various ablation studies justify the effectiveness of it.

References

  • (1)
  • Achille and Soatto (2018) Alessandro Achille and Stefano Soatto. 2018. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research 19, 1 (2018), 1947–1980.
  • Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, R. Devon Hjelm, and Aaron C. Courville. 2018. Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Vol. 80. PMLR, 530–539.
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In DLRS@RecSys.
  • Cheng et al. (2020) Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. 2020. CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 1779–1788.
  • Christensen and Schiaffino (2011) Ingrid A. Christensen and Silvia Schiaffino. 2011. Entertainment recommender systems for group of users. Expert Systems with Applications 38, 11 (2011), 14127–14135. https://doi.org/10.1016/j.eswa.2011.04.221
  • Deng et al. (2022) Weijian Deng, Stephen Gould, and Liang Zheng. 2022. On the strong correlation between model invariance and generalization. Advances in Neural Information Processing Systems 35 (2022), 28052–28067.
  • Feng et al. (2019) Yufei Feng, Fuyu Lv, Weichen Shen, Menghan Wang, Fei Sun, Yu Zhu, and Ke** Yang. 2019. Deep Session Interest Network for Click-Through Rate Prediction. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019. 2301–2307.
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. In IJCAI.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In WWW.
  • Hjelm et al. (2019) R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
  • Juan et al. (2016) Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware factorization machines for CTR prediction. In RecSys.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Lian et al. (2018) Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1754–1763.
  • Lin et al. (2023) Jianghao Lin, Yanru Qu, Wei Guo, Xinyi Dai, Ruiming Tang, Yong Yu, and Weinan Zhang. 2023. MAP: A Model-agnostic Pretraining Framework for Click-through Rate Prediction. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1384–1395.
  • Liu et al. (2020) Bin Liu, Chenxu Zhu, Guilin Li, Weinan Zhang, **cai Lai, Ruiming Tang, Xiuqiang He, Zhenguo Li, and Yong Yu. 2020. AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020. ACM, 2636–2645.
  • Maddison et al. (2017) Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
  • Montero et al. (2020) Milton Llera Montero, Casimir JH Ludwig, Rui Ponte Costa, Gaurav Malhotra, and Jeffrey Bowers. 2020. The role of disentanglement in generalisation. In International Conference on Learning Representations.
  • Pi et al. (2019) Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019. ACM, 2671–2679.
  • Pi et al. (2020) Qi Pi, Guorui Zhou, Yu**g Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020. ACM, 2685–2692.
  • Qin et al. (2024) Jiarui Qin, Weiwen Liu, Ruiming Tang, Weinan Zhang, and Yong Yu. 2024. D2K: Turning Historical Data into Retrievable Knowledge for Recommender Systems. arXiv:2401.11478 [cs.IR]
  • Qin et al. (2021) Jiarui Qin, Weinan Zhang, Rong Su, Zhirong Liu, Weiwen Liu, Ruiming Tang, Xiuqiang He, and Yong Yu. 2021. Retrieval & Interaction Machine for Tabular Data Prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1379–1389.
  • Qin et al. (2020) Jiarui Qin, Weinan Zhang, Xin Wu, Jiarui **, Yuchen Fang, and Yong Yu. 2020. User Behavior Retrieval for Click-Through Rate Prediction. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020. ACM, 2347–2356.
  • Qu et al. (2018) Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, Huifeng Guo, Yong Yu, and Xiuqiang He. 2018. Product-based Neural Networks for User Response Prediction over Multi-field Categorical Data. ACM Transactions on Information Systems (2018).
  • Rendle (2010) Steffen Rendle. 2010. Factorization machines. In ICDM.
  • Smith and Linden (2017) Brent Smith and Greg Linden. 2017. Two decades of recommender systems at Amazon.com. IEEE Internet Computing (2017). https://www.amazon.science/publications/two-decades-of-recommender-systems-at-amazon-com
  • Song et al. (2019) Wei** Song, Chence Shi, Zhi** Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Bei**g, China, November 3-7, 2019. ACM, 1161–1170.
  • Tishby et al. (2000a) Naftali Tishby, Fernando C. N. Pereira, and William Bialek. 2000a. The information bottleneck method. CoRR physics/0004057 (2000).
  • Tishby et al. (2000b) Naftali Tishby, Fernando C. N. Pereira, and William Bialek. 2000b. The information bottleneck method. CoRR physics/0004057 (2000).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 5998–6008.
  • Wang et al. (2023) Cheng Wang, Jiacheng Sun, Zhenhua Dong, Jieming Zhu, Zhenguo Li, Ruixuan Li, and Rui Zhang. 2023. Data-free Knowledge Distillation for Reusing Recommendation Models. In Proceedings of the 17th ACM Conference on Recommender Systems. 386–395.
  • Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD’17, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 12:1–12:7.
  • Wu et al. (2023) Chuhan Wu, Fangzhao Wu, Yongfeng Huang, and Xing Xie. 2023. Personalized news recommendation: Methods and challenges. ACM Transactions on Information Systems 41, 1 (2023), 1–50.
  • Yang et al. (2023) Tao Yang, Yuwang Wang, Cuiling Lan, Yan Lu, and Nanning Zheng. 2023. Vector-based Representation is the Key: A Study on Disentanglement and Compositional Generalization. arXiv preprint arXiv:2305.18063 (2023).
  • Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep Interest Evolution Network for Click-Through Rate Prediction. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. 5941–5948.
  • Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chengru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi **, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018. 1059–1068.