Extracting Essential and Disentangled Knowledge
for Recommendation Enhancement

Kounianhua Du Shanghai Jiao Tong UniversityShanghaiChina [email protected] , Jizheng Chen Shanghai Jiao Tong UniversityShanghaiChina [email protected] , Jianghao Lin Shanghai Jiao Tong UniversityShanghaiChina [email protected] , Menghui Zhu [email protected] Huawei Noah’s Ark LabShanghaiChina , Bo Chen Huawei Noah’s Ark LabShanghaiChina [email protected] , Ruiming Tang Huawei Noah’s Ark LabShenzhenChina [email protected] and Weinan Zhang Shanghai Jiao Tong UniversityShanghaiChina [email protected]

(2018; 20 February 2007; 12 March 2009; 5 June 2009)

Abstract.

Recommender models play a vital role in various industrial scenarios, while often faced with the catastrophic forgetting problem caused by the fast shifting data distribution, e.g., the evolving user interests, click signals fluctuation during sales promotions, etc. To alleviate this problem, a common approach is to reuse knowledge from the historical data. However, preserving the vast and fast-accumulating data is hard, which causes dramatic storage overhead. Memorizing old data through a parametric knowledge base is then proposed, which compresses the vast amount of raw data into model parameters. Despite the flexibility, how to improve the memorization and generalization capabilities of the parametric knowledge base is challenging. In this paper, we propose two constraints to extract Essential and Disentangled Knowledge from past data for rational and generalized recommendation enhancement, which improves the capabilities of the parametric knowledge base without increasing the size of it. The essential principle helps to compress the input into representative vectors that capture the task-relevant information and filter out the noisy information. The disentanglement principle reduces the redundancy of stored information and pushes the knowledge base to focus on capturing the disentangled invariant patterns. These two rules together promote rational compression of information for robust and generalized knowledge representations. For the parametric knowledge base, we train a knowledge extractor that extracts knowledge patterns of arbitrary order from past data and a knowledge encoder that memorizes the arbitrary order patterns, which serves as the retrieval key generator and memory network respectively in the following knowledge reusing phase. The whole process is regularized by the proposed two constraints. Extensive experiments on two datasets justify the effectiveness of the proposed method.

Recommender Systems, Information Compression

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Information systems Recommender systems

1. Introduction

Recommender systems play an important role in alleviating information overload and hel** users discover relevant content in today’s vast digital landscape. They are widely used in various industries, including e-commerce (Smith and Linden, 2017), entertainment (Christensen and Schiaffino, 2011), social media (Wu et al., 2023), and online streaming platforms (Zhou et al., 2018, 2019).

However, the data distribution shifts fast in recommender systems, e.g., the evolving user interests, the fluctuation of click signals during sale promotions, etc. This fast shifting distribution often results in the catastrophic forgetting problem that models cannot remember old knowledge well. A line of methods is to reuse knowledge from old data in a non-parametric way (Pi et al., 2020). However, preserving the vast and fast-accumulating raw data is heavy, which causes dramatic storage overhead. Compressing the knowledge from raw data into parameters for reusing is then proposed (Qin et al., 2024; Wang et al., 2023).

Refer to caption — Figure 1. (a) The illustration of the overall process. The vast knowledge of old data is compressed into the parametric knowledge base. (b) The essential principle for compression, which encourages a compressed representation that captures the fundamental knowledge of data and filters out the noises. (c) The disentangled principle for compression, which reduces the redundancy of stored patterns and decomposes the invariance for better generalization.

Despite the flexibility, how to improve the memorization and generalization capabilities of the parameters is challenging due to the noisy information and the diverse patterns existing in the vast amount of old data. For example, a user named $Alice$ might follow the trend to buy a product $Meal\_Replacement$ during sales promotions and give a positive feedback due to the return cash offer, which is not an actual interest of hers. $Alice-Meal\_Replacement$ forms a spurious pattern that could bias the model from memorizing the invariant pattern. In addition, the patterns existing in the data space are diverse and complex, which are hard to be fully memorized. Some patterns may entangle with each other, leading to the redundant storage and poorer generalization. For example, $Female-Actress-Lipstick$ , $Female-Lipstick$ , and $Actress-Lipstick$ entangles with each other, where $Actress$ information contains $Female$ information so that $Female-Actress-Lipstick$ leads to redundant memorization. Decomposing $Female-Actress-Lipstick$ into $Female-Lipstick$ and $Actress-Lipstick$ helps to reduce redundancy and meanwhile improve the expressiveness of the features, since the minimal information that has better invariance is kept.

To solve the challenges above, we propose the Essential and Disentangled Knowledge base, where two constraints regularize a parametric knowledge base for robust and generalized knowledge preserving without increasing the size of the knowledge base. Concretely, the Essential principle aims to capture the task-relevant information (the red part in Figure 1.(b)) and filter out the noisy information (the grey part in Figure 1.(b)) in the data, which follows the Information Bottleneck principle (Tishby et al., 2000a) that finds the optimal trade-off between compression and prediction. By striking a balance between these two aspects, the principle enables the model to preserve sufficient information of data while kee** the representation minimal and invariant (Achille and Soatto, 2018), which contributes to lower storage cost and higher generalization capability. The Disentanglement principle eliminates redundant information among different patterns (the grey part in Figure 1.(c)) and decomposes the invariance of different patterns. This reduction of redundancy helps to improve memorization within limited model parameters. And at the same time, as depicted in (Montero et al., 2020), improving disentanglement may contribute to the generalization. The two rules together contribute to compressed and rational knowledge vectors that capture the fundamental knowledge of past data and have good generalization ability for future distribution.

Concretely, our method can be split into two stages: knowledge compression and knowledge utilization. During the first stage, we compress the massive old data documents into the learned parameters. Specifically, we first extract informative patterns from the old data using a knowledge extractor. The extracted patterns are then encoded and memorized by a permutation-equivariant encoder for knowledge vectors. The whole process is under the regularization of the essential and disentangled principles. After the compression, the knowledge extractor and knowledge encoder together make up the parametric knowledge base, which serve as the retrieval key generator and memory network in the following stage. During the knowledge utilization stage, we extract knowledge from the constructed knowledge base for each data instance and utilize it as a supplement for recommendation enhancement. The extensive experiment results validate the effectiveness of the proposed method, as well as its superior model compatibility.

Our contributions can be summarized into three-folds.

•

We propose a parametric knowledge base that compresses the massive old data documents into robust and generalized parametric knowledge, which is a modal-agnostic method that could offer diverse knowledge for various base models.
•

We propose two rules to effectively regularize the parametric knowledge base, which promotes robustness and generalization by extracting essential and disentangled knowledge from the old data without increasing the size of it.
•

The knowledge extraction and utilization of EDK is instance-level, which offers fine-grained information tailed for each instance for enhancement.

Major experiments over different baseline models and ablation studies validate the effectiveness of the proposed method.

2. Related Work

2.1. Recommender Systems

This work is closely related to the research topics of recommender systems, which focus on capturing collaborative signals based on feature interaction learning and user behavior modeling.

For feature interaction learning, there are plenty of works mining the interactive patterns of categorical data. FM (Rendle, 2010) models features as low-dimensional embeddings and captures feature interactions by inner products. FFM (Juan et al., 2016) further develops field-aware interactions by multiple embeddings settings. NeuFM (He et al., 2017) proposes to use deep neural networks, improving FM by incorporating a multi-layer perceptron. Wide & Deep (Cheng et al., 2016) combines the strengths of linear models and DNNs for memorization and generalization respectively. DeepFM (Guo et al., 2017) utilizes an FM layer to replace the wide component in (Cheng et al., 2016) to model the pairwise feature interactions. xDeepFM (Lian et al., 2018) introduces the cross layer to replace the wide component and constructs limited high order feature interactions. PNN (Qu et al., 2018) proposes a product layer to better model the “AND” relations between features. DCN (Wang et al., 2017) learns both low-dimensional feature crossing and high-dimensional nonlinear features efficiently. AutoInt (Song et al., 2019) learns high-order feature interactions via the self-attention mechanism. AutoFIS (Liu et al., 2020) applies a gate for each feature interaction and selected limited order beneficial interaction by learning the gate values.

User behavior modeling plays a crucial role in recommendation systems by providing valuable insights into user preferences, interests, and behaviors. By understanding how users interact with a system and make choices, recommendation algorithms can effectively personalize and tailor recommendations to individual users. DIN (Zhou et al., 2018) models user interests with attention mechanism. It reduces the influence of historical behavior information of commodities that are not related to the currently estimated advertisement on the current click estimation judgment. DIEN (Zhou et al., 2019) further points out that user interests evolve over time and modeled the evolving interests of users with the GRU module. DSIN (Feng et al., 2019) splits different sessions and better models the evolving property of user interests. MIMN (Pi et al., 2019) further categorizes user interests into different channels and models different aspects of user interests. SIM (Pi et al., 2020) proposes to model long-term user behaviors and aggregates them using the fast SimHash algorithm.

2.2. Model Invariance & Generalization

Due to the limited capacity of model parameters, improving the quality of the memorized knowledge is important.

As stated in (Achille and Soatto, 2018), an ideal representation should be sufficient, minimal, invariant, and maximally disentangled. It also proves that a sufficient representation of the data is invariant if and only if it contains the smallest amount of information. In other words, achieving the optimal balance of sufficiency and minimalist will at the same time promotes invariance, which contributes to the generalization (Deng et al., 2022). This aligns with the information bottleneck theory (Tishby et al., 2000b) that a model should extract the most relevant information of an input sample $X$ corresponding to $Y$ :

(1)

\min\left(I(X;S)-\beta I(S;Y)\right),

where $S$ denotes the representation of the encoded input.

As for the disentanglement requirement, it is proved in machine learning tasks that disentanglement contributes to generalization (Montero et al., 2020; Yang et al., 2023). A common thread of methods to improve the disentanglement is to minimize the mutual information of the representations.

While mutual information being hard to estimate, one can decrease the upper bound of mutual information to minimize it and increase the lower bound of mutual information to maximize it. CLUB (Cheng et al., 2020) introduces a contrastive log-ratio upper cound of mutual information, which can be represented as

(2)

\begin{split}I(X;Y)&\leq I_{CLUB}(X;Y)\\ &=E_{p(x,y)}\left[\log p(y|x)\right]-E_{p(x)}E_{p(y)}\left[\log p(y|x)\right].% \end{split}

When the conditional distribution $p(y|x)$ is not known, one could use a variational distribution $q_{\theta}(y|x)$ to approximate it and the upper bound then becomes

(3)

I_{vCLUB}=E_{p(x,y)}\left[\log q_{\theta}(y|x)\right]-E_{p(x)}E_{p(y)}\left[% \log q_{\theta}(y|x)\right].

MINE (Belghazi et al., 2018) proposes a lower bound of the mutual information based on the Donsker-Varadhan representation of KL divergence:

(4)

\begin{split}I(X;Y)&:=D_{KL}(\mathbb{J}||\mathbb{M})\geq\hat{I_{w}}^{(DV)}(X;Y% )\\ &:=E_{\mathbb{J}}\left[T_{w}(x;y)\right]-\log E_{\mathbb{M}}\left[e^{T_{w}(x,y% )}\right],\end{split}

where $\mathbb{J}$ is the joint distribution, $\mathbb{M}$ is the product of marginal distributions, and $T_{w}$ is a discriminator. Then DIM (Hjelm et al., 2019) points out that we do not necessarily need to obtain the precise value of MI and use the Jensen-Shannon Divergence to estimate it instead, where a GAN-style loss is proposed:

(5)

L=E_{\mathbb{J}}\left[\log T_{w}(x,y)\right]+E_{\mathbb{M}}\left[\log(1-T_{w}(% x,y))\right].

3. Problem Formulation

Click-Through-Rate prediction predicts the signal of a user clicking a candidate item, which is an essential and important task in recommender systems. Conventional CTR prediction methods (Rendle, 2010; Juan et al., 2016; He et al., 2017; Cheng et al., 2016; Guo et al., 2017; Lian et al., 2018; Liu et al., 2020; Song et al., 2019; Qu et al., 2018; Lin et al., 2023; Wang et al., 2017) can be formulated as

(6)

p(y|X,\theta),

where $X=\left[x_{1},x_{2},\dots,x_{F}\right]$ is the input consisting of features from $F$ feature fields and $\theta$ is the model parameter. These methods make prediction based solely on target feature and model parameters. Another thread of methods that models user behavior patterns (Zhou et al., 2018, 2019; Pi et al., 2019; Feng et al., 2019; Pi et al., 2020) shows important impacts on making personalized recommendation and brings great performance gains, which can be formulated as

(7)

p(y|X,D_{his},\theta),

where $D_{his}$ represents the user behavior histories. These methods focus on modeling user historical pattern by directly inputting users’ historical behaviors.

In this paper, we aim to extract diverse patterns existing in old data and build an efficient paramatric knowledge base to supplement the target prediction. The framework can be formulated as

(8)		$\displaystyle KB_{\theta*}\leftarrow D_{old},$
(9)		$\displaystyle p(y\|X,KB_{\theta*},\theta),$
(10)		$\displaystyle p(y\|X,D_{his},KB_{\theta*},\theta),$

where $D_{old}$ denotes the old data and $KB_{\theta*}$ denotes the knowledge base we construct based on old data.

4. Methodology

Our framework can be divided into two stages: knowledge compression and knowledge utilization. In the knowledge compression stage, we compress the essential and disentangled knowledge within the old data into compact vectors stored in the parameters of the parametric knowledge base. This stage consists of two modules: a knowledge extractor that extracts informative patterns from the raw data features and a knowledge encoder that encodes and memorizes patterns of arbitrary scales. The compression process is guided by the essential and disentangled principles for robust and generalized knowledge within limited model parameters. After the compression, the knowledge extractor and knowledge encoder made up the parametric knowledge base, which serve as the retrieval key generator and the memory network in the following phase respectively. In the knowledge utilization stage, we access the knowledge base trained in the previous stage and adapt the knowledge for enhanced prediction.

4.1. Knowledge Compression

During this stage, we compress the knowledge of old data into a parametric knowledge base consisting of a knowledge extractor $f(\cdot)$ and a knowledge encoder $g(\cdot)$ , guided by the essential and disentangled principles. The overall knowledge compression process can be formulated as

(11)

\underbrace{X\xrightarrow[]{f(\cdot)}\{\tilde{\mathbf{s_{j}}}\}\xrightarrow[]{% g(\cdot)}\{\mathbf{s_{j}}\}\xrightarrow[]{Aggregator}{\mathbf{c}}}_{\text{% Regularized by Essential \& Disentangled}},

where $X$ denotes a sample of the old data, $\{\tilde{\mathbf{s_{j}}}\}$ denotes the different patterns of the input, $\{\mathbf{s_{j}}\}$ denotes the encoded knowledge vectors, and $\mathbf{c}$ denotes the final abbreviated representation of the input. For simplicity, we omit the footnote of index for data sample. After training, the knowledge extractor $f(\cdot)$ and knowledge encoder $g(\cdot)$ constitute the parametric knowledge base into which the vast old data documents are compressed.

4.1.1. Essential & Disentangled Principles

The whole compression process is regularized by the essential and the disentangled principles. The essential principle encourages compressed representation that captures the sufficient and minimal information relevant to the task. This elimination of irrelevant or noisy information helps the parametric knowledge base focus only on the invariant patterns and avoids it from being biased by the spurious patterns. The disentangled principle decomposes the entanglement of the memorized knowledge patterns, which eliminates the redundancy of knowledge memorizing and promotes the disentanglement of different underlying factors, which helps the parametric knowledge base generalize its knowledge to new instances or distributions more effectively. The formulations of the two principles are given as:

•

Essential. $\min I(X;\{\mathbf{s_{j}}\})-\alpha I(\{\mathbf{s_{j}}\};y)$ . The generated knowledge representations should contain as much of the task-relevant information of the original input and as less of task-irrelevant information. It helps to distill complex data into a sufficient and minimal representation, enabling better generalization and robustness.
•

Disentangled. $\min\sum_{i\neq j}I(\mathbf{s_{i}};\mathbf{s_{j}})$ . The generated knowledge representations should contain different aspects of information about the input. It helps to reduce memory redundancy and disentangle underlying factors for better invariance.

Together, the two principles contribute to sufficient, minimal, and invariant knowledge representations, which improves the expressiveness and generalization of the parameters of the knowledge base without increasing the size of it. The objective of the two principles can be formulated as

(12)

\min I(X;\{{\mathbf{s_{j}}}\})-\alpha I(\{\mathbf{s_{j}}\};y)+\beta\sum_{i\neq j% }I(\mathbf{s_{i}};\mathbf{s_{j}}),

where $\alpha$ and $\beta$ are hyperparameters to scale the loss components.

Due to the complexity of computing mutual information among a set of variables, we can relax the above objective and obtain the following loss:

(13)

l_{reg}=\underbrace{-\alpha\sum_{j}I(\mathbf{s_{j}};y)+I(X;\mathbf{c})}_{\text% {Essential}}+\underbrace{\beta\sum_{i\neq j}I(\mathbf{s_{i}};\mathbf{s_{j}})}_% {\text{Disentangled}},

where $\mathbf{c}$ is the abbreviated representation of $X$ .

To achieve the essential objective, we need to 1) maximize the mutual information between each encoded knowledge vector and the label and at the same time 2) minimize the mutual information between the abbreviated representation and the input. These two strike a balance between compression and prediction. For maximizing the mutual information between each encoded knowledge vector and the label, the difficulty lies in that the label space is discrete and only contains two values. Therefore, we follow DIM (Hjelm et al., 2019) to maximize the distance between the joint distribution and the marginal distribution. Concretely, we regard a pattern representation and a random embedded input with the same label as the joint distribution. For the marginal distribution, we sample the pair as the pattern representation and a random embedded input with the opposite label.

(14)

\begin{split}\max\sum_{j}I(\mathbf{s_{j}};y)&=\max\sum_{j}(logD_{\theta}(% \mathbf{{s_{j}}},\mathbf{c^{+}})\\ &+(1-logD_{\theta}(\mathbf{{s_{j}}},\mathbf{c^{-}}))).\end{split}

For minimizing the mutual information between the abbreviated representation and the input, we minimize the lower bound of the mutual information proposed in (Cheng et al., 2020).

(15)

\min I(X;\mathbf{c})=\min I_{vCLUB}(X;\mathbf{c}).

To achieve the disentangled objective, we minimize the mutual information among encoded knowledge vectors. Since the combinations of different vectors is polynomial, it is hard to train a mutual information estimator for each pair of vectors. Instead, we use a loss that is equivalent to the contrastive loss, which enlarges the discrepancy among different patterns.

(16)

\min\sum_{i\neq j}I(\mathbf{s_{i}};\mathbf{s_{j}})=\min-\sum_{i}log\frac{z(% \mathbf{s_{i}})z(\mathbf{s_{i}})}{\sum_{i\neq j}z(\mathbf{s_{i}})z(\mathbf{s_{% j}})},

where $z$ is the augmentation operator defined by

(17)

z=MLP(Norm(Dropout(\cdot))).

4.1.2. Knowledge Extractor $f(\cdot)$

A sample of the data could be represented by $X=\left[x_{1},x_{2},\dots,x_{F}\right]$ , where $x_{i}$ is the feature value of the $i$ -th field and $F$ is the number of feature fields. For each sample of the old data, we generate different masks to obtain different patterns and encode the patterns for knowledge. Concretely, for an arbitrary input $X$ , we use a global attentive readout to extract the mutual relations among features and obtain the context-aware features for mask generation.

Firstly, we embed the features of input $X=\left[x_{1},x_{2},\dots,x_{F}\right]$ by

(18)		$\displaystyle\forall x_{i}\in X,\text{\quad}\mathbf{x}_{i}=\Phi_{1}(x_{i}),$
(19)		$\displaystyle\mathbf{H_{0}}=\left[\mathbf{x}_{1},\dots,\mathbf{x}_{F}\right],$

where $\Phi_{1}$ denotes the embedding operation used to generate embeddings for knowledge extraction, $\mathbf{H_{0}}\in R^{F\times d}$ , and $d$ is the embedding size.

Then we feed the embedded input into a global self-attention layer to obtain the context-aware features for patterns extraction.

(20)

\mathbf{H}\xleftarrow{GlobalAtt}\mathbf{H_{0}}.

After the global self-attention readout, we generate the mask for $K$ patterns of the input by

(21)

\mathbf{M}=\mathbf{H}\mathbf{P},

where $\mathbf{M}\in R^{F\times K}$ is the generated mask, $\mathbf{P}\in R^{d\times K}$ is a learnable parameter, and $K$ denotes the number of knowledge patterns we generate.

To make each entry of the mask approximately binary and ensure the gradient consistency, we map each entry $m_{ij}$ of the generated mask matrix $\mathbf{M}$ using the hard concrete distribution (Maddison et al., 2017), which is a continuous relaxation of discrete random variables. Concretely, we obtain the masks as

(22)		$\displaystyle\text{\qquad\quad}u_{ij}\sim U(0,1),$
(23)	$\displaystyle m_{ij}=\sigma$	$\displaystyle\left(\left(\log u_{ij}-\log(1-u_{ij})+\log m_{ij}\right)/\beta% \right),$
(24)		$\displaystyle m_{ij}=Tanh\left(m_{ij}(\delta-\gamma)+\gamma\right),$

where $u_{ij}$ is sampled from a uniform distribution and $\sigma$ denotes the sigmoid function.

After we obtain the masks, we apply them on the original input to generate different patterns.

(25)		$\displaystyle\mathbf{H^{\prime}}$	$\displaystyle=\left[\Phi_{2}(x_{1}),\dots,\Phi_{2}(x_{F})\right],$
(26)		$\displaystyle\tilde{\mathbf{s}}_{j}$	$\displaystyle=\mathbf{H^{\prime}}\odot\mathbf{M}_{:,j},\text{\quad}\forall j% \in\{1,\dots,K\},$

where $\Phi_{2}$ is the embedding operation used to generate embeddings for knowledge memorization, $\odot$ denotes the element-wise product, and $\tilde{\mathbf{s}}_{j}\in R^{F\times d}$ is a pattern vector.

4.1.3. Knowledge Encoder $g(\cdot)$

After we obtain the extracted patterns $\{\tilde{\mathbf{s_{j}}}\}$ , we feed the patterns into the knowledge encoder to memorize them. Since the generated patterns are of arbitrary lengths, the encoder should be capable of handling inputs of different lengths. Hence, we adopt the classical self-attention architecture (Vaswani et al., 2017).

For every masked knowledge pattern, each entry of it attends to others through a self-attention layer as follows:

(27)	$\displaystyle\mathbf{Q}=\mathbf{\tilde{s}W_{Q}},\text{ }\mathbf{K}$	$\displaystyle=\mathbf{\tilde{s}W_{K}},\text{ }\mathbf{V}=\mathbf{\tilde{s}W_{V% }},$
(28)	$\displaystyle\mathbf{A}=so$	$\displaystyle ftmax\left(\frac{\mathbf{QK^{T}}}{\sqrt{d}}\right),$
(29)	$\displaystyle Attention$	$\displaystyle(\mathbf{Q,K,V})=\mathbf{AV}.$

We represent the above process with $\psi(\cdot)$ . Then the encoded vectors for each knowledge pattern and the final abbreviated representation $\mathbf{c}$ of an input could be obtained by:

(30)		$\displaystyle\mathbf{s_{j}}$	$\displaystyle=MLP(\psi(\mathbf{\tilde{s_{j}}})),$
(31)		$\displaystyle\mathbf{c}$	$\displaystyle=MLP(AGG_{j}\{\mathbf{{s_{j}}}\}).$

4.1.4. Objective

The training objective of the knowledge compression stage can then be summarized as

(32)

l_{compression}=l_{ce}+\lambda_{1}l_{reg}+\lambda_{2}l_{0},

where $l_{ce}$ denotes the cross entropy loss of predicting labels on $\mathbf{c}$ , $l_{reg}$ is the loss corresponding to the two principles proposed in Equation (13), and $l_{0}$ denotes the regularization loss of the generated masks.

4.2. Knowledge Utilization

After we establish the parametric knowledge base $KB_{\theta*}=g(f(\cdot))$ using the old data, we could extract the essential and disentangled knowledge vectors for target prediction enhancement. When making prediction for a target sample $X$ , we could extract the knowledge from the knowledge base and append it for prediction. The process can be formulated as:

(33)

\hat{y}=\Phi_{base}\left(MLP\left(KB_{\theta*}(X)\right),X\right),

where $\Phi_{base}$ denotes the backbone model and $KB_{\theta*}$ denotes the frozen parametric knowledge base.

The final prediction is optimized by the cross-entropy loss:

(34)

l_{pred}=\sum_{\langle X,y\rangle\in D_{train}}(y\log\hat{y}+(1-y)\log(1-\hat{% y})).

5. Experiments

In this section, we empirically evaluate the proposed model with two datasets on the Click-Through-Prediction task. We evaluate the proposed method starting from the following research questions.

RQ1

Is the proposed parametric knowledge base model-agnostic? Does the proposed parametric knowledge base offer performance gain for all the base models?
RQ2

Do the proposed two rules improve the quality of the parametric knowledge base?
RQ3

Are the components of the parametric knowledge effective? How does the number of knowledge vectors per data instance influence the performance?
RQ4

Are the objectives of the proposed constraints (i.e., essential & disentangled principles) achieved during training?

5.1. Experimental Setup

5.1.1. Datasets

We use two datasets to validate the proposed model.

Table 1. Statistics of the Used Datasets.

Dataset	# Users	#Items	#Documents	# Fields	# Features
AD	1,061,768	827,009	25,029,435	12	3,029,333
Eleme	5,782,482	1,853,764	49,114,930	17	16,516,885

•

Ali Display¹¹1https://tianchi.aliyun.com/dataset/56. It is a dataset provided by Alibaba to estimate the click-through rate of Taobao display ads.
•

Eleme²²2https://tianchi.aliyun.com/dataset/131047. It is a dataset provided by Eleme that offered take-away service to users.

The detailed descriptions for the used datasets are summarized in Table 1. For the two datasets, the clicked samples are treated as positive samples and the exposed but not clicked samples are treated as negative samples. We split the data using the global timestamps (Qin et al., 2020, 2021). For regular baseline models, we use the logs before $T_{1}$ for training, 50% of the logs after $T_{1}$ for validation, and 50% of the logs after $T_{1}$ for test. For those supplemented with extra knowledge vectors, we use the logs before $T_{0}$ for building the knowledge base, the logs between $[T_{0},T_{1})$ for training base models, 50% of the logs after $T_{1}$ for validation, and 50% of the logs after $T_{1}$ for test. ( $T_{0}<T_{1}$ ). The training setup is illustrated in Figure 3.

5.1.2. Baselines

Since our method is model-agnostic, we use different widely-used CTR models as our base models.

We first compared a group of feature interaction models. DeepFM (Guo et al., 2017) modeled 2-way pairwise feature interactions and used a DNN for generalization. xDeepFM (Lian et al., 2018) utilized the CIN network to model finite-order feature interactions. PNN (Qu et al., 2018) explicitly modeled the ”AND” relation between features by introducing the product layer. AutoInt (Song et al., 2019) built high-order feature interactions by the multi-head attention. DCN (Wang et al., 2017) learned both low-order feature interactions and high-order nonlinear features efficiently. We also compared with models that extracted multi-samples information for recommendation enhancement. DIN (Zhou et al., 2018) modeled user interests according to the attention scores between the candidate item and the target user’s historical clicked items.

5.1.3. Metrics

The evaluation metrics include area under ROC curve (AUC) and negative log-likehood (LogLoss).

5.1.4. Hyperparameters

We use consistent embedding size for all base models for fair comparison. For AD dataset, the embedding size for base models is set to 32. The size of the knowledge vectors for AD is set to 16. For Eleme dataset, the embedding size for base models is set to 16. The size of the knowledge vectors for Eleme is set to 8. The number of patterns to extract from each data instance is set to 20. The depth and number of attention heads for the knowledge encoder is 3 and 3, respectively. The learning rate is searched in the range of $[1e-4,3e-4,5e-4,1e-3]$ , while the weight decay is searched in the range of $[1e-4,3e-4,5e-4,5e-5,3e-5,1e-5]$ . Adam (Kingma and Ba, 2014) is used for training.

5.2. Main Results (RQ1)

Table 2. Experiment results on AD and Eleme. Since EDK is model-agnostic, we test its compatibility on different backbones. N/A denotes the original recommendation backbone. Rel.Impr denotes the relative AUC improvement of EDK against each backbone model. The symbol * indicates statistically significant improvement with p-value

<0.001

Model		AD			Eleme
Model		AUC	Logloss	Rel. Impr.	AUC	Logloss	Rel. Impr.
DeepFM	N/A	0.6220	0.1949	2.54%	0.6221	0.0907	5.36%
DeepFM	w/ EDK	0.6378*	0.1935*	2.54%	0.6555*	0.0901*	5.36%
DCN	N/A	0.6231	0.1948	2.25%	0.6386	0.0889	2.77%
DCN	w/ EDK	0.6371*	0.1936*	2.25%	0.6563*	0.0881*	2.77%
PNN	N/A	0.6299	0.1944	1.03%	0.6324	0.0884	3.61%
PNN	w/ EDK	0.6364*	0.1935*	1.03%	0.6552*	0.0881*	3.61%
xDeepFM	N/A	0.6291	0.1943	1.08%	0.6466	0.0883	1.48%
xDeepFM	w/ EDK	0.6359*	0.1939*	1.08%	0.6562*	0.0887*	1.48%
AutoInt	N/A	0.6339	0.1944	0.35%	0.6464	0.0934	1.39%
AutoInt	w/ EDK	0.6361*	0.1937*	0.35%	0.6554*	0.0885*	1.39%
DIN	N/A	0.6233	0.1947	2.23%	0.6411	0.0883	2.26%
DIN	w/ EDK	0.6372*	0.1935*	2.23%	0.6556*	0.0883*	2.26%

The overall performance of the proposed method is displayed in Table 2. Six widely used conventional recommendation models are used as the backbone models, the feature interaction operators of which include dnn-based, product-based, and attention-based methodologies. From the results, one can see that EDK consistently offers significant performance improvements for all the selected backbone models. The improvements are statistically significant under p-value $<0.001$ . This validates that compressing knowledge of old data documents into parameters could preserve useful knowledge and offer performance gains for recommender systems. In addition, EDK shows superior model compatibility, which demonstrates the effectiveness of our established parametric knowledge base under the essential and disentangled constraints.

5.3. Ablation Study (RQ2-RQ3)

In this section, we conduct various ablation studies to validate the effectiveness of model designs.

5.3.1. Impact of the Two Proposed Principles (RQ2)

To enforce the parametric knowledge base to memorize the essential and disentangle knowledge patterns within limited model parameters, we propose two constraints to regularize the knowledge compression process. In this section, we investigate the effectiveness of the two proposed constraints. We conduct experiments with backbones on which EDK achieves the best performance. Concretely, we conduct experiments with the DeepFM backbone for AD and conduct experiments on the DCN backbone for eleme. We remove the regularizations of the essential and disentangled principles to see the impacts of them. The results are displayed in Table 3.

Table 3. Impact of the Proposed Principles.

Datasets	AD		Eleme
Datasets	AUC	Logloss	AUC	Logloss
EDK	0.6378	0.1935	0.6563	0.0881
w/o Disentangled	0.6356	0.1939	0.6433	0.0886
w/o Essential	0.6347	0.1937	0.6520	0.0884
w/o both	0.6304	0.1940	0.6411	0.0888

From the results, we can see that the proposed essential and disentangled constraints improve the quality of the memorized knowledge of the parametric knowledge base. Both the essential principle and the disentangled principle can solely contribute to the final performance, while the two complement each other and can collaborate for more qualified knowledge.

5.3.2. Impact of the Knowledge Extractor (RQ3)

The patterns of the old data space are diverse and complex, being hard to be fully memorized due to the existence of spurious patterns and entanglement. To memorize the informative patterns within the data space, a knowledge extractor is designed to extract the essential and disentangled patterns within data points. When the number of knowledge pattern per data instance is set to 1, it reduces to memorizing a raw data point in the space. In this section, we conduct experiments on the number of knowledge patterns per data point to study the impact of the knowledge extractor. Similarly, we select DeepFM as the backbone for AD dataset and DCN as the backbone for Eleme dataset. We test on the range of $\left[5,10,15,20,25\right]$ , the results are listed in Figure 4.

From the results, we can see that improving the number of knowledge vectors per data point generally gives rise to the performance, which validates the effectiveness of the knowledge extractor. In addition, further improving the number of knowledge patterns gives minor performance improvements or even impairs the final performance. This is because the increase of the knowledge patterns may result in redundancy and noise, which hurts the robustness of the model parameters.

5.4. Objectives of the Principles (RQ4)

In this section, we analyze whether the proposed two principles are achieved during the knowledge compression process.

5.4.1. Loss Curves of Different Components

First of all, we study whether the losses corresponding to the two principles are learned well during the knowledge compression process.

The learning curves of different losses on AD dataset is illustrated in Figure 5. The blue line displays the loss curve for the disentangled constraint, which is $\sum_{i\neq j}I(s_{i};s_{j})$ . The red line and black line together make up the loss for the essential constraint, which is $-\alpha\sum_{j}I(\mathbf{s_{j}};y)+I(X;\mathbf{c})$ . Concretely, the black line illustrates the curve for $-\sum_{j}I(\mathbf{s_{j}};y)$ which promotes the memorization for task relevant information, and the red line illustrates the curve for $I(X;\mathbf{c})$ , which compresses the representation and helps to filter the noisy information.

From the learning curves, we can see that all losses are learned well. While for the red one which corresponds to the vCLUB value, the MI upper bound estimation increases at the first epoch and decrease in the following training steps, since it takes some steps first to train the MI estimator well for correct estimation.

5.4.2. Visualizations of the Extracted Patterns

Then, we visualize the extracted patterns for more explicit analysis. Concretely, we analyze the pattern masks generated by the knowledge extractor and the corresponding representation encoded by the knowledge encoder.

Firstly, we study the masks generated by the trained knowledge extractor. Note that the entries of the masks are continuous. Here we regard an entry with value larger than $0.5$ as an existing feature and calculate the number of the non-zero entries of the generated masks. The extracted pattern scale distribution is illustrated in Figure 6.

From the results, we could find that the scales of extracted patterns vary. For Eleme, the extracted patterns include distinct features, feature interactions, and denoised views of the original inputs. While for AD, the extracted patterns mainly consist of high-order feature interactions and denoised views of the inputs.

In addition, we further study the representations of the raw data instance features and the extracted patterns. We use the T-SNE visualization to illustrate the distribution of the encodings of raw data instances and the extracted pattern vectors as displayed in Figure 7.Each point in the left part of the figure denotes a representation of a raw data instance, and each point in the right figure represents the an encoded pattern vector. From the results, we could see that there are apparently different manifolds existing in the extracted patterns representation space for both AD and Eleme datasets, while the raw data instances just appear in different clusterings. This validates the effectiveness of the proposed method on extracting the diverse and complex patterns from the old data.

6. Conclusion

In this paper, we design a parametric knowledge base EDK that compresses the massive old data into compact knowledge stored in parameters, which is regularized by the proposed essential and disentangled principles. The two principles promote robust and generalized knowledge memorized by the parametric knowledge base without increasing the size of it. The essential principle strikes a balance between compression and prediction, which filters out the noisy and irrelevant information of the old data and preserves the task-relevant information. The disentangled principle helps to reduce the redundancy of stored information and decompose the entanglement of the knowledge representations, which gives rise to the invariance of the produced knowledge vectors. EDK is model-agnostic, flexible, and tailored for fine-grained instance-level knowledge augmentation. Experiments and various ablation studies justify the effectiveness of it.

References

(1)
Achille and Soatto (2018) Alessandro Achille and Stefano Soatto. 2018. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research 19, 1 (2018), 1947–1980.
Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, R. Devon Hjelm, and Aaron C. Courville. 2018. Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Vol. 80. PMLR, 530–539.
Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In DLRS@RecSys.
Cheng et al. (2020) Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, and Lawrence Carin. 2020. CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 1779–1788.
Christensen and Schiaffino (2011) Ingrid A. Christensen and Silvia Schiaffino. 2011. Entertainment recommender systems for group of users. Expert Systems with Applications 38, 11 (2011), 14127–14135. https://doi.org/10.1016/j.eswa.2011.04.221
Deng et al. (2022) Weijian Deng, Stephen Gould, and Liang Zheng. 2022. On the strong correlation between model invariance and generalization. Advances in Neural Information Processing Systems 35 (2022), 28052–28067.
Feng et al. (2019) Yufei Feng, Fuyu Lv, Weichen Shen, Menghan Wang, Fei Sun, Yu Zhu, and Ke** Yang. 2019. Deep Session Interest Network for Click-Through Rate Prediction. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019. 2301–2307.
Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. In IJCAI.
He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In WWW.
Hjelm et al. (2019) R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
Juan et al. (2016) Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. 2016. Field-aware factorization machines for CTR prediction. In RecSys.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Lian et al. (2018) Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1754–1763.
Lin et al. (2023) Jianghao Lin, Yanru Qu, Wei Guo, Xinyi Dai, Ruiming Tang, Yong Yu, and Weinan Zhang. 2023. MAP: A Model-agnostic Pretraining Framework for Click-through Rate Prediction. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1384–1395.
Liu et al. (2020) Bin Liu, Chenxu Zhu, Guilin Li, Weinan Zhang, **cai Lai, Ruiming Tang, Xiuqiang He, Zhenguo Li, and Yong Yu. 2020. AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020. ACM, 2636–2645.
Maddison et al. (2017) Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
Montero et al. (2020) Milton Llera Montero, Casimir JH Ludwig, Rui Ponte Costa, Gaurav Malhotra, and Jeffrey Bowers. 2020. The role of disentanglement in generalisation. In International Conference on Learning Representations.
Pi et al. (2019) Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019. ACM, 2671–2679.
Pi et al. (2020) Qi Pi, Guorui Zhou, Yu**g Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020. ACM, 2685–2692.
Qin et al. (2024) Jiarui Qin, Weiwen Liu, Ruiming Tang, Weinan Zhang, and Yong Yu. 2024. D2K: Turning Historical Data into Retrievable Knowledge for Recommender Systems. arXiv:2401.11478 [cs.IR]
Qin et al. (2021) Jiarui Qin, Weinan Zhang, Rong Su, Zhirong Liu, Weiwen Liu, Ruiming Tang, Xiuqiang He, and Yong Yu. 2021. Retrieval & Interaction Machine for Tabular Data Prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1379–1389.
Qin et al. (2020) Jiarui Qin, Weinan Zhang, Xin Wu, Jiarui **, Yuchen Fang, and Yong Yu. 2020. User Behavior Retrieval for Click-Through Rate Prediction. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020. ACM, 2347–2356.
Qu et al. (2018) Yanru Qu, Bohui Fang, Weinan Zhang, Ruiming Tang, Minzhe Niu, Huifeng Guo, Yong Yu, and Xiuqiang He. 2018. Product-based Neural Networks for User Response Prediction over Multi-field Categorical Data. ACM Transactions on Information Systems (2018).
Rendle (2010) Steffen Rendle. 2010. Factorization machines. In ICDM.
Smith and Linden (2017) Brent Smith and Greg Linden. 2017. Two decades of recommender systems at Amazon.com. IEEE Internet Computing (2017). https://www.amazon.science/publications/two-decades-of-recommender-systems-at-amazon-com
Song et al. (2019) Wei** Song, Chence Shi, Zhi** Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Bei**g, China, November 3-7, 2019. ACM, 1161–1170.
Tishby et al. (2000a) Naftali Tishby, Fernando C. N. Pereira, and William Bialek. 2000a. The information bottleneck method. CoRR physics/0004057 (2000).
Tishby et al. (2000b) Naftali Tishby, Fernando C. N. Pereira, and William Bialek. 2000b. The information bottleneck method. CoRR physics/0004057 (2000).
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 5998–6008.
Wang et al. (2023) Cheng Wang, Jiacheng Sun, Zhenhua Dong, Jieming Zhu, Zhenguo Li, Ruixuan Li, and Rui Zhang. 2023. Data-free Knowledge Distillation for Reusing Recommendation Models. In Proceedings of the 17th ACM Conference on Recommender Systems. 386–395.
Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD’17, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 12:1–12:7.
Wu et al. (2023) Chuhan Wu, Fangzhao Wu, Yongfeng Huang, and Xing Xie. 2023. Personalized news recommendation: Methods and challenges. ACM Transactions on Information Systems 41, 1 (2023), 1–50.
Yang et al. (2023) Tao Yang, Yuwang Wang, Cuiling Lan, Yan Lu, and Nanning Zheng. 2023. Vector-based Representation is the Key: A Study on Disentanglement and Compositional Generalization. arXiv preprint arXiv:2305.18063 (2023).
Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep Interest Evolution Network for Click-Through Rate Prediction. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. 5941–5948.
Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chengru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi **, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018. 1059–1068.

Extracting Essential and Disentangled Knowledge for Recommendation Enhancement