Collaboration and Transition: Distilling Item Transitions into Multi-Query Self-Attention for Sequential Recommendation
Abstract.
Modern recommender systems employ various sequential modules such as self-attention to learn dynamic user interests. However, these methods are less effective in capturing collaborative and transitional signals within user interaction sequences. First, the self-attention architecture uses the embedding of a single item as the attention query, making it challenging to capture collaborative signals. Second, these methods typically follow an auto-regressive framework, which is unable to learn global item transition patterns. To overcome these limitations, we propose a new method called Multi-Query Self-Attention with Transition-Aware Embedding Distillation (MQSA-TED). First, we propose an -query self-attention module that employs flexible window sizes for attention queries to capture collaborative signals. In addition, we introduce a multi-query self-attention method that balances the bias-variance trade-off in modeling user preferences by combining long and short-query self-attentions. Second, we develop a transition-aware embedding distillation module that distills global item-to-item transition patterns into item embeddings, which enables the model to memorize and leverage transitional signals and serves as a calibrator for collaborative signals. Experimental results on four real-world datasets demonstrate the effectiveness of the proposed modules.
1. Introduction
In recent years, there has been an increasing focus on modeling dynamic user preferences in modern recommender systems (Zhou et al., 2018; Gao et al., 2022), which is achieved by incorporating various sequential modules such as RNN (Hidasi et al., 2015), CNN (Tang and Wang, 2018a), and Transformer (Kang and McAuley, 2018; Sun et al., 2019). These sequential recommenders aim to integrate contextual factors derived from recent user interactions into personalized user interests. Contextual factors reveal typical item-to-item transition patterns. The main challenge in sequential recommendation lies in effectively learning both personalized user interests and general item transition patterns while maintaining an appropriate balance between the two factors. For instance, a user interested in sportswear may also seek a shirt after purchasing a suit. If we only rely on collaborative signals to generate recommendations, we may overlook the user’s temporary need for items to complement their suit. On the other hand, if we only consider transitional signals to make recommendations, we may neglect the user’s primary interest in sportswear. Therefore, it is crucial to leverage both signals and find a balance between them. We define the collaborative and transitional signals in the context of sequential recommendation tasks as follows:
Definition 1.0 (Collaborative Signals).
In the context of sequential recommendation, collaborative signals refer to the similarities between sequences of users’ interacted items.
Definition 1.0 (Transitional Signals).
In the context of sequential recommendation, transitional signals refer to the transition frequency between pairs of users’ interacted items.
Specifically, collaborative signals can be used by following a sequence-to-item methodology, leveraging the collaborative behavior of users to identify patterns in their interactions and recommend relevant items. On the other hand, transitional signals exploit item-to-item relationships in user interaction sequences, enabling the identification of trigger items that will lead to related purchases.
Although recent sequential recommendation methods such as SASRec (Kang and McAuley, 2018) have demonstrated remarkable performance, they have inherent limitations in effectively capturing both signals within user interaction sequences. To highlight these limitations, we conducted experiments comparing the performance of SASRec with two baseline methods: Item Transition and LightGCN (He et al., 2020). Item Transition is a memory-based, non-personalized method that makes recommendations based on the global transition frequency from the current item to candidate items, serving as a benchmark based on transitional signals (see Section 3.2 for details). LightGCN is a state-of-the-art non-sequential recommendation method that learns user and item embeddings through linear propagation on the user-item interaction graph, serving as a benchmark based on collaborative signals. We conducted experiments on two Amazon datasets, Beauty and Sports (Zhou et al., 2022), and grouped the test samples based on the transition frequency observed in the training data. Results shown in Figure 1 reveal two limitations of SASRec in leveraging both signals:
First, SASRec has a lower ability to leverage collaborative signals than LightGCN. For test samples where the item transition frequency is zero, LightGCN consistently outperforms SASRec on both datasets. This observation shows the limited ability of SASRec to generalize to test samples lacking observed item transitions. Notably, SASRec uses the embedding of the most recent item as the query in its self-attention module, which can be regarded as an attention-enhanced first-order Markov chain model that is inherently limited in leveraging collaborative signals.
Second, SASRec’s ability to leverage transitional signals is lower than Item Transition. For test samples where the item transition frequency exceeds one, i.e., the transition occurs multiple times in the training data, Item Transition significantly outperforms SASRec on both datasets. This observation highlights the limited effectiveness of SASRec in leveraging transitional signals.
Inspired by these observations, we propose a new method called Multi-Query Self-Attention with Transition-Aware Embedding Distillation (MQSA-TED) for sequential recommendation tasks, which consists of two main components to capture collaborative and transitional signals, respectively. First, we propose an -query self-attention module that uses several items (instead of a single item) in windows of flexible sizes as attention queries to capture collaborative signals. By enlarging the window size , the model can leverage similarities between longer-range sequences of users’ interacted items to generate recommendations. However, using a large will result in a global bias as the recommendation will mainly focus on the user’s long-term interests while ignoring the interest shift over time. To strike a balance between bias and variance in modeling users’ dynamic interests, we introduce a multi-query self-attention method by combining long and short-query self-attentions. Second, we develop a transition-aware embedding distillation module that distills global item-to-item transition patterns into item embeddings, which serves as a calibration module that enables the model to effectively memorize and leverage transitional signals when making recommendations. Notably, our proposed method achieves inherent disentanglement of user collaboration modeling and item transition modeling by employing dual supervision: the original item embedding captures item-to-item transitional signals, while the item embedding created after self-attention modules captures sequence-to-item collaborative signals. Our contributions in this paper are summarized as follows:
-
•
We propose an -query self-attention module that uses flexible window sizes for attention queries to capture collaborative signals. We also design a multi-query self-attention method that combines long and short-query self-attentions to balance the bias-variance trade-off in modeling users’ dynamic interests.
-
•
We develop a transition-aware embedding distillation module that distills the global item-to-item transition patterns into item embeddings to capture transitional signals, which serves as a calibration module for collaborative signals.
-
•
We conduct extensive experiments on four real-world datasets to show the effectiveness of our proposed method. The results also highlight the different effects of the proposed two modules in improving recommendation performances.
2. Preliminaries
2.1. Problem Formulation
The sequential recommendation task aims to predict the next item that a user will interact with based on their historical interactions. Let be the set of users, be the set of items, and be the interaction sequence of user , where denotes the length of the sequence. The problem is formulated as calculating the probability that the next item will be interacted with, given the user’s historical interactions:
(1) |
Then the top-N items will be recommended to user based on these probabilities in descending order.
2.2. SASRec
We first briefly introduce the SASRec (Kang and McAuley, 2018) model, which is a state-of-the-art sequential recommender based on the self-attention module in Transformer (Vaswani et al., 2017) and will be used as the base model in our approach. Given a user interaction sequence of the most recent items (here we omit the superscript for simplicity), an embedding matrix is used to convert the sequence into an embedding sequence , where is the embedding size. Then a learnable positional embedding is added to encode the position information, resulting in , where . Next, the transformer (Vaswani et al., 2017) module is used:
(2) |
which adopts multiple blocks of self-attention and feed-forward networks. The self-attention layer is used to capture the long-term sequential dependency as follows:
(3) |
(4) |
where represents the queries, the keys, the values, and , , are the projection matrices for queries, keys, and values, respectively. Finally, the model predicts ranking scores by taking the dot product between the sequence embedding and the candidate item embeddings as . The cumulative cross-entropy loss is used for model training as follows:111This loss function has been shown more effective than the negative sampling-based binary cross-entropy loss (Li et al., 2023) and we use it for all models in our experiments.
(5) |
where is a one-hot vector converted from the index of the ground truth item at timestamp .
3. Methodology
In this section, we present the proposed method, which consists of two main components as illustrated in Figure 2: 1) Multi-Query Self-Attention for user collaboration modeling, and 2) Transition-Aware Embedding Distillation for item transition modeling.
3.1. Multi-Query Self-Attention for User Collaboration Modeling
We adopt SASRec as our base model owing to its strong ability to capture long-term sequential dependency and its state-of-the-art performance in sequential recommendation tasks (Kang and McAuley, 2018). SASRec uses the self-attention module in Transformer (Vaswani et al., 2017), whose main components are the queries, keys, and values, as shown in Equation (4). Specifically, the attention query at timestamp in SASRec can be expressed as follows:
(6) |
where is the embedding vector of the item at timestamp after adding the positional embedding, and is a learnable projection matrix. Then, the attention weights assigned to historical items at timestamp are determined by the scaled dot-product between the query embedding and the key embeddings as shown in Equation (3). Therefore, the attention weights are dominated by the single item at timestamp , leading to a type of short-query self-attention.
However, this type of self-attention is limited in leveraging collaborative signals, especially when the item at timestamp is inconsistent with the user’s primary preference. Specifically, SASRec can be viewed as a self-attention-enhanced first-order Markov chain model and its recommendation results can be significantly affected by a minor change in the order of users’ interacted item sequences, such as swap** the position of the last two items. In other words, SASRec may generalize poorly on test samples lacking observed item transitions. However, real-world recommendation scenarios such as restaurant recommendations on Yelp have shown that user interests are relatively stable and less sensitive to the order of several recent choices (Zhu et al., 2021), which SASRec may have difficulty in co** with. To address this limitation, we propose an -query self-attention approach. First, we define the -query self-attention as follows:
Definition 3.0 (-query Self-Attention).
An -query self-attention is a type of self-attention module that uses the embeddings or their transformed representations of the most recent timestamps’ items (tokens) as the attention query.
Here we use the simple mean-pooling of the embeddings of the last items at timestamp as the query embedding:
(7) |
where is a hyperparameter that controls the range of the attention query. Alternatively, other functions can be used to generate the query embedding, such as a weighted summation with time decay.
It is important to note that the hyperparameter controls the range of the historical context in self-attention. Using a large value of means that the model relies on long-range historical items to represent user interests, which contributes to capturing collaborative signals but may accumulate bias as user interests may shift over time. Conversely, using a small value of means that the model adopts the latest interacted items to represent user interests but can introduce variance due to the small number of used items. To balance the bias-variance trade-off, we propose a Multi-Query Self-Attention (MQSA) method that combines the short-query self-attention (with , similar to SASRec) with the long-query self-attention (with a larger ) using a hyperparameter :
(8) |
Then, the sequence embedding is used along with the embedding of candidate items to predict their ranking scores through dot product. Notably, we can also allow the model to learn the optimal . However, simultaneously learning the weights and the embeddings is challenging due to the inherent complexity. We could also incorporate more s. We leave these for exploration in future work.
It is worth mentioning that the formulation of MQSA shares similar ideas with some approaches in the literature, such as FPMC (Rendle et al., 2010) and Fossil (He and McAuley, 2016a), which explicitly model long-term user interests by employing user or item embeddings, respectively, and combine them with factorized Markov chains for sequential recommendation tasks. Compared to Fossil which uses the whole interacted items, MQSA introduces flexible window sizes of the last items to control the bias-variance trade-off. Furthermore, MQSA employs self-attention modules to enhance expressiveness, resulting in improved performance compared to the use of pure item embeddings in Fossil.
3.2. Transition-Aware Embedding Distillation for Item Transition Modeling
Sequential recommendation models have demonstrated their effectiveness in enhancing recommendation accuracy by capturing long-term user interests (Hidasi et al., 2015; Kang and McAuley, 2018; Zhou et al., 2022). However, these models may have limitations in leveraging the global item-to-item transitional signals. Specifically, most existing methods follow an auto-regressive framework (Kang and McAuley, 2018; Zhou et al., 2022). For each user, their preference at timestamp is learned based on their interacted items up to and including , which is then used to predict the item at timestamp . Nevertheless, this framework fails to enable the model to learn the global item-to-item transition patterns. In other words, the items not interacted with a user are treated equally, without considering the potential items that the current item is more likely to trigger.
To address this limitation, we propose a heuristic recommender based on item transitions and then develop a knowledge distillation method to integrate these global item transition patterns into sequential models. Specifically, we construct a global item transition graph where represents item nodes and represents transition edges between items. is a weighted and directed graph, where the weight of each edge represents the transition frequency between two items within a time span , based on all user interaction sequences. Note that the time span hyperparameter is used to control the long-term item transition patterns and is set to by default (i.e. only transitions between directly adjacent items are considered). We use the adjacent matrix of as the heuristic recommender, where is the transition frequency from item to item , as shown in Figure 2. It is a memory-based non-personalized method that recommends items based on the transition frequency from the current item to candidate items, as introduced in our preliminary experiments in Section 1.
To distill the item transitions into the sequential model, we propose a Transition-Aware Embedding Distillation (TED) method. First, we normalize the transition frequencies using a row normalization approach as . Then, we use a softmax function with temperature to generate pseudo-labels for knowledge distillation:
(9) |
where a higher value of generates a softer probability distribution over items (Hinton et al., 2015).
We adopt a simple factorization model as the student model, which predicts the item transition distribution of item by using the dot product between its embedding vector and the embedding matrix before the self-attention layers, where the dropout (Srivastava et al., 2014) strategy is also used for robust learning. We apply the softmax function with temperature to obtain the predicted transition probabilities:
(10) |
We use the cross-entropy loss to distill the item transitions into the sequential model by comparing the predicted and pseudo-label transition probabilities:
(11) |
Therefore, the factorization model can learn from the Item Transition model, enabling the item embeddings to memorize the item transition patterns. The overall loss function for the full model is:
(12) |
where is the parameters, and are the hyperparameters that control the weights of distillation and regularization, respectively.
3.3. Discussion
3.3.1. Relationship Between Two Modules
Here we discuss the relationship between the user collaboration and item transition modules, and how they complement each other.
Expressiveness vs. Calibration. The item transition module learns from a memory-based method that generates potential candidate items based on the global transition trends of the current item. However, it may generalize poorly to the items lacking observed transition patterns. On the other hand, the user collaboration module is a neural model that employs self-attentions to capture long-term user preferences and select the most likely next item based on historical items, resulting in a stronger ability to generalize but a limited ability to memorize and leverage item-to-item transition patterns. Therefore, the user collaboration model requires the item transition model to act as a calibrator for its predictions.
Disentangled Learning. The user collaboration and item transition modules are inherently disentangled, as we employ dual supervision where the original item embedding captures item-to-item transitional signals while the item embedding after self-attentions captures sequence-to-item collaborative signals.
Retrieval vs. Re-Ranking. The item transition and user collaboration modules can be regarded as a retrieval model and a re-ranking model, respectively. The retrieval model provides insight into generating potential candidate items, while the re-ranking model provides insight into selecting the most relevant items for users based on their respective interaction histories.
3.3.2. Comparison with Existing Methods
The proposed Transition-Aware Embedding Distillation (TED) module serves as a calibrator based on the item transition graph. Here we compare it with recent graph-based regularization methods:
Graph Regularization (GraReg) (Zhang et al., 2020a) is a Euclidean distance-based regularization term on embedding layers using a -nearest neighbor (-NN) graph:
(13) |
where is the coefficient hyperparameter for graph regularization, and is the edges in the -NN graph. We can use the transition frequency as the weights of the edges here. Therefore, GraReg uses the most related items for regularization, leading to learning localized transition patterns. Additionally, GraReg introduces an alignment loss but lacks a uniformity loss, where related items should be close to each other while unrelated ones should be separated (Wang et al., 2022). In contrast, TED uses the global item transitions as the teacher model, enabling the item embeddings to memorize and leverage transitional signals.
Graph-based Embedding Smoothing (GES) (Zhu et al., 2021) employs graph convolutions on the global item transition graph for embedding smoothing in sequential recommenders:
(14) |
where is the adjacency matrix of the item transition graph with self-loops, is the degree matrix of , and is the number of graph convolutional layers. However, stacking multiple graph convolutional layers may result in over-smoothing problems (Kipf and Welling, 2016), potentially leading to a decline in model performance. In comparison, TED incorporates a hyperparameter to control the power of item transition distillation, allowing for flexibility in different recommendation scenarios.
3.3.3. Model Complexity
Here we analyze the space and time complexity of the proposed model.
Space Complexity. The learnable parameters in SASRec are for item embeddings, positional embeddings, self-attention layers, feed-forward layers, and layer normalization. The total number of parameters in SASRec is (Kang and McAuley, 2018). Our proposed model introduces the long-query self-attention, which adds for projection matrices, feed-forward networks, and layer normalization. The embedding distillation module does not add any extra parameters. Therefore, the space complexity of our proposed model is the same as that of SASRec.
Time Complexity. The computational complexity of the self-attention layer and the feed-forward layer in SASRec is . The cumulative cross-entropy loss has a complexity of . Thus, the total computational complexity of SASRec is . In our proposed model, the self-attention module has the same complexity as in SASRec. The embedding distillation module has a complexity of . Hence, the time complexity of the proposed model is the same as that of SASRec with the cumulative cross-entropy loss.
4. Experiments
We conduct experiments on four real-world datasets to evaluate the effectiveness of the proposed method.222The codes and datasets are available at https://github.com/zhuty16/MQSA-TED The experiments are designed to answer the following research questions:
-
RQ1.
How does the proposed method compare with state-of-the-art sequential recommendation methods?
-
RQ2.
How do the hyperparameters and various components affect the model performance?
-
RQ3.
How does the proposed TED method compare with graph-based regularization methods?
-
RQ4.
Can the proposed TED method benefit various recommendation models?
-
RQ5.
How do the proposed two modules improve the model performance?
4.1. Experimental Settings
4.1.1. Datasets
Dataset | # Users | # Items | # Actions | Density | Avg. Len. |
---|---|---|---|---|---|
Beauty | 22,363 | 12,101 | 198,502 | 0.073% | 8.88 |
Sports | 25,598 | 18,357 | 296,337 | 0.063% | 8.32 |
Toys | 19,412 | 11,924 | 167,597 | 0.072% | 8.63 |
Yelp | 30,431 | 20,033 | 316,354 | 0.052% | 10.40 |
We adopt four datasets from (Zhou et al., 2022) for experiments. The Beauty, Sports, and Toys datasets are from the Amazon Review Dataset in (McAuley et al., 2015; He and McAuley, 2016b).333https://cseweb.ucsd.edu/~jmcauley/datasets.html The Yelp dataset is from the Yelp Open Dataset.444https://www.yelp.com/dataset The training data, validation data, and test data are identical to those used in (Zhou et al., 2022), which follows the leave-one-out evaluation protocol that treats the last item as the test data, the second last item as the validation data, and the remaining items as the training data for each user (Kang and McAuley, 2018). The dataset statistics are shown in Table 1.
Dataset | Metric | POP | LightGCN | FPMC | Caser | GRU4Rec | SASRec | BERT4Rec | FMLP-Rec | MQSA-TED | Improv. |
---|---|---|---|---|---|---|---|---|---|---|---|
Beauty | HR@5 | 0.0077 | 0.0374 | 0.0596 | 0.0359 | 0.0489 | 0.0694 | 0.0419 | 0.0698 | 0.0752* | 7.23% |
NDCG@5 | 0.0042 | 0.0247 | 0.0419 | 0.0241 | 0.0342 | 0.0492 | 0.0275 | 0.0488 | 0.0534* | 8.58% | |
HR@10 | 0.0135 | 0.0571 | 0.0838 | 0.0511 | 0.0695 | 0.0932 | 0.0647 | 0.0995 | 0.1039* | 4.44% | |
NDCG@10 | 0.0061 | 0.0311 | 0.0497 | 0.0290 | 0.0408 | 0.0568 | 0.0349 | 0.0583 | 0.0627* | 7.48% | |
HR@20 | 0.0217 | 0.0841 | 0.1151 | 0.0720 | 0.0998 | 0.1286 | 0.0992 | 0.1361 | 0.1435* | 5.40% | |
NDCG@20 | 0.0081 | 0.0379 | 0.0576 | 0.0343 | 0.0484 | 0.0657 | 0.0435 | 0.0675 | 0.0726* | 7.62% | |
Sports | HR@5 | 0.0057 | 0.0252 | 0.0337 | 0.0195 | 0.0221 | 0.0380 | 0.0241 | 0.0415 | 0.0455* | 9.52% |
NDCG@5 | 0.0041 | 0.0170 | 0.0234 | 0.0128 | 0.0143 | 0.0267 | 0.0161 | 0.0287 | 0.0320* | 11.34% | |
HR@10 | 0.0091 | 0.0384 | 0.0499 | 0.0290 | 0.0357 | 0.0541 | 0.0380 | 0.0598 | 0.0643* | 7.48% | |
NDCG@10 | 0.0052 | 0.0212 | 0.0286 | 0.0159 | 0.0187 | 0.0318 | 0.0206 | 0.0346 | 0.0380* | 9.85% | |
HR@20 | 0.0175 | 0.0576 | 0.0703 | 0.0431 | 0.0548 | 0.0752 | 0.0583 | 0.0847 | 0.0906* | 6.93% | |
NDCG@20 | 0.0073 | 0.0260 | 0.0337 | 0.0195 | 0.0235 | 0.0371 | 0.0257 | 0.0409 | 0.0446* | 9.09% | |
Toys | HR@5 | 0.0065 | 0.0378 | 0.0664 | 0.0307 | 0.0420 | 0.0736 | 0.0379 | 0.0785 | 0.0834* | 6.24% |
NDCG@5 | 0.0044 | 0.0251 | 0.0463 | 0.0224 | 0.0297 | 0.0533 | 0.0244 | 0.0570 | 0.0600* | 5.31% | |
HR@10 | 0.0090 | 0.0564 | 0.0925 | 0.0420 | 0.0597 | 0.0989 | 0.0589 | 0.1062 | 0.1130* | 6.42% | |
NDCG@10 | 0.0052 | 0.0311 | 0.0547 | 0.0260 | 0.0354 | 0.0615 | 0.0312 | 0.0659 | 0.0696* | 5.56% | |
HR@20 | 0.0143 | 0.0795 | 0.1212 | 0.0597 | 0.0834 | 0.1299 | 0.0857 | 0.1399 | 0.1503* | 7.41% | |
NDCG@20 | 0.0065 | 0.0370 | 0.0619 | 0.0305 | 0.0414 | 0.0693 | 0.0379 | 0.0743 | 0.0789* | 6.23% | |
Yelp | HR@5 | 0.0056 | 0.0290 | 0.0272 | 0.0199 | 0.0211 | 0.0232 | 0.0264 | 0.0270 | 0.0320* | 10.18% |
NDCG@5 | 0.0036 | 0.0184 | 0.0173 | 0.0129 | 0.0134 | 0.0151 | 0.0169 | 0.0169 | 0.0205* | 11.74% | |
HR@10 | 0.0096 | 0.0486 | 0.0433 | 0.0334 | 0.0367 | 0.0379 | 0.0441 | 0.0446 | 0.0517* | 6.36% | |
NDCG@10 | 0.0049 | 0.0246 | 0.0224 | 0.0172 | 0.0184 | 0.0198 | 0.0226 | 0.0225 | 0.0269* | 8.95% | |
HR@20 | 0.0158 | 0.0790 | 0.0695 | 0.0535 | 0.0603 | 0.0623 | 0.0737 | 0.0721 | 0.0832* | 5.24% | |
NDCG@20 | 0.0065 | 0.0323 | 0.0290 | 0.0222 | 0.0244 | 0.0259 | 0.0300 | 0.0294 | 0.0348* | 7.62% |
4.1.2. Baselines
We compare the proposed method with various types of state-of-the-art baselines in sequential recommendation:
-
•
POP: a non-personalized method that ranks items based on their popularity.
-
•
LightGCN (He et al., 2020): a GCN-based method that learns user and item embeddings through linear propagation on the user-item interaction graph.
-
•
FPMC (Rendle et al., 2010): a Markov chain-based method that combines matrix factorization and factorized Markov chains.
-
•
Caser (Tang and Wang, 2018a): a CNN-based method that uses horizontal and vertical convolutions to learn sequential patterns.
-
•
GRU4Rec (Hidasi et al., 2015): an RNN-based method that uses Gated Recurrent Units (GRU) to model dynamic user preferences.
- •
- •
-
•
FMLP-Rec (Zhou et al., 2022): an MLP-based method that is currently the state-of-the-art sequential recommendation model based on filter-enhanced MLP.
4.1.3. Evaluation Metrics and Protocols
We adopt Hit Ratio@N (HR@N) and NDCG@N to evaluate the performance of the compared methods on the sequential recommendation task (Zhou et al., 2020, 2022). We set to , , and by default and report the average scores of users. For each user, we rank all items except for the positive ones in their training or validation data (Krichene and Rendle, 2022). To ensure the robustness of the results, we randomly initialize each model five times and report the average performance.
4.1.4. Implementation and Hyperparameter Settings
We implement all models with TensorFlow and use the cross-entropy loss for all models for a fair comparison, which has been proved to outperform the negative sampling-based losses significantly (Li et al., 2023). For common hyperparameters in all models, the maximum sequence length is set to , the embedding size is set to , the learning rate is tuned in {5e-3, 1e-3, 5e-4, 1e-4}, and the regularization is tuned in {0, 1e-6, 1e-5, 1e-4, 1e-3}. All models are trained with mini-batch Adam (Kingma and Ba, 2014) and the batch size is set to . Other hyperparameters of different models are tuned on the validation set according to the suggestions in their respective papers. The results of baseline methods under their optimal hyperparameter settings are reported.
4.2. Main Results (RQ1)
Table 2 presents a performance comparison of different methods. The results show that, on Amazon datasets, sequential methods such as FPMC, SASRec, and FMLP-Rec outperform the non-sequential method LightGCN significantly. Among the sequential methods, FMLP-Rec performs the best. However, on the Yelp dataset, LightGCN outperforms the sequential methods due to the weak sequentiality of user interactions on Yelp (Zhu et al., 2021). Furthermore, our proposed method significantly outperforms all baseline methods, with an average improvement of in Hit Ratio@20 and in NDCG@20 compared to the best baseline.
Figure 3 shows the performances of SASRec and our proposed method with respect to the training epochs. One can observe that our proposed method consistently outperforms SASRec by a notable margin, showing the effectiveness of the proposed modules.
4.3. Hyperparameter and Ablation Studies (RQ2)
Figure 4 presents the performance of our proposed method with respect to various hyperparameters and modules:
4.3.1. Length of Long-Query Self-Attention .
It can be observed the best depends on the datasets and the model generally performs well when is in the range of , showing the effectiveness of long-query self-attention in capturing collaborative signals.
4.3.2. Balance of Long and Short-Query Self-Attention .
The results show that when is approximately , the model achieves the best performance, indicating a proper bias-variance trade-off in modeling user interests. Notably, when , the model degrades to SASRec with TED. Therefore, the proposed multi-query self-attention significantly outperforms the short-query self-attention used in SASRec with a proper .
4.3.3. Weight of Embedding Distillation .
It can be seen that the model performs better when is approximately , demonstrating the effectiveness of the TED module. Note that when , our proposed method degrades to the MQSA model without TED, resulting in a significant drop in performance.
4.3.4. Temperature of Embedding Distillation .
The results suggest that the model requires relatively hard pseudo-labels of item transition distributions for effective knowledge distillation, as the best performance is achieved when or .
Dataset | Metric | MQSA | +GES | +GraReg | +TED |
---|---|---|---|---|---|
Beauty | NDCG@10 | 0.0599 | 0.0623 | 0.0611 | 0.0627 |
NDCG@20 | 0.0694 | 0.0724 | 0.0708 | 0.0726 | |
Sports | NDCG@10 | 0.0344 | 0.0370 | 0.0351 | 0.0380 |
NDCG@20 | 0.0408 | 0.0434 | 0.0416 | 0.0446 | |
Toys | NDCG@10 | 0.0654 | 0.0672 | 0.0667 | 0.0696 |
NDCG@20 | 0.0749 | 0.0765 | 0.0755 | 0.0789 | |
Yelp | NDCG@10 | 0.0255 | 0.0244 | 0.0257 | 0.0269 |
NDCG@20 | 0.0327 | 0.0320 | 0.0330 | 0.0348 |
Dataset | Metric | LightGCN | +TED | FMLP-Rec | +TED |
---|---|---|---|---|---|
Beauty | NDCG@10 | 0.0311 | 0.0399 | 0.0583 | 0.0596 |
NDCG@20 | 0.0379 | 0.0484 | 0.0675 | 0.0684 | |
Sports | NDCG@10 | 0.0212 | 0.0246 | 0.0346 | 0.0356 |
NDCG@20 | 0.0260 | 0.0298 | 0.0409 | 0.0423 | |
Toys | NDCG@10 | 0.0311 | 0.0388 | 0.0659 | 0.0675 |
NDCG@20 | 0.0370 | 0.0459 | 0.0743 | 0.0762 | |
Yelp | NDCG@10 | 0.0246 | 0.0236 | 0.0225 | 0.0226 |
NDCG@20 | 0.0323 | 0.0312 | 0.0294 | 0.0296 |
4.4. Comparison with Graph Methods (RQ3)
We also compare the proposed Transition-Aware Embedding Distillation (TED) module with graph-based regularization methods in Table 3. The results show that most of the methods can improve the performance of MQSA. Specifically, GES performs better than GraReg on Amazon datasets but worse on the Yelp dataset. Moreover, our proposed TED method outperforms GES and GraReg in most cases, indicating the effectiveness of learning global and accurate item transition patterns by knowledge distillation.
4.5. TED for Various Base Models (RQ4)
We also compare the performance of various base models with and without our proposed Transition-Aware Embedding Distillation (TED) module in Table 4. The results demonstrate that TED can act as a domain adapter, which enhances the performance of the non-sequential method LightGCN on sequential recommendation tasks. Furthermore, the incorporation of TED yields remarkable improvement for the state-of-the-art sequential recommendation method FMLP-Rec. Notably, TED shows limited effects on the Yelp dataset due to the weak sequentiality of user interactions. In other words, transitional signals are less important in this dataset.
4.6. Performance Comparison by Groups (RQ5)
Figure 5 presents the performance of different methods on test samples grouped by transition frequencies observed in the training data from the validation item (the second last item) to the test item (the last item). We evaluate the SASRec model with the Transition-Aware Embedding Distillation (SASRec-TED), the Multi-Query Self-Attention model (MQSA), and the full MQSA-TED model. Compared with the results in Figure 1, the improvement of MQSA over SASRec mainly results from the improvement on test samples lacking transition instances. However, the integration of long-query self-attention may hurt the performance on test samples with frequent transitions. By incorporating the TED module as a calibrator, MQSA-TED performs better than MQSA mainly on test samples with high transition frequencies. As MQSA and TED focus on collaborative and transitional signals, respectively, their combination will result in a reasonable balance between the two signal types.
5. Related Work
Sequential Recommendation. Sequential recommendation methods aim to capture dynamic user preferences (He et al., 2017; Chen et al., 2018; Wang et al., 2020). Early efforts adopt Markov Chains (MCs) to learn item transition patterns, such as FPMC (Rendle et al., 2010), which combines the Matrix Factorization (MF) with the first-order Markov chain. Fossil (He and McAuley, 2016a) fuses the similarity-based model with high-order Markov chains. Recent efforts incorporate deep models, such as GRU4Rec (Hidasi et al., 2015) and NARM (Li et al., 2017), which employ Gated Recurrent Units (GRU). Caser (Tang and Wang, 2018a) uses horizontal and vertical convolutional filters. SASRec (Kang and McAuley, 2018) and BERT4Rec (Sun et al., 2019) use unidirectional and bidirectional self-attention modules in Transformer (Vaswani et al., 2017), respectively. FMLP-Rec (Zhou et al., 2022) is an all-MLP model with learnable filters in the frequency domain. However, previous efforts typically follow an auto-regressive framework, which neglects the valuable information in global item transition patterns. In this paper, we propose a Transition-Aware Embedding Distillation module to memorize and leverage the transitional signals.
Self-Attention in Recommendation. The Transformer architecture has achieved remarkable success in modeling long-term dependencies in Natural Language Processing (NLP) (Vaswani et al., 2017; Devlin et al., 2018). Consequently, recent efforts employ such architecture for sequential recommendation tasks (Ren et al., 2020; Qiu et al., 2022). For example, SASRec (Kang and McAuley, 2018) and BERT4Rec (Sun et al., 2019) use unidirectional and bidirectional self-attention modules, respectively. In addition, some efforts aim to enhance self-attention-based models by incorporating side information. For instance, TiSASRec (Li et al., 2020) incorporates time interval embeddings into SASRec. S-Rec (Zhou et al., 2020) introduces self-supervision tasks to learn correlations among attributes, items, sub-sequences, and sequences based on mutual information maximization. SASRec-GES (Zhu et al., 2021) employs graph convolutions on sequential and semantic item graphs to generate smoothed item embeddings. Efforts have also been made to improve the efficiency or effectiveness of SASRec (Li et al., 2021). CL4SRec (Xie et al., 2022) uses contrastive learning to derive self-supervision signals from user interaction sequences. Despite these advances, previous studies paid less attention to the limitations of the conventional self-attention architecture in capturing collaborative signals. In this paper, we propose a Multi-Query Self-Attention method that combines long and short-query self-attentions to enhance its effectiveness in modeling user collaborations.
Knowledge Distillation in Recommendation. Knowledge distillation is a widely-used model compression technique in various fields (Hinton et al., 2015), where a student model is trained with both a ground-truth label distribution and a smoothed pseudo-label distribution generated by a teacher model. Recent efforts apply this method to recommender systems, such as Ranking Distillation (Tang and Wang, 2018b), which trains a student model to rank items based on both training data and teacher model predictions. Collaborative Distillation (Lee et al., 2019) uses probabilistic rank-aware sampling with teacher-guided and student-guided training strategies. Other existing methods aim to distill knowledge from side information into recommendation models to enhance their performance and interpretability. For instance, SCML (Zhu et al., 2020) combines the item-based CF model with the social CF model through embedding-level and output-level mutual learning. DESIGN (Tao et al., 2022) integrates information from the user-item interaction graph and the user-user social graph and makes them learn from each other. Zhang et al. (Zhang et al., 2020b) propose a joint learning framework to distill structured knowledge from a path-based model into a neural model. However, knowledge distillation has received less attention in the context of sequential recommendation. In this paper, we distill the knowledge of item transitions into sequential recommendation models to enhance their performances.
6. Conclusion
In this paper, we addressed the limitations of existing sequential recommendation methods in capturing collaborative and transitional signals in user interaction sequences. To overcome these limitations, we proposed a new method called Multi-Query Self-Attention with Transition-Aware Embedding Distillation (MQSA-TED). To capture collaborative signals, we introduced an -query self-attention module using flexible window sizes for attention queries and combined long and short-query self-attentions. In addition, we developed a transition-aware embedding distillation module that distills global item transition patterns into item embeddings, enabling the model to memorize and leverage transitional signals. Experimental results on four real-world datasets demonstrated the effectiveness of both modules in improving sequential recommendation performance.
Acknowledgements.
This research was partly supported by a CIHR-NSERC-SSHRC Healthy Cities Research Training Platform grant of Canada.References
- (1)
- Chen et al. (2018) Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2018. Sequential recommendation with user memory networks. In Proceedings of the eleventh ACM international conference on web search and data mining. 108–116.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Gao et al. (2022) Chongming Gao, Shijun Li, Yuan Zhang, Jiawei Chen, Biao Li, Wenqiang Lei, Peng Jiang, and Xiangnan He. 2022. KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 3953–3957.
- He et al. (2017) Ruining He, Wang-Cheng Kang, and Julian McAuley. 2017. Translation-based recommendation. In Proceedings of the eleventh ACM conference on recommender systems. 161–169.
- He and McAuley (2016a) Ruining He and Julian McAuley. 2016a. Fusing similarity models with markov chains for sparse sequential recommendation. In 2016 IEEE 16th international conference on data mining (ICDM). IEEE, 191–200.
- He and McAuley (2016b) Ruining He and Julian McAuley. 2016b. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th international conference on world wide web. 507–517.
- He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639–648.
- Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
- Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
- Krichene and Rendle (2022) Walid Krichene and Steffen Rendle. 2022. On sampled metrics for item recommendation. Commun. ACM 65, 7 (2022), 75–83.
- Lee et al. (2019) Jae-woong Lee, Min** Choi, Jongwuk Lee, and Hyunjung Shim. 2019. Collaborative distillation for top-N recommendation. In 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 369–378.
- Li et al. (2023) Fangyu Li, Shenbao Yu, Feng Zeng, and Fang Yang. 2023. Effective and Efficient Training for Sequential Recommendation Using Cumulative Cross-Entropy Loss. arXiv preprint arXiv:2301.00979 (2023).
- Li et al. (2017) **g Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1419–1428.
- Li et al. (2020) Jiacheng Li, Yujie Wang, and Julian McAuley. 2020. Time interval aware self-attention for sequential recommendation. In Proceedings of the 13th ACM international conference on web search and data mining. 322–330.
- Li et al. (2021) Yang Li, Tong Chen, Peng-Fei Zhang, and Hongzhi Yin. 2021. Lightweight self-attentive sequential recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 967–977.
- McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th ACM SIGIR international conference on research and development in information retrieval. 43–52.
- Qiu et al. (2022) Ruihong Qiu, Zi Huang, Hongzhi Yin, and Zijian Wang. 2022. Contrastive learning for representation degeneration problem in sequential recommendation. In Proceedings of the fifteenth ACM international conference on web search and data mining. 813–823.
- Ren et al. (2020) Ruiyang Ren, Zhaoyang Liu, Yaliang Li, Wayne Xin Zhao, Hui Wang, Bolin Ding, and Ji-Rong Wen. 2020. Sequential recommendation with self-attentive multi-adversarial network. In Proceedings of the 43rd ACM SIGIR international conference on research and development in information retrieval. 89–98.
- Rendle et al. (2010) Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th international conference on world wide web. 811–820.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.
- Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
- Tang and Wang (2018a) Jiaxi Tang and Ke Wang. 2018a. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the eleventh ACM international conference on web search and data mining. 565–573.
- Tang and Wang (2018b) Jiaxi Tang and Ke Wang. 2018b. Ranking distillation: Learning compact ranking models with high performance for recommender system. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2289–2298.
- Tao et al. (2022) Ye Tao, Ying Li, Su Zhang, Zhirong Hou, and Zhonghai Wu. 2022. Revisiting Graph based Social Recommendation: A Distillation Enhanced Social Graph Network. In Proceedings of the ACM Web Conference 2022. 2830–2838.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
- Wang et al. (2022) Chenyang Wang, Yuanqing Yu, Weizhi Ma, Min Zhang, Chong Chen, Yiqun Liu, and Shao** Ma. 2022. Towards Representation Alignment and Uniformity in Collaborative Filtering. In Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1816–1825.
- Wang et al. (2020) Chenyang Wang, Min Zhang, Weizhi Ma, Yiqun Liu, and Shao** Ma. 2020. Make it a chorus: knowledge-and time-aware item modeling for sequential recommendation. In Proceedings of the 43rd ACM SIGIR International conference on research and development in Information Retrieval. 109–118.
- Xie et al. (2022) Xu Xie, Fei Sun, Zhaoyang Liu, Shiwen Wu, **yang Gao, Jiandong Zhang, Bolin Ding, and Bin Cui. 2022. Contrastive learning for sequential recommendation. In 2022 IEEE 38th international conference on data engineering (ICDE). IEEE, 1259–1273.
- Zhang et al. (2020a) Yuan Zhang, Fei Sun, Xiaoyong Yang, Chen Xu, Wenwu Ou, and Yan Zhang. 2020a. Graph-based regularization on embedding layers for recommendation. ACM Transactions on Information Systems (TOIS) 39, 1 (2020), 1–27.
- Zhang et al. (2020b) Yuan Zhang, Xiaoran Xu, Hanning Zhou, and Yan Zhang. 2020b. Distilling structured knowledge into embeddings for explainable and accurate recommendation. In Proceedings of the 13th ACM international conference on web search and data mining. 735–743.
- Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi **, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1059–1068.
- Zhou et al. (2020) Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM international conference on information & knowledge management. 1893–1902.
- Zhou et al. (2022) Kun Zhou, Hui Yu, Wayne Xin Zhao, and Ji-Rong Wen. 2022. Filter-enhanced MLP is all you need for sequential recommendation. In Proceedings of the ACM Web Conference 2022. 2388–2399.
- Zhu et al. (2020) Tianyu Zhu, Guannan Liu, and Guoqing Chen. 2020. Social collaborative mutual learning for item recommendation. ACM Transactions on Knowledge Discovery from Data (TKDD) 14, 4 (2020), 1–19.
- Zhu et al. (2021) Tianyu Zhu, Leilei Sun, and Guoqing Chen. 2021. Graph-based embedding smoothing for sequential recommendation. IEEE Transactions on Knowledge and Data Engineering 35, 1 (2021), 496–508.