License: arXiv.org perpetual non-exclusive license
arXiv:2311.01056v2 [cs.IR] 25 Dec 2023

Collaboration and Transition: Distilling Item Transitions into Multi-Query Self-Attention for Sequential Recommendation

Tianyu Zhu Beihang UniversityBei**gChina [email protected] Yansong Shi Tsinghua UniversityBei**gChina [email protected] Yuan Zhang Kuaishou TechnologyBei**gChina [email protected] Yihong Wu Université de MontréalMontréalCanada [email protected] Fengran Mo Université de MontréalMontréalCanada [email protected]  and  Jian-Yun Nie Université de MontréalMontréalCanada [email protected]
(2024)
Abstract.

Modern recommender systems employ various sequential modules such as self-attention to learn dynamic user interests. However, these methods are less effective in capturing collaborative and transitional signals within user interaction sequences. First, the self-attention architecture uses the embedding of a single item as the attention query, making it challenging to capture collaborative signals. Second, these methods typically follow an auto-regressive framework, which is unable to learn global item transition patterns. To overcome these limitations, we propose a new method called Multi-Query Self-Attention with Transition-Aware Embedding Distillation (MQSA-TED). First, we propose an L𝐿Litalic_L-query self-attention module that employs flexible window sizes for attention queries to capture collaborative signals. In addition, we introduce a multi-query self-attention method that balances the bias-variance trade-off in modeling user preferences by combining long and short-query self-attentions. Second, we develop a transition-aware embedding distillation module that distills global item-to-item transition patterns into item embeddings, which enables the model to memorize and leverage transitional signals and serves as a calibrator for collaborative signals. Experimental results on four real-world datasets demonstrate the effectiveness of the proposed modules.

sequential recommendation, self-attention, knowledge distillation
copyright: acmcopyrightjournalyear: 2024copyright: acmlicensedconference: Proceedings of the 17th ACM International Conference on Web Search and Data Mining; March 4–8, 2024; Merida, Mexicobooktitle: Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24), March 4–8, 2024, Merida, Mexicoprice: 15.00isbn: 979-8-4007-0371-3/24/03doi: 10.1145/3616855.3635787ccs: Information systems Recommender systems

1. Introduction

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 1. Performance of three methods w.r.t. item transition frequency on two datasets. Item Transition performs better on test samples with frequent transitions, while LightGCN performs better on test samples lacking transition instances. SASRec achieves the best performance on average.

In recent years, there has been an increasing focus on modeling dynamic user preferences in modern recommender systems (Zhou et al., 2018; Gao et al., 2022), which is achieved by incorporating various sequential modules such as RNN (Hidasi et al., 2015), CNN (Tang and Wang, 2018a), and Transformer (Kang and McAuley, 2018; Sun et al., 2019). These sequential recommenders aim to integrate contextual factors derived from recent user interactions into personalized user interests. Contextual factors reveal typical item-to-item transition patterns. The main challenge in sequential recommendation lies in effectively learning both personalized user interests and general item transition patterns while maintaining an appropriate balance between the two factors. For instance, a user interested in sportswear may also seek a shirt after purchasing a suit. If we only rely on collaborative signals to generate recommendations, we may overlook the user’s temporary need for items to complement their suit. On the other hand, if we only consider transitional signals to make recommendations, we may neglect the user’s primary interest in sportswear. Therefore, it is crucial to leverage both signals and find a balance between them. We define the collaborative and transitional signals in the context of sequential recommendation tasks as follows:

Definition 1.0 (Collaborative Signals).

In the context of sequential recommendation, collaborative signals refer to the similarities between sequences of users’ interacted items.

Definition 1.0 (Transitional Signals).

In the context of sequential recommendation, transitional signals refer to the transition frequency between pairs of users’ interacted items.

Specifically, collaborative signals can be used by following a sequence-to-item methodology, leveraging the collaborative behavior of users to identify patterns in their interactions and recommend relevant items. On the other hand, transitional signals exploit item-to-item relationships in user interaction sequences, enabling the identification of trigger items that will lead to related purchases.

Although recent sequential recommendation methods such as SASRec (Kang and McAuley, 2018) have demonstrated remarkable performance, they have inherent limitations in effectively capturing both signals within user interaction sequences. To highlight these limitations, we conducted experiments comparing the performance of SASRec with two baseline methods: Item Transition and LightGCN (He et al., 2020). Item Transition is a memory-based, non-personalized method that makes recommendations based on the global transition frequency from the current item to candidate items, serving as a benchmark based on transitional signals (see Section 3.2 for details). LightGCN is a state-of-the-art non-sequential recommendation method that learns user and item embeddings through linear propagation on the user-item interaction graph, serving as a benchmark based on collaborative signals. We conducted experiments on two Amazon datasets, Beauty and Sports (Zhou et al., 2022), and grouped the test samples based on the transition frequency observed in the training data. Results shown in Figure 1 reveal two limitations of SASRec in leveraging both signals:

First, SASRec has a lower ability to leverage collaborative signals than LightGCN. For test samples where the item transition frequency is zero, LightGCN consistently outperforms SASRec on both datasets. This observation shows the limited ability of SASRec to generalize to test samples lacking observed item transitions. Notably, SASRec uses the embedding of the most recent item as the query in its self-attention module, which can be regarded as an attention-enhanced first-order Markov chain model that is inherently limited in leveraging collaborative signals.

Second, SASRec’s ability to leverage transitional signals is lower than Item Transition. For test samples where the item transition frequency exceeds one, i.e., the transition occurs multiple times in the training data, Item Transition significantly outperforms SASRec on both datasets. This observation highlights the limited effectiveness of SASRec in leveraging transitional signals.

Inspired by these observations, we propose a new method called Multi-Query Self-Attention with Transition-Aware Embedding Distillation (MQSA-TED) for sequential recommendation tasks, which consists of two main components to capture collaborative and transitional signals, respectively. First, we propose an L𝐿Litalic_L-query self-attention module that uses several items (instead of a single item) in windows of flexible sizes as attention queries to capture collaborative signals. By enlarging the window size L𝐿Litalic_L, the model can leverage similarities between longer-range sequences of users’ interacted items to generate recommendations. However, using a large L𝐿Litalic_L will result in a global bias as the recommendation will mainly focus on the user’s long-term interests while ignoring the interest shift over time. To strike a balance between bias and variance in modeling users’ dynamic interests, we introduce a multi-query self-attention method by combining long and short-query self-attentions. Second, we develop a transition-aware embedding distillation module that distills global item-to-item transition patterns into item embeddings, which serves as a calibration module that enables the model to effectively memorize and leverage transitional signals when making recommendations. Notably, our proposed method achieves inherent disentanglement of user collaboration modeling and item transition modeling by employing dual supervision: the original item embedding captures item-to-item transitional signals, while the item embedding created after self-attention modules captures sequence-to-item collaborative signals. Our contributions in this paper are summarized as follows:

  • We propose an L𝐿Litalic_L-query self-attention module that uses flexible window sizes for attention queries to capture collaborative signals. We also design a multi-query self-attention method that combines long and short-query self-attentions to balance the bias-variance trade-off in modeling users’ dynamic interests.

  • We develop a transition-aware embedding distillation module that distills the global item-to-item transition patterns into item embeddings to capture transitional signals, which serves as a calibration module for collaborative signals.

  • We conduct extensive experiments on four real-world datasets to show the effectiveness of our proposed method. The results also highlight the different effects of the proposed two modules in improving recommendation performances.

2. Preliminaries

2.1. Problem Formulation

The sequential recommendation task aims to predict the next item that a user will interact with based on their historical interactions. Let 𝒰={u1,u2,,u|𝒰|}𝒰subscript𝑢1subscript𝑢2subscript𝑢𝒰\mathcal{U}=\{u_{1},u_{2},\cdots,u_{|\mathcal{U}|}\}caligraphic_U = { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT | caligraphic_U | end_POSTSUBSCRIPT } be the set of users, ={i1,i2,,i||}subscript𝑖1subscript𝑖2subscript𝑖\mathcal{I}=\{i_{1},i_{2},\cdots,i_{|\mathcal{I}|}\}caligraphic_I = { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_i start_POSTSUBSCRIPT | caligraphic_I | end_POSTSUBSCRIPT } be the set of items, and S(u)=[i1(u),i2(u),,inu(u)]superscript𝑆𝑢superscriptsubscript𝑖1𝑢superscriptsubscript𝑖2𝑢superscriptsubscript𝑖subscript𝑛𝑢𝑢S^{(u)}=[i_{1}^{(u)},i_{2}^{(u)},\cdots,i_{n_{u}}^{(u)}]italic_S start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT = [ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT , ⋯ , italic_i start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT ] be the interaction sequence of user u𝑢uitalic_u, where nusubscript𝑛𝑢n_{u}italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT denotes the length of the sequence. The problem is formulated as calculating the probability that the next item will be interacted with, given the user’s historical interactions:

(1) p(inu+1(u)|S(u)).𝑝conditionalsuperscriptsubscript𝑖subscript𝑛𝑢1𝑢superscript𝑆𝑢p\left(i_{n_{u}+1}^{(u)}|S^{(u)}\right).italic_p ( italic_i start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT ( italic_u ) end_POSTSUPERSCRIPT ) .

Then the top-N items will be recommended to user u𝑢uitalic_u based on these probabilities in descending order.

2.2. SASRec

We first briefly introduce the SASRec (Kang and McAuley, 2018) model, which is a state-of-the-art sequential recommender based on the self-attention module in Transformer (Vaswani et al., 2017) and will be used as the base model in our approach. Given a user interaction sequence of the most recent n𝑛nitalic_n items [i1,i2,,in]subscript𝑖1subscript𝑖2subscript𝑖𝑛[i_{1},i_{2},\cdots,i_{n}][ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_i start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] (here we omit the superscript (u)𝑢(u)( italic_u ) for simplicity), an embedding matrix 𝐄||×d𝐄superscript𝑑\mathbf{E}\in\mathbb{R}^{|\mathcal{I}|\times d}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_I | × italic_d end_POSTSUPERSCRIPT is used to convert the sequence into an embedding sequence [𝐞1,𝐞2,,𝐞n]subscript𝐞1subscript𝐞2subscript𝐞𝑛[\mathbf{e}_{1},\mathbf{e}_{2},\cdots,\mathbf{e}_{n}][ bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], where d𝑑ditalic_d is the embedding size. Then a learnable positional embedding 𝐏n×d𝐏superscript𝑛𝑑\mathbf{P}\in\mathbb{R}^{n\times d}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT is added to encode the position information, resulting in [𝐞^1,𝐞^2,,𝐞^n]subscript^𝐞1subscript^𝐞2subscript^𝐞𝑛[\hat{\mathbf{e}}_{1},\hat{\mathbf{e}}_{2},\cdots,\hat{\mathbf{e}}_{n}][ over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ], where 𝐞^t=𝐞t+𝐩tsubscript^𝐞𝑡subscript𝐞𝑡subscript𝐩𝑡\hat{\mathbf{e}}_{t}=\mathbf{e}_{t}+\mathbf{p}_{t}over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Next, the transformer (Vaswani et al., 2017) module is used:

(2) [𝐞~1,𝐞~2,,𝐞~n]=Transformer([𝐞^1,𝐞^2,,𝐞^n]),subscript~𝐞1subscript~𝐞2subscript~𝐞𝑛Transformersubscript^𝐞1subscript^𝐞2subscript^𝐞𝑛[\tilde{\mathbf{e}}_{1},\tilde{\mathbf{e}}_{2},\cdots,\tilde{\mathbf{e}}_{n}]=% \textrm{Transformer}([\hat{\mathbf{e}}_{1},\hat{\mathbf{e}}_{2},\cdots,\hat{% \mathbf{e}}_{n}]),[ over~ start_ARG bold_e end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over~ start_ARG bold_e end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] = Transformer ( [ over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ) ,

which adopts multiple blocks of self-attention and feed-forward networks. The self-attention layer is used to capture the long-term sequential dependency as follows:

(3) Attention(𝐐,𝐊,𝐕)=softmax(𝐐𝐊Td)𝐕,Attention𝐐𝐊𝐕softmaxsuperscript𝐐𝐊𝑇𝑑𝐕\textrm{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\mathrm{softmax}\left(% \frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{d}}\right)\mathbf{V},Attention ( bold_Q , bold_K , bold_V ) = roman_softmax ( divide start_ARG bold_QK start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) bold_V ,
(4) 𝐐=𝐄^𝐖Q,𝐊=𝐄^𝐖K,𝐕=𝐄^𝐖V,formulae-sequence𝐐^𝐄superscript𝐖𝑄formulae-sequence𝐊^𝐄superscript𝐖𝐾𝐕^𝐄superscript𝐖𝑉\mathbf{Q}=\hat{\mathbf{E}}\mathbf{W}^{Q},\ \mathbf{K}=\hat{\mathbf{E}}\mathbf% {W}^{K},\ \mathbf{V}=\hat{\mathbf{E}}\mathbf{W}^{V},bold_Q = over^ start_ARG bold_E end_ARG bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , bold_K = over^ start_ARG bold_E end_ARG bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , bold_V = over^ start_ARG bold_E end_ARG bold_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ,

where 𝐐𝐐\mathbf{Q}bold_Q represents the queries, 𝐊𝐊\mathbf{K}bold_K the keys, 𝐕𝐕\mathbf{V}bold_V the values, and 𝐖Qsuperscript𝐖𝑄\mathbf{W}^{Q}bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, 𝐖Ksuperscript𝐖𝐾\mathbf{W}^{K}bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, 𝐖Vd×dsuperscript𝐖𝑉superscript𝑑𝑑\mathbf{W}^{V}\in\mathbb{R}^{d\times d}bold_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT are the projection matrices for queries, keys, and values, respectively. Finally, the model predicts ranking scores by taking the dot product between the sequence embedding and the candidate item embeddings as 𝐫^t=𝐞~t𝐄Tsubscript^𝐫𝑡subscript~𝐞𝑡superscript𝐄𝑇\hat{\mathbf{r}}_{t}=\tilde{\mathbf{e}}_{t}\mathbf{E}^{T}over^ start_ARG bold_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over~ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_E start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The cumulative cross-entropy loss is used for model training as follows:111This loss function has been shown more effective than the negative sampling-based binary cross-entropy loss (Li et al., 2023) and we use it for all models in our experiments.

(5) rec=t=1n𝐫tlogsoftmax(𝐫^t),subscript𝑟𝑒𝑐superscriptsubscript𝑡1𝑛subscript𝐫𝑡softmaxsubscript^𝐫𝑡\mathcal{L}_{rec}=-\sum_{t=1}^{n}\mathbf{r}_{t}\log\textrm{softmax}(\hat{% \mathbf{r}}_{t}),caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_log softmax ( over^ start_ARG bold_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where 𝐫t1×||subscript𝐫𝑡superscript1\mathbf{r}_{t}\in\mathbb{R}^{1\times|\mathcal{I}|}bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × | caligraphic_I | end_POSTSUPERSCRIPT is a one-hot vector converted from the index of the ground truth item at timestamp t𝑡titalic_t.

3. Methodology

Refer to caption
Figure 2. Illustration of the proposed MQSA-TED method. It consists of two main components: 1) Multi-Query Self-Attention for user collaboration modeling, and 2) Transition-Aware Embedding Distillation for item transition modeling.

In this section, we present the proposed method, which consists of two main components as illustrated in Figure 2: 1) Multi-Query Self-Attention for user collaboration modeling, and 2) Transition-Aware Embedding Distillation for item transition modeling.

3.1. Multi-Query Self-Attention for User Collaboration Modeling

We adopt SASRec as our base model owing to its strong ability to capture long-term sequential dependency and its state-of-the-art performance in sequential recommendation tasks (Kang and McAuley, 2018). SASRec uses the self-attention module in Transformer (Vaswani et al., 2017), whose main components are the queries, keys, and values, as shown in Equation (4). Specifically, the attention query at timestamp t𝑡titalic_t in SASRec can be expressed as follows:

(6) 𝐪t=𝐞^t𝐖Q,subscript𝐪𝑡subscript^𝐞𝑡superscript𝐖𝑄\mathbf{q}_{t}=\hat{\mathbf{e}}_{t}\mathbf{W}^{Q},bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ,

where 𝐞^tsubscript^𝐞𝑡\hat{\mathbf{e}}_{t}over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the embedding vector of the item at timestamp i𝑖iitalic_i after adding the positional embedding, and 𝐖Qsuperscript𝐖𝑄\mathbf{W}^{Q}bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT is a learnable projection matrix. Then, the attention weights assigned to historical items [i1,i2,,it]subscript𝑖1subscript𝑖2subscript𝑖𝑡[i_{1},i_{2},\cdots,i_{t}][ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] at timestamp t𝑡titalic_t are determined by the scaled dot-product between the query embedding and the key embeddings as shown in Equation (3). Therefore, the attention weights are dominated by the single item at timestamp t𝑡titalic_t, leading to a type of short-query self-attention.

However, this type of self-attention is limited in leveraging collaborative signals, especially when the item at timestamp t𝑡titalic_t is inconsistent with the user’s primary preference. Specifically, SASRec can be viewed as a self-attention-enhanced first-order Markov chain model and its recommendation results can be significantly affected by a minor change in the order of users’ interacted item sequences, such as swap** the position of the last two items. In other words, SASRec may generalize poorly on test samples lacking observed item transitions. However, real-world recommendation scenarios such as restaurant recommendations on Yelp have shown that user interests are relatively stable and less sensitive to the order of several recent choices (Zhu et al., 2021), which SASRec may have difficulty in co** with. To address this limitation, we propose an L𝐿Litalic_L-query self-attention approach. First, we define the L𝐿Litalic_L-query self-attention as follows:

Definition 3.0 (L𝐿Litalic_L-query Self-Attention).

An L𝐿Litalic_L-query self-attention is a type of self-attention module that uses the embeddings or their transformed representations of the most recent L𝐿Litalic_L timestamps’ items (tokens) as the attention query.

Here we use the simple mean-pooling of the embeddings of the last L𝐿Litalic_L items at timestamp t𝑡titalic_t as the query embedding:

(7) 𝐪~t=mean-pooling(𝐞^tL+1,𝐞^tL+2,,𝐞^t)𝐖~Q,subscript~𝐪𝑡mean-poolingsubscript^𝐞𝑡𝐿1subscript^𝐞𝑡𝐿2subscript^𝐞𝑡superscript~𝐖𝑄\tilde{\mathbf{q}}_{t}=\textrm{mean-pooling}(\hat{\mathbf{e}}_{t-L+1},\hat{% \mathbf{e}}_{t-L+2},\cdots,\hat{\mathbf{e}}_{t})\tilde{\mathbf{W}}^{Q},over~ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = mean-pooling ( over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_t - italic_L + 1 end_POSTSUBSCRIPT , over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_t - italic_L + 2 end_POSTSUBSCRIPT , ⋯ , over^ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over~ start_ARG bold_W end_ARG start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ,

where L𝐿Litalic_L is a hyperparameter that controls the range of the attention query. Alternatively, other functions can be used to generate the query embedding, such as a weighted summation with time decay.

It is important to note that the hyperparameter L𝐿Litalic_L controls the range of the historical context in self-attention. Using a large value of L𝐿Litalic_L means that the model relies on long-range historical items to represent user interests, which contributes to capturing collaborative signals but may accumulate bias as user interests may shift over time. Conversely, using a small value of L𝐿Litalic_L means that the model adopts the latest interacted items to represent user interests but can introduce variance due to the small number of used items. To balance the bias-variance trade-off, we propose a Multi-Query Self-Attention (MQSA) method that combines the short-query self-attention (with L=1𝐿1L=1italic_L = 1, similar to SASRec) with the long-query self-attention (with a larger L𝐿Litalic_L) using a hyperparameter α𝛼\alphaitalic_α:

(8) 𝐞~t=α𝐞~tshort+(1α)𝐞~tlong.subscript~𝐞𝑡𝛼superscriptsubscript~𝐞𝑡𝑠𝑜𝑟𝑡1𝛼superscriptsubscript~𝐞𝑡𝑙𝑜𝑛𝑔\tilde{\mathbf{e}}_{t}=\alpha\cdot\tilde{\mathbf{e}}_{t}^{short}+(1-\alpha)% \cdot\tilde{\mathbf{e}}_{t}^{long}.over~ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α ⋅ over~ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_h italic_o italic_r italic_t end_POSTSUPERSCRIPT + ( 1 - italic_α ) ⋅ over~ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_o italic_n italic_g end_POSTSUPERSCRIPT .

Then, the sequence embedding 𝐞~tsubscript~𝐞𝑡\tilde{\mathbf{e}}_{t}over~ start_ARG bold_e end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is used along with the embedding of candidate items to predict their ranking scores through dot product. Notably, we can also allow the model to learn the optimal α𝛼\alphaitalic_α. However, simultaneously learning the weights and the embeddings is challenging due to the inherent complexity. We could also incorporate more L𝐿Litalic_Ls. We leave these for exploration in future work.

It is worth mentioning that the formulation of MQSA shares similar ideas with some approaches in the literature, such as FPMC (Rendle et al., 2010) and Fossil (He and McAuley, 2016a), which explicitly model long-term user interests by employing user or item embeddings, respectively, and combine them with factorized Markov chains for sequential recommendation tasks. Compared to Fossil which uses the whole interacted items, MQSA introduces flexible window sizes of the last L𝐿Litalic_L items to control the bias-variance trade-off. Furthermore, MQSA employs self-attention modules to enhance expressiveness, resulting in improved performance compared to the use of pure item embeddings in Fossil.

3.2. Transition-Aware Embedding Distillation for Item Transition Modeling

Sequential recommendation models have demonstrated their effectiveness in enhancing recommendation accuracy by capturing long-term user interests (Hidasi et al., 2015; Kang and McAuley, 2018; Zhou et al., 2022). However, these models may have limitations in leveraging the global item-to-item transitional signals. Specifically, most existing methods follow an auto-regressive framework (Kang and McAuley, 2018; Zhou et al., 2022). For each user, their preference at timestamp t𝑡titalic_t is learned based on their interacted items up to and including t𝑡titalic_t, which is then used to predict the item at timestamp t+1𝑡1t+1italic_t + 1. Nevertheless, this framework fails to enable the model to learn the global item-to-item transition patterns. In other words, the items not interacted with a user are treated equally, without considering the potential items that the current item itsubscript𝑖𝑡i_{t}italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is more likely to trigger.

To address this limitation, we propose a heuristic recommender based on item transitions and then develop a knowledge distillation method to integrate these global item transition patterns into sequential models. Specifically, we construct a global item transition graph 𝒢=(𝒱,)𝒢𝒱\mathcal{G}=(\mathcal{V},\mathcal{E})caligraphic_G = ( caligraphic_V , caligraphic_E ) where 𝒱𝒱\mathcal{V}caligraphic_V represents item nodes and \mathcal{E}caligraphic_E represents transition edges between items. 𝒢𝒢\mathcal{G}caligraphic_G is a weighted and directed graph, where the weight of each edge represents the transition frequency between two items within a time span k𝑘kitalic_k, based on all user interaction sequences. Note that the time span hyperparameter k𝑘kitalic_k is used to control the long-term item transition patterns and is set to 1111 by default (i.e. only transitions between directly adjacent items are considered). We use the adjacent matrix 𝐀||×||𝐀superscript\mathbf{A}\in\mathbb{R}^{|\mathcal{I}|\times|\mathcal{I}|}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_I | × | caligraphic_I | end_POSTSUPERSCRIPT of 𝒢𝒢\mathcal{G}caligraphic_G as the heuristic recommender, where ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the transition frequency from item i𝑖iitalic_i to item j𝑗jitalic_j, as shown in Figure 2. It is a memory-based non-personalized method that recommends items based on the transition frequency from the current item to candidate items, as introduced in our preliminary experiments in Section 1.

To distill the item transitions into the sequential model, we propose a Transition-Aware Embedding Distillation (TED) method. First, we normalize the transition frequencies using a row normalization approach as a¯i,j=ai,jmaxjai,jsubscript¯𝑎𝑖𝑗subscript𝑎𝑖𝑗subscript𝑗subscript𝑎𝑖𝑗\bar{a}_{i,j}=\frac{a_{i,j}}{\max_{j}a_{i,j}}over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG. Then, we use a softmax function with temperature τ𝜏\tauitalic_τ to generate pseudo-labels for knowledge distillation:

(9) 𝐚~i=softmax(𝐚¯i/τ),subscript~𝐚𝑖softmaxsubscript¯𝐚𝑖𝜏\tilde{\mathbf{a}}_{i}=\textrm{softmax}(\bar{\mathbf{a}}_{i}/{\tau}),over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = softmax ( over¯ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_τ ) ,

where a higher value of τ𝜏\tauitalic_τ generates a softer probability distribution over items (Hinton et al., 2015).

We adopt a simple factorization model as the student model, which predicts the item transition distribution of item i𝑖iitalic_i by using the dot product between its embedding vector 𝐞isubscript𝐞𝑖\mathbf{e}_{i}bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the embedding matrix 𝐄𝐄\mathbf{E}bold_E before the self-attention layers, where the dropout (Srivastava et al., 2014) strategy is also used for robust learning. We apply the softmax function with temperature τ𝜏\tauitalic_τ to obtain the predicted transition probabilities:

(10) 𝐚^i=softmax(𝐞i𝐄T/τ).subscript^𝐚𝑖softmaxsubscript𝐞𝑖superscript𝐄𝑇𝜏\hat{\mathbf{a}}_{i}=\textrm{softmax}(\mathbf{e}_{i}\mathbf{E}^{T}/{\tau}).over^ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = softmax ( bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_E start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / italic_τ ) .

We use the cross-entropy loss to distill the item transitions into the sequential model by comparing the predicted and pseudo-label transition probabilities:

(11) kd=i𝐚~ilog𝐚^i.subscript𝑘𝑑subscript𝑖subscript~𝐚𝑖subscript^𝐚𝑖\mathcal{L}_{kd}=-\sum_{i\in\mathcal{I}}\tilde{\mathbf{a}}_{i}\log\hat{\mathbf% {a}}_{i}.caligraphic_L start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log over^ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Therefore, the factorization model can learn from the Item Transition model, enabling the item embeddings to memorize the item transition patterns. The overall loss function for the full model is:

(12) =rec+λkdkd+λΘΘ22,subscript𝑟𝑒𝑐subscript𝜆𝑘𝑑subscript𝑘𝑑subscript𝜆ΘsuperscriptsubscriptnormΘ22\mathcal{L}=\mathcal{L}_{rec}+\lambda_{kd}\mathcal{L}_{kd}+\lambda_{\Theta}||% \Theta||_{2}^{2},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT | | roman_Θ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where ΘΘ\Thetaroman_Θ is the parameters, λkdsubscript𝜆𝑘𝑑\lambda_{kd}italic_λ start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT and λΘsubscript𝜆Θ\lambda_{\Theta}italic_λ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT are the hyperparameters that control the weights of distillation and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization, respectively.

3.3. Discussion

3.3.1. Relationship Between Two Modules

Here we discuss the relationship between the user collaboration and item transition modules, and how they complement each other.

Expressiveness vs. Calibration. The item transition module learns from a memory-based method that generates potential candidate items based on the global transition trends of the current item. However, it may generalize poorly to the items lacking observed transition patterns. On the other hand, the user collaboration module is a neural model that employs self-attentions to capture long-term user preferences and select the most likely next item based on historical items, resulting in a stronger ability to generalize but a limited ability to memorize and leverage item-to-item transition patterns. Therefore, the user collaboration model requires the item transition model to act as a calibrator for its predictions.

Disentangled Learning. The user collaboration and item transition modules are inherently disentangled, as we employ dual supervision where the original item embedding captures item-to-item transitional signals while the item embedding after self-attentions captures sequence-to-item collaborative signals.

Retrieval vs. Re-Ranking. The item transition and user collaboration modules can be regarded as a retrieval model and a re-ranking model, respectively. The retrieval model provides insight into generating potential candidate items, while the re-ranking model provides insight into selecting the most relevant items for users based on their respective interaction histories.

3.3.2. Comparison with Existing Methods

The proposed Transition-Aware Embedding Distillation (TED) module serves as a calibrator based on the item transition graph. Here we compare it with recent graph-based regularization methods:

Graph Regularization (GraReg) (Zhang et al., 2020a) is a Euclidean distance-based regularization term on embedding layers using a k𝑘kitalic_k-nearest neighbor (k𝑘kitalic_k-NN) graph:

(13) =rec+λreg(i,j)𝐞i𝐞j2,subscript𝑟𝑒𝑐subscript𝜆𝑟𝑒𝑔subscript𝑖𝑗superscriptnormsubscript𝐞𝑖subscript𝐞𝑗2\mathcal{L}=\mathcal{L}_{rec}+\lambda_{reg}\sum_{(i,j)\in\mathcal{E}}||\mathbf% {e}_{i}-\mathbf{e}_{j}||^{2},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ caligraphic_E end_POSTSUBSCRIPT | | bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where λregsubscript𝜆𝑟𝑒𝑔\lambda_{reg}italic_λ start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT is the coefficient hyperparameter for graph regularization, and \mathcal{E}caligraphic_E is the edges in the k𝑘kitalic_k-NN graph. We can use the transition frequency as the weights of the edges here. Therefore, GraReg uses the k𝑘kitalic_k most related items for regularization, leading to learning localized transition patterns. Additionally, GraReg introduces an alignment loss but lacks a uniformity loss, where related items should be close to each other while unrelated ones should be separated (Wang et al., 2022). In contrast, TED uses the global item transitions as the teacher model, enabling the item embeddings to memorize and leverage transitional signals.

Graph-based Embedding Smoothing (GES) (Zhu et al., 2021) employs graph convolutions on the global item transition graph for embedding smoothing in sequential recommenders:

(14) 𝐄(l+1)=𝐃~1/2𝐀~𝐃~1/2𝐄(l),superscript𝐄𝑙1superscript~𝐃12~𝐀superscript~𝐃12superscript𝐄𝑙\mathbf{E}^{(l+1)}=\tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D% }}^{-1/2}\mathbf{E}^{(l)},bold_E start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = over~ start_ARG bold_D end_ARG start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT over~ start_ARG bold_A end_ARG over~ start_ARG bold_D end_ARG start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT bold_E start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ,

where 𝐀~=𝐀+𝐈~𝐀𝐀𝐈\tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I}over~ start_ARG bold_A end_ARG = bold_A + bold_I is the adjacency matrix of the item transition graph with self-loops, 𝐃~~𝐃\tilde{\mathbf{D}}over~ start_ARG bold_D end_ARG is the degree matrix of 𝐀~~𝐀\tilde{\mathbf{A}}over~ start_ARG bold_A end_ARG, and l𝑙litalic_l is the number of graph convolutional layers. However, stacking multiple graph convolutional layers may result in over-smoothing problems (Kipf and Welling, 2016), potentially leading to a decline in model performance. In comparison, TED incorporates a hyperparameter to control the power of item transition distillation, allowing for flexibility in different recommendation scenarios.

3.3.3. Model Complexity

Here we analyze the space and time complexity of the proposed model.

Space Complexity. The learnable parameters in SASRec are for item embeddings, positional embeddings, self-attention layers, feed-forward layers, and layer normalization. The total number of parameters in SASRec is 𝒪(||d+nd+d2)𝒪𝑑𝑛𝑑superscript𝑑2\mathcal{O}(|\mathcal{I}|d+nd+d^{2})caligraphic_O ( | caligraphic_I | italic_d + italic_n italic_d + italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (Kang and McAuley, 2018). Our proposed model introduces the long-query self-attention, which adds 𝒪(d2)𝒪superscript𝑑2\mathcal{O}(d^{2})caligraphic_O ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for projection matrices, feed-forward networks, and layer normalization. The embedding distillation module does not add any extra parameters. Therefore, the space complexity of our proposed model is the same as that of SASRec.

Time Complexity. The computational complexity of the self-attention layer and the feed-forward layer in SASRec is 𝒪(n2d+nd2)𝒪superscript𝑛2𝑑𝑛superscript𝑑2\mathcal{O}(n^{2}d+nd^{2})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d + italic_n italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The cumulative cross-entropy loss has a complexity of 𝒪(||nd)𝒪𝑛𝑑\mathcal{O}(|\mathcal{I}|nd)caligraphic_O ( | caligraphic_I | italic_n italic_d ). Thus, the total computational complexity of SASRec is 𝒪(||nd+n2d+nd2)𝒪𝑛𝑑superscript𝑛2𝑑𝑛superscript𝑑2\mathcal{O}(|\mathcal{I}|nd+n^{2}d+nd^{2})caligraphic_O ( | caligraphic_I | italic_n italic_d + italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d + italic_n italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). In our proposed model, the self-attention module has the same complexity as in SASRec. The embedding distillation module has a complexity of 𝒪(||nd)𝒪𝑛𝑑\mathcal{O}(|\mathcal{I}|nd)caligraphic_O ( | caligraphic_I | italic_n italic_d ). Hence, the time complexity of the proposed model is the same as that of SASRec with the cumulative cross-entropy loss.

4. Experiments

We conduct experiments on four real-world datasets to evaluate the effectiveness of the proposed method.222The codes and datasets are available at https://github.com/zhuty16/MQSA-TED The experiments are designed to answer the following research questions:

  • RQ1.

    How does the proposed method compare with state-of-the-art sequential recommendation methods?

  • RQ2.

    How do the hyperparameters and various components affect the model performance?

  • RQ3.

    How does the proposed TED method compare with graph-based regularization methods?

  • RQ4.

    Can the proposed TED method benefit various recommendation models?

  • RQ5.

    How do the proposed two modules improve the model performance?

4.1. Experimental Settings

4.1.1. Datasets

Table 1. Summary of evaluation datasets. The datasets are from (Zhou et al., 2022).
Dataset # Users # Items # Actions Density Avg. Len.
Beauty 22,363 12,101 198,502 0.073% 8.88
Sports 25,598 18,357 296,337 0.063% 8.32
Toys 19,412 11,924 167,597 0.072% 8.63
Yelp 30,431 20,033 316,354 0.052% 10.40

We adopt four datasets from (Zhou et al., 2022) for experiments. The Beauty, Sports, and Toys datasets are from the Amazon Review Dataset in (McAuley et al., 2015; He and McAuley, 2016b).333https://cseweb.ucsd.edu/~jmcauley/datasets.html The Yelp dataset is from the Yelp Open Dataset.444https://www.yelp.com/dataset The training data, validation data, and test data are identical to those used in (Zhou et al., 2022), which follows the leave-one-out evaluation protocol that treats the last item as the test data, the second last item as the validation data, and the remaining items as the training data for each user (Kang and McAuley, 2018). The dataset statistics are shown in Table 1.

Table 2. Performance comparison of different methods on four datasets. The best results are in boldface and the second best are underlined. Asterisk (*) indicates statistically significant improvements over the best baseline determined by a two-sample t-test (p<0.01𝑝0.01p<0.01italic_p < 0.01) after repeating the experiments five times.
Dataset Metric POP LightGCN FPMC Caser GRU4Rec SASRec BERT4Rec FMLP-Rec MQSA-TED Improv.
Beauty HR@5 0.0077 0.0374 0.0596 0.0359 0.0489 0.0694 0.0419 0.0698 0.0752* 7.23%
NDCG@5 0.0042 0.0247 0.0419 0.0241 0.0342 0.0492 0.0275 0.0488 0.0534* 8.58%
HR@10 0.0135 0.0571 0.0838 0.0511 0.0695 0.0932 0.0647 0.0995 0.1039* 4.44%
NDCG@10 0.0061 0.0311 0.0497 0.0290 0.0408 0.0568 0.0349 0.0583 0.0627* 7.48%
HR@20 0.0217 0.0841 0.1151 0.0720 0.0998 0.1286 0.0992 0.1361 0.1435* 5.40%
NDCG@20 0.0081 0.0379 0.0576 0.0343 0.0484 0.0657 0.0435 0.0675 0.0726* 7.62%
Sports HR@5 0.0057 0.0252 0.0337 0.0195 0.0221 0.0380 0.0241 0.0415 0.0455* 9.52%
NDCG@5 0.0041 0.0170 0.0234 0.0128 0.0143 0.0267 0.0161 0.0287 0.0320* 11.34%
HR@10 0.0091 0.0384 0.0499 0.0290 0.0357 0.0541 0.0380 0.0598 0.0643* 7.48%
NDCG@10 0.0052 0.0212 0.0286 0.0159 0.0187 0.0318 0.0206 0.0346 0.0380* 9.85%
HR@20 0.0175 0.0576 0.0703 0.0431 0.0548 0.0752 0.0583 0.0847 0.0906* 6.93%
NDCG@20 0.0073 0.0260 0.0337 0.0195 0.0235 0.0371 0.0257 0.0409 0.0446* 9.09%
Toys HR@5 0.0065 0.0378 0.0664 0.0307 0.0420 0.0736 0.0379 0.0785 0.0834* 6.24%
NDCG@5 0.0044 0.0251 0.0463 0.0224 0.0297 0.0533 0.0244 0.0570 0.0600* 5.31%
HR@10 0.0090 0.0564 0.0925 0.0420 0.0597 0.0989 0.0589 0.1062 0.1130* 6.42%
NDCG@10 0.0052 0.0311 0.0547 0.0260 0.0354 0.0615 0.0312 0.0659 0.0696* 5.56%
HR@20 0.0143 0.0795 0.1212 0.0597 0.0834 0.1299 0.0857 0.1399 0.1503* 7.41%
NDCG@20 0.0065 0.0370 0.0619 0.0305 0.0414 0.0693 0.0379 0.0743 0.0789* 6.23%
Yelp HR@5 0.0056 0.0290 0.0272 0.0199 0.0211 0.0232 0.0264 0.0270 0.0320* 10.18%
NDCG@5 0.0036 0.0184 0.0173 0.0129 0.0134 0.0151 0.0169 0.0169 0.0205* 11.74%
HR@10 0.0096 0.0486 0.0433 0.0334 0.0367 0.0379 0.0441 0.0446 0.0517* 6.36%
NDCG@10 0.0049 0.0246 0.0224 0.0172 0.0184 0.0198 0.0226 0.0225 0.0269* 8.95%
HR@20 0.0158 0.0790 0.0695 0.0535 0.0603 0.0623 0.0737 0.0721 0.0832* 5.24%
NDCG@20 0.0065 0.0323 0.0290 0.0222 0.0244 0.0259 0.0300 0.0294 0.0348* 7.62%
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 3. Performance curves of SASRec and our proposed MQSA-TED on four datasets.

4.1.2. Baselines

We compare the proposed method with various types of state-of-the-art baselines in sequential recommendation:

  • POP: a non-personalized method that ranks items based on their popularity.

  • LightGCN (He et al., 2020): a GCN-based method that learns user and item embeddings through linear propagation on the user-item interaction graph.

  • FPMC (Rendle et al., 2010): a Markov chain-based method that combines matrix factorization and factorized Markov chains.

  • Caser (Tang and Wang, 2018a): a CNN-based method that uses horizontal and vertical convolutions to learn sequential patterns.

  • GRU4Rec (Hidasi et al., 2015): an RNN-based method that uses Gated Recurrent Units (GRU) to model dynamic user preferences.

  • SASRec (Kang and McAuley, 2018): a unidirectional Transformer-based method that models user interests using the self-attention module in Transformer (Vaswani et al., 2017).

  • BERT4Rec (Sun et al., 2019): a bidirectional Transformer-based method that models user interests using the self-attention module in BERT (Devlin et al., 2018).

  • FMLP-Rec (Zhou et al., 2022): an MLP-based method that is currently the state-of-the-art sequential recommendation model based on filter-enhanced MLP.

4.1.3. Evaluation Metrics and Protocols

We adopt Hit Ratio@N (HR@N) and NDCG@N to evaluate the performance of the compared methods on the sequential recommendation task (Zhou et al., 2020, 2022). We set N𝑁Nitalic_N to 5555, 10101010, and 20202020 by default and report the average scores of users. For each user, we rank all items except for the positive ones in their training or validation data (Krichene and Rendle, 2022). To ensure the robustness of the results, we randomly initialize each model five times and report the average performance.

4.1.4. Implementation and Hyperparameter Settings

We implement all models with TensorFlow and use the cross-entropy loss for all models for a fair comparison, which has been proved to outperform the negative sampling-based losses significantly (Li et al., 2023). For common hyperparameters in all models, the maximum sequence length is set to 50505050, the embedding size d𝑑ditalic_d is set to 64646464, the learning rate is tuned in {5e-3, 1e-3, 5e-4, 1e-4}, and the l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization is tuned in {0, 1e-6, 1e-5, 1e-4, 1e-3}. All models are trained with mini-batch Adam (Kingma and Ba, 2014) and the batch size is set to 256256256256. Other hyperparameters of different models are tuned on the validation set according to the suggestions in their respective papers. The results of baseline methods under their optimal hyperparameter settings are reported.

4.2. Main Results (RQ1)

Table 2 presents a performance comparison of different methods. The results show that, on Amazon datasets, sequential methods such as FPMC, SASRec, and FMLP-Rec outperform the non-sequential method LightGCN significantly. Among the sequential methods, FMLP-Rec performs the best. However, on the Yelp dataset, LightGCN outperforms the sequential methods due to the weak sequentiality of user interactions on Yelp (Zhu et al., 2021). Furthermore, our proposed method significantly outperforms all baseline methods, with an average improvement of 6.24%percent6.246.24\%6.24 % in Hit Ratio@20 and 7.64%percent7.647.64\%7.64 % in NDCG@20 compared to the best baseline.

Figure 3 shows the performances of SASRec and our proposed method with respect to the training epochs. One can observe that our proposed method consistently outperforms SASRec by a notable margin, showing the effectiveness of the proposed modules.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Refer to caption
(m)
Refer to caption
(n)
Refer to caption
(o)
Refer to caption
(p)
Figure 4. Performance of the proposed MQSA-TED w.r.t. various hyperparameters on four datasets.

4.3. Hyperparameter and Ablation Studies (RQ2)

Figure 4 presents the performance of our proposed method with respect to various hyperparameters and modules:

4.3.1. Length of Long-Query Self-Attention L𝐿Litalic_L.

It can be observed the best L𝐿Litalic_L depends on the datasets and the model generally performs well when L𝐿Litalic_L is in the range of [2,4]24[2,4][ 2 , 4 ], showing the effectiveness of long-query self-attention in capturing collaborative signals.

4.3.2. Balance of Long and Short-Query Self-Attention α𝛼\alphaitalic_α.

The results show that when α𝛼\alphaitalic_α is approximately 0.50.50.50.5, the model achieves the best performance, indicating a proper bias-variance trade-off in modeling user interests. Notably, when α=1𝛼1\alpha=1italic_α = 1, the model degrades to SASRec with TED. Therefore, the proposed multi-query self-attention significantly outperforms the short-query self-attention used in SASRec with a proper α𝛼\alphaitalic_α.

4.3.3. Weight of Embedding Distillation λkdsubscript𝜆𝑘𝑑\lambda_{kd}italic_λ start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT.

It can be seen that the model performs better when λkdsubscript𝜆𝑘𝑑\lambda_{kd}italic_λ start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT is approximately 0.10.10.10.1, demonstrating the effectiveness of the TED module. Note that when λkd=0subscript𝜆𝑘𝑑0\lambda_{kd}=0italic_λ start_POSTSUBSCRIPT italic_k italic_d end_POSTSUBSCRIPT = 0, our proposed method degrades to the MQSA model without TED, resulting in a significant drop in performance.

4.3.4. Temperature of Embedding Distillation τ𝜏\tauitalic_τ.

The results suggest that the model requires relatively hard pseudo-labels of item transition distributions for effective knowledge distillation, as the best performance is achieved when τ=0.05𝜏0.05\tau=0.05italic_τ = 0.05 or τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1.

Table 3. Performance comparison of the proposed TED module with graph-based methods on four datasets. The best results are in boldface and the second best are underlined.
Dataset Metric MQSA +GES +GraReg +TED
Beauty NDCG@10 0.0599 0.0623 0.0611 0.0627
NDCG@20 0.0694 0.0724 0.0708 0.0726
Sports NDCG@10 0.0344 0.0370 0.0351 0.0380
NDCG@20 0.0408 0.0434 0.0416 0.0446
Toys NDCG@10 0.0654 0.0672 0.0667 0.0696
NDCG@20 0.0749 0.0765 0.0755 0.0789
Yelp NDCG@10 0.0255 0.0244 0.0257 0.0269
NDCG@20 0.0327 0.0320 0.0330 0.0348
Table 4. Performance comparison of LightGCN and FMLP-Rec w/ and w/o the proposed TED module on four datasets. The best results under each backbone are in boldface.
Dataset Metric LightGCN +TED FMLP-Rec +TED
Beauty NDCG@10 0.0311 0.0399 0.0583 0.0596
NDCG@20 0.0379 0.0484 0.0675 0.0684
Sports NDCG@10 0.0212 0.0246 0.0346 0.0356
NDCG@20 0.0260 0.0298 0.0409 0.0423
Toys NDCG@10 0.0311 0.0388 0.0659 0.0675
NDCG@20 0.0370 0.0459 0.0743 0.0762
Yelp NDCG@10 0.0246 0.0236 0.0225 0.0226
NDCG@20 0.0323 0.0312 0.0294 0.0296
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 5. Performance of three methods w.r.t. item transition frequency on two datasets. MQSA-TED outperforms MQSA on test samples with frequent transitions and outperforms SASRec-TED on test samples lacking transition instances.

4.4. Comparison with Graph Methods (RQ3)

We also compare the proposed Transition-Aware Embedding Distillation (TED) module with graph-based regularization methods in Table 3. The results show that most of the methods can improve the performance of MQSA. Specifically, GES performs better than GraReg on Amazon datasets but worse on the Yelp dataset. Moreover, our proposed TED method outperforms GES and GraReg in most cases, indicating the effectiveness of learning global and accurate item transition patterns by knowledge distillation.

4.5. TED for Various Base Models (RQ4)

We also compare the performance of various base models with and without our proposed Transition-Aware Embedding Distillation (TED) module in Table 4. The results demonstrate that TED can act as a domain adapter, which enhances the performance of the non-sequential method LightGCN on sequential recommendation tasks. Furthermore, the incorporation of TED yields remarkable improvement for the state-of-the-art sequential recommendation method FMLP-Rec. Notably, TED shows limited effects on the Yelp dataset due to the weak sequentiality of user interactions. In other words, transitional signals are less important in this dataset.

4.6. Performance Comparison by Groups (RQ5)

Figure 5 presents the performance of different methods on test samples grouped by transition frequencies observed in the training data from the validation item (the second last item) to the test item (the last item). We evaluate the SASRec model with the Transition-Aware Embedding Distillation (SASRec-TED), the Multi-Query Self-Attention model (MQSA), and the full MQSA-TED model. Compared with the results in Figure 1, the improvement of MQSA over SASRec mainly results from the improvement on test samples lacking transition instances. However, the integration of long-query self-attention may hurt the performance on test samples with frequent transitions. By incorporating the TED module as a calibrator, MQSA-TED performs better than MQSA mainly on test samples with high transition frequencies. As MQSA and TED focus on collaborative and transitional signals, respectively, their combination will result in a reasonable balance between the two signal types.

5. Related Work

Sequential Recommendation. Sequential recommendation methods aim to capture dynamic user preferences (He et al., 2017; Chen et al., 2018; Wang et al., 2020). Early efforts adopt Markov Chains (MCs) to learn item transition patterns, such as FPMC (Rendle et al., 2010), which combines the Matrix Factorization (MF) with the first-order Markov chain. Fossil (He and McAuley, 2016a) fuses the similarity-based model with high-order Markov chains. Recent efforts incorporate deep models, such as GRU4Rec (Hidasi et al., 2015) and NARM (Li et al., 2017), which employ Gated Recurrent Units (GRU). Caser (Tang and Wang, 2018a) uses horizontal and vertical convolutional filters. SASRec (Kang and McAuley, 2018) and BERT4Rec (Sun et al., 2019) use unidirectional and bidirectional self-attention modules in Transformer (Vaswani et al., 2017), respectively. FMLP-Rec (Zhou et al., 2022) is an all-MLP model with learnable filters in the frequency domain. However, previous efforts typically follow an auto-regressive framework, which neglects the valuable information in global item transition patterns. In this paper, we propose a Transition-Aware Embedding Distillation module to memorize and leverage the transitional signals.

Self-Attention in Recommendation. The Transformer architecture has achieved remarkable success in modeling long-term dependencies in Natural Language Processing (NLP) (Vaswani et al., 2017; Devlin et al., 2018). Consequently, recent efforts employ such architecture for sequential recommendation tasks (Ren et al., 2020; Qiu et al., 2022). For example, SASRec (Kang and McAuley, 2018) and BERT4Rec (Sun et al., 2019) use unidirectional and bidirectional self-attention modules, respectively. In addition, some efforts aim to enhance self-attention-based models by incorporating side information. For instance, TiSASRec (Li et al., 2020) incorporates time interval embeddings into SASRec. S33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT-Rec (Zhou et al., 2020) introduces self-supervision tasks to learn correlations among attributes, items, sub-sequences, and sequences based on mutual information maximization. SASRec-GES (Zhu et al., 2021) employs graph convolutions on sequential and semantic item graphs to generate smoothed item embeddings. Efforts have also been made to improve the efficiency or effectiveness of SASRec (Li et al., 2021). CL4SRec (Xie et al., 2022) uses contrastive learning to derive self-supervision signals from user interaction sequences. Despite these advances, previous studies paid less attention to the limitations of the conventional self-attention architecture in capturing collaborative signals. In this paper, we propose a Multi-Query Self-Attention method that combines long and short-query self-attentions to enhance its effectiveness in modeling user collaborations.

Knowledge Distillation in Recommendation. Knowledge distillation is a widely-used model compression technique in various fields (Hinton et al., 2015), where a student model is trained with both a ground-truth label distribution and a smoothed pseudo-label distribution generated by a teacher model. Recent efforts apply this method to recommender systems, such as Ranking Distillation (Tang and Wang, 2018b), which trains a student model to rank items based on both training data and teacher model predictions. Collaborative Distillation (Lee et al., 2019) uses probabilistic rank-aware sampling with teacher-guided and student-guided training strategies. Other existing methods aim to distill knowledge from side information into recommendation models to enhance their performance and interpretability. For instance, SCML (Zhu et al., 2020) combines the item-based CF model with the social CF model through embedding-level and output-level mutual learning. DESIGN (Tao et al., 2022) integrates information from the user-item interaction graph and the user-user social graph and makes them learn from each other. Zhang et al. (Zhang et al., 2020b) propose a joint learning framework to distill structured knowledge from a path-based model into a neural model. However, knowledge distillation has received less attention in the context of sequential recommendation. In this paper, we distill the knowledge of item transitions into sequential recommendation models to enhance their performances.

6. Conclusion

In this paper, we addressed the limitations of existing sequential recommendation methods in capturing collaborative and transitional signals in user interaction sequences. To overcome these limitations, we proposed a new method called Multi-Query Self-Attention with Transition-Aware Embedding Distillation (MQSA-TED). To capture collaborative signals, we introduced an L𝐿Litalic_L-query self-attention module using flexible window sizes for attention queries and combined long and short-query self-attentions. In addition, we developed a transition-aware embedding distillation module that distills global item transition patterns into item embeddings, enabling the model to memorize and leverage transitional signals. Experimental results on four real-world datasets demonstrated the effectiveness of both modules in improving sequential recommendation performance.

Acknowledgements.
This research was partly supported by a CIHR-NSERC-SSHRC Healthy Cities Research Training Platform grant of Canada.

References

  • (1)
  • Chen et al. (2018) Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2018. Sequential recommendation with user memory networks. In Proceedings of the eleventh ACM international conference on web search and data mining. 108–116.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Gao et al. (2022) Chongming Gao, Shijun Li, Yuan Zhang, Jiawei Chen, Biao Li, Wenqiang Lei, Peng Jiang, and Xiangnan He. 2022. KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 3953–3957.
  • He et al. (2017) Ruining He, Wang-Cheng Kang, and Julian McAuley. 2017. Translation-based recommendation. In Proceedings of the eleventh ACM conference on recommender systems. 161–169.
  • He and McAuley (2016a) Ruining He and Julian McAuley. 2016a. Fusing similarity models with markov chains for sparse sequential recommendation. In 2016 IEEE 16th international conference on data mining (ICDM). IEEE, 191–200.
  • He and McAuley (2016b) Ruining He and Julian McAuley. 2016b. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th international conference on world wide web. 507–517.
  • He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639–648.
  • Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
  • Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Krichene and Rendle (2022) Walid Krichene and Steffen Rendle. 2022. On sampled metrics for item recommendation. Commun. ACM 65, 7 (2022), 75–83.
  • Lee et al. (2019) Jae-woong Lee, Min** Choi, Jongwuk Lee, and Hyunjung Shim. 2019. Collaborative distillation for top-N recommendation. In 2019 IEEE International Conference on Data Mining (ICDM). IEEE, 369–378.
  • Li et al. (2023) Fangyu Li, Shenbao Yu, Feng Zeng, and Fang Yang. 2023. Effective and Efficient Training for Sequential Recommendation Using Cumulative Cross-Entropy Loss. arXiv preprint arXiv:2301.00979 (2023).
  • Li et al. (2017) **g Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1419–1428.
  • Li et al. (2020) Jiacheng Li, Yujie Wang, and Julian McAuley. 2020. Time interval aware self-attention for sequential recommendation. In Proceedings of the 13th ACM international conference on web search and data mining. 322–330.
  • Li et al. (2021) Yang Li, Tong Chen, Peng-Fei Zhang, and Hongzhi Yin. 2021. Lightweight self-attentive sequential recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 967–977.
  • McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th ACM SIGIR international conference on research and development in information retrieval. 43–52.
  • Qiu et al. (2022) Ruihong Qiu, Zi Huang, Hongzhi Yin, and Zijian Wang. 2022. Contrastive learning for representation degeneration problem in sequential recommendation. In Proceedings of the fifteenth ACM international conference on web search and data mining. 813–823.
  • Ren et al. (2020) Ruiyang Ren, Zhaoyang Liu, Yaliang Li, Wayne Xin Zhao, Hui Wang, Bolin Ding, and Ji-Rong Wen. 2020. Sequential recommendation with self-attentive multi-adversarial network. In Proceedings of the 43rd ACM SIGIR international conference on research and development in information retrieval. 89–98.
  • Rendle et al. (2010) Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th international conference on world wide web. 811–820.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.
  • Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
  • Tang and Wang (2018a) Jiaxi Tang and Ke Wang. 2018a. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the eleventh ACM international conference on web search and data mining. 565–573.
  • Tang and Wang (2018b) Jiaxi Tang and Ke Wang. 2018b. Ranking distillation: Learning compact ranking models with high performance for recommender system. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2289–2298.
  • Tao et al. (2022) Ye Tao, Ying Li, Su Zhang, Zhirong Hou, and Zhonghai Wu. 2022. Revisiting Graph based Social Recommendation: A Distillation Enhanced Social Graph Network. In Proceedings of the ACM Web Conference 2022. 2830–2838.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Wang et al. (2022) Chenyang Wang, Yuanqing Yu, Weizhi Ma, Min Zhang, Chong Chen, Yiqun Liu, and Shao** Ma. 2022. Towards Representation Alignment and Uniformity in Collaborative Filtering. In Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1816–1825.
  • Wang et al. (2020) Chenyang Wang, Min Zhang, Weizhi Ma, Yiqun Liu, and Shao** Ma. 2020. Make it a chorus: knowledge-and time-aware item modeling for sequential recommendation. In Proceedings of the 43rd ACM SIGIR International conference on research and development in Information Retrieval. 109–118.
  • Xie et al. (2022) Xu Xie, Fei Sun, Zhaoyang Liu, Shiwen Wu, **yang Gao, Jiandong Zhang, Bolin Ding, and Bin Cui. 2022. Contrastive learning for sequential recommendation. In 2022 IEEE 38th international conference on data engineering (ICDE). IEEE, 1259–1273.
  • Zhang et al. (2020a) Yuan Zhang, Fei Sun, Xiaoyong Yang, Chen Xu, Wenwu Ou, and Yan Zhang. 2020a. Graph-based regularization on embedding layers for recommendation. ACM Transactions on Information Systems (TOIS) 39, 1 (2020), 1–27.
  • Zhang et al. (2020b) Yuan Zhang, Xiaoran Xu, Hanning Zhou, and Yan Zhang. 2020b. Distilling structured knowledge into embeddings for explainable and accurate recommendation. In Proceedings of the 13th ACM international conference on web search and data mining. 735–743.
  • Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi **, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1059–1068.
  • Zhou et al. (2020) Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM international conference on information & knowledge management. 1893–1902.
  • Zhou et al. (2022) Kun Zhou, Hui Yu, Wayne Xin Zhao, and Ji-Rong Wen. 2022. Filter-enhanced MLP is all you need for sequential recommendation. In Proceedings of the ACM Web Conference 2022. 2388–2399.
  • Zhu et al. (2020) Tianyu Zhu, Guannan Liu, and Guoqing Chen. 2020. Social collaborative mutual learning for item recommendation. ACM Transactions on Knowledge Discovery from Data (TKDD) 14, 4 (2020), 1–19.
  • Zhu et al. (2021) Tianyu Zhu, Leilei Sun, and Guoqing Chen. 2021. Graph-based embedding smoothing for sequential recommendation. IEEE Transactions on Knowledge and Data Engineering 35, 1 (2021), 496–508.