GLINT-RU: Gated Lightweight Intelligent Recurrent Units for Sequential Recommender Systems

Sheng Zhang City University of HongKong [email protected] Maolin Wang City University of Hong Kong [email protected]  and  Xiangyu Zhao City University of Hong Kong [email protected]
(2024)
Abstract.

In the rapidly evolving field of artificial intelligence, transformer-based models have gained significant attention in the context of Sequential Recommender Systems (SRSs), demonstrating remarkable proficiency in capturing user-item interactions. However, such attention-based frameworks result in substantial computational overhead and extended inference time. To address this problem, this paper proposes a novel efficient sequential recommendation framework GLINT-RU that leverages dense selective Gated Recurrent Units (GRU) module to accelerate the inference speed, which is a pioneering work to further exploit the potential of efficient GRU modules in SRSs. The GRU module lies at the heart of GLINT-RU, playing a crucial role in substantially reducing both inference time and GPU memory usage. Through the integration of a dense selective gate, our framework adeptly captures both long-term and short-term item dependencies, enabling the adaptive generation of item scores. GLINT-RU further integrates a mixing block, enriching it with global user-item interaction information to bolster recommendation quality. Moreover, we design a gated Multi-layer Perceptron (MLP) for our framework where the information is deeply filtered. Extensive experiments on three datasets are conducted to highlight the effectiveness and efficiency of GLINT-RU. Our GLINT-RU achieves exceptional inference speed and prediction accuracy, outperforming existing baselines based on Recurrent Neural Network (RNN), Transformer, MLP and State Space Model (SSM). These results establish a new standard in sequential recommendation, highlighting the potential of GLINT-RU as a renewing approach in the realm of recommender systems. The implementation code is available at https://github.com/szhang-cityu/GLINT-RU.

Sequential Recommender Systems, Gated Recurrnet Units, Efficient Model
copyright: acmlicensedjournalyear: 2024doi: XXXXXXX.XXXXXXXisbn: 978-1-4503-XXXX-X/18/06ccs: Information systems Recommender systems

1. INTRODUCTION

In this era of data exploding, Sequential Recommender Systems (SRSs)  (Zhang et al., 2019; Yue et al., 2024; Liu et al., 2023a; Hidasi et al., 2015; Kang and McAuley, 2018; Liu et al., 2024; Li et al., 2022) have gained much attention in capturing users’ preferences within a large amount of sequential interaction data. GRU4Rec  (Hidasi et al., 2015) as one of the earliest session-based recommendation models, employs stacked Gated Recurrent Units (GRU) for item-to-item recommendations. However, the RNN-based methods (Hidasi et al., 2015; Li et al., 2017) are fading from the recommendation realm due to their relatively lower accuracy. In recent years, transformer-based SRSs  (Kang and McAuley, 2018; Sun et al., 2019) have become increasingly popular for the powerful multi-head attention mechanism  (Vaswani et al., 2017). They exhibit remarkable ability in capturing sequential interactions and delivering accurate predictions (Li et al., 2023). However, despite their effectiveness, current attention-based models suffer from substantial computational demands and extended inference time, which is caused by the dot product operation in the attention mechanism  (Wang et al., 2020; Liu et al., 2023a).

To tackle the issue of the high computational cost of attention-based SRSs, much research has been done to improve the attention mechanism to the linear computational complexity. LinRec  (Liu et al., 2023a) changes the order of dot product between query and key matrices by designing a special map**, dramatically reducing the inference time. LightSAN  (Fan et al., 2021) projects historical interactions into interest representations with shorter length, thereby reducing the computational complexity of transformers to a linear scale. Mamba4Rec  (Liu et al., 2024) outperforms the existing attention mechanism by utilizing one efficient mamba block based on a selective SSM variant (i.e., Mamba (Gu and Dao, 2023)), within which a structured state tensor is used to address long-range dependencies. SMLP4Rec  (Gao et al., 2024) is an efficient pure Multi-layer Perceptron (MLP) framework that achieves fast inference by resha** input sequence tensors, demonstrating the potential of MLP-based models  (Li et al., 2022; Zhou et al., 2022; Long et al., 2024) for sequential recommendation tasks.

However, to achieve high performance, the transformer-based recommendation requires deeply stacked transformers. Even with a linearized attention mechanism, the inference speed of such large transformer architecture may not meet the desired efficiency standards  (Zhang et al., 2024). Furthermore, the efficiency of the MLP architecture is adversely affected by the resha** of large tensors, which is notably time-consuming. GRU4Rec  (Hidasi et al., 2015) is also an efficient framework for recommendation tasks, but it exhibits limitations in learning complex user behaviors. In this paper we aim to further improve the efficiency of recommendations, and tackle the issue of low accuracy of traditional RNN-based frameworks.

To further reduce resource consumption, accelerate inference speed, and enhance the model performance, we propose a novel efficient SRS framework named Gated Lightweight IntelligeNT Recurrent Units (GLINT-RU). This paper employs various gate mechanisms in appropriate positions to fully perceive the data environment and filter or select information automatically  (Zhang et al., 2024; Chang et al., 2023; De et al., 2024). We introduce an expert mixing block that captures the long-/short-term dependencies via GRU, and utilize linear attention to capture the global interaction information between users and items (Jacobs et al., 1991; Masoudnia and Ebrahimpour, 2014). This strategy not only supplements the sequential information, but also improves the inference speed due to the linear computational complexity of paralleled GRU and linear attention mechanism. Additionally, we implement a dense selective GRU, which selects the output of GRU adaptively, and considers the connections among adjacent items. It leverages the gate mechanism to select crossed features, and extracts short-duration patterns to refine the model’s understanding of user behavior dynamics. Moreover, a gated MLP block is utilized to select on the outputs of the expert mixing block, deeply filtering the information based on the data environment.

We summarize the major contributions of our work as:

  • In this paper, we introduce GLINT-RU, a novel and efficient recommendation framework that achieves remarkable inference speed with a streamlined architecture that only requires a single layer, and exhibits state-of-the-art performance. It is a advanced model that captures complex user-item interactions from mixing blocks, which substantially improves the performance of traditional sequential recommendation models.

  • We introduce a dense selective GRU module, which not only incorporates connections between adjacent items, but also empowers the model with the capability to selectively learn sequential information. The integration of this advanced GRU module into the model markedly elevates its performance, establishing a new standard for efficient recommender systems.

  • We implement a gated MLP block, allowing for deep filtering of dense information by automatically perceiving the data environment. This mechanism enables our model to be more flexible and adaptive to complex sequential user behaviors.

2. PRELIMINARIES

In this section we will briefly introduce our recommendation task, and then introduce the basic efficient GRU and linear attention modules used in our framework.

2.1. Problem Statement

For a sequential recommendation task, we have a set of users 𝒰={u1,u2,,u|𝒰|}𝒰subscript𝑢1subscript𝑢2subscript𝑢𝒰\mathcal{U}=\{u_{1},u_{2},\dots,u_{|\mathcal{U}|}\}caligraphic_U = { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT | caligraphic_U | end_POSTSUBSCRIPT } who have historical interactions with a set of items 𝒱={v1,v2,,v|𝒱|}𝒱subscript𝑣1subscript𝑣2subscript𝑣𝒱\mathcal{V}=\{v_{1},v_{2},\dots,v_{|\mathcal{V}|}\}caligraphic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT | caligraphic_V | end_POSTSUBSCRIPT }. Among these users, the i𝑖iitalic_i-th user has a preferred item sequence denoted as si=[v1(i),v2(i),s_{i}=[v_{1}^{(i)},v_{2}^{(i)},\dotsitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , …, vni(i)]v_{n_{i}}^{(i)}]italic_v start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ], where nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the length of the item list that the i𝑖iitalic_i-th user interacts with. Our goal is to design an efficient framework and predict the next item that the user is most likely to interact.

2.2. Gated Recurrent Units

As an essential part of GLINT-RU framework, the GRU  (Chung et al., 2014) module contributes to the recommendation task by capturing the long-term dependencies among the items in the sequence and dynamically adjust its memory content across different parts of the sequence  (Dey and Salem, 2017; Shen et al., 2018). The update mechanism of a GRU cell can be formalized as follows:

(1) 𝒛tsubscript𝒛𝑡\displaystyle\boldsymbol{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =σ(𝑾z[𝒉t1,𝒙t]+𝒃z),absent𝜎subscript𝑾𝑧subscript𝒉𝑡1subscript𝒙𝑡subscript𝒃𝑧\displaystyle=\sigma(\boldsymbol{W}_{z}\cdot[\boldsymbol{h}_{t-1},\boldsymbol{% x}_{t}]+\boldsymbol{b}_{z}),= italic_σ ( bold_italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ⋅ [ bold_italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + bold_italic_b start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ,
𝒓tsubscript𝒓𝑡\displaystyle\boldsymbol{r}_{t}bold_italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =σ(𝑾r[𝒉t1,𝒙t]+𝒃r),absent𝜎subscript𝑾𝑟subscript𝒉𝑡1subscript𝒙𝑡subscript𝒃𝑟\displaystyle=\sigma(\boldsymbol{W}_{r}\cdot[\boldsymbol{h}_{t-1},\boldsymbol{% x}_{t}]+\boldsymbol{b}_{r}),= italic_σ ( bold_italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ⋅ [ bold_italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + bold_italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ,
𝒉~tsubscript~𝒉𝑡\displaystyle\tilde{\boldsymbol{h}}_{t}over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =tanh(𝑾[𝒓t𝒉t1,𝒙t]+𝒃),absent𝑾subscript𝒓𝑡subscript𝒉𝑡1subscript𝒙𝑡𝒃\displaystyle=\tanh(\boldsymbol{W}\cdot[\boldsymbol{r}_{t}*\boldsymbol{h}_{t-1% },\boldsymbol{x}_{t}]+\boldsymbol{b}),= roman_tanh ( bold_italic_W ⋅ [ bold_italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ bold_italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + bold_italic_b ) ,
𝒉tsubscript𝒉𝑡\displaystyle\boldsymbol{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =𝒛t𝒉t1+(1𝒛t)𝒉~t,absentsubscript𝒛𝑡subscript𝒉𝑡11subscript𝒛𝑡subscript~𝒉𝑡\displaystyle=\boldsymbol{z}_{t}*\boldsymbol{h}_{t-1}+(1-\boldsymbol{z}_{t})*% \tilde{\boldsymbol{h}}_{t},= bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ bold_italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∗ over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,

where σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is the sigmoid activation function, 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the input of GRU module in t𝑡titalic_t-th time step, 𝒉tsubscript𝒉𝑡\boldsymbol{h}_{t}bold_italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the t𝑡titalic_t-th hidden states, 𝒛tsubscript𝒛𝑡\boldsymbol{z}_{t}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒓tsubscript𝒓𝑡\boldsymbol{r}_{t}bold_italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the update gate and the reset gate, respectively. 𝒃z,𝒃r,𝒃subscript𝒃𝑧subscript𝒃𝑟𝒃\boldsymbol{b}_{z},\boldsymbol{b}_{r},\boldsymbol{b}bold_italic_b start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , bold_italic_b are bias, 𝑾z,𝑾r,𝑾subscript𝑾𝑧subscript𝑾𝑟𝑾\boldsymbol{W}_{z},\boldsymbol{W}_{r},\boldsymbol{W}bold_italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , bold_italic_W are trainable weight matrices. As is shown in Eq.(1), GRU uses the update gate to control the retained information volume from previous hidden states in the current time step, while the reset gate controls the information that should be forgotten.

The GRU (Gated Recurrent Unit) module, equipped with update and reset gates in sequential GRU Cells, is adept at capturing the relationships among the items throughout the sequence, while maintaining a relatively low computational complexity. However, the sequential information in GRU cannot be interacted, and each hidden state is primarily encoded from preceding elements, which restricts the representational capacity of GRU-based recommendation models, particularly in capturing complex item dependencies across the entire sequence.

2.3. Linear Attention Mechanism

Attention mechanism as a powerful core of the transformer structure, exhibits performance in learning sequence interactions in recommendation tasks. However, the high computational cost of dot product between query matrix 𝑸𝑸\boldsymbol{Q}bold_italic_Q and key matrix 𝑲𝑲\boldsymbol{K}bold_italic_K substantially lower the inference speed of transformer-based SRSs especially when the sequence length N𝑁Nitalic_N is much larger than hidden size d𝑑ditalic_d (Shen et al., 2021; Wang et al., 2020). To tackle this issue, the linear attention mechanism  (Liu et al., 2023a) designs an special map** function to change the order of the dot product and reduce the computational complexity to 𝒪(Nd2)𝒪𝑁superscript𝑑2\mathcal{O}(Nd^{2})caligraphic_O ( italic_N italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The linear attention mechanism can be written as:

(2) 𝑨(𝑸,𝑲,𝑽)=𝒳1(elu(𝑸))(𝒳2(elu(𝑲))T𝑽),superscript𝑨𝑸𝑲𝑽subscript𝒳1𝑒𝑙𝑢𝑸subscript𝒳2superscript𝑒𝑙𝑢𝑲T𝑽\displaystyle\boldsymbol{A}^{\prime}(\boldsymbol{Q,K,V})=\mathcal{X}_{1}\left(% elu(\boldsymbol{Q})\right)\left(\mathcal{X}_{2}\left(elu(\boldsymbol{K})\right% )^{\mathrm{T}}\boldsymbol{V}\right),bold_italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_Q bold_, bold_italic_K bold_, bold_italic_V ) = caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_e italic_l italic_u ( bold_italic_Q ) ) ( caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_e italic_l italic_u ( bold_italic_K ) ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_italic_V ) ,

where 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are row-wise and column-wise L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT normalization map**s, respectively, 𝑸,𝑲,𝑽𝑸𝑲𝑽\boldsymbol{Q,K,V}bold_italic_Q bold_, bold_italic_K bold_, bold_italic_V are learnable query, key and value matrices and 𝑨superscript𝑨\boldsymbol{A}^{\prime}bold_italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the output attention score. This approach mitigates the issue that the softmax layer concentrates on the scores of merely a few positions, enlarging the information capacity of attention mechanism  (Liu et al., 2023a). By implementing linear attention, our GLINT-RU framework is capable of learning interactions between items in long sequences.

3. METHODOLOGY

Refer to caption
Figure 1. Framework of proposed GLINT-RU. The core part of this framework is the Dense Selective GRU, where hidden states are deeply selected and aggregated. Gate mechanisms including expert mixing gate, dense selective gate, and MLP gate are all employed in appropriate positions to deeply filter information.

3.1. Framework Overview

Many existing recommendation frameworks depend on transformer structure  (Sun et al., 2019; Kang and McAuley, 2018; Liu et al., 2023a), which incurs substantial computational overhead and low inference speed. Restricted by the large computational complexity of stacked transformers, linear attention-based models have approached a plateau in terms of minimizing inference time and resource consumption. Uniquely, we propose an advanced recommendation framework that integrates the linear attention mechanism and efficient dense selective GRU module, which further reduce the computational cost compared with stacked linear transformers and SSM-based models. Additionally, this dense selective GRU module also enables our framework to understand complex long-/short-term user behaviors, and substantially reduce the computational cost and inference time. Figure 1 shows the structure for our GLINT-RU framework.

GLINT-RU framework integrates an expert mixing block for mixing sequential information from the dense selective GRU expert and the linear attention expert, and a gated MLP block for further learning and filtering complex user behaviors.

In the expert mixing block, the dense selective GRU module is employed to capture the long-term dependencies and local connections among the items, and selectively learn the sequential information. In addition, the linear attention expert is responsible for modeling item interactions from the user. By combining these two powerful expert modules, our GLINT-RU is capable of adaptively learning dependencies and relevant items from the sequence, which gains deeper insights into the complex user behaviors.

After the user-item interactions are selectively learned by mixing block, the item scores are conveyed to the gated MLP block, where the information is filtered according to the data environment. The framework employs various gates in appropriate positions to deeply filter information, improving the model’s flexibility and the ability to perceive and select information.

3.2. Item Embedding Layer

For sequential recommendation tasks, information on items interacted by users should be encoded to tensors through the embedding layer  (Zhao et al., 2023). We denote the length of input user-item interactions as N𝑁Nitalic_N, and embedding size as d𝑑ditalic_d. For a interaction sequence si=[v1,v2,,vn,vni]subscript𝑠𝑖subscript𝑣1subscript𝑣2subscript𝑣𝑛subscript𝑣subscript𝑛𝑖s_{i}=[v_{1},v_{2},\dots,v_{n},\dots v_{n_{i}}]italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , … italic_v start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ], the n𝑛nitalic_n-th item vnDnsubscript𝑣𝑛superscriptsubscript𝐷𝑛v_{n}\in\mathbb{R}^{D_{n}}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT can be projected into the representation 𝒆nsubscript𝒆𝑛\boldsymbol{e}_{n}bold_italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT by the following formulation:

(3) 𝒆n=𝑾nvn,subscript𝒆𝑛subscript𝑾𝑛subscript𝑣𝑛\boldsymbol{e}_{n}=\boldsymbol{W}_{n}v_{n},bold_italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ,

where 𝑾nd×Dnsubscript𝑾𝑛superscript𝑑subscript𝐷𝑛\boldsymbol{W}_{n}\in\mathbb{R}^{d\times D_{n}}bold_italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is trainable weighted matrix. The embedding layer outputs the encoded item sequence in a tensor:

(4) 𝑬=[𝒆1,𝒆2,,𝒆N]T.𝑬superscriptsubscript𝒆1subscript𝒆2subscript𝒆𝑁T\boldsymbol{E}=[\boldsymbol{e}_{1},\boldsymbol{e}_{2},\cdots,\boldsymbol{e}_{N% }]^{\mathrm{T}}.bold_italic_E = [ bold_italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_e start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT .

In traditional transformer-based models, positional embeddings are typically necessary because the attention mechanism lacks the inherent capability to encode temporal information  (Kang and McAuley, 2018). Uniquely, in this paper we employ the GRU module to capture sequential behaviors, which has already taken temporal information into consideration. Therefore, we decide not to add the positional embedding layer into the framework.

3.3. Dense Selective GRU

Existing GRU cell learns sequential data by conveying information from preceding cells. Although this mechanism has the superiority of capturing the long-term dependencies in the sequence, it predominantly focuses on information from previous items, while potentially overlooking the valuable context information provided by adjacent items, which are often closely related in real-world applications. To address these challenges and extract local temporal features, we introduce the dense selective GRU as the core component of the GLINT-RU framework. This innovation aims to enhance the model’s flexibility and ability to capture complex long-/short-term patterns in user behavior. By implementing dense selective GRU, the computational complexity can be further reduced, and the recommendation accuracy of GRU-based framework can be substantially improved.

3.3.1. Dense GRU module

Therefore, to enable each GRU cell to be exposed to local temporal features of user behaviour, we introduce a temporal convolution layer, where adjacent item information is adaptively fused before being fed into GRU module:

(5) 𝑳𝑳\displaystyle\boldsymbol{L}bold_italic_L =𝑿𝑾0+𝒃0,absent𝑿subscript𝑾0subscript𝒃0\displaystyle=\boldsymbol{X}\boldsymbol{W}_{0}+\boldsymbol{b}_{0},= bold_italic_X bold_italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,
𝑪𝑪\displaystyle\boldsymbol{C}bold_italic_C =TemporalConv1d(𝑳)absentTemporalConv1d𝑳\displaystyle=\mathrm{TemporalConv1d}(\boldsymbol{L})= TemporalConv1d ( bold_italic_L )

where 𝑿=[x1,x2,,𝒙N]T𝑿superscriptsubscript𝑥1subscript𝑥2subscript𝒙𝑁T\boldsymbol{X}=[x_{1},x_{2},\dots,\boldsymbol{x}_{N}]^{\mathrm{T}}bold_italic_X = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT is the input tensor with d𝑑ditalic_d feature dimensions, 𝑳𝑳\boldsymbol{L}bold_italic_L is the output of the linear layer, and 𝑪=[𝒄1,𝒄2,,𝒄N]T𝑪superscriptsubscript𝒄1subscript𝒄2subscript𝒄𝑁T\boldsymbol{C}=[\boldsymbol{c}_{1},\boldsymbol{c}_{2},\dots,\\ \boldsymbol{c}_{N}]^{\mathrm{T}}bold_italic_C = [ bold_italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT is the output of the convolution operation TemporalConv1d()TemporalConv1d\mathrm{TemporalConv1d(\cdot)}TemporalConv1d ( ⋅ ) with N𝑁Nitalic_N steps. 𝑾0subscript𝑾0\boldsymbol{W}_{0}bold_italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒃0subscript𝒃0\boldsymbol{b}_{0}bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are weight matrix and bias, respectively. The size of the convolution kernel is set as k𝑘kitalic_k. According to Eq.(1), the GRU cell generates the output based on the input and the hidden state at the previous time step:

(6) 𝒉n=GRUCell(𝒄n,𝒉n1),subscript𝒉𝑛GRUCellsubscript𝒄𝑛subscript𝒉𝑛1\boldsymbol{h}_{n}=\mathrm{GRUCell}(\boldsymbol{c}_{n},\boldsymbol{h}_{n-1}),bold_italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_GRUCell ( bold_italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) ,

where GRUCell()GRUCell\mathrm{GRUCell}(\cdot)roman_GRUCell ( ⋅ ) is the basic unit in Eq.(1) that processes a current hidden state. Then we generate new features by linearly combining the outputs of the GRU shown in Figure 1, capturing complex patterns of user preferences. These novel feature representations can be more effectively utilized for predicting behaviors and preferences of users. The feature-crossing process can be written as:

(7) Φ(𝑯)=𝑯𝑾H+𝒃H,Φ𝑯𝑯subscript𝑾𝐻subscript𝒃𝐻\Phi(\boldsymbol{H})=\boldsymbol{H}\boldsymbol{W}_{H}+\boldsymbol{b}_{H},roman_Φ ( bold_italic_H ) = bold_italic_H bold_italic_W start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ,

where 𝑯=[𝒉1,𝒉2,,𝒉N]T𝑯superscriptsubscript𝒉1subscript𝒉2subscript𝒉𝑁T\boldsymbol{H}=[\boldsymbol{h}_{1},\boldsymbol{h}_{2},\dots,\boldsymbol{h}_{N}% ]^{\mathrm{T}}bold_italic_H = [ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT is the output of GRU, 𝑾Hsubscript𝑾𝐻\boldsymbol{W}_{H}bold_italic_W start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is the learnable weight matrix and bHsubscript𝑏𝐻b_{H}italic_b start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT are bias. Although each input state 𝒄tsubscript𝒄𝑡\boldsymbol{c}_{t}bold_italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT incorporates information from both preceding and subsequent items, each output hidden state is still determined by the hidden state of the preceding time step. Therefore, to capture the context information of the output sequential hidden states, we implement a temporal convolution on the crossed features. This convolution layer extracts local temporal features to understand user behavior dynamics, and enhance the predictive accuracy of our model:

(8) 𝒀=TemporalConv1d(𝒢)𝒀TemporalConv1d𝒢\boldsymbol{Y}=\mathrm{TemporalConv1d}(\mathcal{G})bold_italic_Y = TemporalConv1d ( caligraphic_G )

where 𝒢𝒢\mathcal{G}caligraphic_G is the Selective Gate function, and 𝒀=[𝒚1,𝒚2,,𝒚N]T𝒀superscriptsubscript𝒚1subscript𝒚2subscript𝒚𝑁T\boldsymbol{Y}=[\boldsymbol{y}_{1},\boldsymbol{y}_{2},\dots,\boldsymbol{y}_{N}% ]^{\mathrm{T}}bold_italic_Y = [ bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT is the output matrix from the dense selective GRU module . The two convolution layers together with GRU cells improve the density of the sequential information, enabling each hidden state to be learned from behaviors of more input time steps.

3.3.2. Selective Gate

To filter the hidden information of the GRU module, we design a selective gate where outputs of the feature crossing layer are selected based on the input state of the GRU. The selective gate weights are generated by a two tiny linear layers with SiLU activation function  (Elfwing et al., 2018), and we use them to select useful hidden states and filter information:

(9) 𝜹1(𝑪)subscript𝜹1𝑪\displaystyle\boldsymbol{\delta}_{1}(\boldsymbol{C})bold_italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_C ) =𝑪𝑾δ(1)+bδ(1),absent𝑪superscriptsubscript𝑾𝛿1superscriptsubscript𝑏𝛿1\displaystyle=\boldsymbol{C}\boldsymbol{W}_{\delta}^{(1)}+b_{\delta}^{(1)},= bold_italic_C bold_italic_W start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ,
𝒢(𝑯)=𝛀(𝜹1(𝑪),𝑯)𝒢𝑯𝛀subscript𝜹1𝑪𝑯\displaystyle\mathcal{G}(\boldsymbol{H})=\boldsymbol{\Omega}(\boldsymbol{% \delta}_{1}(\boldsymbol{C}),\boldsymbol{H})caligraphic_G ( bold_italic_H ) = bold_Ω ( bold_italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_C ) , bold_italic_H ) =(SiLU(𝜹1(𝑪))𝑾Ω(1)+𝒃Ω(1))Φ(𝑯),absenttensor-productSiLUsubscript𝜹1𝑪superscriptsubscript𝑾Ω1superscriptsubscript𝒃Ω1Φ𝑯\displaystyle=(\mathrm{SiLU}(\boldsymbol{\delta}_{1}(\boldsymbol{C}))% \boldsymbol{W}_{\Omega}^{(1)}+\boldsymbol{b}_{\Omega}^{(1)})\otimes\Phi(% \boldsymbol{H}),= ( roman_SiLU ( bold_italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_C ) ) bold_italic_W start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ⊗ roman_Φ ( bold_italic_H ) ,

where 𝑪=[𝒄1,𝒄2,,𝒄N]T𝑪superscriptsubscript𝒄1subscript𝒄2subscript𝒄𝑁T\boldsymbol{C}=[\boldsymbol{c}_{1},\boldsymbol{c}_{2},\dots,\boldsymbol{c}_{N}% ]^{\mathrm{T}}bold_italic_C = [ bold_italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT also serves as the input of the GRU module, 𝑾δ(1)superscriptsubscript𝑾𝛿1\boldsymbol{W}_{\delta}^{(1)}bold_italic_W start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , 𝑾Ω(1)superscriptsubscript𝑾Ω1\boldsymbol{W}_{\Omega}^{(1)}bold_italic_W start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT are weight matrices, 𝒃δ(1),𝒃Ω(1)superscriptsubscript𝒃𝛿1superscriptsubscript𝒃Ω1\boldsymbol{b}_{\delta}^{(1)},\boldsymbol{b}_{\Omega}^{(1)}bold_italic_b start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , bold_italic_b start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT are bias. With this gate mechanism, our GRU-based model could become more flexible with the ability to perceive the data environment.

3.4. Expert Mixing Block

Although a single GRU module can capture long-term dependencies for recommendation tasks, it lacks the ability to model global connections within the item sequence. By introducing the linear attention mechanism and mixing these two powerful experts, the computational complexity will be substantially reduced compared with transformer-based models, and more information about item connections and dependencies will be considered. Moreover, the two employed experts are parallel in our framework, which further improves the model efficiency.

In real applications, the conditions of the data might vary a lot. The GRU is naturally suited for sequential data, demonstrating effective performance in sequential recommendation tasks that exhibit strong temporal dependencies, while attention focuses on relevant items in the sequence dynamically. To adapt to complex data conditions, we attribute appropriate weights to the two experts by a mixing gate:

(10) 𝑴𝑴\displaystyle\boldsymbol{M}bold_italic_M =α1(t)𝑨(𝑸,𝑲,𝑽)+α2(t)𝒀,absentsuperscriptsubscript𝛼1𝑡superscript𝑨𝑸𝑲𝑽superscriptsubscript𝛼2𝑡𝒀\displaystyle=\alpha_{1}^{(t)}\boldsymbol{A}^{\prime}(\boldsymbol{Q},% \boldsymbol{K},\boldsymbol{V})+\alpha_{2}^{(t)}\boldsymbol{Y},= italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_Q , bold_italic_K , bold_italic_V ) + italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_italic_Y ,
αi(t)superscriptsubscript𝛼𝑖𝑡\displaystyle\alpha_{i}^{(t)}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =softmax(αi(t1))=αi(t1)eα1(t1)+eα2(t1),i=1,2,formulae-sequenceabsentsoftmaxsuperscriptsubscript𝛼𝑖𝑡1superscriptsubscript𝛼𝑖𝑡1superscript𝑒superscriptsubscript𝛼1𝑡1superscript𝑒superscriptsubscript𝛼2𝑡1𝑖12\displaystyle=\mathrm{softmax}(\alpha_{i}^{(t-1)})=\frac{\alpha_{i}^{(t-1)}}{e% ^{\alpha_{1}^{(t-1)}}+e^{\alpha_{2}^{(t-1)}}},\ \ i=1,2,= roman_softmax ( italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ) = divide start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG , italic_i = 1 , 2 ,

where α1(t),α2(t)superscriptsubscript𝛼1𝑡superscriptsubscript𝛼2𝑡\alpha_{1}^{(t)},\alpha_{2}^{(t)}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT are trainable mixing parameters at t𝑡titalic_t-th training iteration. Then we filter this output by introducing another data-aware gate which selects the outputs based on the original input data batch:

(11) 𝜹2(𝑿)subscript𝜹2𝑿\displaystyle\boldsymbol{\delta}_{2}(\boldsymbol{X})bold_italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_X ) =𝑿𝑾δ(2)+bδ(2),absent𝑿superscriptsubscript𝑾𝛿2superscriptsubscript𝑏𝛿2\displaystyle=\boldsymbol{X}\boldsymbol{W}_{\delta}^{(2)}+b_{\delta}^{(2)},= bold_italic_X bold_italic_W start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ,
𝒁=𝛀2(𝜹2(𝑿),𝑴)𝒁subscript𝛀2subscript𝜹2𝑿𝑴\displaystyle\boldsymbol{Z}=\boldsymbol{\Omega}_{2}(\boldsymbol{\delta}_{2}(% \boldsymbol{X}),\boldsymbol{M})bold_italic_Z = bold_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_X ) , bold_italic_M ) =(GeLU(𝜹2(𝑿)))𝑴,absenttensor-productGeLUsubscript𝜹2𝑿𝑴\displaystyle=(\mathrm{GeLU}(\boldsymbol{\delta}_{2}(\boldsymbol{X})))\otimes% \boldsymbol{M},= ( roman_GeLU ( bold_italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_X ) ) ) ⊗ bold_italic_M ,

where 𝑿𝑿\boldsymbol{X}bold_italic_X is the input of expert mixing block, 𝑾δ(2)superscriptsubscript𝑾𝛿2\boldsymbol{W}_{\delta}^{(2)}bold_italic_W start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT is the weight matrix of linear layer, 𝒃δ(2)superscriptsubscript𝒃𝛿2\boldsymbol{b}_{\delta}^{(2)}bold_italic_b start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT are bias.

3.5. Gated MLP Block

Most existing frameworks, especially transformer-based models, utilize a two-layer feed forward network to capture the nonlinear relationships among features before giving predictions. To further enhance the performance of the model and augment useful features from the expert mixing block, we introduce the gated MLP block, which employs a gate mechanism again to deeply filter the information and generate item representations for predictions.

(12) 𝜹3(𝒁)subscript𝜹3𝒁\displaystyle\boldsymbol{\delta}_{3}(\boldsymbol{Z})bold_italic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_italic_Z ) =𝒁𝑾δ(3)+bδ(3),absent𝒁superscriptsubscript𝑾𝛿3superscriptsubscript𝑏𝛿3\displaystyle=\boldsymbol{Z}\boldsymbol{W}_{\delta}^{(3)}+b_{\delta}^{(3)},= bold_italic_Z bold_italic_W start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ,
𝑷=𝛀2(𝜹3(𝒁),𝒁)𝑷subscript𝛀2subscript𝜹3𝒁𝒁\displaystyle\boldsymbol{P}=\boldsymbol{\Omega}_{2}(\boldsymbol{\delta}_{3}(% \boldsymbol{Z}),\boldsymbol{Z})bold_italic_P = bold_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_italic_Z ) , bold_italic_Z ) =(GeLU(𝜹3(𝒁)))𝒁,absenttensor-productGeLUsubscript𝜹3𝒁𝒁\displaystyle=(\mathrm{GeLU}(\boldsymbol{\delta}_{3}(\boldsymbol{Z})))\otimes% \boldsymbol{Z},= ( roman_GeLU ( bold_italic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_italic_Z ) ) ) ⊗ bold_italic_Z ,
𝑹𝑹\displaystyle\boldsymbol{R}bold_italic_R =𝑷𝑾o+𝒃oabsent𝑷subscript𝑾𝑜subscript𝒃𝑜\displaystyle=\boldsymbol{P}\boldsymbol{W}_{o}+\boldsymbol{b}_{o}= bold_italic_P bold_italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT

where 𝒁𝒁\boldsymbol{Z}bold_italic_Z is the output of expert mixing block, 𝑷𝑷\boldsymbol{P}bold_italic_P denotes the output of gated linear layer, 𝑹𝑹\boldsymbol{R}bold_italic_R represents the item representation, and 𝑾δ(3),𝑾osuperscriptsubscript𝑾𝛿3subscript𝑾𝑜\boldsymbol{W}_{\delta}^{(3)},\boldsymbol{W}_{o}bold_italic_W start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT are weight matrices, 𝒃δ(3),𝒃osuperscriptsubscript𝒃𝛿3subscript𝒃𝑜\boldsymbol{b}_{\delta}^{(3)},\boldsymbol{b}_{o}bold_italic_b start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT are bias. The recommendation scores are generated by item representations and embeddings, followed by final prediction probability y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the i𝑖iitalic_i-th item:

(13) y^i=softmax(𝑹i(𝒆i)T),subscript^𝑦𝑖softmaxsubscript𝑹𝑖superscriptsubscript𝒆𝑖T\displaystyle\hat{y}_{i}=\mathrm{softmax}(\boldsymbol{R}_{i}(\boldsymbol{e}_{i% })^{\mathrm{T}}),over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_softmax ( bold_italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ) ,

where 𝑹isubscript𝑹𝑖\boldsymbol{R}_{i}bold_italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the representation of i𝑖iitalic_i-th item. Finally, our GLINT-RU give top-k𝑘kitalic_k items with the highest scores for recommendation tasks.

3.6. Complexity Analysis

In this subsection, we will analyze the time complexity of our GLINT-RU, and illustrate why GLINT-RU has inherent superiority over other popular SRS models in model efficiency.

Assuming that the sequence length is set as N𝑁Nitalic_N, the embedding size is d𝑑ditalic_d and the kernel sizes for GLINT-RU is k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the time complexity of GLINT-RU is 𝒪((2k1+12)Nd2)𝒪2subscript𝑘112𝑁superscript𝑑2\mathcal{O}((2k_{1}+12)Nd^{2})caligraphic_O ( ( 2 italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 12 ) italic_N italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The complexity is calculated throughout the network, from the embedding layer to the prediction layer. GLINT-RU is more efficient than other models in the following aspects: Firstly, our model shows significantly low time complexity, as GLINT-RU is a highly paralleled mixed network with only one layer to achieve high performance. Our framework utilizes paralleled expert networks, and employs efficient GRU module to capture long-term dependencies, outperforming all the other models in theoretical complexity. Secondly, traditional transformer-based model SASRec  (Kang and McAuley, 2018) suffer from large computational complexity, especially When the sequence length N𝑁Nitalic_N is large. LinRec  (Liu et al., 2023a) changes the dot product of attention mechanism, substantially reduce the computational cost. However, Linrec still requires stacked transformers to achieve high performance. These transformer-based models require at least two layers to achieve satisfactory accuracy, while our GLINT-RU achieves outstanding performance with only one layer. Thirdly, Mamba4Rec utilizes a hardware-aware mechanism to accelerate the training and inference speed  (Gu and Dao, 2023), but its improvement in recommendation performance and efficiency is not stable, which will be further verified via our experiments.

4. EXPERIMENTS

In this section, we conduct extensive experiments to show the effectiveness and efficiency of our GLINT-RU Framework. After we introduce our implementation detail, the experiment results will be analyzed in details. The experiments in this section are set to answer the following research questions:

  • RQ1: How does GLINT-RU framework perform compared with other state-of-the-art SRS baseline models?

  • RQ2: To what extent does GLINT-RU improve model efficiency compared with other state-of-the-art SRS frameworks?

  • RQ3: How do dense selective GRU and linear attention mechanism contribute to GLINT-RU?

  • RQ4: How does the hyperparameter setting affect the performance of GLINT-RU?

4.1. Datasets and Evaluation Metrics

We evaluate GLINT-RU based on three benchmark datasets ML-1M  111https://grouplens.org/datasets/movielens/, Amazon-Beauty and Amazon video Games 222https://cseweb.ucsd.edu/ jmcauley/datasets.html#amazon_reviews.

Statistical information of the three datasets is shown in Table 1. We adopt Recall, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG) as the evaluation metrics for our experiments. The interactions are grouped by users chronologically, and the datasets are split by the leave-one-out strategy (Liu et al., 2023a).

4.2. Baselines

To demonstrate the effectiveness and efficiency of GLINT-RU, we make comparisons between GLINT-RU and state-of-the-art effective and efficient RNN-, attention-, MLP- and SSM-based baselines.

Traditional Models (1) GRU4Rec (Hidasi et al., 2015): utilizes GRUs to capture sequential dependencies within user interaction data for session-based recommendations. (2) BERT4Rec  (Sun et al., 2019): adapts the Bidirectional Encoder Representations from Transformers (BERT) architecture to model user behaviors for personalized recommendation. (3) SASRec (Kang and McAuley, 2018): captures long-term and short-term user preferences by applying a multi-head attention mechanism to learn representations of user interaction sequences.

Efficient Models (1) LinRec (Liu et al., 2023a): reduces the computational costs substantially by changing the dot product of attention mechanism in the transformer-based models. We select SASRec as the backbone of LinRec. (2) SMLP4Rec  (Gao et al., 2024) uses a tri-directional fusion scheme to learn correlations on sequence, channel and feature dimensions efficiently. (3) Mamba4Rec  (Liu et al., 2024): explores the potential of selective SSMs for efficient sequential recommendation.

Table 1. Statistical Information of Adopted Datasets.
Datasets # Users # Items # Interactions Avg.Length Sparsity
ML-1M 6041 3707 1,000,209 165.60 95.53%
Beauty 22,364 12,102 198,502 8.88 99.93%
Video Games 24,304 10,673 231,780 9.54 99.91%
Table 2. Overall performance comparison between GLINT-RU and baselines.
Models ML-1M Amazon-Beauty Amazon-Video-Games
Recall@10 MRR@10 NDCG@10 Recall@10 MRR@10 NDCG@10 Recall@10 MRR@10 NDCG@10
GRU4Rec 0.6954 0.4055 0.4748 0.3851 0.1891 0.2351 0.6028 0.2929 0.3660
BERT4Rec 0.7119 0.4041 0.4776 0.3478 0.1584 0.2027 0.5490 0.2541 0.2916
SASRec 0.7205 0.4251 0.4958 0.4332 0.2325 0.2798 0.6459 0.3404 0.4128
LinRec 0.7184 0.4316 0.5002 0.4270 0.2314 0.2775 0.6384 0.3355 0.4073
SMLP4Rec 0.6753 0.3870 0.4558 0.4457 0.2408 0.2891 0.6480 0.3484 0.4195
Mamba4Rec 0.7238 0.4368 0.5054 0.4233 0.2213 0.2689 0.6488 0.3389 0.4123
GLINT-RU 0.7379 0.4517 0.5202 0.4472 0.2498 0.2964 0.6573 0.3549 0.4266
Improv. 1.95% 3.30% 2.93% 0.34% 3.74% 2.53% 1.31% 1.87% 1.69%
  • *

    Recommendation performance of GLINT-RU and existing state-of-the-art benchmark SRS models have been shown. ``"``"``\ast"` ` ∗ " indicates the improvements are statistically significant (i.e., two-sided t-test with p𝑝pitalic_p ¡ 0.05) over baselines). The best results are bolded, and the second-best are underlined.

4.3. Implementation

In this subsection, we introduce the implementation details of the GLINT-RU. We use Adam optimizer  (Kingma and Ba, 2014) with the learning rate 0.001 for our training process. Both the train and evaluation batch size are set as 2048. The hidden size is set as 128 for ML-1M, and 64 for Amazon Beauty and Video Games. As is shown in Table 1, the average length of ML-1M , Beauty and Video Games are 165, 8.88 and 9.54, so we set the maximum sequence length as 200 for ML-1M, and 100 for the two amazon datasets. We adopt the dropout rate 0.5 for amazon datasets considering their high level of sparsity, compared with 0.2 for ML-1M. Other implementation details follow the settings of original papers (Liu et al., 2024; Gao et al., 2024; Liu et al., 2023a).

4.4. Overall Performance (RQ1)

In this subsection, we compare the performance of GLINT-RU with both traditional recommendation frameworks and state-of-the-art efficient models. The results, as shown in Table 2, demonstrate the effectiveness of GLINT-RU on metrics Recall@10, MRR@10, and NDCG@10 in bold. According to the above table, it is evident that GLINT-RU defeats all the selected transformer-, RNN-, MLP- and SSM-based baselines. We improve the performance upper bound of efficient recommendation models by 0.34%3.74%similar-topercent0.34percent3.740.34\%\sim 3.74\%0.34 % ∼ 3.74 %.

Traditional RNN-based models, like GRU4Rec, might have difficulty in dealing with complex user behaviors. Although it can capture the long-term dependencies from long sequences, it struggles to learn effectively from extremely sparse datasets. Traditional attention-based methods like SASRec have great performance on the three datasets, but it is quite inefficient due to the high computational complexity of the attention mechanism.

Efficient model LinRec changes the softmax operation, and takes attention scores from more positions into consideration, improving the performance on long-term sequential recommendation tasks compared with its backbone SASRec. Both SMLP4Rec and Mamba4Rec achieve impressive performance on the three datasets. However, Mamba4Rec demonstrates enhanced proficiency in modeling long-term sequence, but exhibits low performance in relatively short sequences. Conversely, the SMLP4Rec shows superior performance in short sequence recommendations while being less effective with longer sequences. In addition, SMLP4Rec requires additional features to enhance its performance. Uniquely, GLINT-RU integrates the advantages of linear attention mechanism and dense selective GRU module, adaptively extracting dependencies and most relevant behaviors from users. The gate mechanism employed in GLINT-RU substantially enhances its ability to filter the information based on the dynamic data environment and mix the experts according to the data adaptively.

In summary, GLINT-RU as a novel efficient framework, shows its superiority over state-of-the-art baselines. This underscores the potential of dense selective GRU and models with hybrid modules as more powerful tools for recommendation tasks.

Table 3. Efficiency comparison.
Datasets Model Infer. Training GPU Memory
ML-1M GLINT-RU 31ms 86s/epoch 8.81G
Mamba4Rec 41ms 108s/epoch 7.72G
SASRec 55ms 172s/epoch 21.51GB
BERT4Rec 88ms 285s/epoch 21.51GB
LinRec 37ms 101s/epoch 11.67GB
SMLP4REc 51ms 151s/epoch 16.13GB
Improv. 16.22% 17.44% -
Beauty GLINT-RU 278ms 3.8s/epoch 2.62G
Mamba4Rec 351ms 4.5s/epoch 2.32G
SASRec 444ms 8.1s/epoch 7.67GB
BERT4Rec 1372ms 13s/epoch 11.69GB
LinRec 340ms 5.6s/epoch 4.14GB
SMLP4Rec 361ms 5.3s/epoch 2.95GB
Improv. 18.24% 15.56% -
Video Games GLINT-RU 247ms 4.5s/epoch 2.49G
Mamba4Rec 309ms 5.6s/epoch 2.28G
SASRec 406ms 9.6s/epoch 7.65GB
BERT4Rec 1290ms 15s/epoch 10.98GB
LinRec 327ms 6.6s/epoch 4.13GB
SMLP4Rec 389ms 9.1s/epoch 3.38GB
Improv. 20.06% 19.64% -
  • *

    Inference time (ms) of each mini-batch, training time (s/epoch) and GPU memory (GB) of GLINT-RU and other baseline models. The best results are bolded, and the second best results are underlined.

4.5. Efficiency Comparison (RQ2)

In this subsection, we analyze the efficiency of GLINT-RU and state-of-the-art sequential recommendation models. We evaluate the model efficiency according to their inference time of each mini-batch, training time and GPU memory occupation.

The results, shown in Table 3, provide several valuable insights: Firstly, by utilizing the efficient Dense GRU module and linear attention module, GLINT-RU dramatically reduces the training time and inference time, improving the training inference time by 15%20%similar-topercent15percent2015\%\sim 20\%15 % ∼ 20 % compared with most efficient recommendation baseline models. In addition, due to the low computational cost of GRU and linear attention mechanism, GLINT-RU exhibits minimal GPU memory consumption, which is comparable to the state-of-the-art SSM-based efficient model Mamba4Rec. In Section 3.6, we demonstrate that the GLINT-RU exhibits low theoretical computational complexity, as we employs the parallel networks and efficient GRU as core component for our recommender system. The theoretical analysis has been verified by the results in Table 3.

Traditional transformer-based recommendation models like SASRec and BERT4Rec suffer from extended inference time and high GPU memory occupation. When processing long sequential data, the conventional attention mechanism falls behind novel efficient models due to its high computational cost. Among all the baseline models, the SSM-based Mamba4Rec framework exhibits impressive efficiency, but Mamba requires complex mathematical computation, which slows down its inference and training speed. Additionally, LinRec suffers from the inherent shortage of its backbone SASRec that requires stacked transformer layers to enhance the model performance. Such large transformer structures extend both the inference and training time. Although SMLP4Rec achieves high performance in the model accuracy, it struggles to train and inference efficiently, especially when processing long-term sequential data.

4.6. Ablation Study (RQ3)

In this subsection, we will analyze the efficacy of essential components in GLINT-RU architecture. We conduct the ablation study on the ML-1M dataset, and the results are outlined in Table 4, providing insightful observations.

The results verify the essential role of the dense selective GRU module, as the performance of the model will dramatically decrease without the GRU module. This reveals the insights that the gated GRU module effectively capture the long-term dependencies of user-item interactions. Linear attention mechanism supplements the information of relevant items in the sequence. As is shown in Table 4, it improves the performance of GLINT-RU to some extent. Adding a temporal convolution layer incorporates context information from adjacent items, resulting in an enhancement in model performance. In addition, the gated MLP block plays a similar role as the feed-forward network in our framework, which filters complex information from the expert mixing block. It is noteworthy that even without the gated MLP block our framework still outperforms all the state-of-the art efficient models, demonstrating its inherent remarkable superiority for sequential recommendation tasks. After we remove the gated MLP, the GPU memory occupation of GLINT-RU becomes 7.63GB, less than Mamba4Rec shown in Table 3, and the inference time will be reduced to 241ms. We name this framework as ”Light GLINT-RU”, which is more applicable to resource constrained scenarios.

Table 4. Ablation study for Components of GLINT-RU.
Model Components Recall@10 MRR@10 NDCG@10
Default 0.7379 0.4517 0.5202
w/o Gated MLP (Light GLINT-RU) 0.7260 0.4369 0.5060
w/o Attention 0.7195 0.4312 0.5001
w/o GRU 0.6762 0.3913 0.4593
w/o Temporal Conv1d 0.7232 0.4322 0.5019
  • *

    ``"``"``\ast"` ` ∗ " indicates the improvements are statistically significant (i.e., two-sided t-test with p𝑝pitalic_p ¡ 0.05) over baselines)

4.7. Parameter Analysis

In this subsection, we analyze the impact of the crucial hyperparameter kernel size k𝑘kitalic_k in GLINT-RU, which determines the amount of information of each hidden state. We discuss the performance and efficiency of GLINT-RU with varied k𝑘kitalic_k, providing valuable insights of the stability and robustness of GLINT-RU.

We conduct the parameter analysis on the dataset Amazon Beauty, and the result is shown in Figure 3.

Refer to caption
Figure 2. Impacts of kernel size k𝑘kitalic_k of the temporal convolution on the performance of GLINT-RU and Mamba4Rec and the GPU Memory occupation of GLINT-RU.
Refer to caption
Figure 3. Impacts of kernel size k𝑘kitalic_k of the temporal convolution on the training and inference time.

The first three subfigures display the model performance of GLINT-RU with different kernel sizes. Kernel sizes that are larger than 1 enable the aggregation of the information from adjacent items in each hidden state, enhancing the model performance. Additionally, our model exhibits stable and high performance, providing a wide range of choices of kernel sizes. This finding indicates the robustness of GLINT-RU framework. The accuracy of the recommendation system is observed to be positively correlated with the kernel size k𝑘kitalic_k. This enhancement in performance can be attributed to the fact that a larger kernel size aggregates information from more items, thereby learning a more extensive context into the hidden state. However, as the kernel size continues to expand, dense selective GRU might incorporates irrelevant data into the output state, which might lead to a marginal decline in the accuracy. Mamba4Rec as a novel efficient SSM-based model, also employs a temporal convolution layer in the model structure. However, as is shown in Figure 3, as the kernel size changes, the performance of Mamba4Rec becomes quite unstable compared with our GLINT-RU. Furthermore, our GLINT-RU consistently outperforms Mamba4Rec across all kernel sizes, further demonstrating the stability and superiority of this novel GLINT-RU framework.

The hist plot in Figure 3 shows the GPU memory occupation with different kernel sizes. Setting different kernel sizes for the model does not significantly increase GPU utilization, indicating that the appropriate kernel size can be freely chosen based on accuracy requirements. Moreover, increasing the kernel size has slight impacts on the training and inference time of each mini-batch, which further verifies the efficiency and stability of our model.

5. Related Works

Traditional Sequential Recommendation Models Transformers, RNNs are popular effective frameworks for sequential recommendation. Various studies have been conducted to leverage transformers for recommendation tasks. Transformer-based SASRec  (Kang and McAuley, 2018) makes predictions based on multi-head attention within long-/short-term interactions. SSE-PT (Wu et al., 2020) improves the performance by adding user embeddings to the transformer. MB-STR (Yuan et al., 2022) is also a variant of transformer-based model, which captures diverse user behavior dynamics, and alleviates the impacts from data sparsity problems. RNN-based GRU4Rec  (Hidasi et al., 2015) learns dependencies among the items using gated recurrent units. HRNN (Quadrana et al., 2017) as an advanced RNN-based model adds an additional GRU layer to learn information from user sessions and instantly track the user preferences.

Efficient Recommendation Models To tackle the high computational complexity of effective transformer-based SRSs, and accelerate the inference speed for real-world applications, researchers endeavor to invent increasingly effcient models. AutoSeqRec  (Liu et al., 2023b) is constructed based on Autoencoder, which is a innovative work that captures long-term preferences through collaborative filtering. To further improve the model efficiency, DMAN (Tan et al., 2021) combines the long-term attention net and recurrent attention net to memorize users’ interests dynamically and support efficient inference. LinRec  (Liu et al., 2023a) cuts down the computational cost of transformer-based backbones by changing the dot product in the attention mechanism. LightSAN  (Fan et al., 2021) projects the initial interactions into representations with shorter length, which is also an efficient approach for transformer-based models. SSM-based models like Mamba4Rec (Liu et al., 2024) and RecMamba (Yang et al., 2024) utilizes selective state space models to achieve high performances and high efficiency, becoming emerging powerful tools for sequential recommendation tasks. FMLP-Rec (Zhou et al., 2022) with learnable filters and SMLP4Rec (Gao et al., 2024) with diverse mixers are representative pure MLP-based efficient SRSs. LRURec (Yue et al., 2024) is constructed based on linear recurrent units, and achieves fast inference bu employing recursive parallelization.

6. Conclusion

In this paper, we have presented an innovative dense selective GRU framework GLINT-RU for sequential recommendation tasks. Due to the parallel netork design and the implementation of efficient dense selective GRU, the computational cost can be substantially reduced, thereby resulting in faster inference speed compared with state-of-the-art efficient models. Additionally, our model GLINT-RU tackles the issue of low accuracy of traditional GRU frameworks, and represents a significant advancement in the efficient recommendation models. GLINT-RU applies gate mechanisms to every module, deeply filtering and selecting the information according to the data environment. Dense GRU allows each hidden state to contain the information from the adjacent items, which captures local temporal features of user behavior to enhance the predictive accuracy. Our extensive experiments demonstrate that GLINT-RU achieves outstanding performance, not only improving the model accuracy, but also accelerating the training and inference speed dramatically. These results underscore our GLINT-RU’s potential to become a novel, stable, and efficient framework on datasets with various sparsity and sequence length. As an efficient and accurate recommender system that defeats the state-of-the-art efficient RNN-, transformer-, MLP- and SSM-based recommendation models, we believe that our novel framework will become a valuable foundation for research in more domains in recommender systems.

References

  • (1)
  • Chang et al. (2023) Jianxin Chang, Chenbin Zhang, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, and Kun Gai. 2023. Pepnet: Parameter and embedding personalized network for infusing with personalized prior information. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3795–3804.
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
  • De et al. (2024) Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. 2024. Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models. arXiv preprint arXiv:2402.19427 (2024).
  • Dey and Salem (2017) Rahul Dey and Fathi M Salem. 2017. Gate-variants of gated recurrent unit (GRU) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, 1597–1600.
  • Elfwing et al. (2018) Stefan Elfwing, Eiji Uchibe, and Kenji Doya. 2018. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks 107 (2018), 3–11.
  • Fan et al. (2021) Xinyan Fan, Zheng Liu, Jianxun Lian, Wayne Xin Zhao, Xing Xie, and Ji-Rong Wen. 2021. Lighter and better: low-rank decomposed self-attention networks for next-item recommendation. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 1733–1737.
  • Gao et al. (2024) **gtong Gao, Xiangyu Zhao, Muyang Li, Minghao Zhao, Runze Wu, Ruocheng Guo, Yiding Liu, and Dawei Yin. 2024. SMLP4Rec: An Efficient all-MLP Architecture for Sequential Recommendations. ACM Transactions on Information Systems 42, 3 (2024), 1–23.
  • Gu and Dao (2023) Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023).
  • Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
  • Jacobs et al. (1991) Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation 3, 1 (1991), 79–87.
  • Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Li et al. (2023) Chengxi Li, Ye**g Wang, Qidong Liu, Xiangyu Zhao, Wanyu Wang, Yiqi Wang, Lixin Zou, Wenqi Fan, and Qing Li. 2023. STRec: Sparse Transformer for Sequential Recommendations. In Proceedings of the 17th ACM Conference on Recommender Systems. 101–111.
  • Li et al. (2017) **g Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1419–1428.
  • Li et al. (2022) Muyang Li, Xiangyu Zhao, Chuan Lyu, Minghao Zhao, Runze Wu, and Ruocheng Guo. 2022. MLP4Rec: A pure MLP architecture for sequential recommendations. arXiv preprint arXiv:2204.11510 (2022).
  • Liu et al. (2024) Chengkai Liu, Jianghao Lin, Jianling Wang, Hanzhou Liu, and James Caverlee. 2024. Mamba4Rec: Towards Efficient Sequential Recommendation with Selective State Space Models. arXiv preprint arXiv:2403.03900 (2024).
  • Liu et al. (2023a) Langming Liu, Liu Cai, Chi Zhang, Xiangyu Zhao, **gtong Gao, Wanyu Wang, Yifu Lv, Wenqi Fan, Yiqi Wang, Ming He, et al. 2023a. Linrec: Linear attention mechanism for long-term sequential recommender systems. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 289–299.
  • Liu et al. (2023b) Sijia Liu, Jiahao Liu, Hansu Gu, Dongsheng Li, Tun Lu, Peng Zhang, and Ning Gu. 2023b. Autoseqrec: Autoencoder for efficient sequential recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 1493–1502.
  • Long et al. (2024) Chao Long, Huanhuan Yuan, Junhua Fang, Xuefeng Xian, Guanfeng Liu, Victor S Sheng, and Pengpeng Zhao. 2024. Learning Global and Multi-granularity Local Representation with MLP for Sequential Recommendation. ACM Transactions on Knowledge Discovery from Data 18, 4 (2024), 1–15.
  • Masoudnia and Ebrahimpour (2014) Saeed Masoudnia and Reza Ebrahimpour. 2014. Mixture of experts: a literature survey. Artificial Intelligence Review 42 (2014), 275–293.
  • Quadrana et al. (2017) Massimo Quadrana, Alexandros Karatzoglou, Balázs Hidasi, and Paolo Cremonesi. 2017. Personalizing session-based recommendations with hierarchical recurrent neural networks. In proceedings of the Eleventh ACM Conference on Recommender Systems. 130–137.
  • Shen et al. (2018) Guizhu Shen, Qing** Tan, Haoyu Zhang, ** Zeng, and Jianjun Xu. 2018. Deep learning with gated recurrent unit networks for financial sequence predictions. Procedia computer science 131 (2018), 895–903.
  • Shen et al. (2021) Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. 2021. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 3531–3539.
  • Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
  • Tan et al. (2021) Qiaoyu Tan, Jianwei Zhang, Ninghao Liu, Xiao Huang, Hongxia Yang, **gren Zhou, and Xia Hu. 2021. Dynamic memory based attention network for sequential recommendation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 4384–4392.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).
  • Wang et al. (2020) Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020).
  • Wu et al. (2020) Liwei Wu, Shuqing Li, Cho-Jui Hsieh, and James Sharpnack. 2020. SSE-PT: Sequential recommendation via personalized transformer. In Proceedings of the 14th ACM conference on recommender systems. 328–337.
  • Yang et al. (2024) Jiyuan Yang, Yuanzi Li, **gyu Zhao, Hanbing Wang, Muyang Ma, Jun Ma, Zhaochun Ren, Mengqi Zhang, Xin Xin, Zhumin Chen, et al. 2024. Uncovering Selective State Space Model’s Capabilities in Lifelong Sequential Recommendation. arXiv preprint arXiv:2403.16371 (2024).
  • Yuan et al. (2022) Enming Yuan, Wei Guo, Zhicheng He, Huifeng Guo, Chengkai Liu, and Ruiming Tang. 2022. Multi-behavior sequential transformer recommender. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. 1642–1652.
  • Yue et al. (2024) Zhenrui Yue, Yueqi Wang, Zhankui He, Huimin Zeng, Julian McAuley, and Dong Wang. 2024. Linear recurrent units for sequential recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 930–938.
  • Zhang et al. (2024) Sheng Zhang, Maolin Wang, Yao Zhao, Chenyi Zhuang, **jie Gu, Ruocheng Guo, Xiangyu Zhao, Zijian Zhang, and Hongzhi Yin. 2024. EASRec: Elastic Architecture Search for Efficient Long-term Sequential Recommender Systems. arXiv preprint arXiv:2402.00390 (2024).
  • Zhang et al. (2019) Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives. ACM computing surveys (CSUR) 52, 1 (2019), 1–38.
  • Zhao et al. (2023) Xiangyu Zhao, Maolin Wang, Xinjian Zhao, Jiansheng Li, Shucheng Zhou, Dawei Yin, Qing Li, Jiliang Tang, and Ruocheng Guo. 2023. Embedding in Recommender Systems: A Survey. arXiv preprint arXiv:2310.18608 (2023).
  • Zhou et al. (2022) Kun Zhou, Hui Yu, Wayne Xin Zhao, and Ji-Rong Wen. 2022. Filter-enhanced MLP is all you need for sequential recommendation. In Proceedings of the ACM web conference 2022. 2388–2399.