A Simple Yet Effective Approach for Diversified Session-Based Recommendation ††thanks: Citation:
Abstract
Session-based recommender systems (SBRSs) have become extremely popular in view of the core capability of capturing short-term and dynamic user preferences. However, most SBRSs primarily maximize recommendation accuracy but ignore users’ minor preferences, thus leading to filter bubbles in the long run. Only a handful of works, being devoted to improving diversity, depend on unique model designs and calibrated loss functions, which cannot be easily adapted to existing accuracy-oriented SBRSs. It is thus worthwhile to come up with a simple yet effective design that can be used as a plugin to facilitate existing SBRSs on generating a more diversified list in the meantime preserving the recommendation accuracy. In this case, we propose an end-to-end framework applied for every existing representative (accuracy-oriented) SBRS, called diversified category-aware attentive SBRS (DCA-SBRS), to boost the performance on recommendation diversity. It consists of two novel designs: a model-agnostic diversity-oriented loss function, and a non-invasive category-aware attention mechanism. Extensive experiments on three datasets showcase that our framework helps existing SBRSs achieve extraordinary performance in terms of recommendation diversity (e.g., an average of 74.1% increase on ILD10) and comprehensive performance (e.g., an average of 52.3% lift on F-score10), without significantly deteriorating recommendation accuracy compared to state-of-the-art accuracy-oriented SBRSs. The source code can be obtained via github.com/qyin863/DCA-SBRS.
Keywords recommender systems, session-based recommendation, diversification, diversified recommendation
1 Introduction
Session-based recommender systems (SBRSs) have gained significant attention because they provide more timely and accurate recommendations by incorporating short-term and dynamic user preferences [fang2020deep, wang2021survey]. To enhance recommendation accuracy, existing SBRSs utilize sophisticated models like deep neural networks that capture short-term preferences from the most recent session. For instance, GRU4Rec [hidasi2015session] employs gated recurrent units (GRU) to learn a session’s sequential behaviors. Furthermore, the attention mechanism is imported to capture main-purpose (intent) preferences such as NARM [li2017neural] and STAMP [LiuZMZ18]. Moreover, graph neural networks (GNNs) are utilized to learn more complex item relationships (e.g., SR-GNN [WuT0WXT19], GC-SAN [XuZLSXZFZ19], and GCE-GNN [wang2020global]). For the above state-of-the-art (SOTA) SBRSs, attention mechanisms are used together with RNNs or GNNs to improve recommendation performance [wang2021survey].
However, the aforementioned SOTA (accuracy-oriented) SBRSs would gradually overemphasize dominant interests and weaken minor ones [steck2018calibrated], thus leading to a filter bubble [nguyen2014exploring, khenissi2020theoretical] over time. As such, diversified recommender systems (RSs) are raised to recommend more diverse lists (e.g., with items covering many categories). The diversified works in traditional recommendation fall into three major categories: post-processing heuristic methods [carbonell1998use, steck2018calibrated], determinantal point process (DPP) methods [chen_fast_2018, wu2019pd, gan2020enhancing] and end-to-end learning methods [zheng2021dgcn, liang2021enhancing]. However, to the best of our knowledge, there are only three representative diversified SBRSs such as MCPRN [Wang0WSOC19], ComiRec [Cen2020ControllableMF] and IDSR [chen2020improving]. Both MCPRN and ComiRec design multiple channels rather than one major channel to learn multiple purposes in a session, where recommendations strive to satisfy these purposes instead of only capturing the main purpose as representative accuracy-oriented works (e.g., NARM). Following the above multiple-purpose assumption, IDSR also jointly incorporates both item relevance and diversity into the prediction score and loss function.
To conclude, existing studies on diversified SBRSs mainly suffer from two challenges: (1) as we can tell from previous studies, model variants like multiple channels and unique diversity-oriented loss (objective) fitted for special diversity modules are carefully calibrated by diversified SBRSs. However, such diversified designs cannot be easily adopted by existing representative accuracy-oriented SOTA SBRSs. Thus, the first research challenge lies in how to come up with simple yet effective designs (like loss function) that can facilitate the diversity performance of SOTA accuracy-oriented SBRSs? and (2) previous diversified works mostly fail to obtain a comparable performance on accuracy to those representative accuracy-oriented SBRSs, since in most cases improved diversity is reached at the cost of sacrificing a certain level of accuracy. To mitigate the adversarial effect, side information like category of items is generally imported to help better learn user preferences [zhao2018categorical, sun2019research, liu2021noninvasive]. However, for representative accuracy-oriented SBRSs, we surprisingly find that simply concatenating item ID and its category information as the input and adopting the common multi-task learning framework, as in SBRS+MTL [zhao2018categorical], cannot considerably improve recommendation performance and may even result in worse performance in terms of accuracy metrics (see Figure 1). In this case, our second challenge is to seek for a solution that can help maintain recommendation accuracy for diversified SBRSs by better exploiting category information.
Towards the aforementioned two issues, we propose a simple yet effective end-to-end Diversified Category-aware Attentive framework that can be easily instantiated with existing representative accuracy-oriented SBRSs, called DCA-SBRS, to help them generate a more diversified recommendation list without significantly sacrificing their accuracy performance. Given the widespread adoption and efficacy of attention mechanisms in existing state-of-the-art accuracy-oriented SBRSs [wang2021survey, under2023], we extend our approach by incorporating category information into the attention mechanism. Specifically, DCA-SBRS is composed of two particularly designed parts: (1) a Model-agnostic Diversity-oriented Loss (MDL) function, working with accuracy-oriented loss (e.g., cross-entropy loss), exploits items’ category attribute and estimated item scores from the given SBRS; and (2) a Non-invasive Category-aware Attention (NCA) mechanism, which inspired by NOVA [liu2021noninvasive] utilizes category information in a non-invasive way, instead of directly fusing category information, and acts as directional guidance (attention signal) to help more accurate session-based recommendation. The main contributions of this work are summarized as follows:
-
•
We propose a simple yet effective diversity-oriented loss function that can be used as a model-agnostic and individual plugin to deep neural accuracy-oriented SBRSs to improve their diversity performance, mitigating the technical gap between accuracy-oriented and diversified SBRSs.
-
•
We transfer the non-invasive idea from NOVA [liu2021noninvasive] into the common attention mechanism used in SOTA accuracy-oriented SBRSs (e.g., NARM and GCE-GNN) to capture more accurate preference by utilizing category information in a non-invasive way, so as to efficiently help maintain recommendation accuracy.
-
•
We conduct extensive experiments on three real-world datasets, in terms of accuracy, diversity, and comprehensive performance (jointly considering accuracy and diversity), to demonstrate the effectiveness of our DCA-SBRS framework. Experimental results unveil that, our framework can help SOTA SBRSs achieve extraordinary performance in terms of diversity and comprehensive performance (e.g., average 74.1% and 52.3% increase on ILD10 and F-score10 respectively), without significantly deteriorating recommendation accuracy in contrast with SOTA diversified SBRSs (e.g., an average of only decrease on accuracy regarding NDCG@ but increase on diversity for ILD@ on Diginetica). Additionally, we fairly analyze the limitations of the standard comprehensive measure and offer alternative solutions.
2 Related work
Our study is related to two major areas: session-based recommendation, and diversified recommendation.
2.1 Session-Based Recommendation
The approaches on SBRSs can be divided into two groups: conventional non-neural methods and deep neural ones. Typical conventional techniques include but are not limited to Item-KNN [sarwar2001item], BPR-MF [rendle2009bpr], and FPMC [rendle2010factorizing]. For example, FPMC deploys Matrix Factorization (MF) with Markov Chain (MC) to better deal with dependent relationships between items in sequence. However, they generally suffer from inadequately addressing the item relationships in comparatively longer sequences. In contrast, deep neural networks can better deal with much longer sequences and thus generate more effective recommendation [tan2016improved, hidasi2018recurrent]. For example, GRU4Rec [hidasi2015session] and its variants [tan2016improved, hidasi2018recurrent] apply GRU to capture the long-term dependency in a sequence. NARM [li2017neural] further adopts an attention mechanism to assess the similarity between previous items and the last item in every session, and the hidden states are then weighted averaged to obtain the main-purpose session representation. And, STAMP [LiuZMZ18] models both users’ general interests and current interests using attentive nets and basic multiple-layer perceptions (MLPs) instead of adopting RNNs.
However, the above techniques only model one-way transitions between successive items, ignoring transitions between contexts (i.e., other items in the session) [qiu2019rethinking]. Recently, GNNs have been employed to mitigate the research gap [yu2020tagnn]. For instance, SR-GNN [WuT0WXT19] and GC-SAN [XuZLSXZFZ19] import GNNs to generate more accurate item embedding vectors based on the current session graph built for each session. Besides the current session graph, GCE-GNN [wang2020global] also explores item relationships in the global session graph.
It is worth noting that, the above conventional and deep neural SBRSs are all accuracy-oriented approaches that fail to consider diversity (i.e., non-diversified). Given that RSs have an iterative or closed feedback loop, this may result in filter bubbles [nguyen2014exploring, khenissi2020theoretical].
2.2 Diversified Recommendation
Towards individual diversity in traditional RSs, inspired by dissimilarity score in Maximal Marginal Relevance (MMR) [carbonell1998use], some studies [agrawal2009diversifying, santos2010exploiting] define diversification on explicit aspects (categories) or sub-queries. Besides, DPP is utilized [chen_fast_2018, kulesza2012determinantal, wu2019pd, gan2020enhancing] to provide a better relevance-diversity trade-off in recommendation as it can score sets of items collectively and consider negative correlations between various items. The aforementioned studies are two-stage ones which re-rank items accounting for diversity in the second stage. In traditional RS, there are only several end-to-end studies [zheng2021dgcn, liang2021enhancing] which simultaneously optimize diversity and accuracy.
To the best of our knowledge, there are only three diversified (and also end-to-end) works for session-based recommendation: MCPRN [Wang0WSOC19], ComiRec [Cen2020ControllableMF], and IDSR [chen2020improving]. Specifically, MCPRN uses mixture-channel purpose routing networks to guide multi-purpose learning, while ComiRec explores two methods as multi-interest extraction modules(i.e., the dynamic routing and self-attentive methods). Thus, multiple session representations are used by MCPRN and ComiRec to capture user preferences which can implicitly satisfy user needs. In contrast, IDSR delivers the end-to-end recommendation under the guidance of the intent-aware diversity promoting (IDP) loss and explicitly creates set diversity. A “trade-off hyper-parameter" (in IDSR) is adopted to keep the balance between recommendation relevance and diversity.
To summarize, such diversified designs in those three works cannot be easily adapted to existing representative accuracy-oriented SBRSs. Besides, regarding the widely-hold “trade-off" relationship, these studies fail to obtain a satisfying performance on recommendation accuracy (can also be observed in Tables 4-6).
3 Our DCA-SBRS Framework
In this section, we firstly formulate our research problem, and then introduce the two components in the proposed framework in detail.
3.1 Problem Statement and Model Overview
Let be all of items and be all of categories. Each anonymous session, denoted by , consists of item IDs in chronological order (i.e., items clicked by a user), where denotes the -th item clicked within session . Additionally, our framework uses the category attribute of items (i.e., denotes the corresponding category of ) to guide the session representation learning for better item prediction. Given a session , the objective of our session-based recommendation aims to recommend a both diversified and accurate Top- item list, denoted as , for next-item prediction.
To address the problem, we propose a Diversified Category-aware Attentive framework which can be instantiated with SOTA accuracy-oriented SBRS, named DCA-SBRS, to improve the diversity performance of the corresponding SBRS while preserving its recommendation accuracy. It mainly consists of two novel components: 1) Model-agnostic Diversity-oriented Loss function (MDL, ), working with accuracy-oriented loss (e.g., cross-entropy loss ), which is built on items’ category attribute and estimated item scores by the SBRS. It can help achieve more diverse recommendation lists towards existing SOTA accuracy-oriented SBRSs; 2) Non-invasive Category-aware Attention (NCA) mechanism, which utilizes category information as directional guidance to replace normal attention mechanism widely used in existing SBRSs. With such design, since there exists a widely-known “trade-off" relationship between recommendation accuracy and diversity [chen2020improving], the adverse effect induced by diversity objective on recommendation accuracy can be partially alleviated.
Figure 2 presents the architecture of our DCA-SBRS framework, which depicts the installation of the MDL and NCA components on the basis of the general encoder-decoder framework and common attention mechanism from a SOTA SBRS, NARM [li2017neural]. Without losing generality, as shown in Figure 2, let encoder-decoder framework denotes the architecture of SOTA SBRSs where the encoder is to encode session representation, while the decoder is designed to estimate item scores for generating recommendations. The similarity layer projects the session representation into the item space, and then produces a Top- recommendation list. We next present the two components in detail.
3.2 Model-agnostic Diversity-oriented Loss
The goal of this module is to enhance diversity performance by acting as a model-agnostic plugin to accuracy-oriented SBRSs. The non-diversified SBRSs frequently predict relevance scores of items by capturing preferences from item sequences. For simplicity, we attempt to leverage the obtained relevance scores as the foundation of this module and increase recommendation diversity by penalizing more monotonous Recommendation List (e.g., most items in a top-N recommended list of the same category). To fulfill the goal, as shown in Figure 2, the model-agnostic diversity-oriented loss () is designed to facilitate existing SBRSs achieve the end-to-end learning. Specifically, we define it via using the entropy of estimated category distribution in a recommended list, given by,
(1) |
where () denotes the information entropy. A larger depicts that the recommended list is likely to be more diverse from the category perspective. In this case, its negative value can be regarded as penalizing the recommended list with low diversity. Intuitively, the reasonable ( in a recommendation list should satisfy the following two characteristics:
-
•
In proportion to the number of items from the category : In real-world datasets, the grou** induced by categorical attribute can be very unbalanced [zhao2018categorical]. For better understanding, we select two datasets (Diginetica and Retailrocket) and statistically show the number of items belonging to the same category using Box-plot as Figure 3. As can be observed, the outliers in the Box-plot depict that for some categories, a large group of items are involved while for others only a few. The category with a larger group of items is more likely to appear in the RL without considering personalized preference.
-
•
In proportion to relevance scores of items: Regarding personalized preference, representative SBRSs recommend Top- items by ranking the predicted scores given session . As a result, the items with much higher scores are more likely to appear in the RL along with their corresponding categories.
Considering that common accuracy-oriented SBRSs only output predicted item scores without a special module capturing category scores, we simulate the category distribution in the RL, which can well satisfy the above two characteristics as below,
(2) |
where depicts the predicted personalized preference score of item obtained by the given SOTA SBRS ( using softmax function on all items). We sum the scores of items from the category as the occurred probability of category so as to consider both the number of items in and personalized preference . Then, combined with the origin accuracy-oriented loss (e.g., the cross-entropy of the prediction results [li2017neural, wang2020global]) is the final loss function for model training,
(3) |
where controls the importance of our proposed MDL.
3.3 Non-invasive Category-aware Attention
There exists a widely-known “trade-off" relationship between recommendation accuracy and diversity [chen2020improving]. In this case, the plugged diversity loss (in MDL) will probably lead to deteriorating performance on recommendation accuracy towards accuracy-oriented SBRSs. To address this issue, we consider to exploit category information to enhance preference learning.
As shown in Figure 1, invasive fusion (like merely concatenating item embeddings with the relevant category embeddings as input), might not considerably improve recommendation accuracy. Therefore, considering that attention mechanisms are widely adopted by SOTA accuracy-oriented SBRSs, we transfer the non-invasive idea from NOVA [liu2021noninvasive] into the common attention mechanism. Specifically, as shown in Figure 2, the encoder, employing a deep learning technique as an existing SBRS (e.g., RNN [li2017neural], MLP [LiuZMZ18], or GNN [WuT0WXT19, wang2020global]), firstly coverts session into a set of high-dimensional hidden states , which are weighted summed by attention signal output by common attention mechanism at time (denoted as ) to obtain the current session representation decoded at time (denoted as ).
The category-aware extensions for SOTA SBRSs with attention mechanism (i.e. NARM, STAMP, and GCE-GNN) are described in detail in Table 1, where the symbols in the functions are unified with the original papers and thus the corresponding detailed explanation is omitted here. Note that is the corresponding category embedding vector of item in session .
Here, we use NARM [li2017neural] as an example to further elaborate our NCA. In NARM, the attention signal is computed as the correlation between the final hidden state and the hidden state of the -th item, ,
(4) |
where is an activate function (e.g., sigmoid function) and matrix are used to transform hidden states into a latent space, respectively. Correspondingly, our NCA mechanism further uses the category attribute as directional guidance and keeps the hidden states undoped in their vector space. Specifically, NCA uses the category attribute to update the attention signal as:
(5) | ||||
where is the corresponding category embedding vector of item and denotes element-wise addition. Note that here we use the simplest fusor ‘addition’ to straightforwardly add the hidden states and category embedding vectors in this paper. It can also be replaced by other fusors, like ‘concatenation’ or ‘gating’ [liu2021noninvasive].
To conclude, by doing this, we have successfully exploited category information in a non-invasive way to help generate attention signals, with the goal of maintaining the recommendation accuracy.
3.4 Discussion: Simple yet Effective Approach
Our DCA-SBRS framework can serve as a plugin for SOTA accuracy-oriented SBRSs to improve their diversity performance with MDL module while in the meantime striving to maintain their recommendation accuracy with NCA mechanism. Generally speaking, both MDL module and NCA mechanism can be easily equipped with existing SOTA accuracy-oriented SBRSs to further promote their performance regarding diversity towards more trustworthy recommender systems [ge2022survey, wang2022trustworthy]. Extensive experimental results in Section 5 verify that our approach can help SOTA SBRSs (i.e., NARM, STAMP, and GCE-GNN) obtain extraordinary performance in terms of recommendation diversity and comprehensive performance (considering both accuracy and diversity).
Besides, our approach is much more lightweight (simple yet effective) than existing diversified recommender systems: (1) in contrast to the (two-stage) re-ranking methods (e.g., MMR [carbonell1998use]), MDL can achieve end-to-end learning, that is, simultaneously maximizing accuracy and diversity objectives; (2) unlike other diversified SBRSs (e.g., IDSR [chen2020improving]) relying on specifically calibrated diversity-aware components with a substantial amount of extra parameters, our MDL module is a model-agnostic plugin by utilizing the estimated relevance scores of items from every existing SOTA SBRS and the category information, which thus requires limited extra parameters and is efficiently comparable to the corresponding SBRS; and (3) both MMR [carbonell1998use] and IDSR [chen2020improving] employ a greedy iterative inference algorithm to generate the final Top- recommended lists. On the contrary, our DCA-SBRS framework directly generate a recommended list including Top-N items with the highest final scores, implying that our approach is more computationally efficient in model inferences. It is also empirically verified in table 7.
Dataset | Diginetica | Retailrocket | Tmall |
---|---|---|---|
# interactions | 993,483 | 1,040,796 | 1,505,683 |
# train | 186,670 | 283,446 | 188,756 |
# test | 18,101 | 11,718 | 51,894 |
# items | 43,097 | 45,831 | 96,182 |
# categories | 995 | 871 | 822 |
avg. len. | 4.8504 | 3.5262 | 6.0775 |
train DS | 0.3741 | 0.4646 | 0.6575 |
test DS | 0.3721 | 0.4893 | 0.6278 |
train RR | 0.1301 | 0.2488 | 0 |
test RR | 0.1317 | 0.2370 | 0 |
4 Experimental Settings
In this section, we introduce the selection of datasets, baselines, and evaluation metrics. The specifics of dataset preprocessing and partitioning, as well as the hyper-parameter settings for our methods and other baselines, are also provided. The source code and datasets are available online111https://github.com/qyin863/DCA-SBRS..
4.1 Datasets and Preprocessing
For the experimental purpose, we delicately select three representative public e-commerce datasets (i.e., Diginetica222https://competitions.codalab.org/competitions/11161#learn_the_details-overview., Retailrocket333https://www.kaggle.com/retailrocket/ecommerce-dataset., Tmall444https://tianchi.aliyun.com/dataset/dataDetail?dataId=42.) with item category information, following [wang2021survey, li2017neural, wang2020global].
-
•
Diginetica from CIKM Cup 2016, contains user sessions, taken from records of an e-commerce search engine with its own ‘SessionId’. We only use the data with the behavior type ‘view’.
-
•
Retailrocket collects users’ interactions on an e-commerce website over a period of 4.5 months. We select interactions with the behavior type ‘view’, and a new session is created when the user’s idle time exceeds 30 minutes following [luo2020collaborative].
-
•
Tmall from the IJCAI-15 competition, includes anonymous Tmall shop** logs. We adopt interactions with the behavior type ‘buy’ and ‘view’, and partition user history into sessions by day following [ludewig2018evaluation]. We pick sessions as a sampling inspired by Yoochoose fractions [li2017neural].
For data preprocessing, following [li2017neural, LiuZMZ18, WuT0WXT19], we filter out sessions of length and items occuring less than times. Then we set the most recent data (i.e., the last one week) as the test set and the previous sessions as the training set. The validation set contains the final week of data from the training set. Additionally, we drop items appearing in the test set but not in the training set. The statistics of these three datasets after preprocessing are shown in Table 2. A sequence splitting preprocess, that is, generating sub-sequences , , , for a session sequence , is required if a recommendation model is not trained in session-parallel manner [hidasi2015session].
4.2 Baseline Models
To explore the recommendation performance on accuracy and diversity, following [wang2021survey, wang2020global, chen2020improving], we select three categories of popular and representative baseline models for session-based recommendation, including traditional methods, deep neural methods with attention mechanism (as they are chosen as the basic predictors in our proposed framework), and deep diversified methods.
1. Traditional Methods.
-
•
Item-KNN [sarwar2001item] measures cosine similarity of every two items regarding sessions in the training data. It recommends items for a session that are most similar to the last item.
-
•
BPR-MF [rendle2009bpr] performs Matrix Factorization (MF) with a pairwise ranking loss. Particularly, the session feature vector is averaged over all items in the session.
2. Deep Neural Methods with Attention Mechanism.
-
•
NARM [li2017neural] is an RNN-based model with an attention mechanism, which combines the last hidden vector and the main purpose from the hidden states as the final representation to produce recommendations.
-
•
STAMP [LiuZMZ18] applies attention layers on item representations directly and captures the user’s long-term preference as well as short-term interest from the session context.
-
•
GCE-GNN [wang2020global] constructs both the local (current session) and global (all sessions) graphs to obtain session- and global-level item embeddings. Then, before the soft attention, it incorporates the reversed position information into the item embedding.
3. Deep Diversified Methods.
-
•
MCPRN [Wang0WSOC19] models users’ multiple purposes of the session, rather than only one purpose in common SBRSs. Furthermore, it combines the above various learned purposes by the target-aware attention to get the final representation. As stated in the original paper, MCPRN can boost both accuracy and diversity.
-
•
NARM+MMR [chen2020improving] is a two-stage approach which in the second stage uses MMR [carbonell1998use] and a greedy algorithm to re-rank items provided by NARM in terms of relevance scores in the first stage.
-
•
IDSR [chen2020improving] is the first end-to-end deep neural network for SBRSs that takes both diversity and accuracy into account. The hyper-parameter is used to balance the relevance score and diversification score.
4.3 Evaluation Metrics
We adopt the following metrics related to accuracy, diversity, and both to conduct a thorough evaluation. Higher metric values indicate better performance. Towards accuracy, we select HR (Hit Rate), MRR (Mean Reciprocal Rank), and NDCG (Normalized Discounted Cumulative Gain) by following state-of-the-arts [li2017neural, WuT0WXT19, wang2020global]. Specifically, HR depicts whether the Top- Recommended List (abbreviated as RL, and is the length of the RL) contains the target item; MRR and NDCG both measure the hit position and encourage the predicted item to rank ahead in the recommended list. Towards diversity, we choose the widely-used ILD (Intra-List Distance) [Cen2020ControllableMF, chen2020improving], Entropy [Wang0WSOC19, zheng2021dgcn], and Diversity Score [liang2021enhancing] as the evaluation metrics. Particularly, ILD measures the average distance between each pair of items in the recommended list,
(6) |
where represents the euclidean distance between the respective embeddings (e.g., one-hot encoding) of categories that items and belong to.
Entropy measures the entropy of item category distribution in the recommended list; and Diversity Score (shorted as DS) is calculated by the number of interacted/recommended categories divided by number of interacted/recommended items. Additionally, we use F-score [hu2017diversifying], the harmonic mean of HR and ILD, as an aggregative indicator capturing both accuracy and diversity.
4.4 Hyper-parameter Settings
For a fair comparison, we use the Bayesian TPE555Compared to the grid and random search, it has proven to be a more intelligent and effective technique, especially for deep methods (having more hyper-parameters) [sun2020we]. [bergstra2011algorithms] of Hyperopt666https://github.com/hyperopt/hyperopt framework to tune hyper-parameters of all methods according to their performance on the validation set (i.e., the last week of the training set). We have integrated all the codes with PyTorch framework, except for IDSR. Specifically, we adopt its official code777https://bitbucket.org/WanyuChen/idsr/ with its own early-stop** mechanism. For all methods, Adam is utilized as the model optimizer; the dimension of item embedding is searched in the range of stepped by 50; the learning rate is searched in ; the size of mini-batch is searched from ; the number of epochs is searched in the range of stepped by 5. The exceptions are made on GCE-GNN, where we set its dimension of item embedding and size of mini-batch as (consistent with the original paper setting) due to memory space limitations; and set the size of mini-batch as for MCPRN. For IDSR, we search , which balances the importance of relevance and diversification scores, in on every dataset. Moreover, for NARM+MMR, we set the multiplier for the diversification score in MMR, so as to avoid a significant decrease (e.g., more than 20% decline) on accuracy performance in comparison with NARM. The detailed best hyper-parameter settings are shown in Table 3.
5 Experimental Results
In this section, we evaluate the performance of DCA-SBRS on the three selected real-world datasets to verify its superiority (in comparison with other SOTA methods) and the effectiveness of its respective modules. Additionally, we analyze the shortcomings of the standard comprehensive measurement to measure both accuracy and diversity (i.e., F-score), and provide remedies accordingly.
5.1 Overall Comparisons
Tables 4-6 exhibit the experimental results of the chosen baselines on the three real-world datasets, where the best result for each metric is highlighted in boldface and the runner-up is underlined; the row ‘Improvements’ indicates the average relative enhancements achieved by our DCA-SBRSs over the corresponding SBRSs on various metrics across the three datasets, as shown in Equation 7. Note that the reported performance per model in the tables is the average results via running 5 times with the best hyper-parameter settings.
(7) | ||||
ModelMetric | NDCG | MRR | HR | ILD | Entropy | DS | F-score | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@10 | @20 | @10 | @20 | @10 | @20 | @10 | @20 | @10 | @20 | @10 | @20 | @10 | @20 | |
Item-KNN | 0.1313 | 0.1438 | 0.0999 | 0.1036 | 0.2343 | 0.2814 | 0.1653 | 0.2247 | 0.2852 | 0.4353 | 0.1562 | 0.1376 | 0.0375 | 0.0635 |
BPR-MF | 0.0799 | 0.0954 | 0.0618 | 0.0661 | 0.1397 | 0.2012 | 0.5334 | 0.5799 | 0.9490 | 1.2148 | 0.2871 | 0.2159 | 0.0676 | 0.1061 |
NARM | 0.3191 | 0.3468 | 0.2578 | 0.2654 | 0.5162 | 0.6256 | 0.1811 | 0.2519 | 0.3047 | 0.5037 | 0.1575 | 0.1182 | 0.0921 | 0.1645 |
STAMP | 0.3143 | 0.3385 | 0.2558 | 0.2624 | 0.5018 | 0.5973 | 0.2704 | 0.3923 | 0.4781 | 0.8410 | 0.1977 | 0.1783 | 0.1381 | 0.2491 |
GCE-GNN | 0.1124 | 0.1623 | 0.1825 | 0.3096 | 0.1328 | 0.0892 | 0.0627 | 0.1145 | ||||||
MCPRN | 0.2321 | 0.2610 | 0.1858 | 0.1938 | 0.3829 | 0.4972 | 0.2671 | 0.3394 | 0.4651 | 0.7106 | 0.1935 | 0.1556 | 0.1100 | 0.1867 |
NARM+MMR | 0.2626 | 0.2896 | 0.2092 | 0.2167 | 0.4354 | 0.5420 | 0.3484 | 0.4574 | 0.6157 | 0.9691 | 0.2234 | 0.1909 | 0.1401 | 0.2443 |
IDSR() | 0.2681 | 0.2958 | 0.2140 | 0.2217 | 0.4438 | 0.5532 | 0.4105 | 0.4635 | 0.7464 | 1.0110 | 0.2593 | 0.2090 | 0.1814 | 0.2688 |
DCA-NARM | 0.3226 | 0.3435 | 0.2641 | 0.2699 | 0.5099 | 0.5920 | 0.4115 | 0.6791 | 0.7698 | 1.6254 | 0.2691 | 0.3399 | 0.2022 | 0.4017 |
DCA-STAMP | 0.3067 | 0.3237 | 0.2529 | 0.2577 | 0.4779 | 0.5444 | ||||||||
DCA-GCEGNN | 0.3342 | 0.3554 | 0.2813 | 0.2872 | 0.5032 | 0.5868 | 0.3090 | 0.5419 | 0.5844 | 1.2960 | 0.2304 | 0.2836 | 0.1426 | 0.3172 |
Improvements | -1.56% | -3.29% | -0.29% | -0.91% | -3.82% | -7.38% | 138% | 175% | 168% | 232% | 73.6% | 184% | 114% | 136% |
ModelMetric | NDCG | MRR | HR | ILD | Entropy | DS | F-score | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@10 | @20 | @10 | @20 | @10 | @20 | @10 | @20 | @10 | @20 | @10 | @20 | @10 | @20 | |
Item-KNN | 0.1558 | 0.1634 | 0.1267 | 0.1289 | 0.2491 | 0.2777 | 0.6868 | 0.7954 | 1.2871 | 1.7206 | 0.3749 | 0.3822 | 0.1491 | 0.1979 |
BPR-MF | 0.1244 | 0.1369 | 0.1037 | 0.1072 | 0.1915 | 0.2407 | 0.8106 | 0.8599 | 1.5023 | 1.8863 | 0.4077 | 0.3183 | 0.1391 | 0.1899 |
NARM | 0.3625 | 0.3815 | 0.3138 | 0.3190 | 0.5181 | 0.5928 | 0.4860 | 0.5885 | 0.8698 | 1.2658 | 0.2767 | 0.2369 | 0.2475 | 0.3507 |
STAMP | 0.3516 | 0.3688 | 0.3068 | 0.3115 | 0.4945 | 0.5624 | 0.5313 | 0.6563 | 0.9769 | 1.4613 | 0.3046 | 0.2739 | 0.2530 | 0.3642 |
GCE-GNN | 0.3701 | 0.4525 | 0.6312 | 0.9139 | 0.2207 | 0.1744 | 0.2143 | 0.3044 | ||||||
MCPRN | 0.2363 | 0.2501 | 0.2085 | 0.2123 | 0.3252 | 0.3799 | 0.7664 | 0.8432 | 1.4931 | 2.0162 | 0.4322 | 0.3852 | 0.2293 | 0.2930 |
NARM+MMR | 0.3234 | 0.3413 | 0.2785 | 0.2834 | 0.4669 | 0.5375 | 0.6247 | 0.7436 | 1.1543 | 1.6684 | 0.3424 | 0.3073 | 0.2764 | 0.3863 |
IDSR() | 0.2863 | 0.3116 | 0.2526 | 0.2596 | 0.3998 | 0.4996 | 1.0939 | 2.7566 | 0.5506 | 0.5093 | ||||
DCA-NARM | 0.3654 | 0.3804 | 0.3200 | 0.3241 | 0.5099 | 0.5688 | 0.7181 | 0.9328 | 1.3801 | 2.3049 | 0.4074 | 0.4618 | 0.3544 | 0.5053 |
DCA-STAMP | 0.3362 | 0.3471 | 0.2929 | 0.2960 | 0.4726 | 0.5155 | 0.9061 | 1.8147 | 0.5230 | 0.3994 | ||||
DCA-GCEGNN | 0.3826 | 0.3985 | 0.3364 | 0.3408 | 0.5293 | 0.5921 | 0.5970 | 0.7813 | 1.1258 | 1.8713 | 0.3461 | 0.3754 | 0.3103 | 0.4533 |
Improvements | -1.97% | -3.05% | -1.45% | -1.80% | -3.15% | -5.78% | 59.9% | 67.7% | 74.3% | 96.5% | 58.6% | 111% | 48.6% | 45.8% |
ModelMetric | NDCG | MRR | HR | ILD | Entropy | DS | F-score | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@10 | @20 | @10 | @20 | @10 | @20 | @10 | @20 | @10 | @20 | @10 | @20 | @10 | @20 | |
Item-KNN | 0.0551 | 0.0655 | 0.8888 | 0.9593 | 1.6790 | 2.0452 | 0.4546 | 0.4219 | 0.0442 | 0.0573 | ||||
BPR-MF | 0.0096 | 0.0119 | 0.0069 | 0.0075 | 0.0186 | 0.0279 | 0.9963 | 1.0350 | 1.8716 | 2.3219 | 0.4852 | 0.3805 | 0.0168 | 0.0259 |
NARM | 0.0244 | 0.0306 | 0.0174 | 0.0191 | 0.0476 | 0.0720 | 0.9453 | 1.0085 | 1.7760 | 2.2625 | 0.4689 | 0.3778 | 0.0386 | 0.0642 |
STAMP | 0.0171 | 0.0215 | 0.0121 | 0.0133 | 0.0336 | 0.0511 | 1.0449 | 1.0959 | 2.0375 | 2.5806 | 0.5428 | 0.4494 | 0.0292 | 0.0481 |
GCE-GNN | 0.0282 | 0.0355 | 0.0187 | 0.0207 | 0.8571 | 0.9326 | 1.5691 | 2.0340 | 0.4161 | 0.3345 | 0.0443 | 0.0744 | ||
MCPRN | 0.0110 | 0.0142 | 0.0075 | 0.0084 | 0.0225 | 0.0354 | 1.0661 | 1.1042 | 2.1139 | 2.6437 | 0.5686 | 0.4679 | 0.0193 | 0.0326 |
NARM+MMR | 0.0198 | 0.0249 | 0.0141 | 0.0154 | 0.0386 | 0.0592 | 1.0116 | 1.0634 | 1.9386 | 2.4437 | 0.5124 | 0.4155 | 0.0331 | 0.0548 |
IDSR() | 0.0083 | 0.0114 | 0.0054 | 0.0063 | 0.0179 | 0.0303 | 1.2969 | 2.8725 | 3.4530 | 0.8108 | 0.6773 | 0.0192 | 0.0327 | |
DCA-NARM | 0.0145 | 0.0171 | 0.0106 | 0.0113 | 0.0272 | 0.0374 | 1.3096 | 0.0274 | 0.0402 | |||||
DCA-STAMP | 0.0164 | 0.0192 | 0.0122 | 0.0129 | 0.0304 | 0.0414 | 1.2720 | 1.3274 | 2.7719 | 3.7395 | 0.7919 | 0.7962 | 0.0311 | 0.0453 |
DCA-GCEGNN | 0.0259 | 0.0329 | 0.0165 | 0.0185 | 0.0566 | 0.0843 | 0.9647 | 1.0464 | 1.8368 | 2.4230 | 0.4894 | 0.4205 | ||
Improvements. | -17.6% | -20.7% | -16.7% | -18.16% | -19.03% | -24.0% | 24.3% | 22.3% | 39.4% | 45.4% | 48.6% | 76.5% | -5.70% | -12.9% |
5.1.1 Performance on Recommendation Accuracy
The accuracy of all approaches is measured via NDCG@, MRR@, and HR@ () in Tables 4-6, where several observations are obtained as follows. 1) For traditional methods, Item-KNN outperforms BPR-MF across all three datasets. Both are generally defeated by the deep neural approaches, except for Item-KNN on Tmall. 2) Compared with our proposed framework, the existing accuracy-focused SBRSs come in first with the help of the neural network to learn more precise item embeddings and attention mechanism to denoise. Among them, GCE-GNN outperforms other methods on all three datasets, which demonstrates the expressive power of local current session graph and global session graph. 3) The accuracy of the aforementioned SBRSs is slightly decreased under our DCA framework, with few exceptions, such as DCA-NARM vs. NARM on Diginetica and Retailrocket. While, the perturbation (e.g., with and drops on average w.r.t. NDCG@ on Diginetica and Retailrocket respectively) can be tolerated given our significant enhancements in diversity and comprehensive metrics, which will be elaborated in what follows. 4) Deep diversified SBRSs generally perform better than traditional methods whereas worse than the accuracy-oriented deep methods due to their special design for gaining higher diversity. In contrast to NARM, the accuracy of NARM+MMR drops significantly across three datasets. It’s worth noting that our DCA-SBRSs show a superior advantage over deep diversified methods, for instance, the performance of our DCA-SBRS is one time better than IDSR w.r.t. HR@ on Tmall.
5.1.2 Performance on Recommendation Diversity
The diversity of all comparisons is measured via ILD@, Entropy@, and DS@ () in Tables 4-6. Three major findings can be noted. 1) Existing SBRSs benefit significantly from our proposed DCA framework. For instance, averagely, across the three datasets, the relative improvements regarding diversity on ILD@ achieved by our DCA-SBRSs over the corresponding SBRSs (e.g., DCA-NARM vs. NARM) can reach , , and , respectively. Besides, some of our DCA-SBRSs (e.g., DCA-STAMP) outperform all other methods (including the deep diversified models) on Diginetica and Tmall. 2) Towards diversified models, the performance of IDSR exceeds that of MCPRN on all three datasets. Meanwhile, all of them beat existing accuracy-oriented SBRSs (except MCPRN vs. STAMP on Diginetica), indicating the efficacy of these diversified methods in gaining better diversity. 3) Existing accuracy-oriented SBRSs perform worst due to ignoring the demands on diversity. Among them, STAMP performs best across all three datasets. Moreover, traditional methods (led by BPR-MF), though being surpassed by these accuracy-oriented SBRSs with regard to recommendation accuracy, perform slightly better when it comes to diversity.
5.1.3 Comprehensive Performance
To comprehensively assess the performance from both accuracy and diversity perspectives, we further compare them in terms of F-score@ () in Tables 4-6, and several interesting findings can be gained. 1) Our proposed DCA-SBRSs perform the best among all baselines. Specifically, a quite encouraging phenomenon is observed that some of our DCA-SBRSs show effectiveness by defeating diversified models in terms of both accuracy and diversity (e.g., DCA-STAMP on Diginetica, DCA-NARM on Tmall and DCA-NARM vs. NARM+MMR on Diginetica and Retailrocket). Additionally, our framework also outperforms accuracy-oriented SBRSs with significant gains on diversity while only minor drops on accuracy. 2) Towards deep diversified models, IDSR achieves both better accuracy and diversity than MCPRN and NARM+MMR on Diginetica and Retailrocket, demonstrating the superiority of IDSR against MCPRN. 3) Typically, traditional methods perform worse than accuracy-oriented SBRSs. Comparing accuracy-oriented SBRSs and diversified SBRSs, the former performs better on Tmall, while worse on Diginetica. This is mainly caused by the calculation of the F-score (harmonic mean of HR and ILD). Due to the different features (e.g., distribution) of various datasets, the results achieved on different datasets regarding HR and ILD may vary a lot. For instance, the ILD values are generally higher than HR values on Diginetica, while the opposite case is held on Tmall. Therefore, the model achieving the best result w.r.t. the weaker metric (e.g., HR on Diginetica) will gain advantages regarding the comprehensive performance, i.e., F-score.
Interestingly, we notice that all methods perform worse regarding the recommendation accuracy whilst better w.r.t. diversity on Tmall compared with the other two datasets. This might be caused by the unique data distribution of Tmall, i.e., lower RR and higher DS in Table 2. Nevertheless, our proposed DCA still exceeds other diversified SBRSs, showing the stability of our DCA.
Model\Time | Training(/epoch) | Inference | ||||
---|---|---|---|---|---|---|
Digi* | Retail* | Tmall | Digi* | Retail* | Tmall | |
NARM | 49s | 68s | 156s | 560s | 137s | 2678s |
NARM+MMR | 49s | 68s | 156s | 5173s | 2082s | 6075s |
MCPRN | 244s | 3003s | 1138s | 2325s | 1689s | 3167s |
IDSR | 1486s | 1646s | 4604s | 62s | 29s | 1928s |
DCA-NARM | 127s | 159s | 418s | 8s | 4s | 134s |
Note: Diginetica and Retailrocket are shortened as Digi* and Retail*. |
5.1.4 Performance on Time Complexity
Following the discussion in Section 3.4, we empirically verify the efficiency of our lightweight DCA. As such, we record the training and inference time for representative methods, including NARM, NARM+MMR, DCA-NARM, MCPRN, IDSR and our DCA-NARM, across three datasets shown in Table 7. Two major findings are noted. 1) MMR is a re-ranking (two-stage) method by a greedy search for diversity-promoting based on the trained NARM from the first step (training stage). NARM+MMR hence has a substantially longer inference time than NARM. By contrast, our DCA+NARM accomplishes an end-to-end learning and avoids greedy search in the inference stage, thus being faster than NARM+MMR. 2) Unlike other diversified SBRSs (i.e., IDSR and MCPRN) relying on specifically calibrated diversity-aware components, our DCA framework performs effectively on both training and inference stages due to limited additional parameters.
5.1.5 Adaptation on F-score
We now discuss the drawbacks of the current comprehensive metric (F-score [hu2017diversifying]), and provide remedies accordingly. First, due to different scales of HR and ILD, the weaker metric may easily dominate the final comprehensive performance, particularly on Tmall in Table 6. Therefore, it is necessary to map the two metrics into the same range before calculating F-score. Alternatively, we may replace ILD with DS (Diversity Score [liang2021enhancing]) in F-score since HR and DS are in the same range of . Second, a clear decline on accuracy is generally not acceptable in real-world recommendation scenarios. According to Tables 4-6, diversified models have apparent drops on accuracy due to the significant improvements on diversity. However, their comprehensive performance (i.e., F-score) is not the worst, even the best on Retailrocket (Table 5). That is to say, the current comprehensive performance does not match what is actually anticipated by the real-world applications. As such, we propose a generalized comprehensive metric F to solve the aforementioned issue, as below:
(8) |
where . Accordingly, the F-score [hu2017diversifying] can be regarded as a special case, i.e., . For a consistent range of ACC and DIV, we recommend . Additionally, if accuracy is prioritized over diversity, we suggest , e.g., , to put more emphasis on accuracy since it is less meaningful to gain diversity without taking accuracy into account in real-world applications. Note that with the proposed , our proposed DCA-SBRSs rank first thanks to the satisfying performance on accuracy and superior performance on diversity, as shown in Table 8. Specifically, on Retailrocket, the ranking of our DCA-SBRS improves w.r.t. , while diversified models (e.g., IDSR) experience a decline in ranking by changing from to due to its inferior accuracy performance.
5.2 The Impact of Essential Modules
5.2.1 Impact of Model-agnostic Diversified Loss (abbr. MDL)
Our proposed MDL in Equation 1 aims to improve the diversity of accuracy-oriented SBRSs as an end-to-end plugin by punishing monotonous RL with low diversity. In Figure 4, we compare the accuracy-oriented SBRSs (labeled as ‘SBRSs’) and the corresponding variants with our MDL supplemented solely (labeled as ‘SBRSs+MDL’) w.r.t. ILD@. Accordingly, by adding our MDL, the diversity of all baseline SBRSs significantly improves across the three datasets. Specifically, on Diginetica, Retailrocket, and Tmall, the average relative improvements are , , and , respectively. Besides, among the three selected baselines (NARM, STAMP, and GCE-GNN), MDL improves NARM most (i.e., ).
It’s worth noting that, for simplicity, we set in Equation 3. To analyze the effect of MDL in a fine-grained manner, we select NARM as our basic predictor and vary the value of from 0 to 1 stepped by 0.1. Figure 5 depicts the variation w.r.t. accuracy (i.e., NDCG and HR), diversity (i.e., ILD), and comprehensive performance (i.e., F-score) with varied on the three datasets888For ease of presentation, we display the values of ‘ILD minus one’ (i.e., ILD-1) on Tmall to ensure all metrics in a proper scale without changing the overall trend.. As noted, the accuracy slightly decreases with the increasing of on all three datasets; whilst a significant enhancements on diversity is noted on all datasets, showcasing the remarkable effectiveness of our MDL. Towards comprehensive performance, F-score climbs up when varies from 0 to 1 on Diginetica and Retailrocket; whereas it has a slight decline on Tmall. The possible explanation can be found in Section 5.1.3. As a whole, the recommendation accuracy drops and diversity increases by boosting the value of gradually. This indicates the necessity of fune-tuning to achieve more satisfying performance.
5.2.2 Impact of Non-invasive Category-aware Attention (abbr. NCA)
As indicated in Section 5.2.1, the recommendation accuracy of baseline SBRSs may slightly drops when integrating our designed MDL. To ease this issue, we propose category-aware attention (i.e., NCA) by importing category information into the pervasive attention mechanisms in SBRSs, with the goal of assisting item prediction. This differs from simply concatenating category information as the input of SBRSs. For verification, we compare accuracy-oriented SBRSs (labeled as ‘SBRSs’) and the corresponding variants by simply substituting the attention mechanism with our category-aware attention (labeled as ‘SBRSs+NCA’) on accuracy (i.e., NDCG@10), as depicted in Figure 6. In general, replacing the attention mechanism with our NCA facilitates the accuracy of SBRSs. Specifically, NCA helps NARM and GCE-GNN enhance their accuracy on all datasets. A similar trend is held by STAMP on Tmall; however, on the other two datasets, the accuracy of STAMP+NCA has not improved. That is perhaps due to the straightforward design of STAMP, which employs item embeddings directly rather than hidden states from RNNs or GNNs (e.g., NARM and GCE-GNN). As a result, STAMP+NCA simply sums item embeddings and the relevant category embeddings before computing attention scores, which may introduce more noise to interfere with the final item prediction.
There’s no denying that our DCA framework aids existing accuracy-oriented SBRSs in achieving extraordinary diversity and comprehensive performance gains while maintaining accuracy simultaneously, even without a thorough accuracy improvements for all SBRSs+NCA on all datasets as shown in Figure 6 (this may be caused by different features of datasets or designs of baseline predictors). Alternatively stated, the efficacy of our proposed framework does not rely on NCA only.
5.3 Discussion on our proposed DCA framework
Our proposed Diversified Category-aware Attentive (DCA) framework comprises two key components: a model-agnostic diversity-oriented loss function and a non-invasive category-aware attention mechanism. To evaluate the efficacy of the DCA framework, we selected three deep neural methods with attention mechanisms as their backbone, as detailed in Section 4.2 and Section 5. Notably, these methods all rank among the top five SBRSs in terms of accuracy [under2023]. In the session-based evaluation survey [under2023], it is evident that all of the top-performing SBRSs in accuracy leverage attention mechanisms.
However, our DCA framework isn’t limited solely to attention-based models. Despite the original SBRS not making use of an attention mechanism, we demonstrate the seamless integration of this component for enhanced session representation. Specifically, we adopt GRU4Rec[hidasi2015session], an RNN-based SBRS without an attention mechanism, as our backbone model to showcase the effectiveness of our DCA framework in this context. As illustrated in Figure LABEL:fig:gru_dca, we compare GRU4Rec with two variants: GRU4Rec with an attention mechanism and DCA-GRU4Rec, considering accuracy, diversity, and comprehensive performance. In summary, GRU4Rec with an attention mechanism outperforms the baseline GRU4Rec in terms of accuracy but lags in terms of diversity. Our DCA-GRU4Rec, on the other hand, achieves similar accuracy to GRU4Rec with an attention mechanism while significantly enhancing diversity and delivering a satisfactory overall performance. This substantiates the effectiveness of our DCA framework when applied to backbone models without attention mechanisms.
In conclusion, our DCA framework is highly versatile and can be seamlessly integrated into common SBRSs, whether they incorporate attention mechanisms or not, consistently showcasing its effectiveness in enhancing recommendation system performance.