Are We Really Achieving Better Beyond-Accuracy Performance
in Next Basket Recommendation?

Ming Li 0000-0001-7430-4961 Yuanna Liu 0000-0002-9868-6578 University of AmsterdamAmsterdamThe Netherlands [email protected], [email protected] Sami Jullien 0000-0003-4507-6335 AIRLab, University of AmsterdamAmsterdamThe Netherlands [email protected] Mozhdeh Ariannezhad 0000-0002-1113-8094 Booking.comAmsterdamThe Netherlands [email protected] Andrew Yates 0000-0002-5970-880X University of AmsterdamAmsterdamThe Netherlands [email protected] Mohammad Aliannejadi 0000-0002-9447-4172 University of AmsterdamAmsterdamThe Netherlands [email protected]  and  Maarten de Rijke 0000-0002-1086-0202 University of AmsterdamAmsterdamThe Netherlands [email protected]
(2024)
Abstract.
\Ac

NBR is a special type of sequential recommendation that is increasingly receiving attention. So far, most NBR studies have focused on optimizing the accuracy of the recommendation, whereas optimizing for beyond-accuracy metrics, e.g., item fairness and diversity remains largely unexplored. Recent studies into next basket recommendation (NBR) have found a substantial performance difference between recommending repeat items and explore items. Repeat items contribute most of the users’ perceived accuracy compared with explore items.

Informed by these findings, we identify a potential “short-cut” to optimize for beyond-accuracy metrics while maintaining high accuracy. To leverage and verify the existence of such short-cuts, we propose a plug-and-play two-step repetition-exploration (TREx) framework that treats repeat items and explores items separately, where we design a simple yet highly effective repetition module to ensure high accuracy, while two exploration modules target optimizing only beyond-accuracy metrics.

Experiments are performed on two widely-used datasets w.r.t. a range of beyond-accuracy metrics, viz. five fairness metrics and three diversity metrics. Our experimental results show that: (i) we can achieve state-of-the-art performance w.r.t. accuracy via the designed repetition module in two-step repetition-exploration (TREx); and (ii) the simple TREx framework achieves “better” beyond-accuracy performance than existing sophisticated methods. Prima facie, this appears to be good news: we can achieve high accuracy and improved beyond-accuracy metrics at the same time. However, we argue that the real-world value of our algorithmic solution, TREx, is likely to be limited and reflect on the reasonableness of the evaluation setup. We end up challenging existing evaluation paradigms, particularly in the context of beyond-accuracy metrics, and provide insights for researchers to navigate potential pitfalls and determine reasonable metrics to consider when optimizing for accuracy and beyond-accuracy metrics.

Next basket recommendation; Repetition and exploration; Evaluation
journalyear: 2024copyright: rightsretainedconference: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 14–18, 2024; Washington, DC, USAbooktitle: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), July 14–18, 2024, Washington, DC, USAdoi: 10.1145/3626772.3657835isbn: 979-8-4007-0431-4/24/07ccs: Information systems Recommender systemsccs: Information systems Retrieval models and ranking

1. Introduction

Recommender systems have become an essential instrument for connecting people to the content, services, and products they need. In e-commerce, more and more consumers purchase food and household products online instead of visiting physical retail stores (Kumar and Hosanagar, 2019). The COVID-19 pandemic has only accelerated this shift (OECD, 2020). In this scenario, consumers usually purchase a set of items at the same time, a so-called basket. \AcfiNBR is a type of sequential recommendation that caters to this scenario: baskets are the target of recommendation and historical sequential data consists of users’ interactions with baskets. \AcNBR has increasingly been attracting attention in recent years (Ariannezhad et al., 2023). Many methods, based on different machine learning techniques, have been proposed for accurate recommendations, e.g., Markov chain (MC)-based methods (Rendle et al., 2010; Wang et al., 2015), frequency and nearest neighbor-based methods (Hu et al., 2020; Faggioli et al., 2020), RNN-based methods (Yu et al., 2016; Le et al., 2019; Hu and He, 2019; Qin et al., 2021), and self-attention methods (Yu et al., 2020; Sun et al., 2020; Chen et al., 2021a).

Repetition vs. exploration in NBR. Recently, Li et al. (2023d) have assessed the performance of state-of-the-art NBR in terms of repeat and explore items: items that a user has interacted with before and items that they have never interacted with before, respectively. The authors distinguish between the task of repetition recommendation (recommending repeat items) and the task of exploration recommendation (recommending explore items). Repetition and exploration recommendations have different levels of difficulty, where recommending items that are regularly present in a user’s baskets is shown to be a far easier task (Li et al., 2023d). Building on these findings, repetition-only (Katz et al., 2022; Ariannezhad et al., 2022) and exploration-only (Li et al., 2023a) methods have been proposed to optimize the accuracy of next basket recommendation.

Accuracy and beyond-accuracy metrics. Even though accuracy naturally serves as the most important objective of recommendations, it is widely recognized that it should not be the sole focus. Beyond-accuracy metrics such as item fairness (Ekstrand et al., 2019; Wu et al., 2021; Ge et al., 2021; Wu et al., 2022) and diversity (Chen et al., 2021b; Zhang and Hurley, 2008; Zhao et al., 2023) also play crucial roles in evaluating recommendation services. Such beyond-accuracy metrics have gained increasing attention and have been optimized in a range of recommendation scenarios (Yin et al., 2023; Zhao et al., 2023). In the NBR scenario, however, beyond-accuracy metrics have been far less studied than accuracy-based metrics. In this paper, we help to address this knowledge gap. Following the paradigm of multiple-objective recommender systems (Jannach, 2022), it is widely recognized that there is a trade-off between accuracy and beyond-accuracy metrics. E.g., diversity goals are reckoned to stand in contrast with accuracy. Put differently, a method achieving a better beyond-accuracy performance while maintaining the same level of accuracy performance is considered to be a success (Yin et al., 2023; Zhao et al., 2023). And how can we achieve a reasonable balance between accuracy and beyond-accuracy metrics in NBR?

Potential “short-cuts” to balancing accuracy and beyond-accuracy metrics. Besides the imbalance between repetition and exploration (Li et al., 2023d, e, c, b), Li et al. also found that repeat items contribute most of the accuracy, whereas the explore items in the recommended basket contribute very little to the user’s perceived utility. As Table 1 summarizes, there are essential differences between the repetition and exploration tasks, which explain the substantial performance differences between the two tasks.

Inspired by these findings, we hypothesize that there may be a “short-cut” strategy to optimize for both accuracy and beyond-accuracy metrics, which contains two aspects: (i) accuracy: Predict repeat items to achieve good accuracy: predicting repeat items is much easier than predicting explore items (Li et al., 2023d), and (ii) beyond-accuracy: Use explore items to improve beyond-accuracy metrics: it is very difficult to recommend quality explore items. Thus, exchange the low accuracy that is typically achieved on such items for beyond-accuracy metrics, i.e., trade accuracy for diversity and item fairness. We call this NBR strategy a short-cut strategy because it avoids making the fundamental trade-off between accuracy and beyond-accuracy metrics.

Table 1. Comparison of the repetition and exploration tasks in NBR.
Aspect Repetition Exploration
Task difficulty Easy Difficult
Number of items Dozens Thousands
Item interactions Previous None
Users’ interest With feedback Without feedback
Task type Re-consume Infer new

TREx framework. To operationalize our short-cut idea, and check whether the “short-cut” strategy can be made to work, we propose the two-step repetition-exploration (TREx) framework. TREx decouples the prediction of repeat items and explore items. Specifically, TREx uses separate models for predicting (a) repeat items, and (b) explore items, and then combines the outcomes of the two prediction models to generate the next basket. In contrast, existing NBR methods usually output the scores/probabilities of all items and then select the top-k𝑘kitalic_k items to fill up a basket to be recommended, ignoring the differences between repeat and explore items.

For TREx’s repeat item prediction, we propose a simple yet effective probability-based method, which considers the item characteristics and users’ repurchase frequency. For exploration recommendations, we design two strategies that cater to the different beyond-accuracy metrics. The flexibility of TREx allows us to design suitable models for repetition and exploration, with the possibility of controlling the proportions of repetition and exploration to investigate the relations between accuracy and various beyond-accuracy metrics.

Findings and reflections. We consider two types of widely-used beyond-accuracy metrics, i.e., diversity and item fairness. Specifically, we investigate five fairness metrics (i.e., logEUR, logRUR, EEL, EED, and logDP) (Liu et al., 2024; Raj and Ekstrand, 2022) and three diversity metrics (i.e., ILD, Entropy, and DS) (Yin et al., 2023). To provide an overall understanding of these metrics, we group them according to different levels of connection with accuracy as follows: (i) Strong connection: logRUR, (ii) Weak connection: logEUR, EEL, EED (iii) No connection: logDP, ILD, Entropy, DS. Briefly, the strong connection between logRUR and accuracy stems from the fact that logRUR uses ground truth relevance to discount the exposure, making sure that only correctly predicted items contribute to effective exposure. The connection between logEUR, EEL, and accuracy is weak because they just ensure the exposure distribution across groups of recommended results is close to the group exposure distribution of ground truth, without considering whether the exposure is contributed by correctly predicted items. Since the position weighting model of EED considers ground truth, EED shows a weak connection. There is no connection between accuracy and logDP, ILD, Entropy, and DS because their exposure distributions across groups are designed to reflect a specific distribution. The strength of the connection between a beyond-accuracy metric and accuracy determines whether there is a short-cut towards optimizing both accuracy and the beyond-accuracy metric.

We perform experiments on two brick-and-mortar retailers’ NBR datasets, considering six NBR baselines and eight metrics. The experimental results show that: (1) State-of-the-art accuracy can be achieved by only recommending repeat items via the proposed simple yet effective repetition model. (2) Leveraging the “short-cut” using TREx achieves “better” beyond accuracy performance w.r.t. seven out of eight beyond-accuracy metrics. (3) In terms of the item fairness metric having a strong connection with the accuracy (i.e., logRUR), it is more difficult to achieve better beyond-accuracy metrics via the proposed strategy.

Step** back. Instead of blindly claiming TREx with the designed modules as a state-of-the-art method for optimizing both accuracy and various beyond-accuracy metrics, we reflect and challenge our evaluation paradigm in the definition of success in this setting. The core question is:

Are we really achieving better beyond-accuracy performance in next basket recommendation?

Two perspectives offer different ways forward for researchers and practitioners to address this question:

  1. (1)

    If we are willing to sacrifice the accuracy of the exploration, then superior beyond-accuracy performance can be achieved by leveraging the “short-cut” strategy via TREx, which is straightforward and efficient. This “short-cut” strategy must be considered before develo** more sophisticated and elaborate approaches.

  2. (2)

    Conversely, if we believe it is unreasonable to sacrifice the accuracy of exploration (Williams et al., 2014), the existence of the “short-cut” strategy reveals flaws in our current evaluation paradigm to demonstrate an NBR method’s superiority. A fine-grained analysis (i.e., distinguishing between repetition and exploration) needs to be performed to check whether “better” beyond-accuracy is achieved by triggering the “short-cut” strategy, which would hurt the exploration accuracy after all.

Our contributions. The main contributions of the paper are:

  • We identify a “short-cut” strategy (i.e., sacrificing accuracy for exploration and using explore items to optimize for beyond-accuracy metrics), which could achieve “better” beyond-accuracy metrics without degrading accuracy.

  • We propose a simple repetition recommendation model considering item features and users’ repurchase frequency, which can achieve the state-of-the-art NBR accuracy by only recommending repeat items.

  • We propose TREx, a flexible two-step repetition-exploration framework for NBR, which allows us to control the trade-off between accuracy and beyond-accuracy metrics w.r.t. the recommended baskets.

  • We conduct experiments on two datasets w.r.t. eight beyond-accuracy metrics, and find that leveraging “short-cuts” via TREx can achieve better performance on a wide range of metrics. We also find that the stronger the connection with accuracy, the more challenging it becomes to utilize a “short-cut” strategy to enhance a beyond-accuracy metric.

  • We reflect on, and challenge, existing evaluation paradigms, and find that a fine-grained level analysis can provide a complementary view of a method’s performance.

2. Related Work

We summarize related research on next basket recommendation and beyond-accuracy metrics.

Next basket recommendation. The NBR problem has been studied for many years. Factorizing personalized Markov chains (FPMC) (Rendle et al., 2010) leverages matrix factorization and Markov chains to model users’ general interest and basket transition relations. HRM (Wang et al., 2015) applies aggregation operations to learn a hierarchical representation of baskets. RNNs have been adapted to the NBR task to learn long-term trends by modeling the whole basket sequence. E.g., Dream (Yu et al., 2016) uses max/avg pooling to encode baskets. Sets2Sets (Hu and He, 2019) adapts an attention mechanism and adds frequency information to improve performance. Some methods (Le et al., 2019; Wang et al., 2020) consider the underlying item relations to get a better representation. Yu et al. (2020) argue that item-item relations between baskets are important, and leverage GNNs to use these relations. Some methods (Bai et al., 2018; Wang et al., 2019b; Sun et al., 2020; Leng et al., 2020) exploit auxiliary information, including product categories, amounts, prices, and explicit timestamps. TIFUKNN (Hu et al., 2020) and UP-CF@r (Faggioli et al., 2020), frequency-neighbor-based methods, model temporal patterns, and then combine these with neighbor information or user-wise collaborative filtering. Li et al. (2023d) provide several metrics to evaluate repetition and exploration performance in the NBR task and find that the repetition task is easier than the exploration task. Inspired by this analysis, repetition-only (Ariannezhad et al., 2022; Katz et al., 2022) and exploration-only (Li et al., 2023a) models were proposed for next basket recommendation. Existing NBR work mainly focuses on optimizing accuracy whereas this paper extends to various beyond-accuracy metrics for NBR.

Beyond-accuracy metrics. In addition to accuracy, there are various beyond-accuracy metrics (i.e., diversity, fairness, novelty, serendipity, coverage) we need to consider when making recommendations (Ekstrand et al., 2019). Diversity is a crucial factor in meeting the diverse demands of users (Zhang and Hurley, 2008; Quadrana et al., 2018; Chen et al., 2020; Wang et al., 2019a). Recently, empirical and revisitation studies (Ludewig and Jannach, 2018; Yin et al., 2023) have been conducted to explore the trade-off between accuracy and diversity. The concepts of fairness and item exposure have emerged as crucial considerations since items and producers play pivotal roles within a recommender system and its ecosystem. Related metrics measure whether items receive a fair share of exposure according to different definitions of fairness. Current research on fairness primarily focuses on individual or group fairness, either from the customer’s perspective, adopting a user-centered approach (Bobadilla et al., 2020), or from the provider’s viewpoint, adopting an item-centered approach (Zehlike and Castillo, 2020; Morik et al., 2020), or a two-sided approach (Wu et al., 2021, 2022; Naghiaei et al., 2022). Recently, Liu et al. (2024) evaluated item fairness on existing NBR methods to investigate the robustness of different fairness metrics. Unlike the work listed above, this paper is not limited to optimizing a specific type of metric. It examines the possibility of leveraging a “short-cut” strategy to seemingly optimize various beyond-accuracy metrics and provides insights w.r.t. evaluation paradigms when extending NBR optimization and evaluation to these beyond-accuracy metrics.

Table 2. Notation used in the paper; fairness related notation is adapted from (Raj and Ekstrand, 2022; Liu et al., 2024).
Symbol Description
uU𝑢𝑈u\in Uitalic_u ∈ italic_U Users
iI𝑖𝐼i\in Iitalic_i ∈ italic_I Items
Susubscript𝑆𝑢S_{u}italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT Sequence of historical baskets for u𝑢uitalic_u
Butsuperscriptsubscript𝐵𝑢𝑡B_{u}^{t}italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT t𝑡titalic_t-th basket in Susubscript𝑆𝑢S_{u}italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, a set of items iI𝑖𝐼i\in Iitalic_i ∈ italic_I
Iu,trepsuperscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝I_{u,t}^{rep}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_p end_POSTSUPERSCRIPT Set of repeat items for u𝑢uitalic_u up to timestamp t𝑡titalic_t
Iu,texplsuperscriptsubscript𝐼𝑢𝑡𝑒𝑥𝑝𝑙I_{u,t}^{expl}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_x italic_p italic_l end_POSTSUPERSCRIPT Set of explore items for u𝑢uitalic_u up to timestamp t𝑡titalic_t
Tusubscript𝑇𝑢T_{u}italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT Ground-truth basket for u𝑢uitalic_u that we aim to predict
Tu𝑟𝑒𝑝superscriptsubscript𝑇𝑢𝑟𝑒𝑝T_{u}^{\mathit{rep}}italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT Set of repeat items in the ground truth basket Tusubscript𝑇𝑢T_{u}italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for u𝑢uitalic_u
Tu𝑒𝑥𝑝𝑙superscriptsubscript𝑇𝑢𝑒𝑥𝑝𝑙T_{u}^{\mathit{expl}}italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_expl end_POSTSUPERSCRIPT Set of explore items in the ground truth basket Tusubscript𝑇𝑢T_{u}italic_T start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for u𝑢uitalic_u
Pusubscript𝑃𝑢P_{u}italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT Predicted basket for u𝑢uitalic_u
Pu𝑟𝑒𝑝superscriptsubscript𝑃𝑢𝑟𝑒𝑝P_{u}^{\mathit{rep}}italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT Set of repeat items in the predicted basket Pusubscript𝑃𝑢P_{u}italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for u𝑢uitalic_u
Pu𝑒𝑥𝑝𝑙superscriptsubscript𝑃𝑢𝑒𝑥𝑝𝑙P_{u}^{\mathit{expl}}italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_expl end_POSTSUPERSCRIPT Set of explore items in the predicted basket Pusubscript𝑃𝑢P_{u}italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT for u𝑢uitalic_u
G(P)𝐺𝑃G(P)italic_G ( italic_P ) Group alignment matrix for items in P𝑃Pitalic_P
G+superscript𝐺G^{+}italic_G start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT Popular group
Gsuperscript𝐺G^{-}italic_G start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT Unpopular group
𝐚Psubscript𝐚𝑃\mathbf{a}_{P}bold_a start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT Exposure vector for items in P𝑃Pitalic_P
ϵPsubscriptitalic-ϵ𝑃\mathbf{\epsilon}_{P}italic_ϵ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT The exposure of groups in P𝑃Pitalic_P (G(P)T𝐚P)𝐺superscript𝑃𝑇subscript𝐚𝑃(G(P)^{T}\mathbf{a}_{P})( italic_G ( italic_P ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_a start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT )

3. Task Formulation and Definitions

We describe the next basket recommendation problem and formalize the notions of repetition and exploration. Our notation is summarized in Table 2.

Next basket recommendation. Given a set of users U={u1U=\{u_{1}italic_U = { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, u2subscript𝑢2u_{2}italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, …, un}u_{n}\}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } and items I={i1,i2,,im}𝐼subscript𝑖1subscript𝑖2subscript𝑖𝑚I=\{i_{1},i_{2},\ldots,i_{m}\}italic_I = { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, Su={Bu1,Bu2,,But}subscript𝑆𝑢superscriptsubscript𝐵𝑢1superscriptsubscript𝐵𝑢2superscriptsubscript𝐵𝑢𝑡S_{u}=\{B_{u}^{1},B_{u}^{2},\ldots,B_{u}^{t}\}italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = { italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } represents the historical interaction sequence for u𝑢uitalic_u, where Butsuperscriptsubscript𝐵𝑢𝑡B_{u}^{t}italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the user’s basket at the time step t𝑡titalic_t. Butsuperscriptsubscript𝐵𝑢𝑡B_{u}^{t}italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT consists of a set of items iI𝑖𝐼i\in{I}italic_i ∈ italic_I, and the goal of the next basket recommendation task is to predict Pu=But+1subscript𝑃𝑢superscriptsubscript𝐵𝑢𝑡1P_{u}=B_{u}^{t+1}italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT, the following basket of items that the user would probably like, based on the user’s past interactions Susubscript𝑆𝑢S_{u}italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, i.e.,

(1) Pu=B^ut+1=f(Su),subscript𝑃𝑢superscriptsubscript^𝐵𝑢𝑡1𝑓subscript𝑆𝑢P_{u}=\hat{B}_{u}^{t+1}=f(S_{u}),italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_f ( italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ,

where f𝑓fitalic_f is our basket generation algorithm. We assume that the user’s attention and screen space is limited; hence, like previous studies (Li et al., 2023d; Liu et al., 2024), we recommend fixed-size baskets of sizes 10 or 20.

Repetition and exploration. We assume that the set of items is fixed. Although this might not be the case in real-world settings, modeling the addition and deletion of items in the set of items is out of the scope of this paper. With this assumption in mind, the addition of every new basket to the users’ history, may translate into fewer items left to explore. To differentiate between the items coming from the exploration and repeat consumption behavior, for a user u𝑢uitalic_u and timestamp t𝑡titalic_t, a set of items Iu,t𝑟𝑒𝑝Isuperscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝𝐼I_{u,t}^{\mathit{rep}}\subset Iitalic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT ⊂ italic_I are considered to be the “repeat items.” The set of explore items Iu,t𝑒𝑥𝑝𝑙superscriptsubscript𝐼𝑢𝑡𝑒𝑥𝑝𝑙I_{u,t}^{\mathit{expl}}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_expl end_POSTSUPERSCRIPT is simply its complement within the overall item set I𝐼Iitalic_I. We define Iu,t𝑟𝑒𝑝superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝I_{u,t}^{\mathit{rep}}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT as:

(2) Iu,t𝑟𝑒𝑝=Iu,t1repBut.superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝superscriptsubscript𝐼𝑢𝑡1𝑟𝑒𝑝superscriptsubscript𝐵𝑢𝑡I_{u,t}^{\mathit{rep}}=I_{u,t-1}^{rep}\cup B_{u}^{t}.italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT = italic_I start_POSTSUBSCRIPT italic_u , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_p end_POSTSUPERSCRIPT ∪ italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT .

This also means that Iu,1𝑟𝑒𝑝Iu,t1𝑟𝑒𝑝Iu,t𝑟𝑒𝑝superscriptsubscript𝐼𝑢1𝑟𝑒𝑝superscriptsubscript𝐼𝑢𝑡1𝑟𝑒𝑝superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝I_{u,1}^{\mathit{rep}}\subset\cdots\subset I_{u,t-1}^{\mathit{rep}}\subset I_{% u,t}^{\mathit{rep}}italic_I start_POSTSUBSCRIPT italic_u , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT ⊂ ⋯ ⊂ italic_I start_POSTSUBSCRIPT italic_u , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT ⊂ italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT. Conversely, we have Iu,t𝑒𝑥𝑝𝑙Iu,t1𝑒𝑥𝑝𝑙Iu,1𝑒𝑥𝑝𝑙superscriptsubscript𝐼𝑢𝑡𝑒𝑥𝑝𝑙superscriptsubscript𝐼𝑢𝑡1𝑒𝑥𝑝𝑙superscriptsubscript𝐼𝑢1𝑒𝑥𝑝𝑙I_{u,t}^{\mathit{expl}}\subset I_{u,t-1}^{\mathit{expl}}\subset\cdots\subset I% _{u,1}^{\mathit{expl}}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_expl end_POSTSUPERSCRIPT ⊂ italic_I start_POSTSUBSCRIPT italic_u , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_expl end_POSTSUPERSCRIPT ⊂ ⋯ ⊂ italic_I start_POSTSUBSCRIPT italic_u , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_expl end_POSTSUPERSCRIPT.

The task of predicting the next basket for a user u𝑢uitalic_u is equivalent to predicting which items from Iu,trepsuperscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝I_{u,t}^{rep}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_p end_POSTSUPERSCRIPT and Iu,texplsuperscriptsubscript𝐼𝑢𝑡𝑒𝑥𝑝𝑙I_{u,t}^{expl}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_x italic_p italic_l end_POSTSUPERSCRIPT will appear in But+1superscriptsubscript𝐵𝑢𝑡1B_{u}^{t+1}italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT. One way to solve this problem is to decouple it into two subtasks: the repetition subtask that aims to predict which items from Iu,trepsuperscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝I_{u,t}^{rep}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_p end_POSTSUPERSCRIPT to recommend, and the exploration task that recommends items from Iu,texplsuperscriptsubscript𝐼𝑢𝑡𝑒𝑥𝑝𝑙I_{u,t}^{expl}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_x italic_p italic_l end_POSTSUPERSCRIPT. Table 1 shows the different characteristics w.r.t. the repetition and exploration tasks.

Table 3. Summary of fairness and diversity metrics; fairness metrics are adapted from (Raj and Ekstrand, 2022). \uparrow indicates that higher values are better; \downarrow indicates that lower values are better; \circ means that the closer the value is to 0, the better the performance.
Category Metrics Goal Better Accuracy connection
Equal opportunity logRUR Click-through rate proportional to relevance \circ Strong
logEUR Exposure proportional to relevance \circ Weak
EEL Exposure matches ideal (from relevance) \downarrow Weak
Statistical parity EED Exposure well-distributed \downarrow Weak
logDP Exposure equal across groups \circ None
Diversity ILD Average distance between categories for each pair of items in the list \uparrow None
Entropy Entropy of item category distribution in the list \uparrow None
DS Number of categories divided by the number of items in the list \uparrow None

4. Evaluation metrics

Next, we describe the accuracy and beyond-accuracy metrics (i.e., fairness and diversity) considered in the paper.111Due to space limitations, we only provide brief introductions of each metric; more detailed information (e.g., function, responsibility, etc.) can be found in the original papers and relevant survey papers (Zhao et al., 2023; Raj and Ekstrand, 2022; Liu et al., 2024).

Accuracy. In terms of accuracy, we use three metrics that are widely used for the NBR task: Recall@k𝑅𝑒𝑐𝑎𝑙𝑙@𝑘Recall@kitalic_R italic_e italic_c italic_a italic_l italic_l @ italic_k, NDCG@k𝑁𝐷𝐶𝐺@𝑘NDCG@kitalic_N italic_D italic_C italic_G @ italic_k, and PHR@k𝑃𝐻𝑅@𝑘PHR@kitalic_P italic_H italic_R @ italic_k. Recall𝑅𝑒𝑐𝑎𝑙𝑙Recallitalic_R italic_e italic_c italic_a italic_l italic_l measures the ability to find all items that the user will purchase in the next basket; NDCG is a ranking metric that also considers the order of the items; PHR is a user level measurement which represents the ratio of users whose recommended basket contains the item in the ground-truth.

Fairness. Assume π(Pu)𝜋conditional𝑃𝑢\pi(P\mid u)italic_π ( italic_P ∣ italic_u ) is a user-dependent distribution and ρ(u)𝜌𝑢\rho(u)italic_ρ ( italic_u ) is a distribution over users; overall, the recommended item rankings among all users follow the following distribution: ρ(u)π(Pu)𝜌𝑢𝜋conditional𝑃𝑢\rho(u)\pi(P\mid u)italic_ρ ( italic_u ) italic_π ( italic_P ∣ italic_u ). ϵP=G(P)T𝐚Psubscriptitalic-ϵ𝑃𝐺superscript𝑃Tsubscript𝐚𝑃\epsilon_{P}=G(P)^{\mathrm{T}}\mathbf{a}_{P}italic_ϵ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = italic_G ( italic_P ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT bold_a start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is the group exposure within a recommended basket.222The formula to compute the exposure vector 𝐚Psubscript𝐚𝑃\mathbf{a}_{P}bold_a start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT using different position weighting models can be found in (Raj and Ekstrand, 2022; Liu et al., 2024). Its expected value ϵπ=Eπρ[ϵP]subscriptitalic-ϵ𝜋subscript𝐸𝜋𝜌delimited-[]subscriptitalic-ϵ𝑃\epsilon_{\pi}=E_{\pi\rho}[\epsilon_{P}]italic_ϵ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_π italic_ρ end_POSTSUBSCRIPT [ italic_ϵ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ] is the group exposure among all the recommended baskets. Following (Raj and Ekstrand, 2022; Liu et al., 2024), we select a set of well-known fairness metrics and cover two types of fairness considerations as follows:333Item fairness metric Inequity of Amortized Attention (Biega et al., 2018) is not used in this paper since some baselines do not have predicted relevance for items.

(1) Equal opportunity

Promote equal treatment based on merit or utility, regardless of group membership (Raj and Ekstrand, 2022; Liu et al., 2024). (i) Exposed Utility Ratio (EUR)(Singh and Joachims, 2018)quantifies the deviation from the objective that the exposure of each group is proportional to its utility Y(G)𝑌𝐺Y\left(G\right)italic_Y ( italic_G ). (ii) Realized Utility Ratio (RUR)(Singh and Joachims, 2018)models actual user engagement, the click-through rates for the groups Γ(G)Γ𝐺\Gamma\left(G\right)roman_Γ ( italic_G ) are proportional to their utility. (iii) Expected Exposure Loss (EEL)(Diaz et al., 2020)is the distance between expected exposure and target exposure ϵsuperscriptitalic-ϵ\mathbf{\epsilon}^{\ast}italic_ϵ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which is the exposure under the ideal policy.

(2) Statistical parity

Ensure comparable exposure among groups. (i) Expected Exposure Disparity (EED)(Diaz et al., 2020)measures the inequality in exposure distribution across groups. (ii) Demographic Parity (DP)(Singh and Joachims, 2018)measures the ratio of average exposure given to the two groups. Following (Raj and Ekstrand, 2022), we reformulate DP as logDP to tackle the issue of empty-group scenarios and improve interpretability. Exposed Utility Ratio (logEUR) and Realized Utility Ratio (logRUR) are defined in a similar manner.

Diversity. Following (Yin et al., 2023), we consider the following widely-used diversity metrics, which satisfy users’ diversified demands. (i) Intra-List Distance (ILD)(Chen et al., 2020; Cen et al., 2020)measures the average distance between every pair of items in the recommendation list (Pusubscript𝑃𝑢P_{u}italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT), where dijsubscript𝑑𝑖𝑗d_{ij}italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the Euclidean distance between the respective embeddings of categories (ii) Entropy(Zheng et al., 2021; Wang et al., 2019a)quantifies the dispersion of item category distribution in the recommendation list Pusubscript𝑃𝑢P_{u}italic_P start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT; a higher degree of dispersion in the category distribution corresponds to increased diversity. (iii) Diversity Score (DS)(Liang et al., 2021)is calculated as the number of interacted/recommended categories divided by the number of interacted/recommended items. As shown in Table 3, we can group beyond-accuracy metrics according to their connection with accuracy.

5. A Two-Step Repetition-Exploration Framework

Given the differences depicted in Table 1, we propose a two-step repetition-exploration (TREx) framework for NBR. TREx assembles recommendations from a repetition and an exploration module. TREx allows one to easily swap out the sub-algorithms used for repetition and exploration. In the first step, we model the repetition and exploration behavior separately to get candidates from both sources. Then, we generate the recommended basket from those candidates in the second step. The main architectural differences between previous approaches to the NBR problem, which typically consists of a single treatment of all items, and TREx, which treats repeat and explore items differently. The pseudo-code for TREx is given in Algorithm 1. Next, we describe the three modules that make up TREx.444Theoretically, TREx allows us to choose or design the suitable repetition and exploration modules both targeted at the accuracy to achieve state-of-the-art performance. However, we aim to investigate the “short-cut” and relationship between accuracy and various beyond-accuracy metrics.

Data: Basket sequence S𝑆Sitalic_S, basket size k𝑘kitalic_k, repetition confidence threshold v𝑣vitalic_v
Result: Recommended basket But+1superscriptsubscript𝐵𝑢𝑡1B_{u}^{t+1}italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT for each user u𝑢uitalic_u,
1
2Calculate the repetition feature 𝑅𝑒𝑝𝐼(i)𝑅𝑒𝑝𝐼𝑖\mathit{RepI}(i)italic_RepI ( italic_i ) for each item;
3 for each user u𝑢uitalic_u do
4       Get repeat items Iu,t𝑟𝑒𝑝superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝I_{u,t}^{\mathit{rep}}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT, and explore items Iu,t𝑒𝑥𝑝𝑙superscriptsubscript𝐼𝑢𝑡𝑒𝑥𝑝𝑙I_{u,t}^{\mathit{expl}}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_expl end_POSTSUPERSCRIPT;
5       Calculate the repetition score 𝑅𝑒𝑝𝑆u(i)superscript𝑅𝑒𝑝𝑆𝑢𝑖\mathit{RepS}^{u}(i)italic_RepS start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ( italic_i ) for each iIu,t𝑟𝑒𝑝𝑖superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝i\in I_{u,t}^{\mathit{rep}}italic_i ∈ italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT;
6       Remove items i𝑖iitalic_i from Iu,t𝑟𝑒𝑝superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝I_{u,t}^{\mathit{rep}}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT, when 𝑅𝑒𝑝𝑆u(i)<vsuperscript𝑅𝑒𝑝𝑆𝑢𝑖𝑣\mathit{RepS}^{u}(i)<vitalic_RepS start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ( italic_i ) < italic_v;
7      
8      Rank Iu,t𝑟𝑒𝑝superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝I_{u,t}^{\mathit{rep}}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT according to 𝑅𝑒𝑝𝑆u(i)superscript𝑅𝑒𝑝𝑆𝑢𝑖\mathit{RepS}^{u}(i)italic_RepS start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ( italic_i ) in descending order;
9       Initialize next basket But+1superscriptsubscript𝐵𝑢𝑡1B_{u}^{t+1}italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT;
10       if |Iu,t𝑟𝑒𝑝|<ksuperscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝𝑘|I_{u,t}^{\mathit{rep}}|<k| italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT | < italic_k then
11             Fill But+1superscriptsubscript𝐵𝑢𝑡1B_{u}^{t+1}italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT using Iu,t𝑟𝑒𝑝superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝I_{u,t}^{\mathit{rep}}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT;
12             m \leftarrow k|Iu,t𝑟𝑒𝑝|𝑘superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝k-|I_{u,t}^{\mathit{rep}}|italic_k - | italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT |;
13             Fill m empty slots of But+1superscriptsubscript𝐵𝑢𝑡1B_{u}^{t+1}italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT using explore items via exploration module;
14            
15      else
16             Fill But+1superscriptsubscript𝐵𝑢𝑡1B_{u}^{t+1}italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT using top-k𝑘kitalic_k of Iu,t𝑟𝑒𝑝superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝I_{u,t}^{\mathit{rep}}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT;
17            
18       end if
19      
20 end for
Algorithm 1 TREx Framework

5.1. Repetition module

As the repetition task is a much simpler task than exploration, we therefore design a repetition module targeted at improving the accuracy. Intuitively, if a user consumed an item several times in the past, they are likely to repurchase that item in the next basket. Thus, frequency information is a strong signal for repetition prediction (Wan et al., 2018). The personal item frequency (PIF) introduced in TIFUKNN (Hu et al., 2020) and the recency window in UP-CF@r(Faggioli et al., 2020) both capture temporal dependencies by focusing more on recent behavior. However, they do not capture the item characteristics w.r.t. repurchasing. For example, a purchase of a bottle of milk and a pan is more likely to be followed by a repurchase of milk rather than a pan, even if both currently have the same purchase frequency. To consider both item features and user interest simultaneously, we use the repetition score 𝑅𝑒𝑝𝑆u(i)superscript𝑅𝑒𝑝𝑆𝑢𝑖\mathit{RepS}^{u}(i)italic_RepS start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ( italic_i ) to represent the repurchase score of item i𝑖iitalic_i for user u𝑢uitalic_u. This score is decomposed into two parts, the item-specific repurchase feature 𝑅𝑒𝑝𝐼(i)𝑅𝑒𝑝𝐼𝑖\mathit{RepI}(i)italic_RepI ( italic_i ) and the user’s interest Eiusuperscriptsubscript𝐸𝑖𝑢E_{i}^{u}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT in item i𝑖iitalic_i. Formally:

(3) 𝑅𝑒𝑝𝑆u(i)=Eiu𝑅𝑒𝑝𝐼(i).superscript𝑅𝑒𝑝𝑆𝑢𝑖superscriptsubscript𝐸𝑖𝑢𝑅𝑒𝑝𝐼𝑖\mathit{RepS}^{u}(i)=E_{i}^{u}\cdot\mathit{RepI}(i)~{}.italic_RepS start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ( italic_i ) = italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ⋅ italic_RepI ( italic_i ) .

This corresponds to line 1 in Algorithm 1. Given the items in the dataset I={i1,i2,,im}𝐼subscript𝑖1subscript𝑖2subscript𝑖𝑚I=\{i_{1},i_{2},\ldots,i_{m}\}italic_I = { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, we need to derive the repurchase feature 𝑅𝑒𝑝𝐼(i)𝑅𝑒𝑝𝐼𝑖\mathit{RepI}(i)italic_RepI ( italic_i ) for each item in the training set. First, the repurchase frequency RepF(i)𝑅𝑒superscript𝑝𝐹𝑖Rep^{F}(i)italic_R italic_e italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ( italic_i ) can be calculated by gathering the statistical information across users. To mitigate the impact of abnormally high values in some users, we introduce a hyperparameter α𝛼\alphaitalic_α to discount the repurchase frequency of item i𝑖iitalic_i.

(4) 𝑅𝑒𝑝F(i)=U(item i repurchase frequency)α#users who bought item i at least once.superscript𝑅𝑒𝑝𝐹𝑖subscript𝑈superscriptitem i repurchase frequency𝛼#users who bought item i at least once\mathit{Rep}^{F}(i)=\frac{\sum_{U}\left(\text{item $i$ repurchase frequency}% \right)^{\alpha}}{\#\text{users who bought item $i$ at least once}}~{}.italic_Rep start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ( italic_i ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( item italic_i repurchase frequency ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG # users who bought item italic_i at least once end_ARG .

In addition, some items might only have a few samples, which might lead to low confidence about their repetition feature estimation. We leverage the average estimate 𝑅𝑒𝑝𝐹¯¯𝑅𝑒𝑝𝐹\overline{\mathit{RepF}}over¯ start_ARG italic_RepF end_ARG across all items as supplementary information to help items with a few samples. Then, the final repetition feature is given by:

(5) 𝑅𝑒𝑝𝐼(i)=𝑅𝑒𝑝F(i)+𝑅𝑒𝑝𝐹¯Ni,𝑅𝑒𝑝𝐼𝑖superscript𝑅𝑒𝑝𝐹𝑖¯𝑅𝑒𝑝𝐹subscript𝑁𝑖\mathit{RepI}(i)=\mathit{Rep}^{F}(i)+\frac{\overline{\mathit{RepF}}}{N_{i}},italic_RepI ( italic_i ) = italic_Rep start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ( italic_i ) + divide start_ARG over¯ start_ARG italic_RepF end_ARG end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ,

where Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of users who bought item i𝑖iitalic_i. Thus, the average 𝑅𝑒𝑝𝐹¯¯𝑅𝑒𝑝𝐹\overline{\mathit{RepF}}over¯ start_ARG italic_RepF end_ARG will have a small effect on 𝑅𝑒𝑝𝐼(i)𝑅𝑒𝑝𝐼𝑖\mathit{RepI}(i)italic_RepI ( italic_i ) when we have more samples to compute item-specific features. This corresponds to line 1 in Algorithm 1.

The item frequency in a user’s historical baskets can partially reflect the user’s interest. Yet, user interests can also be dynamic. To model temporal dependencies, we introduce a time-decay factor β𝛽\betaitalic_β, which makes the recent interactions have more impact on the interest Eiusuperscriptsubscript𝐸𝑖𝑢E_{i}^{u}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT. Assume that a specific item i𝑖iitalic_i was purchased by the user u𝑢uitalic_u several times in their historical baskets {Bul1,Bul2,,Bulm}superscriptsubscript𝐵𝑢subscript𝑙1superscriptsubscript𝐵𝑢subscript𝑙2superscriptsubscript𝐵𝑢subscript𝑙𝑚\{B_{u}^{l_{1}},B_{u}^{l_{2}},\ldots,B_{u}^{l_{m}}\}{ italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }; the corresponding position set is denoted as Li={l1,l2,,lm}subscript𝐿𝑖subscript𝑙1subscript𝑙2subscript𝑙𝑚L_{i}=\{l_{1},l_{2},\ldots,l_{m}\}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }; then Eiusuperscriptsubscript𝐸𝑖𝑢E_{i}^{u}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT is defined as:

(6) Eiu=j=1mβTlj,superscriptsubscript𝐸𝑖𝑢superscriptsubscript𝑗1𝑚superscript𝛽𝑇subscript𝑙𝑗\textstyle E_{i}^{u}=\sum_{j=1}^{m}\beta^{T-l_{j}}~{},italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_T - italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

where T𝑇Titalic_T represents the length of the user’s basket. TREx’s repeat recommendation model takes item features, user interests, and the temporal order of baskets into consideration. We treat the items in baskets independently and calculate the repetition score 𝑅𝑒𝑝𝑆𝑅𝑒𝑝𝑆\mathit{RepS}italic_RepS for all items that appeared in the previous baskets for each user, which will be used in the final basket generation process.

5.2. Exploration module

As it is more challenging than repetition, exploration is also an important aspect of NBR. To complement the repetition module, we design different exploration modules, targeting item fairness and diversity, respectively. For each user u𝑢uitalic_u, the exploration candidates Iu,texplsuperscriptsubscript𝐼𝑢𝑡𝑒𝑥𝑝𝑙I_{u,t}^{expl}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_x italic_p italic_l end_POSTSUPERSCRIPT are the set of items that the user never bought before.

Item fairness. According to (Li et al., 2023d), we find that NBR methods usually have varying degrees of popularity bias, which means they recommend more popular items compared to the ground truth and harm item fairness. Thus, we recommend unpopular items iG𝑖superscript𝐺i\in G^{-}italic_i ∈ italic_G start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPTfor the exploration module for the sake of approaching the distribution of ground truth and decreasing the exposure gap between the popular and the unpopular groups. Specifically, we randomly sample explore items based on a sampling probability, which is calculated from the purchase frequency of unpopular items.

Diversity. Diversity optimizes for more dispersed categories in the predicted basket. For each user, we record categories of repetition candidates, rank exploration candidates according to their popularity, and select explore items to fill in the But+1superscriptsubscript𝐵𝑢𝑡1B_{u}^{t+1}italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT in turn. The category of each explore item differs from the categories already in But+1superscriptsubscript𝐵𝑢𝑡1B_{u}^{t+1}italic_B start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT.

5.3. Basket generation module

To construct the final basket to be recommended by TREx for the accuracy objective, we adopt a repetition greedy approach and first consider the item candidates generated by the repetition module and fill the remaining slots via the exploration module. TREx𝐹𝑎𝑖𝑟𝑛𝑒𝑠𝑠subscriptTREx𝐹𝑎𝑖𝑟𝑛𝑒𝑠𝑠\mathit{\acs{TREx}_{Fairness}}start_POSTSUBSCRIPT italic_Fairness end_POSTSUBSCRIPT and TREx𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦subscriptTREx𝑑𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦\mathit{\acs{TREx}_{diversity}}start_POSTSUBSCRIPT italic_diversity end_POSTSUBSCRIPT denote TREx with the exploration module targeted at fairness and diversity, respectively. For a user u𝑢uitalic_u, we get their repetition score 𝑅𝑒𝑝𝑆u(i)superscript𝑅𝑒𝑝𝑆𝑢𝑖\mathit{RepS^{u}}(i)italic_RepS start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ( italic_i ), where iIu,t𝑟𝑒𝑝𝑖superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝i\in I_{u,t}^{\mathit{rep}}italic_i ∈ italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT (Algorithm 1, lines 11). First, we define a confidence threshold v𝑣vitalic_v for the repetition score and repetition items are removed from the iIu,t𝑟𝑒𝑝𝑖superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝i\in I_{u,t}^{\mathit{rep}}italic_i ∈ italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT when the corresponding 𝑅𝑒𝑝𝑆u(i)<vsuperscript𝑅𝑒𝑝𝑆𝑢𝑖𝑣\mathit{RepS^{u}}(i)<vitalic_RepS start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ( italic_i ) < italic_v (line 1).555The confidence threshold v𝑣vitalic_v controls the proportion of repeat items and explore items in the recommendation, as well as the accuracy and beyond-accuracy trade-off in this paper. We sweep repetition confidence bound v𝑣vitalic_v to get TREx variants with different accuracy and beyond-accuracy metrics performance. Then, Iu,t𝑟𝑒𝑝superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝I_{u,t}^{\mathit{rep}}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT can be seen as the repetition candidates set. If the number of repetition candidates exceeds the basket size, the items with a high score will have priority to fill the basket (Algorithm 1, line 1). If the number of repetition candidates is smaller than the basket size, the basket is first filled with all items in the repetition candidates set Iu,t𝑟𝑒𝑝superscriptsubscript𝐼𝑢𝑡𝑟𝑒𝑝I_{u,t}^{\mathit{rep}}italic_I start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_rep end_POSTSUPERSCRIPT. Then, we fill up the basket using the explore items via the exploration module, where m𝑚mitalic_m represents the number of empty slots (lines 11).

6. Experiments

Table 4. Statistics of the processed datasets.
Dataset #items #users
Avg.
basket
size
Avg.
#baskets
per user
Repeat
ratio
Explore
ratio
Instacart 29,399 19,210 10.06 15.91 0.60 0.40
Dunnhumby 37,162 02,482 10.07 43.17 0.43 0.57

6.1. Experimental setup

Datasets. We conduct experiments on two widely-used datasets: (i) Instacart,666https://www.kaggle.com/c/instacart-market-basket-analysis/data which includes a large number of grocery orders from users; following (Liu et al., 2024; Naumov et al., 2023), similar-to{\sim}20000 users are randomly selected to conduct experiments; and (ii) Dunnhumby,777https://www.dunnhumby.com/source-files/ which contains two years’ household-level transactions of 2500 frequent shoppers at a retailer. Following (Liu et al., 2024; Ariannezhad et al., 2022), we sample users who have at least three baskets and remove items that appeared less than five times. The two datasets vary in the repeat ratio, i.e., the proportion of repeat items in the ground-truth baskets (Li et al., 2023d). We focus on the fixed size (10 or 20) NBR problem. The statistics of the processed datasets are shown in Table 4. In our experiments, each dataset is partitioned according to (Naumov et al., 2023; Ariannezhad et al., 2022; Faggioli et al., 2020; Liu et al., 2024). The training baskets encompass all user baskets except the last one. In cases where users have over 50 baskets in the training data, only their last 50 baskets are considered for inclusion in the training set. The final baskets of all users are then divided equally between a 50% validation set and a 50% test set. Figure 1 shows the distribution of users across repeat ratios, which is the proportion of repeat items in the ground-truth basket.

Refer to caption
Figure 1. Distribution of users across different repeat ratios for Instacart and Dunnhumby.

NBR baselines. We compare TREx with 8 representative baselines, which we select based on their characteristics in the analysis performed in (Li et al., 2023d; Liu et al., 2024), divided into three groups:

6.1.1. Simple baselines

(i) G-TopFrequses the k𝑘kitalic_k most popular items in the dataset to form the recommended next basket. (ii) P-TopFreqis a personalized TopFreq method, which treats the most frequent k𝑘kitalic_k items in historical records of the user as the next basket. (iii) GP-TopFreq(Li et al., 2023d)is a simple combination of P-TopFreq and G-TopFreq, which first use P-TopFreq to fill the basket, then use G-TopFeq to fill the remaining slots.

6.1.2. Nearest neighbor-based methods

(i) TIFUKNN(Hu et al., 2020)is a state-of-art method that models the temporal dynamics of frequency information of users’ past baskets to introduce Personalized Frequency Information (PIF), then it uses KNN-based method on the PIF. (ii) UP-CF@r(Faggioli et al., 2020)is a combination of recency aware user-wise popularity and user-wise collaborative filtering.

6.1.3. Neural network-based methods

(i) Dream(Yu et al., 2016)models users’ global sequential basket behavior for NBR using recurrent neural network (RNN). (ii) DNNTSP(Yu et al., 2020)is a state-of-art method that leverages a GNN and self-attention techniques. It encodes item-item relations via a graph and employs a self-attention mechanism to capture temporal dependencies of users’ basket sequences. (iii) ReCANet(Ariannezhad et al., 2022)is a repeat-only model for NBR, which uses user-item representations with historical consumption patterns via RNN.

Configurations. To assess group fairness (Section 4), we follow configurations from previous research (Li et al., 2022; Liu et al., 2024); the group of items is determined by their popularity (i.e., the number of purchases recorded in the historical baskets of the dataset). The top 20% of items with the highest purchase frequency as the popular group (G+superscript𝐺G^{+}italic_G start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT), while the remaining 80% of items are assigned to the unpopular group (Gsuperscript𝐺G^{-}italic_G start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT). For the baseline methods, a grid search is performed to find the optimal hyper-parameters via the validation set. For TIFUKNN, the number of neighbors k𝑘kitalic_k is tuned on {100,300,500,900,1100,1300}10030050090011001300\{100,300,500,900,1100,1300\}{ 100 , 300 , 500 , 900 , 1100 , 1300 }, the number of groups m𝑚mitalic_m is tuned on {3,7,11,15,19,23}3711151923\{3,7,11,15,19,23\}{ 3 , 7 , 11 , 15 , 19 , 23 }, the within-basket time-decayed ratio rbsubscript𝑟𝑏r_{b}italic_r start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and the group time-decayed ratio rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT are selected from {0.1,0.2,,0.9,1}0.10.20.91\{0.1,0.2,\ldots,0.9,1\}{ 0.1 , 0.2 , … , 0.9 , 1 }, and the fusion weight α𝛼\alphaitalic_α is selected from {0,0.1,,0.9,1}00.10.91\{0,0.1,\ldots,0.9,1\}{ 0 , 0.1 , … , 0.9 , 1 }. For UP-CF@r, recency window r𝑟ritalic_r is tuned on {1,5,10,25,100,}151025100\{1,5,10,25,100,\infty\}{ 1 , 5 , 10 , 25 , 100 , ∞ }, locality q𝑞qitalic_q is tuned on [1,5,10,50,100,1000]1510501001000[1,5,10,50,100,\allowbreak 1000][ 1 , 5 , 10 , 50 , 100 , 1000 ], and asymmetry α𝛼\alphaitalic_α is tuned on {0,0.25,0.5,0.75,1}00.250.50.751\{0,0.25,0.5,0.75,1\}{ 0 , 0.25 , 0.5 , 0.75 , 1 }. For Dream, DNNTSP, and ReCANet, the item and user embedding size is tuned on {16,32,64,128}163264128\{16,32,64,128\}{ 16 , 32 , 64 , 128 }. As to TREx, for the repetition module, α𝛼\alphaitalic_α is selected from {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, and the time-decay factor β𝛽\betaitalic_β is selected from {0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0}. To facilitate reproducibility, we release the source code and all hyper-parameters in an online repository: https://github.com/lynEcho/TREX.

Table 5. Comparison of TREx-Rep (repetition-module only) against baselines and two types of state-of-art methods; boldface indicates the maximum; underlining indicates the second best performing method. \dagger indicates that TREx-Rep results achieve the same level of performance as SOTA baselines (paired t-test).
Dataset Metric G-TopFreq P-TopFreq GP-TopFreq UP-CF@r TIFUKNN Dream DNNTSP ReCANet TREx-Rep
Instacart Recall@10 0.0704 0.3143 0.3150 0.3377 0.3456 0.0704 0.3295 0.3490 0.3476\dagger
NDCG@10 0.0817 0.3339 0.3343 0.3582 0.3657 0.0817 0.3434 0.3699 0.3661\dagger
PHR@10 0.4600 0.8447 0.8460 0.8586 0.8639 0.4600 0.8581 0.8668 0.8655\dagger
Recall@20 0.0973 0.4138 0.4168 0.4405 0.4559 0.0979 0.4339 0.4562 0.4557\dagger
NDCG@20 0.0962 0.3889 0.3902 0.4161 0.4271 0.0968 0.4018 0.4303 0.4269\dagger
PHR@20 0.5302 0.8921 0.8959 0.9045 0.9098 0.5346 0.9033 0.9097 0.9092\dagger
Dunnhumby Recall@10 0.0897 0.1628 0.1628 0.1699 0.1763 0.0896 0.0871 0.1730 0.1815\dagger
NDCG@10 0.0798 0.1562 0.1562 0.1639 0.1683 0.0759 0.0792 0.1625 0.1689\dagger
PHR@10 0.3795 0.5399 0.5399 0.5536 0.5729 0.3873 0.4303 0.5655 0.5761\dagger
Recall@20 0.1046 0.2075 0.2075 0.2168 0.2227 0.1081 0.1442 0.2252 0.2257\dagger
NDCG@20 0.0877 0.1787 0.1787 0.1885 0.1917 0.0853 0.1021 0.1879 0.1921\dagger
PHR@20 0.4392 0.6116 0.6116 0.6326 0.6342 0.4558 0.5378 0.6377 0.6390\dagger
Refer to caption
Figure 2. Performance of TREx-Rep when we add a time-decay factor β𝛽\betaitalic_β (+T), add both β𝛽\betaitalic_β and item-specific repetition feature RepI(i)𝑅𝑒𝑝𝐼𝑖RepI(i)italic_R italic_e italic_p italic_I ( italic_i ) (+T+RF).

6.2. Overall accuracy performance

By decoupling the repetition and exploration tasks, TREx-Rep optimizes for the repeat items prediction and accounts for the accuracy of the NBR performance. Table 5 shows the experimental results for TREx-Rep and the baselines. We observe that TREx-Rep surpasses two complex deep learning-based methods (i.e., Dream and DNNTSP) by a large margin on the Dunnhumby and Instacart datasets, and TREx-Rep always achieves or matches the SOTA accuracy on both datasets across different accuracy metrics. Note that, TREx-Rep achieves a competitive accuracy performance by only using part of the available slots in the basket.888As TREx-Rep only recommends repeat items, the basket could not be fulfilled when the number of user’s repeat items (historical items) is smaller than the basket size. ReCANet also only recommends repeat items, however, it is a complex neural-based model, which is much slower than the proposed TREx-Rep module. Compared to the deep learning methods with complex architectures that try to learn basket representations and model temporal relations, TREx-Rep is very efficient due to its simplicity.

To investigate the effect of the repetition features and the improvement in repetition performance in NBR. We conduct experiments on TREx-Rep by gradually adding the time-decay factor β𝛽\betaitalic_β and item-specific repetition feature 𝑅𝑒𝑝𝐼(i)𝑅𝑒𝑝𝐼𝑖\mathit{RepI}(i)italic_RepI ( italic_i ). The results are shown in Figure 2. The accuracy increases when we gradually integrate different factors into TREx-Rep, which indicates that both the time-decay factor β𝛽\betaitalic_β and the item-specific repetition feature RepI(i)𝑅𝑒𝑝𝐼𝑖RepI(i)italic_R italic_e italic_p italic_I ( italic_i ) contribute to the accuracy performance of TREx-Rep. Significant improvements over only using the time-decay factor β𝛽\betaitalic_β can be observed on the Dunnhumby dataset when the item-specific repetition feature RepI(i)𝑅𝑒𝑝𝐼𝑖RepI(i)italic_R italic_e italic_p italic_I ( italic_i ) is also adopted to compute the repetition score RepSu(i)𝑅𝑒𝑝superscript𝑆𝑢𝑖RepS^{u}(i)italic_R italic_e italic_p italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ( italic_i ). Note that the improvement of adding RepI(i)𝑅𝑒𝑝𝐼𝑖RepI(i)italic_R italic_e italic_p italic_I ( italic_i ) to TREx-Rep on the Instacart dataset is relatively small. We conjecture that items in the Instacart dataset are more regular products, that have little difference in repetition feature with each other. Figure 3 shows the performance when using different amounts of training samples, the improvement in recall resulting from adding RepI(i)𝑅𝑒𝑝𝐼𝑖RepI(i)italic_R italic_e italic_p italic_I ( italic_i ) increases when we use more training data since we have more samples for estimating the repetition feature RepI(i)𝑅𝑒𝑝𝐼𝑖RepI(i)italic_R italic_e italic_p italic_I ( italic_i ).

Refer to caption
Figure 3. The recall improvement of (+T+RF) over (+T) when the training sample ratio changes from 0.2 to 1.

6.3. Beyond-accuracy performance

We conduct experiments to verify whether TREx with the designed models (i.e., TRExDiversitysubscriptTREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦\acs{TREx}_{Diversity}start_POSTSUBSCRIPT italic_D italic_i italic_v italic_e italic_r italic_s italic_i italic_t italic_y end_POSTSUBSCRIPT and TRExFairnesssubscriptTREx𝐹𝑎𝑖𝑟𝑛𝑒𝑠𝑠\acs{TREx}_{Fairness}start_POSTSUBSCRIPT italic_F italic_a italic_i italic_r italic_n italic_e italic_s italic_s end_POSTSUBSCRIPT) could achieve better performance on representative diversity and item fairness metrics. Note that, the recommended basket remains fixed for a specific user in existing baselines, resulting in fixed performance regarding both accuracy and beyond-accuracy metrics on each dataset. In contrast, TREx provides the flexibility to adjust the trade-off between accuracy and beyond-accuracy metrics by adjusting the repetition confidence bound v𝑣vitalic_v. This allows for a more nuanced control over the recommendation process compared to traditional baselines.

Refer to caption
Refer to caption
Figure 4. Performance of TRExDiversitysubscriptTREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦\acs{TREx}_{Diversity}start_POSTSUBSCRIPT italic_D italic_i italic_v italic_e italic_r italic_s italic_i italic_t italic_y end_POSTSUBSCRIPT at different v𝑣vitalic_v values, compared with different NBR methods in terms of different diversity metrics. The red +++ marker indicates the direction with both high accuracy and diversity.

Diversity. The experimental results w.r.t. the accuracy and different diversity metrics (i.e., ILD, Entropy, and DS) are shown in Figure 4.999G-TopFreq and Dream exhibit low recall, fairness, and diversity, which prevents them from being visible in Figures 4 and 5. We have the following observations: (1) Compared to methods (i.e., TIFUKNN and ReCANet) with the best accuracy, TREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦subscriptTREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦\mathit{\acs{TREx}_{Diversity}}start_POSTSUBSCRIPT italic_Diversity end_POSTSUBSCRIPT can achieve better performance in terms of all three diversity metrics while preserving the same level of accuracy on both datasets. (2) In contrast to other baseline methods (excluding TIFUKNN and ReCANet), TREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦subscriptTREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦\mathit{\acs{TREx}_{Diversity}}start_POSTSUBSCRIPT italic_Diversity end_POSTSUBSCRIPT showcases the ability to recommend baskets with enhanced accuracy and diversity simultaneously.

Item fairness.

Refer to caption
Figure 5. Performance of TRExFairnesssubscriptTREx𝐹𝑎𝑖𝑟𝑛𝑒𝑠𝑠\acs{TREx}_{Fairness}start_POSTSUBSCRIPT italic_F italic_a italic_i italic_r italic_n italic_e italic_s italic_s end_POSTSUBSCRIPT at different v𝑣vitalic_v values, compared with different NBR methods in terms of different fairness metrics. The red +++ marker indicates the direction with both high accuracy and fairness.

The experimental results regarding the accuracy and five fairness metrics (LogRUR, logEUR, logDP, EEL, and EED) are depicted in Figure 5. Based on our analysis, we make the following observations: (i) On the Dunnhumby dataset, TRExFairnesssubscriptTREx𝐹𝑎𝑖𝑟𝑛𝑒𝑠𝑠\acs{TREx}_{Fairness}start_POSTSUBSCRIPT italic_F italic_a italic_i italic_r italic_n italic_e italic_s italic_s end_POSTSUBSCRIPT demonstrates superior fairness w.r.t. logDP and logEUR while maintaining the same level of accuracy performance as the best-performing baselines (i.e., TIFUKNN and ReCANet). Similarly, on Dunnhumby, TRExFairnesssubscriptTREx𝐹𝑎𝑖𝑟𝑛𝑒𝑠𝑠\acs{TREx}_{Fairness}start_POSTSUBSCRIPT italic_F italic_a italic_i italic_r italic_n italic_e italic_s italic_s end_POSTSUBSCRIPT showcases enhanced fairness across four fairness metrics (logDP, logEUR, EEL, and EED) while achieving accuracy performance comparable to the best-performing baselines. (ii) TREx𝐹𝑎𝑖𝑟𝑛𝑒𝑠𝑠subscriptTREx𝐹𝑎𝑖𝑟𝑛𝑒𝑠𝑠\mathit{\acs{TREx}_{Fairness}}start_POSTSUBSCRIPT italic_Fairness end_POSTSUBSCRIPTdemonstrates its capability to recommend baskets with improved accuracy and fairness w.r.t. logDP and logEUR concurrently, when compared to complex baselines such as Dream, UP-CF@r, and DNNTSP. (iii) In terms of logRUR, TREx𝐹𝑎𝑖𝑟𝑛𝑒𝑠𝑠subscriptTREx𝐹𝑎𝑖𝑟𝑛𝑒𝑠𝑠\mathit{\acs{TREx}_{Fairness}}start_POSTSUBSCRIPT italic_Fairness end_POSTSUBSCRIPT exhibits inferior performance in fairness while maintaining similar accuracy levels compared to several existing baselines. Moreover, as both accuracy and fairness decrease simultaneously, a win-win and lose-lose scenario is evident rather than a conventional trade-off relationship in this fairness evaluation.

Connections with accuracy. To get a better understanding of the possibility of leveraging the “short-cut” via TREx to improve beyond-accuracy metrics, we conduct an analysis by categorizing these beyond-accuracy metrics into different groups based on their connections with accuracy (see Section 4 and Table 3).

We can observe that TREx can easily achieve better performance w.r.t. beyond-accuracy metrics have no connections with the accuracy (i.e., ILD, Entropy, DS, and logDP) on two datasets. When beyond-accuracy metrics (e.g., logEUR, EEL, and EED) exhibit weak associations with accuracy, TREx outperforms alternative methods in some instances (4 out of 6). However, in cases where beyond-accuracy metrics are strongly correlated with accuracy (e.g., logRUR), TREx struggles to achieve superior performance. Since only accurate predictions contribute to improvements in logRUR fairness, leveraging the exploration module to optimize such beyond-accuracy metrics is very challenging.

6.4. Reflections and discussions

The above results verify our hypothesis and demonstrate the effectiveness of leveraging a “short-cut” strategy to achieve better beyond-accuracy under the current evaluation paradigms.

It is controversial to use this “short-cut” strategy in real-world scenarios when NBR practitioners consider beyond-accuracy metrics. In scenarios where the accuracy of exploration is not important to practitioners and only overall accuracy is of concern, the “short-cut” strategy proves to be a straightforward and efficient means to achieve better performance w.r.t. various beyond-accuracy metrics. TREx must be considered or serve as a baseline before designing more sophisticated methods, such as including multi-objective loss functions (Leng et al., 2020; Chen et al., 2020), integer programming (Zhao et al., 2023), and so on.

However, in some scenarios, it is unreasonable to sacrifice the exploration accuracy despite it being low. Therefore, the existence of the “short-cut” strategy reveals the potential flaws of the existing evaluation paradigms (i.e., using overall metrics to define success). We look into the exploration accuracy (Li et al., 2023d) of TREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦subscriptTREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦\mathit{\acs{TREx}_{Diversity}}start_POSTSUBSCRIPT italic_Diversity end_POSTSUBSCRIPT when it outperforms several existing baselines in terms of both overall accuracy and diversity (i.e., success according to existing evaluation paradigm). Table 6 shows the huge decrease in the accuracy of exploring items in the recommended basket of TREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦subscriptTREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦\mathit{\acs{TREx}_{Diversity}}start_POSTSUBSCRIPT italic_Diversity end_POSTSUBSCRIPT, compared to these baselines, since the designed module in TREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦subscriptTREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦\mathit{\acs{TREx}_{Diversity}}start_POSTSUBSCRIPT italic_Diversity end_POSTSUBSCRIPT is mainly designed for improving diversity and does not consider accuracy. In this sense, we can not simply claim the superiority of TREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦subscriptTREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦\mathit{\acs{TREx}_{Diversity}}start_POSTSUBSCRIPT italic_Diversity end_POSTSUBSCRIPT compared to these baselines just depends on the overall performance.

Note that, the fundamental reason for the existence of this “short-cut” is that predicting accurate explore items is much more difficult than predicting repeat items, and exploration prediction only accounts for a limited user’s overall accuracy (Li et al., 2023d, e, c, b). Given that exploration prediction contributes only minimally to the overall accuracy of users, it becomes feasible to allocate resources toward optimizing other beyond-accuracy metrics instead of accuracy itself.

Therefore, beyond using the overall performance to measure accuracy and beyond-accuracy metrics, a fine-grained level evaluation could help to provide a more rigid identification of the success when considering beyond-accuracy metrics.

Table 6. Exploration accuracy (Li et al., 2023d) of TRExDiversitysubscriptTREx𝐷𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦\acs{TREx}_{Diversity}start_POSTSUBSCRIPT italic_D italic_i italic_v italic_e italic_r italic_s italic_i italic_t italic_y end_POSTSUBSCRIPT compared with NBR methods that are inferior to it within existing evaluation paradigms.
Dataset Metric TIFUKNN Dream DNNTSP TREx-Div
Instacart Recallexpl@10subscriptRecall𝑒𝑥𝑝𝑙@10\mathrm{Recall}_{expl}@10roman_Recall start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l end_POSTSUBSCRIPT @ 10 0.0014 0.0322 0.0014 0.0002
PHRexpl@10subscriptPHR𝑒𝑥𝑝𝑙@10\mathrm{PHR}_{expl}@10roman_PHR start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l end_POSTSUBSCRIPT @ 10 0.0037 0.1431 0.0040 0.0009
Recallexpl@20subscriptRecall𝑒𝑥𝑝𝑙@20\mathrm{Recall}_{expl}@20roman_Recall start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l end_POSTSUBSCRIPT @ 20 0.0077 0.0526 0.0072 0.0008
PHRexpl@20subscriptPHR𝑒𝑥𝑝𝑙@20\mathrm{PHR}_{expl}@20roman_PHR start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l end_POSTSUBSCRIPT @ 20 0.0198 0.2120 0.0217 0.0031
Dunnhumby Recallexpl@10subscriptRecall𝑒𝑥𝑝𝑙@10\mathrm{Recall}_{expl}@10roman_Recall start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l end_POSTSUBSCRIPT @ 10 0.0042 0.0111 0.0017 0.0000
PHRexpl@10subscriptPHR𝑒𝑥𝑝𝑙@10\mathrm{PHR}_{expl}@10roman_PHR start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l end_POSTSUBSCRIPT @ 10 0.0139 0.0521 0.0085 0.0019
Recallexpl@20subscriptRecall𝑒𝑥𝑝𝑙@20\mathrm{Recall}_{expl}@20roman_Recall start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l end_POSTSUBSCRIPT @ 20 0.0069 0.0214 0.0028 0.0016
PHRexpl@20subscriptPHR𝑒𝑥𝑝𝑙@20\mathrm{PHR}_{expl}@20roman_PHR start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l end_POSTSUBSCRIPT @ 20 0.0232 0.1045 0.0115 0.0065

7. Conclusion

We have expanded the research objectives of NBR to go beyond sole accuracy to encompass both accuracy and beyond-accuracy metrics. We have recognized a potential “short-cut” strategy to optimize beyond-accuracy metrics while preserving high accuracy levels. To capitalize on and validate the presence of such “short-cuts,” we have introduced a plug-and-play framework called two-step repetition-exploration (TREx) considering the differences between repetition and exploration tasks. This framework treats repeat items and explore items as distinct entities, employing a straightforward yet highly effective repetition module to uphold accuracy standards. Concurrently, two exploration modules have been devised to target the optimization of beyond-accuracy metrics. We have conducted experiments on two publicly available datasets w.r.t. eight representative beyond-accuracy metrics, including item fairness (i.e., logEUR, LogRUR, logDP, EEL, and EED) and diversity (i.e., ILD, Entropy, and DS).

Our experimental results demonstrate the effectiveness of our proposed “short-cut” strategy, which can achieve better beyond-accuracy performance w.r.t. several fairness and diversity metrics on different datasets. Additionally, we group beyond-accuracy metrics according to the strength of their connection with accuracy. Our analysis reveals that the stronger the connection with accuracy, the more difficult it becomes to employ a “short-cut” strategy to optimize these beyond-accuracy metrics, favoring the metrics with a stronger connection to avoid such short-cuts.

As to the broader implications of our work, we have discussed the reasonableness of leveraging the “short-cut” strategy to trade the accuracy of exploration for beyond-accuracy metrics in various scenarios. The presence of this “short-cut” highlights a potential flaw in the definition of success within existing evaluation paradigms, particularly in scenarios where exploration accuracy is important despite being low (Williams et al., 2014). A fine-grained level evaluation should be performed in NBR to offer a more precise identification of achieving “better” performance in such a scenario.

Despite the simplicity of the “short-cut” strategy and TREx, our paper sheds light on the research direction of considering both accuracy and beyond-accuracy metrics in NBR. Rather than blindly embracing sophisticated methods in NBR, follow-up research should realize the existence of the “short-cut” and potential flaws of existing evaluation paradigms in this research direction.

Acknowledgements

This work is partially supported by the Dutch Research Council (NWO), under project numbers 024.004.022, NWA.1389.20.183, KICH3.LTP.20.006, and VI.Vidi.223.166. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.

References

  • (1)
  • Ariannezhad et al. (2022) Mozhdeh Ariannezhad, Sami Jullien, Ming Li, Min Fang, Sebastian Schelter, and Maarten de Rijke. 2022. ReCANet: A Repeat Consumption-Aware Neural Network for Next Basket Recommendation in Grocery Shop**. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1240–1250.
  • Ariannezhad et al. (2023) Mozhdeh Ariannezhad, Ming Li, Sami Jullien, and Maarten de Rijke. 2023. Complex Item Set Recommendation. In SIGIR 2023: 46th international ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 3444–3447.
  • Bai et al. (2018) Ting Bai, Jian-Yun Nie, Wayne Xin Zhao, Yutao Zhu, Pan Du, and Ji-Rong Wen. 2018. An Attribute-aware Neural Attentive Model for Next Basket recommendation. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. 1201–1204.
  • Biega et al. (2018) Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. 2018. Equity of attention: Amortizing individual fairness in rankings. In The 41st international ACM SIGIR conference on research & development in information retrieval. 405–414.
  • Bobadilla et al. (2020) Jesús Bobadilla, Raúl Lara-Cabrera, Ángel González-Prieto, and Fernando Ortega. 2020. Deepfair: Deep Learning for Improving Fairness in Recommender Systems. arXiv preprint arXiv:2006.05255 (2020).
  • Cen et al. (2020) Yukuo Cen, Jianwei Zhang, Xu Zou, Chang Zhou, Hongxia Yang, and Jie Tang. 2020. Controllable Multi-interest Framework for Recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2942–2951.
  • Chen et al. (2020) Wanyu Chen, Pengjie Ren, Fei Cai, Fei Sun, and Maarten de Rijke. 2020. Improving end-to-end sequential recommendations with intent-aware diversification. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 175–184.
  • Chen et al. (2021b) Wanyu Chen, Pengjie Ren, Fei Cai, Fei Sun, and Maarten De Rijke. 2021b. Multi-interest Diversification for End-to-end Sequential Recommendation. ACM Transactions on Information Systems 40, 1 (2021), 1–30.
  • Chen et al. (2021a) Yongjun Chen, Jia Li, Chenghao Liu, Chenxi Li, Markus Anderle, Julian McAuley, and Caiming Xiong. 2021a. Modeling Dynamic Attributes for Next Basket Recommendation. arXiv preprint arXiv:2109.11654 (2021).
  • Diaz et al. (2020) Fernando Diaz, Bhaskar Mitra, Michael D Ekstrand, Asia J Biega, and Ben Carterette. 2020. Evaluating Stochastic Rankings with Expected Exposure. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 275–284.
  • Ekstrand et al. (2019) Michael D. Ekstrand, Robin Burke, and Fernando Diaz. 2019. Fairness and Discrimination in Retrieval and Recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1403–1404.
  • Faggioli et al. (2020) Guglielmo Faggioli, Mirko Polato, and Fabio Aiolli. 2020. Recency Aware Collaborative Filtering for Next Basket Recommendation. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization. 80–87.
  • Ge et al. (2021) Yingqiang Ge, Shuchang Liu, Ruoyuan Gao, Yikun Xian, Yunqi Li, Xiangyu Zhao, Changhua Pei, Fei Sun, Junfeng Ge, Wenwu Ou, and Yongfeng Zhang. 2021. Towards Long-Term Fairness in Recommendation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 445–453.
  • Hu and He (2019) Haoji Hu and Xiangnan He. 2019. Sets2Sets: Learning from Sequential Sets with Neural Networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1491–1499.
  • Hu et al. (2020) Haoji Hu, Xiangnan He, **yang Gao, and Zhi-Li Zhang. 2020. Modeling Personalized Item Frequency Information for Next-basket Recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1071–1080.
  • Jannach (2022) Dietmar Jannach. 2022. Multi-Objective Recommender Systems: Survey and Challenges. In MORS workshop held in conjunction with the 16th ACM Conference on Recommender Systems (RecSys), 2022.
  • Katz et al. (2022) Ori Katz, Oren Barkan, Noam Koenigstein, and Nir Zabari. 2022. Learning to Ride a Buy-Cycle: A Hyper-Convolutional Model for Next Basket Repurchase Recommendation. In Proceedings of the 16th ACM Conference on Recommender Systems. 316–326.
  • Kumar and Hosanagar (2019) Anuj Kumar and Kartik Hosanagar. 2019. Measuring the Value of Recommendation Links on Product Demand. Information Systems Research 30, 3 (2019), 819–838.
  • Le et al. (2019) Duc-Trong Le, Hady W Lauw, and Yuan Fang. 2019. Correlation-sensitive Next-basket Recommendation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2808–2814.
  • Leng et al. (2020) Youfang Leng, Li Yu, Jie Xiong, and Guanyu Xu. 2020. Recurrent Convolution Basket Map for Diversity Next-Basket Recommendation. In International Conference on Database Systems for Advanced Applications. 638–653.
  • Li et al. (2023a) Ming Li, Mozhdeh Ariannezhad, Andrew Yates, and Maarten de Rijke. 2023a. Masked and Swapped Sequence Modeling for Next Novel Basket Recommendation in Grocery Shop**. In Proceedings of the 17th ACM Conference on Recommender Systems. 35–46.
  • Li et al. (2023b) Ming Li, Mozhdeh Ariannezhad, Andrew Yates, and Maarten De Rijke. 2023b. Who Will Purchase this Item Next? Reverse Next Period Recommendation in Grocery Shop**. ACM Transactions on Recommender Systems 1, 2 (2023), 1–32.
  • Li et al. (2023c) Ming Li, ** Huang, and Maarten de Rijke. 2023c. Repetition and Exploration in Offline Reinforcement Learning-based Recommendations. In 4th Workshop on Deep Reinforcement Learning for Information Retrieval at CIKM 2023. ACM.
  • Li et al. (2023d) Ming Li, Sami Jullien, Mozhdeh Ariannezhad, and Maarten de Rijke. 2023d. A next basket recommendation reality check. ACM Transactions on Information Systems 41, 4 (2023), 1–29.
  • Li et al. (2023e) Ming Li, Ali Vardasbi, Andrew Yates, and Maarten de Rijke. 2023e. Repetition and Exploration in Sequential Recommendation. In SIGIR 2023: 46th international ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2532–2541.
  • Li et al. (2022) Yunqi Li, Hanxiong Chen, Shuyuan Xu, Yingqiang Ge, Juntao Tan, Shuchang Liu, and Yongfeng Zhang. 2022. Fairness in recommendation: A survey. ACM Transactions on Intelligent Systems and Technology (2022).
  • Liang et al. (2021) Yile Liang, Tieyun Qian, Qing Li, and Hongzhi Yin. 2021. Enhancing Domain-level and User-level Adaptivity in Diversified Recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 747–756.
  • Liu et al. (2024) Yuanna Liu, Ming Li, Mozhdeh Ariannezhad, Masoud Mansoury, Mohammad Aliannejadi, and Maarten de Rijke. 2024. Measuring Item Fairness in Next Basket Recommendation: A Reproducibility Study. In ECIR 2024: 46th European Conference on Information Retrieval. Springer, 210–225.
  • Ludewig and Jannach (2018) Malte Ludewig and Dietmar Jannach. 2018. Evaluation of Session-based Recommendation Algorithms. User Modeling and User-Adapted Interaction 28 (2018), 331–390.
  • Morik et al. (2020) Marco Morik, Ashudeep Singh, Jessica Hong, and Thorsten Joachims. 2020. Controlling Fairness and Bias in Dynamic Learning-to-Rank. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 429–438.
  • Naghiaei et al. (2022) Mohammadmehdi Naghiaei, Hossein A. Rahmani, and Yashar Deldjoo. 2022. CPFair: Personalized Consumer and Producer Fairness Re-Ranking for Recommender Systems. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 770–779.
  • Naumov et al. (2023) Sergey Naumov, Marina Ananyeva, Oleg Lashinin, Sergey Kolesnikov, and Dmitry I Ignatov. 2023. Time-Dependent Next-Basket Recommendations. In European Conference on Information Retrieval. Springer, 502–511.
  • OECD (2020) OECD. 2020. E-commerce in the Time of COVID-19. https://www.oecd.org/coronavirus/policy-responses/e-commerce-in-the-time-of-covid-19-3a2b78e8/.
  • Qin et al. (2021) Yuqi Qin, Pengfei Wang, and Chenliang Li. 2021. The World is Binary: Contrastive Learning for Denoising Next Basket Recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 859–868.
  • Quadrana et al. (2018) Massimo Quadrana, Paolo Cremonesi, and Dietmar Jannach. 2018. Sequence-Aware Recommender Systems. Comput. Surveys 51, 4 (2018), 1–36.
  • Raj and Ekstrand (2022) Amifa Raj and Michael D Ekstrand. 2022. Measuring fairness in ranked results: An analytical and empirical comparison. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 726–736.
  • Rendle et al. (2010) Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factorizing Personalized Markov Chains for Next-basket Recommendation. In Proceedings of the 19th International Conference on World Wide Web. 811–820.
  • Singh and Joachims (2018) Ashudeep Singh and Thorsten Joachims. 2018. Fairness of Exposure in Rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2219–2228.
  • Sun et al. (2020) Leilei Sun, Yansong Bai, Bowen Du, Chuanren Liu, Hui Xiong, and Weifeng Lv. 2020. Dual Sequential Network for Temporal Sets Prediction. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1439–1448.
  • Wan et al. (2018) Mengting Wan, Di Wang, Jie Liu, Paul Bennett, and Julian McAuley. 2018. Representing and Recommending Shop** Baskets with Complementarity, Compatibility and Loyalty. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1133–1142.
  • Wang et al. (2015) Pengfei Wang, Jiafeng Guo, Yanyan Lan, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2015. Learning Hierarchical Representation Model for Next Basket Recommendation. In Proceedings of the 38th International ACM SIGIR conference on Research and Development in Information Retrieval. 403–412.
  • Wang et al. (2019b) Pengfei Wang, Yongfeng Zhang, Shuzi Niu, and Jiafeng Guo. 2019b. Modeling Temporal Dynamics of Users’ Purchase Behaviors for Next Basket Prediction. Journal of Computer Science and Technology 34, 6 (2019), 1230–1240.
  • Wang et al. (2019a) Shou** Wang, Liang Hu, Yan Wang, Quan Z Sheng, Mehmet Orgun, and Longbing Cao. 2019a. Modeling Multi-purpose Sessions for Next-item Recommendations via Mixture-channel Purpose Routing Networks. In International Joint Conference on Artificial Intelligence.
  • Wang et al. (2020) Shou** Wang, Liang Hu, Yan Wang, Quan Z Sheng, Mehmet Orgun, and Longbing Cao. 2020. Intention Nets: Psychology-inspired User Choice behavior Modeling for Next-basket Prediction. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. 6259–6266.
  • Williams et al. (2014) Patti Williams, Iris W. Hung, Anirban Mukhopadhyay, Rik Pieters, Xinyue Zhou, Tim Wildschut, Constantine Sedikides, Kan Shi, Cong Feng, Cassie Mogilner, Jennifer Aaker, Sepandar D. Kamvar, Fabrizio Di Muro, and Kyle B. Murray. 2014. Emotions and Consumer Behavior. Journal of Consumer Research 40, 5 (2014), viii–xi.
  • Wu et al. (2022) Haolun Wu, Bhaskar Mitra, Chen Ma, Fernando Diaz, and Xue Liu. 2022. Joint Multisided Exposure Fairness for Recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 703–714.
  • Wu et al. (2021) Yao Wu, Jian Cao, Guandong Xu, and Yudong Tan. 2021. TFROM: A Two-Sided Fairness-Aware Recommendation Model for Both Customers and Providers. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1013–1022.
  • Yin et al. (2023) Qing Yin, Hui Fang, Zhu Sun, and Yew-Soon Ong. 2023. Understanding Diversity in Session-Based Recommendation. ACM Transactions on Information Systems 42, 1 (2023), 1–34.
  • Yu et al. (2016) Feng Yu, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2016. A Dynamic Recurrent Model for Next Basket Recommendation. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 729–732.
  • Yu et al. (2020) Le Yu, Leilei Sun, Bowen Du, Chuanren Liu, Hui Xiong, and Weifeng Lv. 2020. Predicting Temporal Sets with Deep Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1083–1091.
  • Zehlike and Castillo (2020) Meike Zehlike and Carlos Castillo. 2020. Reducing Disparate Exposure in Ranking: A Learning To Rank Approach. In Proceedings of The Web Conference 2020. 2849–2855.
  • Zhang and Hurley (2008) Mi Zhang and Neil Hurley. 2008. Avoiding Monotony: Improving the Diversity of Recommendation Lists. In Proceedings of the 2008 ACM Conference on Recommender Systems. 123–130.
  • Zhao et al. (2023) Yuying Zhao, Yu Wang, Yunchao Liu, Xueqi Cheng, Charu Aggarwal, and Tyler Derr. 2023. Fairness and Diversity in Recommender Systems: A Survey. arXiv preprint arXiv:2307.04644 (2023).
  • Zheng et al. (2021) Yu Zheng, Chen Gao, Liang Chen, Depeng **, and Yong Li. 2021. DGCN: Diversified Recommendation with Graph Convolutional Networks. In Proceedings of the Web Conference 2021. 401–412.