A First Look at Selection Bias in Preference Elicitation for Recommendation

Shashank Gupta University of AmsterdamAmsterdamThe Netherlands [email protected] Harrie Oosterhuis Radboud UniversiteitNijmegenThe Netherlands [email protected]  and  Maarten de Rijke University of AmsterdamAmsterdamThe Netherlands [email protected]
(2023)
Abstract.
\Acl

PE explicitly asks users what kind of recommendations they would like to receive. It is a popular technique for conversational recommender systems to deal with cold-starts. Previous work has studied selection bias in implicit feedback, e.g., clicks, and in some forms of explicit feedback, i.e., ratings on items. Despite the fact that the extreme sparsity of preference elicitation interactions make them severely more prone to selection bias than natural interactions, the effect of selection bias in preference elicitation on the resulting recommendations has not been studied yet. To address this gap, we take a first look at the effects of selection bias in preference elicitation and how they may be further investigated in the future. We find that a big hurdle is the current lack of any publicly available dataset that has preference elicitation interactions. As a solution, we propose a simulation of a topic-based preference elicitation process. The results from our simulation-based experiments indicate (i) that ignoring the effect of selection bias early in preference elicitation can lead to an exacerbation of overrepresentation in subsequent item recommendations, and (ii) that debiasing methods can alleviate this effect, which leads to significant improvements in subsequent item recommendation performance. Our aim is for the proposed simulator and initial results to provide a starting point and motivation for future research into this important but overlooked problem setting.

submissionid: 760copyright: acmlicensedjournalyear: 2023conference: CONSEQUENCES Workshop at RecSys ’23; September 18-22; Singaporebooktitle: CONSEQUENCES Workshop at RecSys ’23, September 18-22, 2023, Singaporeccs: Information systems Recommender systems

1. Introduction

Traditional recommender systems provide a single-shot human-system interface that is static in nature. They often rely on the user’s past interactions to infer their preferences and generate a recommendation based on that. Traditional collaborative filtering (CF)-based methods fall into this category (Ilievski and Roy, 2013; Jannach et al., 2018; He et al., 2017). However, these methods have trouble handling settings where user preferences are dynamic – in practice, preferences often drift over time due to external covariates (Jannach et al., 2018) – or single-shot recommendation settings where user intent has to be inferred from contextual information, instead of past interactions (Mehrotra et al., 2019). Additionally, these methods struggle to generate good recommendations for cold-start users and items. These issues, coupled with the sparse nature of user-item interaction data, make learning a good recommendation model difficult. A solution to these issues could be asking for a user’s preferences directly at a coarser granularity in a preference elicitation (PE) stage. Users are generally very willing to indicate or clarify their preferences, when prompted (Priyogi, 2019).

Refer to caption
Refer to caption
Figure 1. Rating distribution over item topics on the Coat Music dataset (Left), and Genre popularity in the MovieLens dataset (Right).

preference elicitation (PE) can be used in a variety of settings, including so-called question-based conversational recommender systems (Zhang et al., 2020; Christakopoulou et al., 2018; Lei et al., 2020), which consist of the following main components: (i) preference elicitation(PE), where the user’s preferences on items or item topics are collected or elicited, and, subsequently, (ii) item recommendation, where the system generates recommendations for users, conditioned on their response during the PE stage. The interactive aspect of CRSs can help in dealing with dynamic user preferences and the lack of intent information. It can also help with the cold-start problem, by collecting user’s preferences on a group of items, instead of on an item directly (Chang et al., 2015).

Recommender systems are commonly optimized based on logged user interactions. However, such interactions provide a biased view of the actual user preferences (Marlin and Zemel, 2009; Marlin et al., 2007; Saito et al., 2020; Yang et al., 2018). In particular, ratings are generally not evenly spread over all items but are heavily affected by popularity bias, resulting in a small number of items receiving most ratings. Figure 1 (left) demonstrates this effect on the rating distribution of item topics in Coat, a popular recommendation dataset with an unbiased test set (Schnabel et al., 2016). Popularity bias can be seen as a specific form of selection bias, due to which only part of the user preferences are observed in ratings (Marlin et al., 2007). Importantly, selection bias on the item level propagates to the topic level; for example, Figure 1 demonstrates the popularity distribution over movie genres in the MovieLens dataset. Similar to how selection bias in item ratings results in a biased view over topic preferences, it seems likely that selection bias in a PE stage could negatively affect the subsequent recommendation stage. While selection bias in user interaction data is widely studied (Marlin and Zemel, 2009; Marlin et al., 2007; Saito et al., 2020; Yang et al., 2018; Schnabel et al., 2016; Marlin et al., 2007), to the best of our knowledge, previous work has not considered the effects of selection bias in PE. To address this gap, this work takes a first look at the problem of selection bias in PE for recommendation. We focus on elicitation on the topic-level followed by subsequent item recommendation. Because there is currently no publicly available recommendation dataset that represents PE, we introduce a method for simulating a PE stage from static recommendation datasets. Our experimental results in the simulator reveal that selection bias in the PE stage does, indeed, have negative effects on subsequent item recommendation. We find that existing debiasing methods can be adapted to reduce these effects, leading to significantly better recommendations.

2. Correcting for Selection Bias in Preference Elicitation

In this section, we discuss how common debiasing methods for item recommendation can be applied to topic-level PE (Schnabel et al., 2016). Let U𝑈Uitalic_U be the set of all users, I𝐼Iitalic_I the set of all items, and T𝑇Titalic_T the set of all item-topics (referred to as topics hereafter) in the dataset, and Y{0,1}|U||T|𝑌superscript01𝑈𝑇Y\in\{0,1\}^{|U|\cdot|T|}italic_Y ∈ { 0 , 1 } start_POSTSUPERSCRIPT | italic_U | ⋅ | italic_T | end_POSTSUPERSCRIPT the user-topic complete rating matrix; Yu,tsubscript𝑌𝑢𝑡Y_{u,t}italic_Y start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT is the true rating for the pair (u,t)𝑢𝑡(u,t)( italic_u , italic_t ). T{0,1}|I||T|𝑇superscript01𝐼𝑇T\in\{0,1\}^{|I|\cdot|T|}italic_T ∈ { 0 , 1 } start_POSTSUPERSCRIPT | italic_I | ⋅ | italic_T | end_POSTSUPERSCRIPT is the indicator matrix where Ti,t=1subscript𝑇𝑖𝑡1T_{i,t}=1italic_T start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = 1 if item i𝑖iitalic_i belongs to the topic t𝑡titalic_t. R{0,1}|U||I|𝑅superscript01𝑈𝐼R\in\{0,1\}^{|U|\cdot|I|}italic_R ∈ { 0 , 1 } start_POSTSUPERSCRIPT | italic_U | ⋅ | italic_I | end_POSTSUPERSCRIPT is the rating matrix, with entry Ru,isubscript𝑅𝑢𝑖R_{u,i}italic_R start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT indicating user u𝑢uitalic_u’s rating for item i𝑖iitalic_i. In reality, not all entries in the Y𝑌Yitalic_Y matrix are observed; let O{0,1}|U||T|𝑂superscript01𝑈𝑇O\in\{0,1\}^{|U|\cdot|T|}italic_O ∈ { 0 , 1 } start_POSTSUPERSCRIPT | italic_U | ⋅ | italic_T | end_POSTSUPERSCRIPT be the observation matrix, with Ou,tsubscript𝑂𝑢𝑡O_{u,t}italic_O start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT indicating whether the rating Yu,tsubscript𝑌𝑢𝑡Y_{u,t}italic_Y start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT is observed or not. The entries in the Yu,tsubscript𝑌𝑢𝑡Y_{u,t}italic_Y start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT matrix are affected by selection bias. O𝑂Oitalic_O controls the selection bias, where certain ratings are overrepresented or underrepresented in the dataset; we use ρu,t=P(Ou,t=1)subscript𝜌𝑢𝑡𝑃subscript𝑂𝑢𝑡1\rho_{u,t}=P(O_{u,t}=1)italic_ρ start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = italic_P ( italic_O start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = 1 ) to denote the probability of observing a rating Yu,tsubscript𝑌𝑢𝑡Y_{u,t}italic_Y start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT in the dataset.

Ideal rating estimator. An ideal rating prediction loss can be defined as follows:

(1) ideal=1|U||T|u,tL(y^u,t,yu,t).subscriptideal1𝑈𝑇subscript𝑢𝑡𝐿subscript^𝑦𝑢𝑡subscript𝑦𝑢𝑡\mathcal{L}_{\text{ideal}}=\frac{1}{|U||T|}\sum_{u,t}L(\hat{y}_{u,t},y_{u,t}).caligraphic_L start_POSTSUBSCRIPT ideal end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_U | | italic_T | end_ARG ∑ start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT italic_L ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT ) .

The loss function L(y^u,t,yu,t)𝐿subscript^𝑦𝑢𝑡subscript𝑦𝑢𝑡L(\hat{y}_{u,t},y_{u,t})italic_L ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT ) used for rating prediction could be mean squared error (MSE).

Naive rating estimator. One could naively ignore selection bias in the observed rating data and estimate the prediction loss by simple averaging, resulting in the naive training loss estimator:

(2) naive=1|{u,t:Ou,t=1}|u,t:Ou,t=1L(y^u,t,yu,t),subscriptnaive1conditional-set𝑢𝑡subscript𝑂𝑢𝑡1subscript:𝑢𝑡subscript𝑂𝑢𝑡1𝐿subscript^𝑦𝑢𝑡subscript𝑦𝑢𝑡\mathcal{L}_{\text{naive}}=\frac{1}{|\{u,t:O_{u,t}=1\}|}\sum_{u,t:O_{u,t}=1}L(% \hat{y}_{u,t},y_{u,t}),caligraphic_L start_POSTSUBSCRIPT naive end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | { italic_u , italic_t : italic_O start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = 1 } | end_ARG ∑ start_POSTSUBSCRIPT italic_u , italic_t : italic_O start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT italic_L ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT ) ,

where |{u,t:Ou,t=1}|conditional-set𝑢𝑡subscript𝑂𝑢𝑡1|\{u,t:O_{u,t}=1\}|| { italic_u , italic_t : italic_O start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = 1 } | is the number of observed ratings in the dataset. It is clearly a biased estimator of the ideal-loss (Eq. 1(Schnabel et al., 2016).

Unbiased preference elicitation. To debias the loss function in Eq. 2, we apply inverse propensity scoring (IPS(Joachims et al., 2017; Schnabel et al., 2016; Saito et al., 2020), where the propensity value ρu,t=p(Ou,t=1)subscript𝜌𝑢𝑡𝑝subscript𝑂𝑢𝑡1\rho_{u,t}=p(O_{u,t}=1)italic_ρ start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = italic_p ( italic_O start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = 1 ) is used as a weight in the loss function. The modified loss function is defined as follows:

(3) ips=1|U||T|u,t:Ou,t=1L(y^u,t,yu,t)ρu,t.subscriptips1𝑈𝑇subscript:𝑢𝑡subscript𝑂𝑢𝑡1𝐿subscript^𝑦𝑢𝑡subscript𝑦𝑢𝑡subscript𝜌𝑢𝑡\mathcal{L}_{\text{ips}}=\frac{1}{|U||T|}\sum_{u,t:O_{u,t}=1}\frac{L(\hat{y}_{% u,t},y_{u,t})}{\rho_{u,t}}.caligraphic_L start_POSTSUBSCRIPT ips end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_U | | italic_T | end_ARG ∑ start_POSTSUBSCRIPT italic_u , italic_t : italic_O start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT divide start_ARG italic_L ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT end_ARG .

The modified ipssubscriptips\mathcal{L}_{\text{ips}}caligraphic_L start_POSTSUBSCRIPT ips end_POSTSUBSCRIPT is an unbiased estimate of the ideal-loss defined in Eq. 1 (Schnabel et al., 2016; Saito et al., 2020), i.e., 𝔼O[ips]=ideal.subscript𝔼𝑂delimited-[]subscriptipssubscriptideal\mathbb{E}_{O}[\mathcal{L}_{\text{ips}}]=\mathcal{L}_{\text{ideal}}.blackboard_E start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT [ caligraphic_L start_POSTSUBSCRIPT ips end_POSTSUBSCRIPT ] = caligraphic_L start_POSTSUBSCRIPT ideal end_POSTSUBSCRIPT .

3. Experiments

Below, we discuss the semi-synthetic experimental setup, fully-synthetic setup, followed by empirical results. For details on simulating preference elicitation data, and the synthetic topic generation, we defer to Appendix A.

Yahoo! R3 dataset. This dataset is collected as part of a music-recommendation service; it includes rating information from 15,400 users on 1,000 items, which are self-selected by users, i.e., these are MNAR ratings (Yahoo! R3, 2022). A separate test-set comprises of ratings from a uniformly-random policy, ensuring the ratings are free from selection bias. Topic information is not present in the dataset, hence we use the synthetic topic generation method discussed in Appendix A. We use 20% of the unbiased test data to generate the bipartite user-item graph and generate item embeddings, followed by synthetic topic generation, and finally the unbiased PE data (Appendix A). For clustering, we experiment with different numbers of clusters to evaluate the robustness of the method under different setups.

Fully-synthetic dataset. Along with simulating conversations from user-item interactions, we also experiment with a fully-synthetic dataset setting, where we simulate user-topic interactions directly. Following (Huang et al., 2020), the following two stage process is applied: (i) Given N𝑁Nitalic_N users and T𝑇Titalic_T topics, their corresponding latent-factors for users (𝐏𝐑Nd𝐏superscript𝐑𝑁𝑑\mathbf{P}\in\mathbf{R}^{N*d}bold_P ∈ bold_R start_POSTSUPERSCRIPT italic_N ∗ italic_d end_POSTSUPERSCRIPT) and topics (𝐐𝐑Td𝐐superscript𝐑𝑇𝑑\mathbf{Q}\in\mathbf{R}^{T*d}bold_Q ∈ bold_R start_POSTSUPERSCRIPT italic_T ∗ italic_d end_POSTSUPERSCRIPT) are generated via Gaussian distribution 𝒩(0, 1)𝒩01\mathcal{N}(0,\,1)caligraphic_N ( 0 , 1 ). The rating scores are generated via a dot-produce of user and topic latent factors. And (ii) the MNAR logged data is generated via the following mechanism:

(4) P(ou,tyu,t)=αP(ou,tyu,t,pos-bias)+(1α)P(ou,iuniform)𝑃conditionalsubscript𝑜𝑢𝑡subscript𝑦𝑢𝑡𝛼𝑃conditionalsubscript𝑜𝑢𝑡subscript𝑦𝑢𝑡pos-bias1𝛼𝑃conditionalsubscript𝑜𝑢𝑖uniform\displaystyle P(o_{u,t}\mid y_{u,t})=\alpha P(o_{u,t}\mid y_{u,t},\text{pos-% bias})+(1-\alpha)P(o_{u,i}\mid\text{uniform})italic_P ( italic_o start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT ) = italic_α italic_P ( italic_o start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT , pos-bias ) + ( 1 - italic_α ) italic_P ( italic_o start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT ∣ uniform )

The simulator is available at: https://github.com/shashankg7/Bias-Preference-Elicitation.

Table 1. Performance of the debiasing method on the unbiased rating prediction task on the Yahoo! R3 dataset. Significant improvements over the baseline (MF) are marked with (p<0.01𝑝0.01p<0.01italic_p < 0.01). Average values over 10 different runs are reported.
Exp. setting Method MAE\downarrow MSE\downarrow NDCG@3\uparrow
#clusters=25#clusters25\#\text{clusters}=25# clusters = 25 MF 1.3041 2.5634 0.7461
ExpoMF 1.3075 2.8213 0.7503
MF-IPS 0.8327 1.0832 0.7511
#clusters=50#clusters50\#\text{clusters}=50# clusters = 50 MF 1.3094 2.5857 0.7476
ExpoMF 1.3050 2.8138 0.7511
MF-IPS 0.8268 1.0777 0.7553
#clusters=75#clusters75\#\text{clusters}=75# clusters = 75 MF 1.3112 2.5887 0.7460
ExpoMF 0.8451 1.1530 0.7505
MF-IPS 0.8451 1.1530 0.7521
#clusters=100#clusters100\#\text{clusters}=100# clusters = 100 MF 1.3057 2.5403 0.7460
ExpoMF 1.3109 2.8316 0.7499
MF-IPS 0.8464 1.1553 0.7518
Table 2. Performance of the debiasing method on the unbiased rating prediction task on the fully-synthetic dataset. Significant improvements over the baseline (MF) are marked with (p<0.01𝑝0.01p<0.01italic_p < 0.01). Average values over 10 different runs are reported.
Exp. setting Method MAE\downarrow MSE\downarrow NDCG@3\uparrow
α=0.25𝛼0.25\alpha=0.25italic_α = 0.25 MF 0.8449 1.0847 0.7611
ExpoMF 1.6643 3.9344 0.6638
MF-IPS 0.7894 0.9874 0.7511
α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 MF 0.8666 1.1461 0.7852
ExpoMF 1.6506 3.9178 0.6838
MF-IPS 0.7670 0.9185 0.7836
α=0.75𝛼0.75\alpha=0.75italic_α = 0.75 MF 0.9012 1.2383 0.8053
ExpoMF 1.6469 3.9708 0.6984
MF-IPS 0.7330 0.8322 0.8230
α=1.0𝛼1.0\alpha=1.0italic_α = 1.0 MF 0.9622 1.3974 0.8179
ExpoMF 1.6473 4.0386 0.7121
MF-IPS 0.7254 0.8078 0.8362

4. Results

We evaluate the effect of debiasing PE on the unbiased test set. We use mean average error (MAE) and mean squared error (MSE) as evaluation metrics (Schnabel et al., 2016) for measuring accuracy in rating prediction. To evaluate the quality of rankings, we use NDCG@3, following Saito (2020). We use ExpoMF (Liang et al., 2016) as a baseline for debiasing, which uses a generative model to correct for the bias.

Results for the semi-synthetic dataset are presented in Table 2. Results are reported for different numbers of item clusters in the synthetic topic generation (see Section A). Different numbers of clusters represent a different PE setting where the number of item topics varies. Metric values suggest that a naive method for learning rating prediction (using the objective in Eq. 2) results in sub-optimal performance across all settings of clusters. The results suggest that, even for a small-scale PE system (with 35 item-topics), a selection-bias exists, and using IPS for debiasing helps.

For the fully-synthetic setup, results are presented in Table 2. Results are reported for different values of α𝛼\alphaitalic_α (see Eq. 4), which represent different levels of selection bias. A lower value of α𝛼\alphaitalic_α represents a setting where the second term (with uniform observation probability) dominates, simulating a setting where data is sampled from a uniformly-random policy. Similarly, a higher α𝛼\alphaitalic_α value represents a setting with higher positivity-bias. The value of α𝛼\alphaitalic_α controls the degree of positivity bias in the simulated logged data. The results from a debiasing rating-prediction method (MF-IPS) are consistent with the results in the semi-synthetic setting for the rating prediction task, for the MAE and MSE metrics. However, for lower values of α𝛼\alphaitalic_α (0.25, 0.5), the baseline matrix factorization (MF) outperforms other methods in terms of NDCG. We suspect this is caused by the uniform data generation part dominating the biased counterpart, hence there is less signal for learning user preferences. For higher α𝛼\alphaitalic_α values, the results are consistently better for the IPS method. It is also interesting to note that even for the case where the uniformly-random policy dominates (α=0.25𝛼0.25\alpha=0.25italic_α = 0.25), debiasing improves the performance in terms of MAE and MSE.

The results in this section show that a naive method for rating prediction in the PE stage results in a sub-optimal system, which we consistently observe across all experimental setups.

5. Conclusion

We have explored the effect of selection bias in PE for recommender systems. We have shown that user-item interactions (ratings) in the preference elicitation stage suffer from the issue of selection bias, which is a common issue when dealing with ratings at the item-level (Schnabel et al., 2016). We have also explored how training a PE system on biased data can lead to error propagation in downstream tasks. To the best of our knowledge, we are the first to explore and identify the issue of bias in the PE stage. We have shown that, similar to the case of static item recommendations, selection bias exists in a PE setting as well.

We have also investigated the application of existing debiasing methods used in item-based recommendation methods, and have shown that these methods can be successfully applied in our setting. Importantly, given a lack of unbiased test collections for evaluating bias in a PE, we have proposed, and are sharing, a simulation method to generate an unbiased test collection for evaluating debiasing methods. Finally, with the release of our simulator and experimental source code, in addition to our comparison of existing methods, we wish to provide a starting point and motivation for future research to further investigate the problem of bias in similar areas. As part of future work, we propose a joint debiasing method for the PE stage and the corresponding downstream tasks.

References

  • (1)
  • Chang et al. (2015) Shuo Chang, F Maxwell Harper, and Loren Terveen. 2015. Using Groups of Items for Preference Elicitation in Recommender Systems. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. 1258–1269.
  • Christakopoulou et al. (2018) Konstantina Christakopoulou, Alex Beutel, Rui Li, Sagar Jain, and Ed H. Chi. 2018. Q&R: A Two-Stage Approach toward Interactive Recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 139–148.
  • Gao et al. (2018) Ming Gao, Leihui Chen, Xiangnan He, and Aoying Zhou. 2018. BiNE: Bipartite Network Embedding. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 715–724.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web. 173–182.
  • Huang et al. (2020) ** Huang, Harrie Oosterhuis, Maarten de Rijke, and Herke van Hoof. 2020. Kee** Dataset Biases out of the Simulation: A Debiased Simulator for Reinforcement Learning based Recommender Systems. In Fourteenth ACM Conference on Recommender Systems. 190–199.
  • Ilievski and Roy (2013) Ilija Ilievski and Sujoy Roy. 2013. Personalized News Recommendation based on Implicit Feedback. In Proceedings of the 2013 International News Recommender Systems Workshop and Challenge. 10–15.
  • Jannach et al. (2018) Dietmar Jannach, Lukas Lerche, and Markus Zanker. 2018. Recommending Based on Implicit Feedback. In Social Information Access. Springer, 510–569.
  • Joachims et al. (2017) Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 781–789.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, ICLR.
  • Lei et al. (2020) Wenqiang Lei, Xiangnan He, Yisong Miao, Qingyun Wu, Richang Hong, Min-Yen Kan, and Tat-Seng Chua. 2020. Estimation-Action-Reflection: Towards Deep Interaction between Conversational and Recommender Systems. In Proceedings of the 13th International Conference on Web Search and Data Mining. 304–312.
  • Liang et al. (2016) Dawen Liang, Laurent Charlin, James McInerney, and David M Blei. 2016. Modeling User Exposure in Recommendation. In Proceedings of the 25th international conference on World Wide Web. 951–961.
  • Marlin and Zemel (2009) Benjamin M. Marlin and Richard S. Zemel. 2009. Collaborative Prediction and Ranking with Non-Random Missing Data. In Proceedings of the Third ACM Conference on Recommender Systems. 5–12.
  • Marlin et al. (2007) Benjamin M Marlin, Richard S Zemel, Sam Roweis, and Malcolm Slaney. 2007. Collaborative Filtering and the Missing at Random Assumption. In Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence. 267–275.
  • Mehrotra et al. (2019) Rishabh Mehrotra, Mounia Lalmas, Doug Kenney, Thomas Lim-Meng, and Golli Hashemian. 2019. Jointly Leveraging Intent and Interaction Signals to Predict User Satisfaction with Slate Recommendations. In The World Wide Web Conference. 1256–1267.
  • Priyogi (2019) Bilih Priyogi. 2019. Preference Elicitation Strategy for Conversational Recommender System. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, 824–825.
  • Reynolds (2009) Douglas A. Reynolds. 2009. Gaussian Mixture Models. Encyclopedia of Biometrics 741, 659-663 (2009).
  • Saito (2020) Yuta Saito. 2020. Asymmetric Tri-training for Debiasing Missing-not-at-random Explicit Feedback. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 309–318.
  • Saito et al. (2020) Yuta Saito, Suguru Yaginuma, Yuta Nishino, Hayato Sakata, and Kazuhide Nakata. 2020. Unbiased Recommender Learning from Missing-Not-At-Random Implicit Feedback. In Proceedings of the 13th International Conference on Web Search and Data Mining. 501–509.
  • Schnabel et al. (2016) Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as Treatments: Debiasing Learning and Evaluation. In International Conference on Machine Learning. PMLR, 1670–1679.
  • Swaminathan and Joachims (2015) Adith Swaminathan and Thorsten Joachims. 2015. The Self-normalized Estimator for Counterfactual Learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2. 3231–3239.
  • Yahoo! R3 (2022) Yahoo! R3. 2022. R3 - Yahoo! Music Ratings for User Selected and Randomly Selected Songs, version 1.0. URL: https://webscope.sandbox.yahoo.com/catalog.php?datatype=r.
  • Yang et al. (2018) Longqi Yang, Yin Cui, Yuan Xuan, Chenyang Wang, Serge Belongie, and Deborah Estrin. 2018. Unbiased Offline Recommender Evaluation for Missing-Not-At-Random Implicit Feedback. In Proceedings of the 12th ACM Conference on Recommender Systems. 279–287.
  • Zhang et al. (2020) Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020. Task-oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9604–9611.

Appendix A Experiments

Simulating preference elicitation. To evaluate the effects of an unbiased recommendation method, ideally, we need an unbiased held-out dataset collected with a randomized logging policy at the item-topic level, free from the effects of selection bias (Huang et al., 2020; Schnabel et al., 2016). Unfortunately and to the best of our knowledge, no such dataset exists for PE. As a solution, we propose a simple method to simulate a benchmark dataset to evaluate the effects of selection bias in PE. For each topic t𝑡titalic_t, we aggregate the ratings from each item i𝑖iitalic_i which belongs to the topic, for both the biased training set and the unbiased test set. As a result, we get a biased training set with user-topic interactions and an unbiased test set without the effects to selection bias, to evaluate the performance of various debiasing methods.

Synthetic topic generation. An item’s topic category information is not always guaranteed to be present, for reasons such as privacy constraints from external vendors, noisy or unreliable topic labelling, etc. To deal with this issue, we propose a synthetic topic generation method that only relies on user-item interaction information. Given user-item interactions, we create a bipartite graph G=V,E𝐺𝑉𝐸G=\langle V,E\rangleitalic_G = ⟨ italic_V , italic_E ⟩, where the set of vertices V𝑉Vitalic_V is divided into two groups, one of which consists of nodes representing users, and the other has nodes representing items. The set E𝐸Eitalic_E consists of edges between the two groups. Each interaction pair (u,i)𝑢𝑖(u,i)( italic_u , italic_i ) results in an edge between the node corresponding to i𝑖iitalic_i and u𝑢uitalic_u. Given this bipartite-graph, we learn node embeddings via graph representation learning bipartite network embedding (BINE(Gao et al., 2018). We make use of a small unbiased test set to generate the bipartite graph, in an attempt to learn unbiased network embeddings. Given the vector representation of all items from the graph embedding method, we use clustering to group the items in the embedding space. We use Gaussian mixture models (Reynolds, 2009) to cluster the embeddings. The cluster centers are considered as the topics.

Coat dataset. This dataset consists of user interactions for a coat-recommendation service, which includes ratings from 290 users on 300 items which are self-selected by users, i.e., these are MNAR ratings (Schnabel et al., 2016). For the unbiased test, a uniformly-random policy is deployed to collect unbiased ratings on 10 items. Items are labelled with topics in the dataset, where each item can belong to multiple categories. Propensity scores P(Ou,i=1)𝑃subscript𝑂𝑢𝑖1P(O_{u,i}=1)italic_P ( italic_O start_POSTSUBSCRIPT italic_u , italic_i end_POSTSUBSCRIPT = 1 ) are computed using logistic-regression with item covariates.

Hyperparameters. We use 5-fold cross-validation for hyper-parameter tuning in all our experiments. We use Adam (Kingma and Ba, 2014) for optimizing the model-parameters for the loss-functions defined previously. For hyper-parameter tuning, we use the self normalizing importance sampling (SNIPS) estimator (Swaminathan and Joachims, 2015), and optimize for MAE.