MobileConvRec: A Conversational Dataset for Mobile Apps Recommendations
Abstract.
Existing recommendation systems have focused on two paradigms: (i)Β historical user-item interaction-based recommendations and (ii)Β conversational recommendations. Conversational recommendation systems facilitate natural language dialogues between users and the system, allowing the system to solicit usersβ explicit needs while enabling users to inquire about recommendations and provide feedback. Due to substantial advancements in natural language processing, conversational recommendation systems have gained prominence. Existing conversational recommendation datasets have greatly facilitated research in their respective domains. Despite the exponential growth in mobile users and apps in recent years, research in conversational mobile app recommender systems has faced substantial constraints. This limitation can primarily be attributed to the lack of high-quality benchmark datasets specifically tailored for mobile apps. To facilitate research for conversational mobile app recommendations, we introduce . simulates conversations by leveraging real user interactions with mobile apps on the Google Play store, originally captured in large-scale mobile app recommendation dataset . The proposed conversational recommendation dataset synergizes sequential user-item interactions, which reflect implicit user preferences, with comprehensive multi-turn conversations to effectively grasp explicit user needs. consists of over 12K multi-turn recommendation-related conversations spanning 45 app categories. Furthermore, presents rich metadata for each app such as permissions data, security and privacy-related information, and binary executables of apps, among others. We demonstrate that can serve as an excellent testbed for conversational mobile app recommendation through a comparative study of several pre-trained large language models. The dataset is available at https://huggingface.co/datasets/recmeapp/MobileConvRec.
1. Introduction
![Refer to caption](x1.png)
Dataset features |
RRGenΒ (rrgen, ) |
AARSynthΒ (aarsynth, ) |
Srisopha et al.Β (srisopha-how, )β |
PPriorΒ (pprior, )β |
Β (mobilerec, ) |
|
---|---|---|---|---|---|---|
Multi-turn conversations | β | β | β | β | β | β |
Multiple interactions by a single user | β | β | β | β | β | β |
Interaction timestamp | β | β | β | β | β | β |
Security & privacy-related metadata | β | β | β | β | β | β |
App executables | β | β | β | β | β | β |
Number of apps | 58 | 103 | 1,600 | 9,869 | 10,173 | 1,730 |
Number of app categories | 15 | 23 | 32 | 48 | 48 | 45 |
β Β (srisopha-how, ) is not publicly available. β Β (pprior, ) contains only negative user reviews. |
In the past decade, mobile apps have seen exponential growth, with over 5 billion users reportedΒ (number-of-smartphone-users, ). These apps are utilized for diverse purposes, including productivity, news consumption, entertainment, ride-sharing, and food services, to name a few. Consequently, app distribution channels have seen significant growth. Notably, the Apple App StoreΒ (appstore, ) and Google PlayΒ (googleplay, ) alone host over 2.2 million and 3.5 million apps, respectivelyΒ (number-of-apps-on-stores, ). The expanding app marketplaces present a significant challenge for users in efficiently discovering apps that match their preferences. Conversational recommendation systems can play a crucial role in alleviating usersβ cognitive overload by discerning both their implicit needs, inferred from previous interaction history, and their explicit needs, expressed through conversational interactions. As illustrated in FigureΒ 1, an app recommendation system can suggest new apps to users by integrating both their implicit preferences (i.e.,Β prior installations and interactions with apps) and explicit needs (i.e.,Β current conversation with the system).
Unlike traditional recommendation systems, which primarily rely on usersβ interaction history, conversational recommendation systems possess the potential to: (i)Β understand usersβ historical interactions alongside multi-turn natural language dialog, and (ii)Β generate human-like responses to not only recommend items but also facilitate preference refinement, knowledgeable discussion, and recommendation justification. Conversational recommendation systems have shown remarkable success in a wide range of domains, such as moviesΒ (dodge2015evaluating, ; dodge2015evaluating, ; kang2019recommendation, ), musicΒ (moon2019opendialkg, ), sportsΒ (moon2019opendialkg, ), e-commerceΒ (liu2023u, ; jia2022convrec, ), and travelΒ (liao2021mmconv, ), among othersΒ (xu2020user, ). The existence of datasets specifically designed for various domains, featuring multi-turn conversational interactionsΒ (dodge2015evaluating, ; he2023large, ; liu2020towards, ), has played a crucial role in advancing the development and refinement of conversational recommendation systems.
Several prominent datasets focus on mobile apps, including RRGenΒ (rrgen, ), AARSythΒ (aarsynth, ), Srisopha et al.(srisopha-how, ), PPrior(pprior, ), and Β (mobilerec, ), among others. However, it is worth noting that RRGen only comprises single-turn interactions and encompasses fewer than 100 apps from less than 20 categories. While datasets like AARSynth, Srisopha et al., and PPrior offer millions of interactions, their lack of unique user identifiers renders them inadequate for develo** any type of app recommendation system. Moreover, PPrior contains only negative user interactions and the dataset from Srisopha et al.Β (srisopha-how, ) is not publicly available. Although Β (mobilerec, ) provides a large-scale dataset with unique user identifiers and has been employed in constructing app recommendation systems, it lacks multi-turn natural language interactions. A comparison of mobile app datasets is presented in TableΒ 1.
In this work, we attempt to bridge this research gap by offering a large-scale, rich, and diverse benchmark dataset, which we call . This dataset is designed to facilitate researchers in the development of conversational app recommendation systems. We construct by sampling real user interactions with mobile apps sourced from the Google Play store, originally captured in , serving as the basis for our conversational dataset. Our methodology for simulating natural language dialogues between users and the system is rooted in the sampled interactions, ensuring that the simulation faithfully reflects the userβs actual interactions in retrospect. To this end, we develop a theoretical framework designed to process a sequential recommendation dataset containing user interactions with various apps over time (e.g., time-stamped reviews). The framework subsequently generates a conversational recommendation dataset as its output. To streamline the simulation process, we divide it into two steps. Firstly, the simulation generates a dialogue outline at a semantic level. Subsequently, in the second step, this semantic information is transformed into contextual natural language utterances. Specifically, the conversation is initiated by the computer simulator by selecting a question aimed at understanding the userβs interests. This is accomplished by sampling an aspect from global user preferences, following a normalized probability distribution over all aspects. In response to the computer simulatorβs inquiry, the human simulator provides a reply, considering the review text associated with the sampled interaction. To the best of our knowledge, this is the only recommendation dataset that integrates timestamped usersβ historical interactions and multi-turn dialogs, enabling the development of effective conversational recommendation systems.
Datasets | #Dialogs | # Turns | #Users | #Apps | Domain (s) |
FacebookRecΒ (dodge2015evaluating, ) | 1M | 6M | - | - | Movies |
REDIALΒ (li2018towards, ) | 10K | 182K | 956 | 6,281 | Movies |
GoRecDialΒ (kang2019recommendation, ) | 9K | 170K | - | - | Movies |
OpenDialKGΒ (moon2019opendialkg, ) | 15K | 91K | - | - | Movies, books, sports, music |
TG-ReDialΒ (zhou2020towards, ) | 10K | 129K | 1,482 | - | Movies |
DuRecDial | 10K | - | 10K | - | Movies, music, food, restaurant |
news, weather | |||||
CCPE-MΒ (radlinski2019coached, ) | 502 | 11K | - | - | Movies |
INSPIREDΒ (hayati2020inspired, ) | 1K | 35K | 999 | 1,967 | Movies |
Reddit-Movie-LargeΒ (he2023large, ) | 85K | 133K | 10K | 24,326 | Movies |
Reddit-Movie-BaseΒ (he2023large, ) | 634K | 1.6M | 36K | 51,203 | Movies |
U-NEEDΒ (liu2023u, ) | 7K | 333K | - | - | E-commerce |
E-ConvRecΒ (jia2022convrec, ) | 25K | 775K | - | - | E-commerce |
HOOPSΒ (fu2021hoops, ) | - | 11.6M | - | - | E-commerce |
MGConvRecΒ (xu2020user, ) | 7K | 73K | - | - | Restaurant |
MMConvΒ (liao2021mmconv, ) | 5K | 39K | - | - | Travel |
12.2K | 156K | 11.8K | 1,730 | All 45 Categories on Google PlayΒ β | |
β Including food & drink, news & magazines, music, shop**, social, sports, weather, etc. |
contains over 12.2K multi-turn dialogs involving 11.8K unique users across 1,730 apps spanning 45 categories. These interactions result in over 156K turns in conversations. In addition to the basic metadata provided for each app in , offers comprehensive metadata for each, including permissions, data collection and sharing practices, security policies of app developers, and binary executables of free apps, among other details. TableΒ 3 describes key features of the dataset. Furthermore, we provide a comparative comparison of our proposed dataset with the latest versions of well-established conversational recommendation datasets across various domains in TableΒ 2.
Through a comparative study utilizing pre-trained large language models (LLMs) such as GPT-2 and Flan-T5, we demonstrate the utility of our dataset in facilitating research in the domain of conversational mobile app recommendations. In our analysis, we present results based on standard evaluation metrics such as Hit@K, NDCG@K, and BLEU for the baseline models. This comprehensive evaluation provides valuable insights into the performance of these models. Notably, our study serves a dual purpose: it lays the foundation for future research in this domain and establishes baseline results that can serve as a benchmark for future comparisons and advancements. Additionally, we identify areas for improvement and potential avenues for further exploration.
Specifically, this work makes the following contributions:
-
β’
We present , the most extensive collection of recommendation-related multi-turn natural language user-system dialogs to date. With over 156K dialog turns spanning a diverse range of more than 1.7K distinct apps sourced from Google Play, covering 45 categories, it stands as the unique dataset in its domain. Notably, this is the only mobile app dataset that features multi-turn conversations.
-
β’
Our experimental study showcases the practical utility of through the utilization of various state-of-the-art LLMs. Furthermore, we establish baseline results, highlighting the datasetβs potential role in driving advancements in conversational mobile app recommendations.
-
β’
comprises rich metadata about apps, facilitating overlooked follow-up question-answering regarding recommended apps in conversational recommender systems. Furthermore, the availability of executable files can aid in conducting security and privacy-related analyses, mitigating potential biases inherent in developer-provided information.
2. Related Work
Over the past few decades, numerous noteworthy works and datasets have significantly contributed to advancing the understanding and development of conversational recommendation systems. Moreover, there have been efforts to collect datasets for various purposes. Next, we discuss related work in the context of both conversational recommendation and mobile app datasets.
2.1. Datasets for Conversational Recommendations
There are several existing conversational recommendation datasets. We provide a brief list in TableΒ 2. Initial research on conversational recommended systems primarily focused on user preferences among pre-determined choicesΒ (dodge2015evaluating, ; christakopoulou2016towards, ). Notably, FacebookRecΒ (dodge2015evaluating, ) is based on four movie dialogue datasets derived from the Facebook movie dialog dataset: a question-answer (QA) dataset, a recommendation dataset, a mix of recommendation and QA dataset, and a general chit-chat dialogue from Reddit dataset. These synthetic datasets were generated using the ratings from MovieLens datasetΒ (movielens, ) and the Open Movie Database (OMDb). The recommendation dataset is synthetically generated, providing single movie names as answers. The Reddit dataset shares similarities, involving natural conversations about movies, but the discourse is more free-form and not primarily focused on obtaining any recommendations.
In recent times, several studies and models have emerged that focus on engaging users in natural language multi-turn dialogs. These efforts prioritize real-time responses through sentiment analysis and seek to deliver desired recommendationsΒ (li2018towards, ; zhou2020towards, ). Different crowd-sourced datasets like ReDialΒ (li2018towards, ), DuRecDialΒ (liu2020towards, ), GoRecDialΒ (kang2019recommendation, ), INSPIREDΒ (hayati2020inspired, ) are human annotated with predefined goals, such as item recommendation and goal planning. The goal-oriented datasets seamlessly integrate elements of chitchat and task-oriented dialogs, specifically in the context of recommendation tasks. Another variant of ReDail called TG-ReDialΒ (zhou2020towards, ), utilizes topic prediction to recommend movies. OpenDialKGΒ (moon2019opendialkg, ) is built on top of Freebase to model dialogue logic through the traversal of the knowledge graph.
DuRecDialΒ (liu2020towards, ) dataset focuses on the multilingual and cross-lingual conversational recommendation. Both E-ConvRecΒ (jia2022convrec, ) and U-NEEDΒ (liu2023u, ) datasets are proposed for E-commerce conversational recommendation. E-ConvRecΒ (jia2022convrec, ) features dialogs on pre-sales topics between users and customer service staff, while U-NEEDΒ (liu2023u, ) provides fine-grained annotations for user needs in pre-sales dialogs, covering five popular categories and including user behaviors before and after the conversations, facilitating the development and evaluation of conversational recommender systems. The HOOPSΒ (fu2021hoops, ) dataset for E-commerce is created on a knowledge graph from Amazon reviewsΒ (amazon-dataset, ) to extract key entities, forming user-item interactions. Dialogs are then synthesized using templates, enabling the generation of substantial data for training policy and recommendation modules in conversational recommender systems. MGConvRexΒ (xu2020user, ) focuses on facilitating restaurant bookings, while MMConvΒ (liao2021mmconv, ) introduces multi-domain conversations, specifically within the context of travel. The Reddit-Movie (base and larger)Β (he2023large, ) shows empirical studies on conversational recommendation tasks using LLMs in a zero-shot setting.
These datasets have played a significant role in the development of several conversational recommendation systemsΒ (li2018towards, ; fu2021hoops, ; liu2020towards, ; liu2023u, ; he2023large, ) in their respective domains. We anticipate that the proposed dataset plays a similar role in stimulating research in building effective conversational app recommender systems. A comprehensive comparison of existing conversational recommender systems with , as depicted in TableΒ 2, indicates that the proposed dataset shares key attributes on par with these datasets.
2.2. Datasets for Mobile Apps
Several datasets exist for user interaction of mobile apps as listed in TableΒ 1. Khalid et al.Β (khalid2014mobile, ) and Β (khalid2013identifying, ) provided a dataset of iOS apps consisting of 6,390 user reviews for 20 apps. Maalej and NabilΒ (maalej2015bug, ) collected the first large dataset with 1.3 million reviews for over 1,100 apps, their dataset focuses on user problems and understanding the user-developer dialogue. It is important to note that these datasets are not publicly available. Top 20 AppsΒ (top20-dataset, ) is available publicly and contains 200K reviews for 20 apps spanning 9 categories. This dataset provides rating scores and text for the reviews. RRGenΒ (rrgen, ) has more than 309K reviews spanning 58 apps. Similar to the Top 20 Apps, RRGen provides only text of reviews with rating scores. Both of these datasets do not provide app metadata, a unique user identifier, and the timestamp of review. AARSynthΒ (aarsynth, ) collected over two million user reviews for over a hundred apps, including app metadata. Reviews of this dataset also miss out on a unique user identifier and the timestamp of review, similar to the datasets mentioned earlier.
Srisopha et al.Β (srisopha-how, ) collected over 9 million user reviews from 1,600 apps. This dataset has review timestamps, which can help to understand reviews in the context of the time. However, this dataset does not include a unique identifier for user and app metadata. Moreover, Srisopha et al. did not make this dataset publicly available. PPriorΒ (pprior, ) dataset provided more than 2 million reviews for over 9 thousand apps covering categories from Google Play. This dataset provides rating scores, review text, and timestamps of reviews. However, it is worth mentioning that this dataset does not include user identifiers for interactions (i.e., reviews) and lacks app metadata. Additionally, it is important to note that PPrior dataset only contains negative user reviews. More recently, Β (mobilerec, ) provided over 1.9 million user interactions with interaction timestamps. However, does not provide multi-turn conversations. We complement by building a conversational dataset on top of spanning 48 categories on Google Play. Furthermore, includes security and privacy metadata along with executables for the apps. This makes our dataset an ideal testbed for further research on mobile apps for conversational recommendation as well as understanding security and privacy perspectives.
![Refer to caption](x2.png)
3. Dataset
The proposed conversational recommendation dataset has been curated in an end-to-end fashion with minimal human intervention. Initially, we outline the theoretical framework that supports the dataset generation process. Following this, we provide implementation details about each step of the dataset creation. FigureΒ 2 provides an overview of the framework.
3.1. Theoretical Framework
At a high level, we develop a framework that ingests a sequential recommendation dataset containing user interactions over time (e.g., time-stamped reviews) with various apps. This framework then generates a conversational recommendation dataset as output. Formally, the framework maps a traditional recommendation dataset to a conversational recommendation dataset :
where each dialog consists of natural language interaction turns. Each turn consists of user utteranceΒ and system utteranceΒ . Furthermore, at th dialog turn, the system utteranceΒ presents a recommendation for an app to the user.
At the dialog level, the framework randomly selects a user interaction with an app along with the corresponding user profile. This user profile is built using the userβs historical interactions. Additionally, the framework incorporates global user preferences regarding the significance of different aspects of the apps (e.g., customization). More precisely, the framework simulates a dialog between the user and the system, with the sampled user interaction serving as a reference to guide the conversational flow. We break down the simulation into two steps for simplicity. In the first step, the simulation generates the dialog outline at a semantic level, while the second step converts this semantic information into contextual natural language utterances. Formally, the first step takes as input global user preferences , a sampled interaction , and corresponding user profile . It then maps this input to the semantic-level dialog outline . Subsequently, the second step maps semantic-level dialog outline to natural language dialog :
In the first step, the user-system simulation follows a straightforward protocol called REQUEST-RESPONSE. In this protocol, either the user or the system can request information about any aspect using the message line: βaspect-name = ?β. The response line takes the form: βaspect-name = valueβ, where the value can either be an actual value (e.g., price = free) or a value between 0 and 1 representing the degree of importance attributed to the requested aspect (e.g., customization: 0.7). Formally, at turn of dialog , the system simulator samples (without replacement) an aspect according to a weighted distribution over all aspects from the global user preferences and requests information regarding that aspect. The user simulator then provides the value for the requested aspect , taking into account the sampled interaction and the userβs profile :
where the probability value for tuples in are determined by analyzing the proportions of users who deem a particular aspect important in the dataset . Meanwhile, the user simulator leverages user profile information and the sampled interaction to collect the value for the requested aspect . This mechanism also accommodates users requesting information about the recommended app later in the conversation.
The second step is straightforward and involves utilizing an off-the-shelf pre-trained LLM, such as ChatGPT, to transform the semantic-level turns into coherent contextual natural language dialogs. Specifically, at turn , we provide the natural language dialog context along with the semantic-level information or , using the appropriate prompt, to the model. The model produces the natural language dialog :
It is crucial to emphasize that the simulated conversation remains grounded in the sampled interaction, ensuring that the simulation closely aligns with the userβs actual interaction in hindsight. Moreover, the recommended app always corresponds to the one involved in the sampled interaction.
3.2. Conversational Dataset Construction
We construct using the theoretical framework from SectionΒ 3.1 and a large-scale dataset for mobile app recommendations, called Β (mobilerec, ). In the following, we provide details of each step and design choices.
3.2.1. Topics Modeling for App Aspect Extraction
First, we identify the key aspects of mobile apps that users prioritize. Our approach involves leveraging the extensive dataset, , which comprises 19.3 million user reviews across diverse mobile apps. Recognizing the importance of an efficient and scalable methodology, we choose to employ the BERTopic(grootendorst2022bertopic, ) library for conducting unsupervised topic modeling. For the vectorization of user review data, we utilize sentence transformersΒ (reimers-2019-sentence-bert, ), specifically employing the pre-trained all-mpnet-base-v1 model. The selection of this model for text embedding is driven by its state-of-the-art performance across various natural language understanding benchmarks.
After obtaining the high-dimensional embeddings for user reviews, the next step involves dimensionality reduction. The goal is to reduce the dimensionality of the embeddings while retaining the relevant semantic information for effective topic modeling. In particular, we use Universal Manifold Approximation and Projection (UMAP)Β (mcinnes2018umap, ) as the dimensionality reduction technique. UMAP offers distinct advantages, as it can effectively capture both local and global structures of the high-dimensional space in lower dimensions. Following dimensionality reduction, we employ the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm to cluster reviews that share similar aspects. HDBSCAN is chosen for its ability to effectively identify clusters of varying shapes and densities in the data. Moreover, HDBSCAN is robust to noise and outliers, allowing it to effectively handle noisy data commonly encountered in user reviews.
After the clustering is performed, we analyze the most significant terms in each cluster to extract the cluster-representative aspects. We experiment with n-gram ranges between 1 and 4. After investing some manual effort in consolidating semantically similar aspects, we arrive at the following unordered list of 20 aspects. These aspects form the foundation of the simulation framework: (i)Β User Interface Design; (ii)Β Navigation; (iii)Β Accessibility; (iv)Β Customization; (v)Β Functionality; (vi)Β Performance; (vii)Β Responsiveness; (viii)Β Security; (ix)Β Privacy; (x)Β Permissions; (xi)Β Data Collection; (xii)Β Data Sharing; (xiii)Β Updates; (xiv)Β Customer support; (xv)Β Reviews and ratings; (xvi)Β Developer; (xvii)Β Price; (xviii)Β In-app purchases; (xix)Β Advertisement Frequency; (xx)Β Battery Drainage.
Feature | Description | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UID |
|
||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
Recommendation |
|
||||||||||||||
|
|
||||||||||||||
App Metadata |
|
3.2.2. Global User Preferences
After extracting app aspects from the dataset, we gather global user preferences in a structured format. We utilize gpt-3.5-turbo to analyze user reviews and identify the aspects being discussed in each review. Subsequently, we aggregate this information to compute the percentage of reviews that address each specific aspect. We refer to these statistics as This analysis provides relative importance of different aspects from a global perspective. For instance, if a large proportion of reviews mention the βperformanceβ aspect, it indicates that performance is a significant consideration for users.
These statistics play a key role in guiding the interactions of the computer simulator during semantic-level conversations. Prioritizing aspects that are frequently mentioned in user reviews increases the likelihood of a successful conversation. Conversely, querying about aspects that garner little attention from users may not only be unhelpful for recommendation purposes but also risk the failure of the conversation. It is crucial to note that when selecting an aspect to inquire from the user, we adopt a probabilistic approach. This means that aspects with higher probabilities are given higher priority. This approach is chosen over a deterministic one to ensure diverse and dynamic conversations. A deterministic approach would result in repetitive conversations, with the system side of the dialogue following the same order consistently. By employing a probabilistic approach, we introduce variability and spontaneity into the conversations.
3.2.3. Interaction Sampling and User Profile
To ensure that conversations are grounded in real interactions and to guide the user simulator effectively during semantic-level conversations, it is essential to sample user interactions from the traditional sequential recommendation dataset. To sample user interactions, we employ a weighted distribution that assigns a higher probability of selection to interactions that have longer reviews and diverse categories. The distribution is defined as , where denotes the weight, represents the squared review length for interaction , and diversity is inversely proportional to the number of apps already sampled within the same category as . This approach ensures that longer reviews are more likely to be selected while also promoting diversity in the sampled interactions across different app categories. Furthermore, longer reviews tend to cover a broader spectrum of topics, aspects, and sentiments related to the app. This increased information density enhances the likelihood of gaining deeper insight into user preferences, thereby offering more informative guidance for the user simulator.
To construct the user profile, we gather all previous interactions associated with the user involved in the sampled interaction . We accumulate the values for all the aspects and refer to the profile for th user as . Both the user profile and the sampled interaction collectively guide the user simulator. They provide essential information and insights into the userβs preferences, behaviors, and past interactions, facilitating the generation of contextually relevant responses in the simulation framework.
3.2.4. User System Simulation
The computer simulator initiates the conversation by deciding to ask a question aimed at understanding the userβs interests. This involves sampling an aspect from global user preferences according to the normalized probability distribution over the aspects. We employ gpt-3.5-turbo to generate a contextual natural language query, with the prompt guiding the formulation of the question. The human simulator responds to the question posed by the computer simulator. If the sampled review discussed the aspect in question, the human simulator provides an answer. Otherwise, the response indicates disinterest (i.e., value 0) in that particular aspect. Once again, gpt-3.5-turbo is utilized to generate a natural language response, with the prompt dictating the model for formulating a response based on aspect value and conversational context. Following several exchanges, at turn , a recommendation is presented to the user. This recommendation corresponds to the actual app that the user had interacted with in the sampled interaction . Following the recommendation, the user simulator may pose follow-up questions based on the userβs profile characteristics. These questions are sampled in accordance with the userβs preferences captured in the user profile, such as a propensity for prioritizing security concerns. Subsequently, the computer simulator provides answers to the user simulatorβs inquiries, drawing upon information from rich metadata about apps.
We employ automatic detection mechanisms to identify failed simulations, particularly focusing on scenarios where multiple aspects are queried that users express no interest in. Conversations of this nature fail to contribute meaningfully to guiding recommendation models due to their lack of relevance to the userβs preferences. Moreover, cases where no user history is available have also been sampled, enabling the simulation of cold-start scenarios. This approach ensures that recommendation models can not get away with only considering the historical interactions of the users. The final dataset has undergone human verification to ensure the informativeness and coherence of the dialogs.
![Refer to caption](x3.png)
3.3. Dataset Features
The proposed dataset comprises several features, encompassing usersβ historical interactions, explicit interest elicitation turns, an annotated recommendation turn, and subsequent question-answer exchanges about the recommended app. TableΒ 3 provides detailed descriptions for the important features of .
In addition to the basic metadata available in , we have broadened our datasetβs scope to encompass more exhaustive metadata attributes. This expansion entails comprehensive details regarding the permissions sought by apps from users, intricacies of data collection specifying its purpose, insights into data-sharing practices with third parties, and security measures governing data transmission and sharing. Furthermore, we provide access to the executable (i.e., .apk) files of free apps, enabling researchers to conduct thorough analyses of the actual binary code, should they desire a deeper exploration.
The sharing of .apk files will be restricted to research and educational purposes only, as per Google Play store policies that prohibit the open distribution of executable files. Access will be granted exclusively to academic researchers upon request. The conversations and app metadata are accessible to the public via Hugging Face datasets.
![Refer to caption](x4.png)
3.4. Dataset Analysis
In FigureΒ 3, we showcase the top 10 app categories. Itβs notable that these categories collectively constitute over 50% of the dataset, aligning closely with the composition of the original recommendation dataset, . Our carefully designed sampling approach facilitated the inclusion of apps from a wide range of categories β spanning all 45 categories β while maintaining proportions that reflect the real-world popularity of these categories.
In FigureΒ 4, we depict the distribution of the number of turns per dialogue. The majority of dialogues consist of turns ranging between 10 and 16, with each turn comprising one user utterance and one system utterance. We believe that this distribution of turns per dialogue poses a decent challenge for recommendation models, while also providing ample learning opportunities for them.
In FigureΒ 5, we illustrate the distribution of the number of words per turn. A noteworthy observation is the diversity within the dataset, encompassing turns with varying lengths, spanning from brief exchanges to more extensive dialogues. This breadth of conversational lengths is anticipated to equip the recommendation models with the capability to effectively handle a wide spectrum of conversations, ensuring its adaptability to diverse user interactions.
![Refer to caption](x5.png)
3.5. Potential Usage Scenarios
The proposed dataset, , serves as a diverse resource for training both sequential and conversational recommender systems, offering a unique blend of usersβ historical interactions and contextual conversational data. This integration presents an opportunity for recommender systems to obtain richer insights and refine their recommendations with a deeper understanding of user preferences. Notably, existing research has predominantly focused on either historical interactions or conversational context, often overlooking the potential synergies afforded by combining the two.
The inclusion of rich metadata about apps further enhances the datasetβs utility, enabling follow-up question-answering functionalities that have been largely neglected by conversational recommender systems thus far. Moreover, the availability of executable files in the dataset offers a unique opportunity for conducting security and privacy-related analyses, allowing for a thorough verification of the functional correctness, and program analysis of the apps. Unlike relying solely on developer-provided information, which may be incomplete or biased, the availability of executable files facilitates a more robust evaluation process. By analyzing the executable files, researchers can explore the inner workings of the apps, uncovering potential security vulnerabilities, privacy concerns Such comprehensive analyses of executables hold promise in bolstering the safety and security of recommendations, thereby enhancing user trust and confidence.
In this work, our primary focus lies in establishing baseline results for conversational mobile app recommendations. Moreover, we expect the manifold potential applications of this dataset across various domains and encourage the broader research community to explore and leverage its diverse capabilities.
Input | Model | Success Rate |
---|---|---|
Dialog Context | GPT-2 | 65.6 % |
Flan-T5 | 36.4 % | |
Dialog Context + Previous Interactions | GPT-2 | 66.6 % |
Flan-T5 | 37.7 % | |
Dialog Context + Sampled Candidates | GPT-2 | 85.3 % |
Flan-T5 | 86.0 % | |
Dialog Context + Similar Candidates | GPT-2 | 47.9 % |
Flan-T5 | 53.6 % | |
Dialog Context + Sampled Candidates + Previous Interactions | GPT-2 | 82.6 % |
Flan-T5 | 86.3 % | |
Dialog Context + Similar Candidates + Previous Interactions | GPT-2 | 47.2 % |
Flan-T5 | 54.7 % |
4. Baselines
We fine-tuned pre-trained GPT-2Β (Radford2019LanguageMA, ) and Flan-T5Β (Chung2022, ) models to establish baseline results across various experimental setups.
Input | Model | Hit@1 | @2 | @3 | @4 | @5 | @6 | @7 | @8 | @9 | @10 |
---|---|---|---|---|---|---|---|---|---|---|---|
Dialog Context + Sampled Candidates | GPT-2 | 0.596 | 0.691 | 0.741 | 0.774 | 0.797 | 0.820 | 0.835 | 0.848 | 0.864 | 0.879 |
Flan-T5 | 0.893 | 0.938 | 0.951 | 0.958 | 0.962 | 0.965 | 0.966 | 0.970 | 0.972 | 0.973 | |
Dialog Context + Similar Candidates | GPT-2 | 0.272 | 0.371 | 0.436 | 0.485 | 0.527 | 0.564 | 0.600 | 0.638 | 0.673 | 0.701 |
Flan-T5 | 0.567 | 0.702 | 0.756 | 0.800 | 0.832 | 0.848 | 0.862 | 0.874 | 0.884 | 0.895 | |
Dialog Context + Sampled Candidates + Previous Interactions | GPT-2 | 0.706 | 0.791 | 0.823 | 0.846 | 0.861 | 0.879 | 0.890 | 0.899 | 0.908 | 0.914 |
Flan-T5 | 0.897 | 0.944 | 0.956 | 0.961 | 0.963 | 0.967 | 0.970 | 0.973 | 0.973 | 0.977 | |
Dialog Context + Similar Candidates + Previous Interactions | GPT-2 | 0.343 | 0.460 | 0.526 | 0.577 | 0.622 | 0.653 | 0.686 | 0.712 | 0.744 | 0.766 |
Flan-T5 | 0.577 | 0.715 | 0.777 | 0.814 | 0.836 | 0.856 | 0.873 | 0.883 | 0.893 | 0.905 |
Input | Model | NDCG@1 | @2 | @3 | @4 | @5 | @6 | @7 | @8 | @9 | @10 |
---|---|---|---|---|---|---|---|---|---|---|---|
Dialog Context + Sampled Candidates | GPT-2 | 0.596 | 0.656 | 0.681 | 0.695 | 0.704 | 0.713 | 0.718 | 0.721 | 0.726 | 0.731 |
Flan-T5 | 0.893 | 0.921 | 0.928 | 0.931 | 0.933 | 0.934 | 0.934 | 0.935 | 0.936 | 0.936 | |
Dialog Context + Similar Candidates | GPT-2 | 0.272 | 0.335 | 0.367 | 0.388 | 0.404 | 0.417 | 0.429 | 0.441 | 0.452 | 0.460 |
Flan-T5 | 0.567 | 0.652 | 0.679 | 0.698 | 0.710 | 0.716 | 0.721 | 0.724 | 0.728 | 0.731 | |
Dialog Context + Sampled Candidates + Previous Interactions | GPT-2 | 0.706 | 0.760 | 0.776 | 0.786 | 0.792 | 0.798 | 0.802 | 0.804 | 0.807 | 0.809 |
Flan-T5 | 0.897 | 0.926 | 0.932 | 0.935 | 0.935 | 0.937 | 0.938 | 0.939 | 0.939 | 0.940 | |
Dialog Context + Similar Candidates + Previous Interactions | GPT-2 | 0.343 | 0.417 | 0.450 | 0.472 | 0.489 | 0.500 | 0.511 | 0.519 | 0.529 | 0.535 |
Flan-T5 | 0.577 | 0.664 | 0.695 | 0.711 | 0.720 | 0.727 | 0.732 | 0.735 | 0.738 | 0.742 |
4.1. Experimental Settings
To ensure robust evaluation, we partitioned the data into distinct training, validation, and testing sets based on the date of interaction. Specifically, interactions occurring on past dates were allocated to the training and testing sets, with the most recent dates exclusively designated for testing. This approach ensures that the models are trained and evaluated on temporally diverse datasets, enabling a comprehensive assessment of their performance across different periods. We consider the following experimental setups.
Recommendation Generation. We assess the modelsβ ability to generate app names as recommendations through fuzzy matching between the ground truth and the generated app names. This experiment involves four different types of inputs to the models: (i)Β only the dialog context; (ii)Β the combination of the dialog context and usersβ historical interactions; (iii)Β the combination of the dialog context and the set of candidate apps available for recommendation; and (iv)Β the combination of the dialog context, the set of candidate apps, and previous interactions. In all our experiments, the number of candidate apps is set to 25, with one being the ground truth app that the user interacted with, and the 24 other different apps. These 24 candidate apps are generated in one of two ways: (i)Β randomly sampling 24 apps (Sampled Candidates); or (ii)Β selecting them from a group of apps similar to the ground truth recommended app (Similar Candidates).
Candidate Apps Ranking. We assess the modelsβ ability to rank a set of candidate apps. We consider two different types of inputs to the models: (i)Β the combination of the dialog context and the set of candidate apps; and (ii)Β the combination of the dialog context, the set of the candidate apps, and the usersβ historical interactions. Similar to the previous experiment, we utilize two sets of candidate apps: (i)Β Sampled Candidates, which consists of the ground truth app and 24 apps randomly selected from the entire app pool, and (ii)Β Similar Candidates, which includes the ground truth app and 24 apps similar to the ground truth app.
Response Generation. In this experiment, we evaluate the modelsβ proficiency in generating appropriate responses based on dialog context. This encompasses the modelsβ ability to elicit usersβ preferences, recommend suitable apps in natural language text, and respond to usersβ inquiries about the recommended apps.
4.2. Evaluation Metrics
For each experimental setup, we use well-established metrics. (i)Β For the recommendation generation task, we employ the success rate metric. Specifically, the success rate calculates the percentage of apps where the generated app name and the ground-truth app name have a Levenshtein distance similarity ratioΒ (Levenshtein1965BinaryCC, ) of more than 0.95. (ii)Β For the apps ranking task evaluation, we utilize standard metricsΒ (jarvelin2002cumulated, ; recbole[1.0], ; li2020sampling, ): Hit@K and NDCG@K where . (iii)Β For response generation task evaluation, we use BLEU scoreΒ (papineni-etal-2002-bleu, ).
5. Results and Discussion
TableΒ 4 presents the results of the experiment, which focused on training models to directly generate recommended app names. We notice that incorporating candidate apps (Sampled Candidates) as input to the model significantly enhances its ability to accurately generate the recommended app names. Specifically, we observe a remarkable 136.26% improvement in the performance of the Flan-T5 model when both dialog context and sampled candidate apps are utilized, compared to when only dialog context is employed as input (86.0 vs 36.4). Similar improvement can also be observed for the GPT-2 model when both dialog context and sampled candidates are utilized (85.3 vs 65.6). The substantial improvement observed was anticipated, as providing sampled candidate apps (i.e., 25 apps) as input significantly reduces the pool of apps to be recommended (from over 1.7K in the training data to just 25). Furthermore, since the names of the apps are provided in the context, the models are less prone to errors during generation. Although there is no definitive winner, our observations indicate that the Flan-T5 model outperforms the GPT-2 model when the input to the model includes candidate apps, whether these candidate apps are Similar Candidates or Sampled Candidates. These findings imply that the Flan-T5 model demonstrates superior proficiency compared to GPT-2 in selecting the appropriate apps when provided as context.
Additionally, we observe ambiguity regarding the impact of incorporating historical interactions on the results. While intuitively, leveraging historical interactions should enhance performance, our experiments do not consistently demonstrate improvements. This suggests that simply passing historical interactions as part of the input may not be the optimal approach and warrants further investigation.
In addition to this, we observe a noticeable difference in model performance when comparing the use of Sampled Candidates versus Similar Candidates as input. For instance, both GPT-2 and Flan-T5 models show approximately 43.8% and 37.6% lower success rates respectively when the input consists of dialog context and similar candidates, compared to when the input includes dialog context and sampled candidates. This decline in performance is expected, as the presence of similar candidates makes it more challenging for the models to select the correct app.
Moreover, upon subjectively evaluating the success and failure cases, we observed that the pre-trained models exhibit a bias towards more popular apps. For instance, VLC for Android is consistently favored over MX Player Pro. This observation underscores potential avenues for further investigation into the factors influencing model preferences and their implications for recommendation systems.
TableΒ 5 and TableΒ 6 present the results for the Hit and NDCG metrics, respectively, for the candidate ranking experiment. Both models exhibit improved quantitative scores for Hit and NDCG metrics as the value of k increases, which aligns with desirable behavior. While the overall results for both models do not significantly differ, we observe slightly better performance from the Flan-T5 model. Particularly noteworthy is the case where the input comprises only Dialog Context and Sampled Candidates, where the Flan-T5 model outperforms the GPT-2 model by 35.74% in Hit@2 and by 40.39% in NDCG@2. These findings underscore the potential superiority of the Flan-T5 model in candidate ranking tasks. Furthermore, similar to the previous experiment (recommendation generation experiment), we observe that the performance of all models declines when using Similar Candidates as input compared to using Sampled Candidates.
Models | BLEU-4 Score |
---|---|
GPT-2 | 0.1934 |
Flan-T5 | 0.2998 |
The performance of the models in the response generation experiment is shown in TableΒ 7. In this experiment, we again observe that the Flan-T5 model outperforms the GPT-2 model by 55.01%, achieving a performance score of 0.2998 compared to 0.1934. While the BLEU scores for both models may not be particularly high, our subjective evaluations suggest that both models exhibit good performance. As an example, during user needs elicitation and response generation, generations such as βWhat average rating do you typically look for when deciding to install a mobile application?β or βGot it! how important is the reputation or credibility of the developer to you when choosing a mobile app?β can yield significantly different BLEU scores based on the corresponding ground truth utterance. However, from a qualitative perspective, both responses are perfectly reasonable and effectively advance the conversation.
6. Conclusion
This paper introduces , a dataset tailored for conversational mobile app recommendations. The key novelty of the proposed dataset is the integration of usersβ historical interactions within multi-turn dialogs, providing a more holistic view of user preferences. comprises over 12.2K multi-turn dialogs encompassing 11.8K unique users interacting with 1,730 apps spanning 45 categories. These interactions accumulate to over 156K turns in conversations, providing an unparalleled level of detail for understanding user preferences across diverse app categories. Furthermore, each app in the dataset includes comprehensive metadata, including details such as app permissions, data collection and sharing practices, security policies implemented by developers, and binary executables of free apps. We showcase the utility of as an experimental testbed for research in conversational app recommendation by conducting a comparative evaluation of various pre-trained language models. This evaluation also establishes baseline results that can serve as valuable reference points for the research community.
References
- [1] Movielens. https://grouplens.org/datasets/movielens/, 2022. Accessed: 2022-11-06.
- [2] Top 20 play store app reviews. https://www.kaggle.com/datasets/odins0n/top-20-play-store-app-reviews-daily-update, 2022. Accessed: 2022-12-09.
- [3] Apple. Apple app store. https://apps.apple.com/, 2022. Accessed: 2022-11-06.
- [4] L.Β Ceci. Number of apps available in leading app store. https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/. Accessed: 2022-11-06.
- [5] Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. Towards conversational recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 815β824, 2016.
- [6] HyungΒ Won Chung, LeΒ Hou, Shayne Longpre, Barret Zoph, YiΒ Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, ShixiangΒ Shane Gu, Zhuyun Dai, Mirac Suzgun, ** Huang, AndrewΒ M. Dai, Hongkun Yu, Slav Petrov, EdΒ H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, QuocΒ V. Le, and Jason Wei. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022.
- [7] J.Β Degenhard. Number of apps available in leading app store. https://www.statista.com/forecasts/1143723/smartphone-users-in-the-world. Accessed: 2022-02-02.
- [8] Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, and Jason Weston. Evaluating prerequisite qualities for learning end-to-end dialog systems. arXiv preprint arXiv:1511.06931, 2015.
- [9] Umar Farooq, ABΒ Siddique, Fuad Jamour, Zhijia Zhao, and Vagelis Hristidis. App-aware response synthesis for user reviews. In 2020 IEEE International Conference on Big Data (Big Data), pages 699β708. IEEE, 2020.
- [10] Moghis Fereidouni, Adib Mosharrof, Umar Farooq, and ABΒ Siddique. Proactive prioritization of app issues via contrastive learning. In 2022 IEEE International Conference on Big Data (Big Data), pages 535β544. IEEE, 2022.
- [11] Zuohui Fu, Yikun Xian, Yaxin Zhu, Shuyuan Xu, Zelong Li, Gerard DeΒ Melo, and Yongfeng Zhang. Hoops: Human-in-the-loop graph reasoning for conversational recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2415β2421, 2021.
- [12] Cuiyun Gao, Jichuan Zeng, Xin Xia, David Lo, MichaelΒ R Lyu, and Irwin King. Automating app review response generation. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 163β175. IEEE, 2019.
- [13] Google. Google play store. https://play.google.com/store/apps, 2022. Accessed: 2022-11-06.
- [14] Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794, 2022.
- [15] ShirleyΒ Anugrah Hayati, Dongyeop Kang, Qingxiaoyang Zhu, Weiyan Shi, and Zhou Yu. Inspired: Toward sociable recommendation dialog systems. arXiv preprint arXiv:2009.14306, 2020.
- [16] Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, BodhisattwaΒ Prasad Majumder, Nathan Kallus, and Julian McAuley. Large language models as zero-shot conversational recommenders. In Proceedings of the 32nd ACM international conference on information and knowledge management, pages 720β730, 2023.
- [17] Kalervo JΓ€rvelin and Jaana KekΓ€lΓ€inen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422β446, 2002.
- [18] Meihuizi Jia, Ruixue Liu, Peiying Wang, Yang Song, Zexi Xi, Haobin Li, Xin Shen, Meng Chen, **hui Pang, and Xiaodong He. E-convrec: a large-scale conversational recommendation dataset for e-commerce customer service. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5787β5796, 2022.
- [19] Dongyeop Kang, Anusha Balakrishnan, Pararth Shah, Paul Crook, Y-Lan Boureau, and Jason Weston. Recommendation as a communication game: Self-supervised bot-play for goal-oriented dialogue. arXiv preprint arXiv:1909.03922, 2019.
- [20] Hammad Khalid. On identifying user complaints of ios apps. In 2013 35th international conference on software engineering (ICSE), pages 1474β1476. IEEE, 2013.
- [21] Hammad Khalid, Emad Shihab, Meiyappan Nagappan, and AhmedΒ E Hassan. What do mobile app users complain about? IEEE software, 32(3):70β77, 2014.
- [22] VladimirΒ I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 10:707β710, 1965.
- [23] Dong Li, Ruoming **, **g Gao, and Zhi Liu. On sampling top-k recommendation evaluation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2114β2124, 2020.
- [24] Raymond Li, Samira EbrahimiΒ Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. Towards deep conversational recommendations. Advances in neural information processing systems, 31, 2018.
- [25] Lizi Liao, LeΒ Hong Long, Zheng Zhang, Minlie Huang, and Tat-Seng Chua. Mmconv: an environment for multimodal conversational search across multiple domains. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pages 675β684, 2021.
- [26] Yuanxing Liu, Weinan Zhang, Baohua Dong, Yan Fan, Hang Wang, Fan Feng, Yifan Chen, Ziyu Zhuang, Hengbin Cui, Yongbin Li, etΒ al. U-need: A fine-grained dataset for user needs-centric e-commerce conversational recommendation. arXiv preprint arXiv:2305.04774, 2023.
- [27] Zeming Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, and Ting Liu. Towards conversational recommendation over multi-type dialogs. arXiv preprint arXiv:2005.03954, 2020.
- [28] Walid Maalej and Hadeer Nabil. Bug report, feature request, or simply praise? on automatically classifying app reviews. In 2015 IEEE 23rd international requirements engineering conference (RE), pages 116β125. IEEE, 2015.
- [29] M.Β H. Maqbool, Umar Farooq, Adib Mosharrof, A.Β B. Siddique, and Hassan Foroosh. Mobilerec: A large scale dataset for mobile apps recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR β23, page 3007β3016, New York, NY, USA, 2023. Association for Computing Machinery.
- [30] Julian McAuley. Amazon product data. http://jmcauley.ucsd.edu/data/amazon/, 2022. Accessed: 2022-11-06.
- [31] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
- [32] Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba. Opendialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 845β854, 2019.
- [33] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-**g Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311β318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
- [34] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
- [35] Filip Radlinski, Krisztian Balog, Bill Byrne, and Karthik Krishnamoorthi. Coached conversational preference elicitation: A case study in understanding movie preferences. 2019.
- [36] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
- [37] Kamonphop Srisopha, Daniel Link, and Barry Boehm. How should developers respond to app reviews? features predicting the success of developer responses. EASE 2021, page 119β128, New York, NY, USA, 2021. Association for Computing Machinery.
- [38] HuΒ Xu, Seungwhan Moon, Honglei Liu, Bing Liu, Pararth Shah, and PhilipΒ S Yu. User memory reasoning for conversational recommendation. arXiv preprint arXiv:2006.00184, 2020.
- [39] WayneΒ Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min, Zhichao Feng, Xinyan Fan, XuΒ Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. Recbole: Towards a unified, comprehensive and efficient framework for recommendation algorithms. In CIKM, pages 4653β4664. ACM, 2021.
- [40] Kun Zhou, Yuanhang Zhou, WayneΒ Xin Zhao, Xiaoke Wang, and Ji-Rong Wen. Towards topic-guided conversational recommender system. arXiv preprint arXiv:2010.04125, 2020.