MobileConvRec: A Conversational Dataset for Mobile Apps Recommendations

Srijata Maji … University of Kentucky ,Β  Moghis Fereidouni … University of Kentucky ,Β  Vinaik Chhetri … Louisiana State University ,Β  Umar Farooq 0000-0001-7229-9847 Louisiana State University Β andΒ  A.B. Siddique 0000-0002-3587-7289 University of Kentucky
(2018)
Abstract.

Existing recommendation systems have focused on two paradigms: (i)Β historical user-item interaction-based recommendations and (ii)Β conversational recommendations. Conversational recommendation systems facilitate natural language dialogues between users and the system, allowing the system to solicit users’ explicit needs while enabling users to inquire about recommendations and provide feedback. Due to substantial advancements in natural language processing, conversational recommendation systems have gained prominence. Existing conversational recommendation datasets have greatly facilitated research in their respective domains. Despite the exponential growth in mobile users and apps in recent years, research in conversational mobile app recommender systems has faced substantial constraints. This limitation can primarily be attributed to the lack of high-quality benchmark datasets specifically tailored for mobile apps. To facilitate research for conversational mobile app recommendations, we introduce π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec. π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec simulates conversations by leveraging real user interactions with mobile apps on the Google Play store, originally captured in large-scale mobile app recommendation dataset π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRec. The proposed conversational recommendation dataset synergizes sequential user-item interactions, which reflect implicit user preferences, with comprehensive multi-turn conversations to effectively grasp explicit user needs. π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec consists of over 12K multi-turn recommendation-related conversations spanning 45 app categories. Furthermore, π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec presents rich metadata for each app such as permissions data, security and privacy-related information, and binary executables of apps, among others. We demonstrate that π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec can serve as an excellent testbed for conversational mobile app recommendation through a comparative study of several pre-trained large language models. The π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec dataset is available at https://huggingface.co/datasets/recmeapp/MobileConvRec.

Conversational Recommendations, App Recommendations.
††copyright: acmcopyright††journalyear: 2018††doi: XXXXXXX.XXXXXXX††ccs: Information systemsΒ Personalization††ccs: Information systemsΒ Recommender systems††ccs: Information systemsΒ Test collections

1. Introduction

Refer to caption
Figure 1. A sample natural language dialog between the user and the system might unfold as follows: The system draws insights from the user’s historical interactions, proactively elicits the user’s needs explicitly, and synergizes this information to make more meaningful recommendations. Additionally, the user may pose follow-up questions regarding the recommended app.
Table 1. Comparison of existing mobile apps datasets with π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec.
Dataset features

RRGenΒ (rrgen, )

AARSynthΒ (aarsynth, )

Srisopha et al.Β (srisopha-how, )βˆ—

PPriorΒ (pprior, )†

π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRecΒ (mobilerec, )

π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec

Multi-turn conversations βœ— βœ— βœ— βœ— βœ— βœ“
Multiple interactions by a single user βœ— βœ— βœ— βœ— βœ“ βœ“
Interaction timestamp βœ— βœ— βœ“ βœ“ βœ“ βœ“
Security & privacy-related metadata βœ— βœ— βœ— βœ— βœ— βœ“
App executables βœ— βœ— βœ— βœ— βœ— βœ“
Number of apps 58 103 1,600 9,869 10,173 1,730
Number of app categories 15 23 32 48 48 45
βˆ— Β (srisopha-how, ) is not publicly available. † Β (pprior, ) contains only negative user reviews.

In the past decade, mobile apps have seen exponential growth, with over 5 billion users reportedΒ (number-of-smartphone-users, ). These apps are utilized for diverse purposes, including productivity, news consumption, entertainment, ride-sharing, and food services, to name a few. Consequently, app distribution channels have seen significant growth. Notably, the Apple App StoreΒ (appstore, ) and Google PlayΒ (googleplay, ) alone host over 2.2 million and 3.5 million apps, respectivelyΒ (number-of-apps-on-stores, ). The expanding app marketplaces present a significant challenge for users in efficiently discovering apps that match their preferences. Conversational recommendation systems can play a crucial role in alleviating users’ cognitive overload by discerning both their implicit needs, inferred from previous interaction history, and their explicit needs, expressed through conversational interactions. As illustrated in FigureΒ 1, an app recommendation system can suggest new apps to users by integrating both their implicit preferences (i.e.,Β prior installations and interactions with apps) and explicit needs (i.e.,Β current conversation with the system).

Unlike traditional recommendation systems, which primarily rely on users’ interaction history, conversational recommendation systems possess the potential to: (i)Β understand users’ historical interactions alongside multi-turn natural language dialog, and (ii)Β generate human-like responses to not only recommend items but also facilitate preference refinement, knowledgeable discussion, and recommendation justification. Conversational recommendation systems have shown remarkable success in a wide range of domains, such as moviesΒ (dodge2015evaluating, ; dodge2015evaluating, ; kang2019recommendation, ), musicΒ (moon2019opendialkg, ), sportsΒ (moon2019opendialkg, ), e-commerceΒ (liu2023u, ; jia2022convrec, ), and travelΒ (liao2021mmconv, ), among othersΒ (xu2020user, ). The existence of datasets specifically designed for various domains, featuring multi-turn conversational interactionsΒ (dodge2015evaluating, ; he2023large, ; liu2020towards, ), has played a crucial role in advancing the development and refinement of conversational recommendation systems.

Several prominent datasets focus on mobile apps, including RRGenΒ (rrgen, ), AARSythΒ (aarsynth, ), Srisopha et al.(srisopha-how, ), PPrior(pprior, ), and π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRecΒ (mobilerec, ), among others. However, it is worth noting that RRGen only comprises single-turn interactions and encompasses fewer than 100 apps from less than 20 categories. While datasets like AARSynth, Srisopha et al., and PPrior offer millions of interactions, their lack of unique user identifiers renders them inadequate for develo** any type of app recommendation system. Moreover, PPrior contains only negative user interactions and the dataset from Srisopha et al.Β (srisopha-how, ) is not publicly available. Although π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRecΒ (mobilerec, ) provides a large-scale dataset with unique user identifiers and has been employed in constructing app recommendation systems, it lacks multi-turn natural language interactions. A comparison of mobile app datasets is presented in TableΒ 1.

In this work, we attempt to bridge this research gap by offering a large-scale, rich, and diverse benchmark dataset, which we call π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec. This dataset is designed to facilitate researchers in the development of conversational app recommendation systems. We construct π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec by sampling real user interactions with mobile apps sourced from the Google Play store, originally captured in π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRec, serving as the basis for our conversational dataset. Our methodology for simulating natural language dialogues between users and the system is rooted in the sampled interactions, ensuring that the simulation faithfully reflects the user’s actual interactions in retrospect. To this end, we develop a theoretical framework designed to process a sequential recommendation dataset containing user interactions with various apps over time (e.g., time-stamped reviews). The framework subsequently generates a conversational recommendation dataset as its output. To streamline the simulation process, we divide it into two steps. Firstly, the simulation generates a dialogue outline at a semantic level. Subsequently, in the second step, this semantic information is transformed into contextual natural language utterances. Specifically, the conversation is initiated by the computer simulator by selecting a question aimed at understanding the user’s interests. This is accomplished by sampling an aspect from global user preferences, following a normalized probability distribution over all aspects. In response to the computer simulator’s inquiry, the human simulator provides a reply, considering the review text associated with the sampled interaction. To the best of our knowledge, this is the only recommendation dataset that integrates timestamped users’ historical interactions and multi-turn dialogs, enabling the development of effective conversational recommendation systems.

Table 2. MobileConvRec’s comparison with the well-known conversational recommendation datasets in different domains.
Datasets #Dialogs # Turns #Users #Apps Domain (s)
FacebookRecΒ (dodge2015evaluating, ) 1M 6M - - Movies
REDIALΒ (li2018towards, ) 10K 182K 956 6,281 Movies
GoRecDialΒ (kang2019recommendation, ) 9K 170K - - Movies
OpenDialKGΒ (moon2019opendialkg, ) 15K 91K - - Movies, books, sports, music
TG-ReDialΒ (zhou2020towards, ) 10K 129K 1,482 - Movies
DuRecDial 10K - 10K - Movies, music, food, restaurant
news, weather
CCPE-MΒ (radlinski2019coached, ) 502 11K - - Movies
INSPIREDΒ (hayati2020inspired, ) 1K 35K 999 1,967 Movies
Reddit-Movie-LargeΒ (he2023large, ) 85K 133K 10K 24,326 Movies
Reddit-Movie-BaseΒ (he2023large, ) 634K 1.6M 36K 51,203 Movies
U-NEEDΒ (liu2023u, ) 7K 333K - - E-commerce
E-ConvRecΒ (jia2022convrec, ) 25K 775K - - E-commerce
HOOPSΒ (fu2021hoops, ) - 11.6M - - E-commerce
MGConvRecΒ (xu2020user, ) 7K 73K - - Restaurant
MMConvΒ (liao2021mmconv, ) 5K 39K - - Travel
π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec 12.2K 156K 11.8K 1,730 All 45 Categories on Google Play †
† Including food & drink, news & magazines, music, shop**, social, sports, weather, etc.

π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec contains over 12.2K multi-turn dialogs involving 11.8K unique users across 1,730 apps spanning 45 categories. These interactions result in over 156K turns in conversations. In addition to the basic metadata provided for each app in π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRec, π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec offers comprehensive metadata for each, including permissions, data collection and sharing practices, security policies of app developers, and binary executables of free apps, among other details. TableΒ 3 describes key features of the dataset. Furthermore, we provide a comparative comparison of our proposed dataset with the latest versions of well-established conversational recommendation datasets across various domains in TableΒ 2.

Through a comparative study utilizing pre-trained large language models (LLMs) such as GPT-2 and Flan-T5, we demonstrate the utility of our dataset in facilitating research in the domain of conversational mobile app recommendations. In our analysis, we present results based on standard evaluation metrics such as Hit@K, NDCG@K, and BLEU for the baseline models. This comprehensive evaluation provides valuable insights into the performance of these models. Notably, our study serves a dual purpose: it lays the foundation for future research in this domain and establishes baseline results that can serve as a benchmark for future comparisons and advancements. Additionally, we identify areas for improvement and potential avenues for further exploration.

Specifically, this work makes the following contributions:

  • β€’

    We present π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec, the most extensive collection of recommendation-related multi-turn natural language user-system dialogs to date. With over 156K dialog turns spanning a diverse range of more than 1.7K distinct apps sourced from Google Play, covering 45 categories, it stands as the unique dataset in its domain. Notably, this is the only mobile app dataset that features multi-turn conversations.

  • β€’

    Our experimental study showcases the practical utility of π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec through the utilization of various state-of-the-art LLMs. Furthermore, we establish baseline results, highlighting the dataset’s potential role in driving advancements in conversational mobile app recommendations.

  • β€’

    π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec comprises rich metadata about apps, facilitating overlooked follow-up question-answering regarding recommended apps in conversational recommender systems. Furthermore, the availability of executable files can aid in conducting security and privacy-related analyses, mitigating potential biases inherent in developer-provided information.

2. Related Work

Over the past few decades, numerous noteworthy works and datasets have significantly contributed to advancing the understanding and development of conversational recommendation systems. Moreover, there have been efforts to collect datasets for various purposes. Next, we discuss related work in the context of both conversational recommendation and mobile app datasets.

2.1. Datasets for Conversational Recommendations

There are several existing conversational recommendation datasets. We provide a brief list in TableΒ 2. Initial research on conversational recommended systems primarily focused on user preferences among pre-determined choicesΒ (dodge2015evaluating, ; christakopoulou2016towards, ). Notably, FacebookRecΒ (dodge2015evaluating, ) is based on four movie dialogue datasets derived from the Facebook movie dialog dataset: a question-answer (QA) dataset, a recommendation dataset, a mix of recommendation and QA dataset, and a general chit-chat dialogue from Reddit dataset. These synthetic datasets were generated using the ratings from MovieLens datasetΒ (movielens, ) and the Open Movie Database (OMDb). The recommendation dataset is synthetically generated, providing single movie names as answers. The Reddit dataset shares similarities, involving natural conversations about movies, but the discourse is more free-form and not primarily focused on obtaining any recommendations.

In recent times, several studies and models have emerged that focus on engaging users in natural language multi-turn dialogs. These efforts prioritize real-time responses through sentiment analysis and seek to deliver desired recommendationsΒ (li2018towards, ; zhou2020towards, ). Different crowd-sourced datasets like ReDialΒ (li2018towards, ), DuRecDialΒ (liu2020towards, ), GoRecDialΒ (kang2019recommendation, ), INSPIREDΒ (hayati2020inspired, ) are human annotated with predefined goals, such as item recommendation and goal planning. The goal-oriented datasets seamlessly integrate elements of chitchat and task-oriented dialogs, specifically in the context of recommendation tasks. Another variant of ReDail called TG-ReDialΒ (zhou2020towards, ), utilizes topic prediction to recommend movies. OpenDialKGΒ (moon2019opendialkg, ) is built on top of Freebase to model dialogue logic through the traversal of the knowledge graph.

DuRecDialΒ (liu2020towards, ) dataset focuses on the multilingual and cross-lingual conversational recommendation. Both E-ConvRecΒ (jia2022convrec, ) and U-NEEDΒ (liu2023u, ) datasets are proposed for E-commerce conversational recommendation. E-ConvRecΒ (jia2022convrec, ) features dialogs on pre-sales topics between users and customer service staff, while U-NEEDΒ (liu2023u, ) provides fine-grained annotations for user needs in pre-sales dialogs, covering five popular categories and including user behaviors before and after the conversations, facilitating the development and evaluation of conversational recommender systems. The HOOPSΒ (fu2021hoops, ) dataset for E-commerce is created on a knowledge graph from Amazon reviewsΒ (amazon-dataset, ) to extract key entities, forming user-item interactions. Dialogs are then synthesized using templates, enabling the generation of substantial data for training policy and recommendation modules in conversational recommender systems. MGConvRexΒ (xu2020user, ) focuses on facilitating restaurant bookings, while MMConvΒ (liao2021mmconv, ) introduces multi-domain conversations, specifically within the context of travel. The Reddit-Movie (base and larger)Β (he2023large, ) shows empirical studies on conversational recommendation tasks using LLMs in a zero-shot setting.

These datasets have played a significant role in the development of several conversational recommendation systemsΒ (li2018towards, ; fu2021hoops, ; liu2020towards, ; liu2023u, ; he2023large, ) in their respective domains. We anticipate that the proposed dataset π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec plays a similar role in stimulating research in building effective conversational app recommender systems. A comprehensive comparison of existing conversational recommender systems with π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec, as depicted in TableΒ 2, indicates that the proposed dataset shares key attributes on par with these datasets.

2.2. Datasets for Mobile Apps

Several datasets exist for user interaction of mobile apps as listed in TableΒ 1. Khalid et al.Β (khalid2014mobile, ) and Β (khalid2013identifying, ) provided a dataset of iOS apps consisting of 6,390 user reviews for 20 apps. Maalej and NabilΒ (maalej2015bug, ) collected the first large dataset with 1.3 million reviews for over 1,100 apps, their dataset focuses on user problems and understanding the user-developer dialogue. It is important to note that these datasets are not publicly available. Top 20 AppsΒ (top20-dataset, ) is available publicly and contains 200K reviews for 20 apps spanning 9 categories. This dataset provides rating scores and text for the reviews. RRGenΒ (rrgen, ) has more than 309K reviews spanning 58 apps. Similar to the Top 20 Apps, RRGen provides only text of reviews with rating scores. Both of these datasets do not provide app metadata, a unique user identifier, and the timestamp of review. AARSynthΒ (aarsynth, ) collected over two million user reviews for over a hundred apps, including app metadata. Reviews of this dataset also miss out on a unique user identifier and the timestamp of review, similar to the datasets mentioned earlier.

Srisopha et al.Β (srisopha-how, ) collected over 9 million user reviews from 1,600 apps. This dataset has review timestamps, which can help to understand reviews in the context of the time. However, this dataset does not include a unique identifier for user and app metadata. Moreover, Srisopha et al. did not make this dataset publicly available. PPriorΒ (pprior, ) dataset provided more than 2 million reviews for over 9 thousand apps covering 48484848 categories from Google Play. This dataset provides rating scores, review text, and timestamps of reviews. However, it is worth mentioning that this dataset does not include user identifiers for interactions (i.e., reviews) and lacks app metadata. Additionally, it is important to note that PPrior dataset only contains negative user reviews. More recently, π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRecΒ (mobilerec, ) provided over 1.9 million user interactions with interaction timestamps. However, π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRec does not provide multi-turn conversations. We complement π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRec by building a conversational dataset on top of π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRec spanning 48 categories on Google Play. Furthermore, π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec includes security and privacy metadata along with executables for the apps. This makes our dataset an ideal testbed for further research on mobile apps for conversational recommendation as well as understanding security and privacy perspectives.

Refer to caption
Figure 2. Overview of the framework: It transforms a conventional sequential recommendation dataset into a conversational recommendation dataset.

3. π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec Dataset

The proposed conversational recommendation dataset has been curated in an end-to-end fashion with minimal human intervention. Initially, we outline the theoretical framework that supports the dataset generation process. Following this, we provide implementation details about each step of the dataset creation. FigureΒ 2 provides an overview of the framework.

3.1. Theoretical Framework

At a high level, we develop a framework that ingests a sequential recommendation dataset containing user interactions over time (e.g., time-stamped reviews) with various apps. This framework then generates a conversational recommendation dataset as output. Formally, the framework F𝐹Fitalic_F maps a traditional recommendation dataset D𝐷Ditalic_D to a conversational recommendation dataset C𝐢Citalic_C:

F(D)β†’C={ct;t∈1,2,β‹―,N}F(D)\rightarrow C=\{c_{t};t\in 1,2,\cdots,N\}italic_F ( italic_D ) β†’ italic_C = { italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ∈ 1 , 2 , β‹― , italic_N }
ct={(u1t,s1t),(u2t,s2t),β‹―,(unt,snt)}subscript𝑐𝑑superscriptsubscript𝑒1𝑑superscriptsubscript𝑠1𝑑superscriptsubscript𝑒2𝑑superscriptsubscript𝑠2𝑑⋯superscriptsubscript𝑒𝑛𝑑superscriptsubscript𝑠𝑛𝑑c_{t}=\{(u_{1}^{t},s_{1}^{t}),(u_{2}^{t},s_{2}^{t}),\cdots,(u_{n}^{t},s_{n}^{t% })\}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , β‹― , ( italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) }

where each dialog ctsubscript𝑐𝑑c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT consists of n𝑛nitalic_n natural language interaction turns. Each turn (uit,sit)superscriptsubscript𝑒𝑖𝑑superscriptsubscript𝑠𝑖𝑑(u_{i}^{t},s_{i}^{t})( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) consists of user utteranceΒ uitsuperscriptsubscript𝑒𝑖𝑑u_{i}^{t}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and system utteranceΒ sitsuperscriptsubscript𝑠𝑖𝑑s_{i}^{t}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Furthermore, at kβˆ’limit-fromπ‘˜k-italic_k -th dialog turn, the system utteranceΒ sktsuperscriptsubscriptπ‘ π‘˜π‘‘s_{k}^{t}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT presents a recommendation for an app to the user.

At the dialog level, the framework randomly selects a user interaction with an app along with the corresponding user profile. This user profile is built using the user’s historical interactions. Additionally, the framework incorporates global user preferences regarding the significance of different aspects of the apps (e.g., customization). More precisely, the framework simulates a dialog between the user and the system, with the sampled user interaction serving as a reference to guide the conversational flow. We break down the simulation into two steps for simplicity. In the first step, the simulation generates the dialog outline at a semantic level, while the second step converts this semantic information into contextual natural language utterances. Formally, the first step F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT takes as input global user preferences G𝐺Gitalic_G, a sampled interaction disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and corresponding user profile uksubscriptπ‘’π‘˜u_{k}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. It then maps this input to the semantic-level dialog outline sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Subsequently, the second step F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT maps semantic-level dialog outline sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to natural language dialog cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

F1⁒(G,di,uk)β†’{si}β†’subscript𝐹1𝐺subscript𝑑𝑖subscriptπ‘’π‘˜subscript𝑠𝑖F_{1}(G,d_{i},u_{k})\rightarrow\{s_{i}\}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_G , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) β†’ { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }
F2⁒({si})β†’{ci}β†’subscript𝐹2subscript𝑠𝑖subscript𝑐𝑖F_{2}(\{s_{i}\})\rightarrow\{c_{i}\}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) β†’ { italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }

In the first step, the user-system simulation follows a straightforward protocol called REQUEST-RESPONSE. In this protocol, either the user or the system can request information about any aspect using the message line: β€œaspect-name = ?”. The response line takes the form: β€œaspect-name = value”, where the value can either be an actual value (e.g., price = free) or a value between 0 and 1 representing the degree of importance attributed to the requested aspect (e.g., customization: 0.7). Formally, at turn t𝑑titalic_t of dialog i𝑖iitalic_i, the system simulator S⁒Y⁒Sπ‘†π‘Œπ‘†SYSitalic_S italic_Y italic_S samples (without replacement) an aspect apsubscriptπ‘Žπ‘a_{p}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT according to a weighted distribution over all aspects from the global user preferences G𝐺Gitalic_G and requests information regarding that aspect. The user simulator U⁒S⁒Rπ‘ˆπ‘†π‘…USRitalic_U italic_S italic_R then provides the value vpsubscript𝑣𝑝v_{p}italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for the requested aspect apsubscriptπ‘Žπ‘a_{p}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, taking into account the sampled interaction disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the user’s profile uksubscriptπ‘’π‘˜u_{k}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

S⁒Y⁒S⁒(G)β†’sti=(ap=?)β†’π‘†π‘Œπ‘†πΊsuperscriptsubscript𝑠𝑑𝑖subscriptπ‘Žπ‘?SYS(G)\rightarrow s_{t}^{i}=(a_{p}=?)italic_S italic_Y italic_S ( italic_G ) β†’ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ? )
ap∼G={ai=p⁒r⁒(ai);i∈1,2,β‹―,m}similar-tosubscriptπ‘Žπ‘πΊformulae-sequencesubscriptπ‘Žπ‘–π‘π‘Ÿsubscriptπ‘Žπ‘–π‘–12β‹―π‘ša_{p}\sim G=\{a_{i}=pr(a_{i});i\in 1,2,\cdots,m\}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∼ italic_G = { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p italic_r ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ; italic_i ∈ 1 , 2 , β‹― , italic_m }
U⁒S⁒R⁒(di,uk,ap)β†’(ap=vp)β†’π‘ˆπ‘†π‘…subscript𝑑𝑖subscriptπ‘’π‘˜subscriptπ‘Žπ‘subscriptπ‘Žπ‘subscript𝑣𝑝USR(d_{i},u_{k},a_{p})\rightarrow(a_{p}=v_{p})italic_U italic_S italic_R ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) β†’ ( italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT )
uk={(ai=vi);i∈1,2,β‹―,m}u_{k}=\{(a_{i}=v_{i});i\in 1,2,\cdots,m\}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ; italic_i ∈ 1 , 2 , β‹― , italic_m }

where the probability value p⁒r⁒(ap)π‘π‘Ÿsubscriptπ‘Žπ‘pr(a_{p})italic_p italic_r ( italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) for tuples (ap=p⁒r⁒(ap))subscriptπ‘Žπ‘π‘π‘Ÿsubscriptπ‘Žπ‘(a_{p}=pr(a_{p}))( italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_p italic_r ( italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) in G𝐺Gitalic_G are determined by analyzing the proportions of users who deem a particular aspect important in the dataset D𝐷Ditalic_D. Meanwhile, the user simulator leverages user profile information and the sampled interaction to collect the value vpsubscript𝑣𝑝v_{p}italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for the requested aspect apsubscriptπ‘Žπ‘a_{p}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. This mechanism also accommodates users requesting information about the recommended app later in the conversation.

The second step is straightforward and involves utilizing an off-the-shelf pre-trained LLM, such as ChatGPT, to transform the semantic-level turns into coherent contextual natural language dialogs. Specifically, at turn t𝑑titalic_t, we provide the natural language dialog context c<tsubscript𝑐absent𝑑c_{<t}italic_c start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT along with the semantic-level information ap=vpsubscriptπ‘Žπ‘subscript𝑣𝑝a_{p}=v_{p}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT or ap=?subscriptπ‘Žπ‘?a_{p}=?italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ?, using the appropriate prompt, to the model. The model produces the natural language dialog c≀tsubscript𝑐absent𝑑c_{\leq t}italic_c start_POSTSUBSCRIPT ≀ italic_t end_POSTSUBSCRIPT:

LLM⁒(c<t,(ap=vp)∨(ap=?))β†’c≀tβ†’LLMsubscript𝑐absent𝑑subscriptπ‘Žπ‘subscript𝑣𝑝subscriptπ‘Žπ‘?subscript𝑐absent𝑑\texttt{LLM}(c_{<t},(a_{p}=v_{p})\vee(a_{p}=?))\rightarrow c_{\leq t}LLM ( italic_c start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , ( italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ∨ ( italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ? ) ) β†’ italic_c start_POSTSUBSCRIPT ≀ italic_t end_POSTSUBSCRIPT

It is crucial to emphasize that the simulated conversation remains grounded in the sampled interaction, ensuring that the simulation closely aligns with the user’s actual interaction in hindsight. Moreover, the recommended app always corresponds to the one involved in the sampled interaction.

3.2. Conversational Dataset Construction

We construct π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec using the theoretical framework from SectionΒ 3.1 and a large-scale dataset for mobile app recommendations, called π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRecΒ (mobilerec, ). In the following, we provide details of each step and design choices.

3.2.1. Topics Modeling for App Aspect Extraction

First, we identify the key aspects of mobile apps that users prioritize. Our approach involves leveraging the extensive dataset, π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRec, which comprises 19.3 million user reviews across diverse mobile apps. Recognizing the importance of an efficient and scalable methodology, we choose to employ the BERTopic(grootendorst2022bertopic, ) library for conducting unsupervised topic modeling. For the vectorization of user review data, we utilize sentence transformersΒ (reimers-2019-sentence-bert, ), specifically employing the pre-trained all-mpnet-base-v1 model. The selection of this model for text embedding is driven by its state-of-the-art performance across various natural language understanding benchmarks.

After obtaining the high-dimensional embeddings for user reviews, the next step involves dimensionality reduction. The goal is to reduce the dimensionality of the embeddings while retaining the relevant semantic information for effective topic modeling. In particular, we use Universal Manifold Approximation and Projection (UMAP)Β (mcinnes2018umap, ) as the dimensionality reduction technique. UMAP offers distinct advantages, as it can effectively capture both local and global structures of the high-dimensional space in lower dimensions. Following dimensionality reduction, we employ the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm to cluster reviews that share similar aspects. HDBSCAN is chosen for its ability to effectively identify clusters of varying shapes and densities in the data. Moreover, HDBSCAN is robust to noise and outliers, allowing it to effectively handle noisy data commonly encountered in user reviews.

After the clustering is performed, we analyze the most significant terms in each cluster to extract the cluster-representative aspects. We experiment with n-gram ranges between 1 and 4. After investing some manual effort in consolidating semantically similar aspects, we arrive at the following unordered list of 20 aspects. These aspects form the foundation of the simulation framework: (i)Β User Interface Design; (ii)Β Navigation; (iii)Β Accessibility; (iv)Β Customization; (v)Β Functionality; (vi)Β Performance; (vii)Β Responsiveness; (viii)Β Security; (ix)Β Privacy; (x)Β Permissions; (xi)Β Data Collection; (xii)Β Data Sharing; (xiii)Β Updates; (xiv)Β Customer support; (xv)Β Reviews and ratings; (xvi)Β Developer; (xvii)Β Price; (xviii)Β In-app purchases; (xix)Β Advertisement Frequency; (xx)Β Battery Drainage.

Table 3. Description of the key features in the π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec dataset.
Feature Description
UID
A 16-character alphanumeric unique identifier for each user, effectively anonymizing their identity.
Example: ajqpT7VwUFheTsw7, l80Is37SlA2J9Pl4, 2EXDIawV03jpHio1.
Sequential
User Interactions
The timestamped series of user interactions preceding the current conversation.
Example Interaction: β€œapp_name”: β€œStickman vs Zombies”, β€œpackage_name”:
β€œcom.aurecas.stickmanzombieshooter”, β€œdate”: β€œ2021-12-01”, β€œrating”: β€œ2”.
Natural Language
Conversation
Multi-turn user-system conversation in natural language.
Example Turn: Computer: β€œWonderful! Which specific category or type of apps interests you the most?
Human: I’m generally drawn toward Music & Audio apps.
Recommendation
This represents the actual app with which the user interacted, based on their sequential interactions. The
goal of the recommender system will be to take into account both the users’ historical interactions and
the current conversational context to make a meaningful recommendation.
Example Recommendation: β€œapp_name”: β€œWalk Band - Multitracks Music”, β€œpackage_name”: β‹―β‹―\cdotsβ‹―.
Negative
Recommendation
This represents the relevant apps with actual recommended app which the user interacted based on their sequential
interactions.
Example Negative Recommendation: β€œapp_name”: β€œSpotify: Music and Podcast”, β€œpackage_name”:β€œcom.spotify.music”
App Metadata
(i)Β Basic metadata about each app such as app package, app name, developer name, app category, developer-provided
long-form textual description, content rating of the app, number of reviews, average rating, price, app type,
and positive and negative app feature among others.
(ii)Β Permissions: List of specific privileges granted to an application, allowing it to access certain resources
or perform certain actions on the device.
Example: Wi-Fi connection information, view Wi-Fi connections, Photos/Media/Files, β‹―β‹―\cdotsβ‹―.
(iii)Β Data Collected: The information gathered by the app and its purpose.
Example: Approximate location for analytics, advertising β‹―β‹―\cdotsβ‹―; Financial info for purchase history β‹―β‹―\cdotsβ‹―.
(iv)Β Data Shared: The information shared with third parties.
Example: Personal info such as email address for advertising or marketing; App activity such as β‹―β‹―\cdotsβ‹―.
(v)Β Security Practices: The measures and protocols implemented to protect data from unauthorized access.
Example: Data is encrypted in transit; Your data is transferred over a secure connection.
(vi)Β App Executable: The application executable file, identified by the .apk extension.
Example: com.aurecas.stickmanzombieshooter.apk

3.2.2. Global User Preferences

After extracting app aspects from the π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRec dataset, we gather global user preferences in a structured format. We utilize gpt-3.5-turbo to analyze user reviews and identify the aspects being discussed in each review. Subsequently, we aggregate this information to compute the percentage of reviews that address each specific aspect. We refer to these statistics as G={ai=p⁒r⁒(ai);i∈1,2,β‹―,m}.𝐺formulae-sequencesubscriptπ‘Žπ‘–π‘π‘Ÿsubscriptπ‘Žπ‘–π‘–12β‹―π‘šG=\{a_{i}=pr(a_{i});i\in 1,2,\cdots,m\}.italic_G = { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p italic_r ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ; italic_i ∈ 1 , 2 , β‹― , italic_m } . This analysis provides relative importance of different aspects from a global perspective. For instance, if a large proportion of reviews mention the β€œperformance” aspect, it indicates that performance is a significant consideration for users.

These statistics play a key role in guiding the interactions of the computer simulator during semantic-level conversations. Prioritizing aspects that are frequently mentioned in user reviews increases the likelihood of a successful conversation. Conversely, querying about aspects that garner little attention from users may not only be unhelpful for recommendation purposes but also risk the failure of the conversation. It is crucial to note that when selecting an aspect to inquire from the user, we adopt a probabilistic approach. This means that aspects with higher probabilities are given higher priority. This approach is chosen over a deterministic one to ensure diverse and dynamic conversations. A deterministic approach would result in repetitive conversations, with the system side of the dialogue following the same order consistently. By employing a probabilistic approach, we introduce variability and spontaneity into the conversations.

3.2.3. Interaction Sampling and User Profile

To ensure that conversations are grounded in real interactions and to guide the user simulator effectively during semantic-level conversations, it is essential to sample user interactions from the traditional sequential recommendation dataset. To sample user interactions, we employ a weighted distribution that assigns a higher probability of selection to interactions that have longer reviews and diverse categories. The distribution is defined as w⁒(li2Γ—diversity)𝑀superscriptsubscript𝑙𝑖2diversityw(l_{i}^{2}\times\texttt{diversity})italic_w ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Γ— diversity ), where w(.)w(.)italic_w ( . ) denotes the weight, li2superscriptsubscript𝑙𝑖2l_{i}^{2}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT represents the squared review length for interaction disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and diversity is inversely proportional to the number of apps already sampled within the same category as disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This approach ensures that longer reviews are more likely to be selected while also promoting diversity in the sampled interactions across different app categories. Furthermore, longer reviews tend to cover a broader spectrum of topics, aspects, and sentiments related to the app. This increased information density enhances the likelihood of gaining deeper insight into user preferences, thereby offering more informative guidance for the user simulator.

To construct the user profile, we gather all previous interactions associated with the user involved in the sampled interaction disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We accumulate the values for all the aspects and refer to the profile for kβˆ’limit-fromπ‘˜k-italic_k -th user as uk={(ai=vi);i∈1,2,β‹―,m}u_{k}=\{(a_{i}=v_{i});i\in 1,2,\cdots,m\}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ; italic_i ∈ 1 , 2 , β‹― , italic_m }. Both the user profile and the sampled interaction collectively guide the user simulator. They provide essential information and insights into the user’s preferences, behaviors, and past interactions, facilitating the generation of contextually relevant responses in the simulation framework.

3.2.4. User System Simulation

The computer simulator initiates the conversation by deciding to ask a question aimed at understanding the user’s interests. This involves sampling an aspect from global user preferences G𝐺Gitalic_G according to the normalized probability distribution over the aspects. We employ gpt-3.5-turbo to generate a contextual natural language query, with the prompt guiding the formulation of the question. The human simulator responds to the question posed by the computer simulator. If the sampled review discussed the aspect in question, the human simulator provides an answer. Otherwise, the response indicates disinterest (i.e., value 0) in that particular aspect. Once again, gpt-3.5-turbo is utilized to generate a natural language response, with the prompt dictating the model for formulating a response based on aspect value and conversational context. Following several exchanges, at turn kπ‘˜kitalic_k, a recommendation is presented to the user. This recommendation corresponds to the actual app that the user had interacted with in the sampled interaction disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Following the recommendation, the user simulator may pose follow-up questions based on the user’s profile characteristics. These questions are sampled in accordance with the user’s preferences captured in the user profile, such as a propensity for prioritizing security concerns. Subsequently, the computer simulator provides answers to the user simulator’s inquiries, drawing upon information from rich metadata about apps.

We employ automatic detection mechanisms to identify failed simulations, particularly focusing on scenarios where multiple aspects are queried that users express no interest in. Conversations of this nature fail to contribute meaningfully to guiding recommendation models due to their lack of relevance to the user’s preferences. Moreover, cases where no user history is available have also been sampled, enabling the simulation of cold-start scenarios. This approach ensures that recommendation models can not get away with only considering the historical interactions of the users. The final dataset has undergone human verification to ensure the informativeness and coherence of the dialogs.

Refer to caption
Figure 3. Top-10 categories in the dataset.

3.3. Dataset Features

The proposed dataset comprises several features, encompassing users’ historical interactions, explicit interest elicitation turns, an annotated recommendation turn, and subsequent question-answer exchanges about the recommended app. TableΒ 3 provides detailed descriptions for the important features of π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec.

In addition to the basic metadata available in π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRec, we have broadened our dataset’s scope to encompass more exhaustive metadata attributes. This expansion entails comprehensive details regarding the permissions sought by apps from users, intricacies of data collection specifying its purpose, insights into data-sharing practices with third parties, and security measures governing data transmission and sharing. Furthermore, we provide access to the executable (i.e., .apk) files of free apps, enabling researchers to conduct thorough analyses of the actual binary code, should they desire a deeper exploration.

The sharing of .apk files will be restricted to research and educational purposes only, as per Google Play store policies that prohibit the open distribution of executable files. Access will be granted exclusively to academic researchers upon request. The conversations and app metadata are accessible to the public via Hugging Face datasets.

Refer to caption
Figure 4. The distribution of the number of turns per dialog.

3.4. Dataset Analysis

In FigureΒ 3, we showcase the top 10 app categories. It’s notable that these categories collectively constitute over 50% of the dataset, aligning closely with the composition of the original recommendation dataset, π–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–±π–Ύπ–Ό\mathsf{MobileRec}sansserif_MobileRec. Our carefully designed sampling approach facilitated the inclusion of apps from a wide range of categories – spanning all 45 categories – while maintaining proportions that reflect the real-world popularity of these categories.

In FigureΒ 4, we depict the distribution of the number of turns per dialogue. The majority of dialogues consist of turns ranging between 10 and 16, with each turn comprising one user utterance and one system utterance. We believe that this distribution of turns per dialogue poses a decent challenge for recommendation models, while also providing ample learning opportunities for them.

In FigureΒ 5, we illustrate the distribution of the number of words per turn. A noteworthy observation is the diversity within the dataset, encompassing turns with varying lengths, spanning from brief exchanges to more extensive dialogues. This breadth of conversational lengths is anticipated to equip the recommendation models with the capability to effectively handle a wide spectrum of conversations, ensuring its adaptability to diverse user interactions.

Refer to caption
Figure 5. The distribution of the number of words per turn.

3.5. Potential Usage Scenarios

The proposed dataset, π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec, serves as a diverse resource for training both sequential and conversational recommender systems, offering a unique blend of users’ historical interactions and contextual conversational data. This integration presents an opportunity for recommender systems to obtain richer insights and refine their recommendations with a deeper understanding of user preferences. Notably, existing research has predominantly focused on either historical interactions or conversational context, often overlooking the potential synergies afforded by combining the two.

The inclusion of rich metadata about apps further enhances the dataset’s utility, enabling follow-up question-answering functionalities that have been largely neglected by conversational recommender systems thus far. Moreover, the availability of executable files in the dataset offers a unique opportunity for conducting security and privacy-related analyses, allowing for a thorough verification of the functional correctness, and program analysis of the apps. Unlike relying solely on developer-provided information, which may be incomplete or biased, the availability of executable files facilitates a more robust evaluation process. By analyzing the executable files, researchers can explore the inner workings of the apps, uncovering potential security vulnerabilities, privacy concerns Such comprehensive analyses of executables hold promise in bolstering the safety and security of recommendations, thereby enhancing user trust and confidence.

In this work, our primary focus lies in establishing baseline results for conversational mobile app recommendations. Moreover, we expect the manifold potential applications of this dataset across various domains and encourage the broader research community to explore and leverage its diverse capabilities.

Table 4. Recommendation Generation Experiment.
Input Model Success Rate
Dialog Context GPT-2 65.6 %
Flan-T5 36.4 %
Dialog Context + Previous Interactions GPT-2 66.6 %
Flan-T5 37.7 %
Dialog Context + Sampled Candidates GPT-2 85.3 %
Flan-T5 86.0 %
Dialog Context + Similar Candidates GPT-2 47.9 %
Flan-T5 53.6 %
Dialog Context + Sampled Candidates + Previous Interactions GPT-2 82.6 %
Flan-T5 86.3 %
Dialog Context + Similar Candidates + Previous Interactions GPT-2 47.2 %
Flan-T5 54.7 %

4. Baselines

We fine-tuned pre-trained GPT-2Β (Radford2019LanguageMA, ) and Flan-T5Β (Chung2022, ) models to establish baseline results across various experimental setups.

Table 5. Candidate Apps Ranking Experiment: Hit@1-to-10.
Input Model Hit@1 @2 @3 @4 @5 @6 @7 @8 @9 @10
Dialog Context + Sampled Candidates GPT-2 0.596 0.691 0.741 0.774 0.797 0.820 0.835 0.848 0.864 0.879
Flan-T5 0.893 0.938 0.951 0.958 0.962 0.965 0.966 0.970 0.972 0.973
Dialog Context + Similar Candidates GPT-2 0.272 0.371 0.436 0.485 0.527 0.564 0.600 0.638 0.673 0.701
Flan-T5 0.567 0.702 0.756 0.800 0.832 0.848 0.862 0.874 0.884 0.895
Dialog Context + Sampled Candidates + Previous Interactions GPT-2 0.706 0.791 0.823 0.846 0.861 0.879 0.890 0.899 0.908 0.914
Flan-T5 0.897 0.944 0.956 0.961 0.963 0.967 0.970 0.973 0.973 0.977
Dialog Context + Similar Candidates + Previous Interactions GPT-2 0.343 0.460 0.526 0.577 0.622 0.653 0.686 0.712 0.744 0.766
Flan-T5 0.577 0.715 0.777 0.814 0.836 0.856 0.873 0.883 0.893 0.905
Table 6. Candidate Apps Ranking Experiment: NDCG@1-to-10.
Input Model NDCG@1 @2 @3 @4 @5 @6 @7 @8 @9 @10
Dialog Context + Sampled Candidates GPT-2 0.596 0.656 0.681 0.695 0.704 0.713 0.718 0.721 0.726 0.731
Flan-T5 0.893 0.921 0.928 0.931 0.933 0.934 0.934 0.935 0.936 0.936
Dialog Context + Similar Candidates GPT-2 0.272 0.335 0.367 0.388 0.404 0.417 0.429 0.441 0.452 0.460
Flan-T5 0.567 0.652 0.679 0.698 0.710 0.716 0.721 0.724 0.728 0.731
Dialog Context + Sampled Candidates + Previous Interactions GPT-2 0.706 0.760 0.776 0.786 0.792 0.798 0.802 0.804 0.807 0.809
Flan-T5 0.897 0.926 0.932 0.935 0.935 0.937 0.938 0.939 0.939 0.940
Dialog Context + Similar Candidates + Previous Interactions GPT-2 0.343 0.417 0.450 0.472 0.489 0.500 0.511 0.519 0.529 0.535
Flan-T5 0.577 0.664 0.695 0.711 0.720 0.727 0.732 0.735 0.738 0.742

4.1. Experimental Settings

To ensure robust evaluation, we partitioned the data into distinct training, validation, and testing sets based on the date of interaction. Specifically, interactions occurring on past dates were allocated to the training and testing sets, with the most recent dates exclusively designated for testing. This approach ensures that the models are trained and evaluated on temporally diverse datasets, enabling a comprehensive assessment of their performance across different periods. We consider the following experimental setups.

Recommendation Generation. We assess the models’ ability to generate app names as recommendations through fuzzy matching between the ground truth and the generated app names. This experiment involves four different types of inputs to the models: (i)Β  only the dialog context; (ii)Β  the combination of the dialog context and users’ historical interactions; (iii)Β the combination of the dialog context and the set of candidate apps available for recommendation; and (iv)Β  the combination of the dialog context, the set of candidate apps, and previous interactions. In all our experiments, the number of candidate apps is set to 25, with one being the ground truth app that the user interacted with, and the 24 other different apps. These 24 candidate apps are generated in one of two ways: (i)Β  randomly sampling 24 apps (Sampled Candidates); or (ii)Β  selecting them from a group of apps similar to the ground truth recommended app (Similar Candidates).

Candidate Apps Ranking. We assess the models’ ability to rank a set of candidate apps. We consider two different types of inputs to the models: (i)Β the combination of the dialog context and the set of candidate apps; and (ii)Β the combination of the dialog context, the set of the candidate apps, and the users’ historical interactions. Similar to the previous experiment, we utilize two sets of candidate apps: (i)Β Sampled Candidates, which consists of the ground truth app and 24 apps randomly selected from the entire app pool, and (ii)Β Similar Candidates, which includes the ground truth app and 24 apps similar to the ground truth app.

Response Generation. In this experiment, we evaluate the models’ proficiency in generating appropriate responses based on dialog context. This encompasses the models’ ability to elicit users’ preferences, recommend suitable apps in natural language text, and respond to users’ inquiries about the recommended apps.

4.2. Evaluation Metrics

For each experimental setup, we use well-established metrics. (i)Β For the recommendation generation task, we employ the success rate metric. Specifically, the success rate calculates the percentage of apps where the generated app name and the ground-truth app name have a Levenshtein distance similarity ratioΒ (Levenshtein1965BinaryCC, ) of more than 0.95. (ii)Β For the apps ranking task evaluation, we utilize standard metricsΒ (jarvelin2002cumulated, ; recbole[1.0], ; li2020sampling, ): Hit@K and NDCG@K where K∈{1,2,3,β‹―,10}𝐾123β‹―10K\in\{1,2,3,\cdots,10\}italic_K ∈ { 1 , 2 , 3 , β‹― , 10 }. (iii)Β For response generation task evaluation, we use BLEU scoreΒ (papineni-etal-2002-bleu, ).

5. Results and Discussion

TableΒ 4 presents the results of the experiment, which focused on training models to directly generate recommended app names. We notice that incorporating candidate apps (Sampled Candidates) as input to the model significantly enhances its ability to accurately generate the recommended app names. Specifically, we observe a remarkable 136.26% improvement in the performance of the Flan-T5 model when both dialog context and sampled candidate apps are utilized, compared to when only dialog context is employed as input (86.0 vs 36.4). Similar improvement can also be observed for the GPT-2 model when both dialog context and sampled candidates are utilized (85.3 vs 65.6). The substantial improvement observed was anticipated, as providing sampled candidate apps (i.e., 25 apps) as input significantly reduces the pool of apps to be recommended (from over 1.7K in the training data to just 25). Furthermore, since the names of the apps are provided in the context, the models are less prone to errors during generation. Although there is no definitive winner, our observations indicate that the Flan-T5 model outperforms the GPT-2 model when the input to the model includes candidate apps, whether these candidate apps are Similar Candidates or Sampled Candidates. These findings imply that the Flan-T5 model demonstrates superior proficiency compared to GPT-2 in selecting the appropriate apps when provided as context.

Additionally, we observe ambiguity regarding the impact of incorporating historical interactions on the results. While intuitively, leveraging historical interactions should enhance performance, our experiments do not consistently demonstrate improvements. This suggests that simply passing historical interactions as part of the input may not be the optimal approach and warrants further investigation.

In addition to this, we observe a noticeable difference in model performance when comparing the use of Sampled Candidates versus Similar Candidates as input. For instance, both GPT-2 and Flan-T5 models show approximately 43.8% and 37.6% lower success rates respectively when the input consists of dialog context and similar candidates, compared to when the input includes dialog context and sampled candidates. This decline in performance is expected, as the presence of similar candidates makes it more challenging for the models to select the correct app.

Moreover, upon subjectively evaluating the success and failure cases, we observed that the pre-trained models exhibit a bias towards more popular apps. For instance, VLC for Android is consistently favored over MX Player Pro. This observation underscores potential avenues for further investigation into the factors influencing model preferences and their implications for recommendation systems.

TableΒ 5 and TableΒ 6 present the results for the Hit and NDCG metrics, respectively, for the candidate ranking experiment. Both models exhibit improved quantitative scores for Hit and NDCG metrics as the value of k increases, which aligns with desirable behavior. While the overall results for both models do not significantly differ, we observe slightly better performance from the Flan-T5 model. Particularly noteworthy is the case where the input comprises only Dialog Context and Sampled Candidates, where the Flan-T5 model outperforms the GPT-2 model by 35.74% in Hit@2 and by 40.39% in NDCG@2. These findings underscore the potential superiority of the Flan-T5 model in candidate ranking tasks. Furthermore, similar to the previous experiment (recommendation generation experiment), we observe that the performance of all models declines when using Similar Candidates as input compared to using Sampled Candidates.

Table 7. Response Generation Experiment.
Models BLEU-4 Score
GPT-2 0.1934
Flan-T5 0.2998

The performance of the models in the response generation experiment is shown in TableΒ 7. In this experiment, we again observe that the Flan-T5 model outperforms the GPT-2 model by 55.01%, achieving a performance score of 0.2998 compared to 0.1934. While the BLEU scores for both models may not be particularly high, our subjective evaluations suggest that both models exhibit good performance. As an example, during user needs elicitation and response generation, generations such as β€œWhat average rating do you typically look for when deciding to install a mobile application?” or β€œGot it! how important is the reputation or credibility of the developer to you when choosing a mobile app?” can yield significantly different BLEU scores based on the corresponding ground truth utterance. However, from a qualitative perspective, both responses are perfectly reasonable and effectively advance the conversation.

6. Conclusion

This paper introduces π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec, a dataset tailored for conversational mobile app recommendations. The key novelty of the proposed dataset is the integration of users’ historical interactions within multi-turn dialogs, providing a more holistic view of user preferences. π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec comprises over 12.2K multi-turn dialogs encompassing 11.8K unique users interacting with 1,730 apps spanning 45 categories. These interactions accumulate to over 156K turns in conversations, providing an unparalleled level of detail for understanding user preferences across diverse app categories. Furthermore, each app in the dataset includes comprehensive metadata, including details such as app permissions, data collection and sharing practices, security policies implemented by developers, and binary executables of free apps. We showcase the utility of π–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Όπ–¬π—ˆπ–»π—‚π—…π–Ύπ–’π—ˆπ—‡π—π–±π–Ύπ–Ό\mathsf{MobileConvRec}sansserif_MobileConvRec as an experimental testbed for research in conversational app recommendation by conducting a comparative evaluation of various pre-trained language models. This evaluation also establishes baseline results that can serve as valuable reference points for the research community.

References

  • [1] Movielens. https://grouplens.org/datasets/movielens/, 2022. Accessed: 2022-11-06.
  • [2] Top 20 play store app reviews. https://www.kaggle.com/datasets/odins0n/top-20-play-store-app-reviews-daily-update, 2022. Accessed: 2022-12-09.
  • [3] Apple. Apple app store. https://apps.apple.com/, 2022. Accessed: 2022-11-06.
  • [4] L.Β Ceci. Number of apps available in leading app store. https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/. Accessed: 2022-11-06.
  • [5] Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. Towards conversational recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 815–824, 2016.
  • [6] HyungΒ Won Chung, LeΒ Hou, Shayne Longpre, Barret Zoph, YiΒ Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, ShixiangΒ Shane Gu, Zhuyun Dai, Mirac Suzgun, ** Huang, AndrewΒ M. Dai, Hongkun Yu, Slav Petrov, EdΒ H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, QuocΒ V. Le, and Jason Wei. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022.
  • [7] J.Β Degenhard. Number of apps available in leading app store. https://www.statista.com/forecasts/1143723/smartphone-users-in-the-world. Accessed: 2022-02-02.
  • [8] Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, and Jason Weston. Evaluating prerequisite qualities for learning end-to-end dialog systems. arXiv preprint arXiv:1511.06931, 2015.
  • [9] Umar Farooq, ABΒ Siddique, Fuad Jamour, Zhijia Zhao, and Vagelis Hristidis. App-aware response synthesis for user reviews. In 2020 IEEE International Conference on Big Data (Big Data), pages 699–708. IEEE, 2020.
  • [10] Moghis Fereidouni, Adib Mosharrof, Umar Farooq, and ABΒ Siddique. Proactive prioritization of app issues via contrastive learning. In 2022 IEEE International Conference on Big Data (Big Data), pages 535–544. IEEE, 2022.
  • [11] Zuohui Fu, Yikun Xian, Yaxin Zhu, Shuyuan Xu, Zelong Li, Gerard DeΒ Melo, and Yongfeng Zhang. Hoops: Human-in-the-loop graph reasoning for conversational recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2415–2421, 2021.
  • [12] Cuiyun Gao, Jichuan Zeng, Xin Xia, David Lo, MichaelΒ R Lyu, and Irwin King. Automating app review response generation. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 163–175. IEEE, 2019.
  • [13] Google. Google play store. https://play.google.com/store/apps, 2022. Accessed: 2022-11-06.
  • [14] Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794, 2022.
  • [15] ShirleyΒ Anugrah Hayati, Dongyeop Kang, Qingxiaoyang Zhu, Weiyan Shi, and Zhou Yu. Inspired: Toward sociable recommendation dialog systems. arXiv preprint arXiv:2009.14306, 2020.
  • [16] Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, BodhisattwaΒ Prasad Majumder, Nathan Kallus, and Julian McAuley. Large language models as zero-shot conversational recommenders. In Proceedings of the 32nd ACM international conference on information and knowledge management, pages 720–730, 2023.
  • [17] Kalervo JΓ€rvelin and Jaana KekΓ€lΓ€inen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002.
  • [18] Meihuizi Jia, Ruixue Liu, Peiying Wang, Yang Song, Zexi Xi, Haobin Li, Xin Shen, Meng Chen, **hui Pang, and Xiaodong He. E-convrec: a large-scale conversational recommendation dataset for e-commerce customer service. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5787–5796, 2022.
  • [19] Dongyeop Kang, Anusha Balakrishnan, Pararth Shah, Paul Crook, Y-Lan Boureau, and Jason Weston. Recommendation as a communication game: Self-supervised bot-play for goal-oriented dialogue. arXiv preprint arXiv:1909.03922, 2019.
  • [20] Hammad Khalid. On identifying user complaints of ios apps. In 2013 35th international conference on software engineering (ICSE), pages 1474–1476. IEEE, 2013.
  • [21] Hammad Khalid, Emad Shihab, Meiyappan Nagappan, and AhmedΒ E Hassan. What do mobile app users complain about? IEEE software, 32(3):70–77, 2014.
  • [22] VladimirΒ I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 10:707–710, 1965.
  • [23] Dong Li, Ruoming **, **g Gao, and Zhi Liu. On sampling top-k recommendation evaluation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2114–2124, 2020.
  • [24] Raymond Li, Samira EbrahimiΒ Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. Towards deep conversational recommendations. Advances in neural information processing systems, 31, 2018.
  • [25] Lizi Liao, LeΒ Hong Long, Zheng Zhang, Minlie Huang, and Tat-Seng Chua. Mmconv: an environment for multimodal conversational search across multiple domains. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pages 675–684, 2021.
  • [26] Yuanxing Liu, Weinan Zhang, Baohua Dong, Yan Fan, Hang Wang, Fan Feng, Yifan Chen, Ziyu Zhuang, Hengbin Cui, Yongbin Li, etΒ al. U-need: A fine-grained dataset for user needs-centric e-commerce conversational recommendation. arXiv preprint arXiv:2305.04774, 2023.
  • [27] Zeming Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, and Ting Liu. Towards conversational recommendation over multi-type dialogs. arXiv preprint arXiv:2005.03954, 2020.
  • [28] Walid Maalej and Hadeer Nabil. Bug report, feature request, or simply praise? on automatically classifying app reviews. In 2015 IEEE 23rd international requirements engineering conference (RE), pages 116–125. IEEE, 2015.
  • [29] M.Β H. Maqbool, Umar Farooq, Adib Mosharrof, A.Β B. Siddique, and Hassan Foroosh. Mobilerec: A large scale dataset for mobile apps recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 3007–3016, New York, NY, USA, 2023. Association for Computing Machinery.
  • [30] Julian McAuley. Amazon product data. http://jmcauley.ucsd.edu/data/amazon/, 2022. Accessed: 2022-11-06.
  • [31] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  • [32] Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba. Opendialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 845–854, 2019.
  • [33] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-**g Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
  • [34] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
  • [35] Filip Radlinski, Krisztian Balog, Bill Byrne, and Karthik Krishnamoorthi. Coached conversational preference elicitation: A case study in understanding movie preferences. 2019.
  • [36] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
  • [37] Kamonphop Srisopha, Daniel Link, and Barry Boehm. How should developers respond to app reviews? features predicting the success of developer responses. EASE 2021, page 119–128, New York, NY, USA, 2021. Association for Computing Machinery.
  • [38] HuΒ Xu, Seungwhan Moon, Honglei Liu, Bing Liu, Pararth Shah, and PhilipΒ S Yu. User memory reasoning for conversational recommendation. arXiv preprint arXiv:2006.00184, 2020.
  • [39] WayneΒ Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min, Zhichao Feng, Xinyan Fan, XuΒ Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. Recbole: Towards a unified, comprehensive and efficient framework for recommendation algorithms. In CIKM, pages 4653–4664. ACM, 2021.
  • [40] Kun Zhou, Yuanhang Zhou, WayneΒ Xin Zhao, Xiaoke Wang, and Ji-Rong Wen. Towards topic-guided conversational recommender system. arXiv preprint arXiv:2010.04125, 2020.