MobileConvRec: A Conversational Dataset for Mobile Apps Recommendations

Srijata Maji … University of Kentucky , Moghis Fereidouni … University of Kentucky , Vinaik Chhetri … Louisiana State University , Umar Farooq 0000-0001-7229-9847 Louisiana State University and A.B. Siddique 0000-0002-3587-7289 University of Kentucky

(2018)

Abstract.

Existing recommendation systems have focused on two paradigms: (i) historical user-item interaction-based recommendations and (ii) conversational recommendations. Conversational recommendation systems facilitate natural language dialogues between users and the system, allowing the system to solicit users’ explicit needs while enabling users to inquire about recommendations and provide feedback. Due to substantial advancements in natural language processing, conversational recommendation systems have gained prominence. Existing conversational recommendation datasets have greatly facilitated research in their respective domains. Despite the exponential growth in mobile users and apps in recent years, research in conversational mobile app recommender systems has faced substantial constraints. This limitation can primarily be attributed to the lack of high-quality benchmark datasets specifically tailored for mobile apps. To facilitate research for conversational mobile app recommendations, we introduce $\mathsf{MobileConvRec}$ . $\mathsf{MobileConvRec}$ simulates conversations by leveraging real user interactions with mobile apps on the Google Play store, originally captured in large-scale mobile app recommendation dataset $\mathsf{MobileRec}$ . The proposed conversational recommendation dataset synergizes sequential user-item interactions, which reflect implicit user preferences, with comprehensive multi-turn conversations to effectively grasp explicit user needs. $\mathsf{MobileConvRec}$ consists of over 12K multi-turn recommendation-related conversations spanning 45 app categories. Furthermore, $\mathsf{MobileConvRec}$ presents rich metadata for each app such as permissions data, security and privacy-related information, and binary executables of apps, among others. We demonstrate that $\mathsf{MobileConvRec}$ can serve as an excellent testbed for conversational mobile app recommendation through a comparative study of several pre-trained large language models. The $\mathsf{MobileConvRec}$ dataset is available at https://huggingface.co/datasets/recmeapp/MobileConvRec.

Conversational Recommendations, App Recommendations.

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†ccs: Information systems Personalization^†^†ccs: Information systems Recommender systems^†^†ccs: Information systems Test collections

1. Introduction

Refer to caption — Figure 1. A sample natural language dialog between the user and the system might unfold as follows: The system draws insights from the user’s historical interactions, proactively elicits the user’s needs explicitly, and synergizes this information to make more meaningful recommendations. Additionally, the user may pose follow-up questions regarding the recommended app.

Table 1. Comparison of existing mobile apps datasets with

\mathsf{MobileConvRec}

Dataset features	RRGen (rrgen, )	AARSynth (aarsynth, )	Srisopha et al. (srisopha-how, )^∗	PPrior (pprior, )^†	$\mathsf{MobileRec}$ (mobilerec, )	$\mathsf{MobileConvRec}$
Multi-turn conversations	✗	✗	✗	✗	✗	✓
Multiple interactions by a single user	✗	✗	✗	✗	✓	✓
Interaction timestamp	✗	✗	✓	✓	✓	✓
Security & privacy-related metadata	✗	✗	✗	✗	✗	✓
App executables	✗	✗	✗	✗	✗	✓
Number of apps	58	103	1,600	9,869	10,173	1,730
Number of app categories	15	23	32	48	48	45
^∗ (srisopha-how, ) is not publicly available. ^† (pprior, ) contains only negative user reviews.

In the past decade, mobile apps have seen exponential growth, with over 5 billion users reported (number-of-smartphone-users, ). These apps are utilized for diverse purposes, including productivity, news consumption, entertainment, ride-sharing, and food services, to name a few. Consequently, app distribution channels have seen significant growth. Notably, the Apple App Store (appstore, ) and Google Play (googleplay, ) alone host over 2.2 million and 3.5 million apps, respectively (number-of-apps-on-stores, ). The expanding app marketplaces present a significant challenge for users in efficiently discovering apps that match their preferences. Conversational recommendation systems can play a crucial role in alleviating users’ cognitive overload by discerning both their implicit needs, inferred from previous interaction history, and their explicit needs, expressed through conversational interactions. As illustrated in Figure 1, an app recommendation system can suggest new apps to users by integrating both their implicit preferences (i.e., prior installations and interactions with apps) and explicit needs (i.e., current conversation with the system).

Unlike traditional recommendation systems, which primarily rely on users’ interaction history, conversational recommendation systems possess the potential to: (i) understand users’ historical interactions alongside multi-turn natural language dialog, and (ii) generate human-like responses to not only recommend items but also facilitate preference refinement, knowledgeable discussion, and recommendation justification. Conversational recommendation systems have shown remarkable success in a wide range of domains, such as movies (dodge2015evaluating, ; dodge2015evaluating, ; kang2019recommendation, ), music (moon2019opendialkg, ), sports (moon2019opendialkg, ), e-commerce (liu2023u, ; jia2022convrec, ), and travel (liao2021mmconv, ), among others (xu2020user, ). The existence of datasets specifically designed for various domains, featuring multi-turn conversational interactions (dodge2015evaluating, ; he2023large, ; liu2020towards, ), has played a crucial role in advancing the development and refinement of conversational recommendation systems.

Several prominent datasets focus on mobile apps, including RRGen (rrgen, ), AARSyth (aarsynth, ), Srisopha et al.(srisopha-how, ), PPrior(pprior, ), and $\mathsf{MobileRec}$ (mobilerec, ), among others. However, it is worth noting that RRGen only comprises single-turn interactions and encompasses fewer than 100 apps from less than 20 categories. While datasets like AARSynth, Srisopha et al., and PPrior offer millions of interactions, their lack of unique user identifiers renders them inadequate for develo** any type of app recommendation system. Moreover, PPrior contains only negative user interactions and the dataset from Srisopha et al. (srisopha-how, ) is not publicly available. Although $\mathsf{MobileRec}$ (mobilerec, ) provides a large-scale dataset with unique user identifiers and has been employed in constructing app recommendation systems, it lacks multi-turn natural language interactions. A comparison of mobile app datasets is presented in Table 1.

In this work, we attempt to bridge this research gap by offering a large-scale, rich, and diverse benchmark dataset, which we call $\mathsf{MobileConvRec}$ . This dataset is designed to facilitate researchers in the development of conversational app recommendation systems. We construct $\mathsf{MobileConvRec}$ by sampling real user interactions with mobile apps sourced from the Google Play store, originally captured in $\mathsf{MobileRec}$ , serving as the basis for our conversational dataset. Our methodology for simulating natural language dialogues between users and the system is rooted in the sampled interactions, ensuring that the simulation faithfully reflects the user’s actual interactions in retrospect. To this end, we develop a theoretical framework designed to process a sequential recommendation dataset containing user interactions with various apps over time (e.g., time-stamped reviews). The framework subsequently generates a conversational recommendation dataset as its output. To streamline the simulation process, we divide it into two steps. Firstly, the simulation generates a dialogue outline at a semantic level. Subsequently, in the second step, this semantic information is transformed into contextual natural language utterances. Specifically, the conversation is initiated by the computer simulator by selecting a question aimed at understanding the user’s interests. This is accomplished by sampling an aspect from global user preferences, following a normalized probability distribution over all aspects. In response to the computer simulator’s inquiry, the human simulator provides a reply, considering the review text associated with the sampled interaction. To the best of our knowledge, this is the only recommendation dataset that integrates timestamped users’ historical interactions and multi-turn dialogs, enabling the development of effective conversational recommendation systems.

Table 2. MobileConvRec’s comparison with the well-known conversational recommendation datasets in different domains.

Datasets	#Dialogs	# Turns	#Users	#Apps	Domain (s)
FacebookRec (dodge2015evaluating, )	1M	6M	-	-	Movies
REDIAL (li2018towards, )	10K	182K	956	6,281	Movies
GoRecDial (kang2019recommendation, )	9K	170K	-	-	Movies
OpenDialKG (moon2019opendialkg, )	15K	91K	-	-	Movies, books, sports, music
TG-ReDial (zhou2020towards, )	10K	129K	1,482	-	Movies
DuRecDial	10K	-	10K	-	Movies, music, food, restaurant
DuRecDial	10K	-	10K	-	news, weather
CCPE-M (radlinski2019coached, )	502	11K	-	-	Movies
INSPIRED (hayati2020inspired, )	1K	35K	999	1,967	Movies
Reddit-Movie-Large (he2023large, )	85K	133K	10K	24,326	Movies
Reddit-Movie-Base (he2023large, )	634K	1.6M	36K	51,203	Movies
U-NEED (liu2023u, )	7K	333K	-	-	E-commerce
E-ConvRec (jia2022convrec, )	25K	775K	-	-	E-commerce
HOOPS (fu2021hoops, )	-	11.6M	-	-	E-commerce
MGConvRec (xu2020user, )	7K	73K	-	-	Restaurant
MMConv (liao2021mmconv, )	5K	39K	-	-	Travel
$\mathsf{MobileConvRec}$	12.2K	156K	11.8K	1,730	All 45 Categories on Google Play ^†
^† Including food & drink, news & magazines, music, shop**, social, sports, weather, etc.

$\mathsf{MobileConvRec}$ contains over 12.2K multi-turn dialogs involving 11.8K unique users across 1,730 apps spanning 45 categories. These interactions result in over 156K turns in conversations. In addition to the basic metadata provided for each app in $\mathsf{MobileRec}$ , $\mathsf{MobileConvRec}$ offers comprehensive metadata for each, including permissions, data collection and sharing practices, security policies of app developers, and binary executables of free apps, among other details. Table 3 describes key features of the dataset. Furthermore, we provide a comparative comparison of our proposed dataset with the latest versions of well-established conversational recommendation datasets across various domains in Table 2.

Through a comparative study utilizing pre-trained large language models (LLMs) such as GPT-2 and Flan-T5, we demonstrate the utility of our dataset in facilitating research in the domain of conversational mobile app recommendations. In our analysis, we present results based on standard evaluation metrics such as Hit@K, NDCG@K, and BLEU for the baseline models. This comprehensive evaluation provides valuable insights into the performance of these models. Notably, our study serves a dual purpose: it lays the foundation for future research in this domain and establishes baseline results that can serve as a benchmark for future comparisons and advancements. Additionally, we identify areas for improvement and potential avenues for further exploration.

Specifically, this work makes the following contributions:

•

We present $\mathsf{MobileConvRec}$ , the most extensive collection of recommendation-related multi-turn natural language user-system dialogs to date. With over 156K dialog turns spanning a diverse range of more than 1.7K distinct apps sourced from Google Play, covering 45 categories, it stands as the unique dataset in its domain. Notably, this is the only mobile app dataset that features multi-turn conversations.
•

Our experimental study showcases the practical utility of $\mathsf{MobileConvRec}$ through the utilization of various state-of-the-art LLMs. Furthermore, we establish baseline results, highlighting the dataset’s potential role in driving advancements in conversational mobile app recommendations.
•

$\mathsf{MobileConvRec}$ comprises rich metadata about apps, facilitating overlooked follow-up question-answering regarding recommended apps in conversational recommender systems. Furthermore, the availability of executable files can aid in conducting security and privacy-related analyses, mitigating potential biases inherent in developer-provided information.

2. Related Work

Over the past few decades, numerous noteworthy works and datasets have significantly contributed to advancing the understanding and development of conversational recommendation systems. Moreover, there have been efforts to collect datasets for various purposes. Next, we discuss related work in the context of both conversational recommendation and mobile app datasets.

2.1. Datasets for Conversational Recommendations

There are several existing conversational recommendation datasets. We provide a brief list in Table 2. Initial research on conversational recommended systems primarily focused on user preferences among pre-determined choices (dodge2015evaluating, ; christakopoulou2016towards, ). Notably, FacebookRec (dodge2015evaluating, ) is based on four movie dialogue datasets derived from the Facebook movie dialog dataset: a question-answer (QA) dataset, a recommendation dataset, a mix of recommendation and QA dataset, and a general chit-chat dialogue from Reddit dataset. These synthetic datasets were generated using the ratings from MovieLens dataset (movielens, ) and the Open Movie Database (OMDb). The recommendation dataset is synthetically generated, providing single movie names as answers. The Reddit dataset shares similarities, involving natural conversations about movies, but the discourse is more free-form and not primarily focused on obtaining any recommendations.

In recent times, several studies and models have emerged that focus on engaging users in natural language multi-turn dialogs. These efforts prioritize real-time responses through sentiment analysis and seek to deliver desired recommendations (li2018towards, ; zhou2020towards, ). Different crowd-sourced datasets like ReDial (li2018towards, ), DuRecDial (liu2020towards, ), GoRecDial (kang2019recommendation, ), INSPIRED (hayati2020inspired, ) are human annotated with predefined goals, such as item recommendation and goal planning. The goal-oriented datasets seamlessly integrate elements of chitchat and task-oriented dialogs, specifically in the context of recommendation tasks. Another variant of ReDail called TG-ReDial (zhou2020towards, ), utilizes topic prediction to recommend movies. OpenDialKG (moon2019opendialkg, ) is built on top of Freebase to model dialogue logic through the traversal of the knowledge graph.

DuRecDial (liu2020towards, ) dataset focuses on the multilingual and cross-lingual conversational recommendation. Both E-ConvRec (jia2022convrec, ) and U-NEED (liu2023u, ) datasets are proposed for E-commerce conversational recommendation. E-ConvRec (jia2022convrec, ) features dialogs on pre-sales topics between users and customer service staff, while U-NEED (liu2023u, ) provides fine-grained annotations for user needs in pre-sales dialogs, covering five popular categories and including user behaviors before and after the conversations, facilitating the development and evaluation of conversational recommender systems. The HOOPS (fu2021hoops, ) dataset for E-commerce is created on a knowledge graph from Amazon reviews (amazon-dataset, ) to extract key entities, forming user-item interactions. Dialogs are then synthesized using templates, enabling the generation of substantial data for training policy and recommendation modules in conversational recommender systems. MGConvRex (xu2020user, ) focuses on facilitating restaurant bookings, while MMConv (liao2021mmconv, ) introduces multi-domain conversations, specifically within the context of travel. The Reddit-Movie (base and larger) (he2023large, ) shows empirical studies on conversational recommendation tasks using LLMs in a zero-shot setting.

These datasets have played a significant role in the development of several conversational recommendation systems (li2018towards, ; fu2021hoops, ; liu2020towards, ; liu2023u, ; he2023large, ) in their respective domains. We anticipate that the proposed dataset $\mathsf{MobileConvRec}$ plays a similar role in stimulating research in building effective conversational app recommender systems. A comprehensive comparison of existing conversational recommender systems with $\mathsf{MobileConvRec}$ , as depicted in Table 2, indicates that the proposed dataset shares key attributes on par with these datasets.

2.2. Datasets for Mobile Apps

Several datasets exist for user interaction of mobile apps as listed in Table 1. Khalid et al. (khalid2014mobile, ) and (khalid2013identifying, ) provided a dataset of iOS apps consisting of 6,390 user reviews for 20 apps. Maalej and Nabil (maalej2015bug, ) collected the first large dataset with 1.3 million reviews for over 1,100 apps, their dataset focuses on user problems and understanding the user-developer dialogue. It is important to note that these datasets are not publicly available. Top 20 Apps (top20-dataset, ) is available publicly and contains 200K reviews for 20 apps spanning 9 categories. This dataset provides rating scores and text for the reviews. RRGen (rrgen, ) has more than 309K reviews spanning 58 apps. Similar to the Top 20 Apps, RRGen provides only text of reviews with rating scores. Both of these datasets do not provide app metadata, a unique user identifier, and the timestamp of review. AARSynth (aarsynth, ) collected over two million user reviews for over a hundred apps, including app metadata. Reviews of this dataset also miss out on a unique user identifier and the timestamp of review, similar to the datasets mentioned earlier.

Srisopha et al. (srisopha-how, ) collected over 9 million user reviews from 1,600 apps. This dataset has review timestamps, which can help to understand reviews in the context of the time. However, this dataset does not include a unique identifier for user and app metadata. Moreover, Srisopha et al. did not make this dataset publicly available. PPrior (pprior, ) dataset provided more than 2 million reviews for over 9 thousand apps covering $48$ categories from Google Play. This dataset provides rating scores, review text, and timestamps of reviews. However, it is worth mentioning that this dataset does not include user identifiers for interactions (i.e., reviews) and lacks app metadata. Additionally, it is important to note that PPrior dataset only contains negative user reviews. More recently, $\mathsf{MobileRec}$ (mobilerec, ) provided over 1.9 million user interactions with interaction timestamps. However, $\mathsf{MobileRec}$ does not provide multi-turn conversations. We complement $\mathsf{MobileRec}$ by building a conversational dataset on top of $\mathsf{MobileRec}$ spanning 48 categories on Google Play. Furthermore, $\mathsf{MobileConvRec}$ includes security and privacy metadata along with executables for the apps. This makes our dataset an ideal testbed for further research on mobile apps for conversational recommendation as well as understanding security and privacy perspectives.

3. $\mathsf{MobileConvRec}$ Dataset

The proposed conversational recommendation dataset has been curated in an end-to-end fashion with minimal human intervention. Initially, we outline the theoretical framework that supports the dataset generation process. Following this, we provide implementation details about each step of the dataset creation. Figure 2 provides an overview of the framework.

3.1. Theoretical Framework

At a high level, we develop a framework that ingests a sequential recommendation dataset containing user interactions over time (e.g., time-stamped reviews) with various apps. This framework then generates a conversational recommendation dataset as output. Formally, the framework $F$ maps a traditional recommendation dataset $D$ to a conversational recommendation dataset $C$ :

F(D)\rightarrow C=\{c_{t};t\in 1,2,\cdots,N\}

c_{t}=\{(u_{1}^{t},s_{1}^{t}),(u_{2}^{t},s_{2}^{t}),\cdots,(u_{n}^{t},s_{n}^{t% })\}

where each dialog $c_{t}$ consists of $n$ natural language interaction turns. Each turn $(u_{i}^{t},s_{i}^{t})$ consists of user utterance $u_{i}^{t}$ and system utterance $s_{i}^{t}$ . Furthermore, at $k-$ th dialog turn, the system utterance $s_{k}^{t}$ presents a recommendation for an app to the user.

At the dialog level, the framework randomly selects a user interaction with an app along with the corresponding user profile. This user profile is built using the user’s historical interactions. Additionally, the framework incorporates global user preferences regarding the significance of different aspects of the apps (e.g., customization). More precisely, the framework simulates a dialog between the user and the system, with the sampled user interaction serving as a reference to guide the conversational flow. We break down the simulation into two steps for simplicity. In the first step, the simulation generates the dialog outline at a semantic level, while the second step converts this semantic information into contextual natural language utterances. Formally, the first step $F_{1}$ takes as input global user preferences $G$ , a sampled interaction $d_{i}$ , and corresponding user profile $u_{k}$ . It then maps this input to the semantic-level dialog outline $s_{i}$ . Subsequently, the second step $F_{2}$ maps semantic-level dialog outline $s_{i}$ to natural language dialog $c_{i}$ :

F_{1}(G,d_{i},u_{k})\rightarrow\{s_{i}\}

F_{2}(\{s_{i}\})\rightarrow\{c_{i}\}

In the first step, the user-system simulation follows a straightforward protocol called REQUEST-RESPONSE. In this protocol, either the user or the system can request information about any aspect using the message line: “aspect-name = ?”. The response line takes the form: “aspect-name = value”, where the value can either be an actual value (e.g., price = free) or a value between 0 and 1 representing the degree of importance attributed to the requested aspect (e.g., customization: 0.7). Formally, at turn $t$ of dialog $i$ , the system simulator $SYS$ samples (without replacement) an aspect $a_{p}$ according to a weighted distribution over all aspects from the global user preferences $G$ and requests information regarding that aspect. The user simulator $USR$ then provides the value $v_{p}$ for the requested aspect $a_{p}$ , taking into account the sampled interaction $d_{i}$ and the user’s profile $u_{k}$ :

SYS(G)\rightarrow s_{t}^{i}=(a_{p}=?)

a_{p}\sim G=\{a_{i}=pr(a_{i});i\in 1,2,\cdots,m\}

USR(d_{i},u_{k},a_{p})\rightarrow(a_{p}=v_{p})

u_{k}=\{(a_{i}=v_{i});i\in 1,2,\cdots,m\}

where the probability value $pr(a_{p})$ for tuples $(a_{p}=pr(a_{p}))$ in $G$ are determined by analyzing the proportions of users who deem a particular aspect important in the dataset $D$ . Meanwhile, the user simulator leverages user profile information and the sampled interaction to collect the value $v_{p}$ for the requested aspect $a_{p}$ . This mechanism also accommodates users requesting information about the recommended app later in the conversation.

The second step is straightforward and involves utilizing an off-the-shelf pre-trained LLM, such as ChatGPT, to transform the semantic-level turns into coherent contextual natural language dialogs. Specifically, at turn $t$ , we provide the natural language dialog context $c_{<t}$ along with the semantic-level information $a_{p}=v_{p}$ or $a_{p}=?$ , using the appropriate prompt, to the model. The model produces the natural language dialog $c_{\leq t}$ :

\texttt{LLM}(c_{<t},(a_{p}=v_{p})\vee(a_{p}=?))\rightarrow c_{\leq t}

It is crucial to emphasize that the simulated conversation remains grounded in the sampled interaction, ensuring that the simulation closely aligns with the user’s actual interaction in hindsight. Moreover, the recommended app always corresponds to the one involved in the sampled interaction.

3.2. Conversational Dataset Construction

We construct $\mathsf{MobileConvRec}$ using the theoretical framework from Section 3.1 and a large-scale dataset for mobile app recommendations, called $\mathsf{MobileRec}$ (mobilerec, ). In the following, we provide details of each step and design choices.

3.2.1. Topics Modeling for App Aspect Extraction

First, we identify the key aspects of mobile apps that users prioritize. Our approach involves leveraging the extensive dataset, $\mathsf{MobileRec}$ , which comprises 19.3 million user reviews across diverse mobile apps. Recognizing the importance of an efficient and scalable methodology, we choose to employ the BERTopic(grootendorst2022bertopic, ) library for conducting unsupervised topic modeling. For the vectorization of user review data, we utilize sentence transformers (reimers-2019-sentence-bert, ), specifically employing the pre-trained all-mpnet-base-v1 model. The selection of this model for text embedding is driven by its state-of-the-art performance across various natural language understanding benchmarks.

After obtaining the high-dimensional embeddings for user reviews, the next step involves dimensionality reduction. The goal is to reduce the dimensionality of the embeddings while retaining the relevant semantic information for effective topic modeling. In particular, we use Universal Manifold Approximation and Projection (UMAP) (mcinnes2018umap, ) as the dimensionality reduction technique. UMAP offers distinct advantages, as it can effectively capture both local and global structures of the high-dimensional space in lower dimensions. Following dimensionality reduction, we employ the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm to cluster reviews that share similar aspects. HDBSCAN is chosen for its ability to effectively identify clusters of varying shapes and densities in the data. Moreover, HDBSCAN is robust to noise and outliers, allowing it to effectively handle noisy data commonly encountered in user reviews.

After the clustering is performed, we analyze the most significant terms in each cluster to extract the cluster-representative aspects. We experiment with n-gram ranges between 1 and 4. After investing some manual effort in consolidating semantically similar aspects, we arrive at the following unordered list of 20 aspects. These aspects form the foundation of the simulation framework: (i) User Interface Design; (ii) Navigation; (iii) Accessibility; (iv) Customization; (v) Functionality; (vi) Performance; (vii) Responsiveness; (viii) Security; (ix) Privacy; (x) Permissions; (xi) Data Collection; (xii) Data Sharing; (xiii) Updates; (xiv) Customer support; (xv) Reviews and ratings; (xvi) Developer; (xvii) Price; (xviii) In-app purchases; (xix) Advertisement Frequency; (xx) Battery Drainage.

Table 3. Description of the key features in the

\mathsf{MobileConvRec}

dataset.

Feature

Description

UID

A 16-character alphanumeric unique identifier for each user, effectively anonymizing their identity.

Example: ajqpT7VwUFheTsw7, l80Is37SlA2J9Pl4, 2EXDIawV03jpHio1.

Sequential

User Interactions

The timestamped series of user interactions preceding the current conversation.

Example Interaction: “app_name”: “Stickman vs Zombies”, “package_name”:

“com.aurecas.stickmanzombieshooter”, “date”: “2021-12-01”, “rating”: “2”.

Natural Language

Conversation

Multi-turn user-system conversation in natural language.

Example Turn: Computer: “Wonderful! Which specific category or type of apps interests you the most?

Human: I’m generally drawn toward Music & Audio apps.

Recommendation

This represents the actual app with which the user interacted, based on their sequential interactions. The

goal of the recommender system will be to take into account both the users’ historical interactions and

the current conversational context to make a meaningful recommendation.

Example Recommendation: “app_name”: “Walk Band - Multitracks Music”, “package_name”:

\cdots

Negative

Recommendation

This represents the relevant apps with actual recommended app which the user interacted based on their sequential

interactions.

Example Negative Recommendation: “app_name”: “Spotify: Music and Podcast”, “package_name”:“com.spotify.music”

App Metadata

(i) Basic metadata about each app such as app package, app name, developer name, app category, developer-provided

long-form textual description, content rating of the app, number of reviews, average rating, price, app type,

and positive and negative app feature among others.

(ii) Permissions: List of specific privileges granted to an application, allowing it to access certain resources

or perform certain actions on the device.

Example: Wi-Fi connection information, view Wi-Fi connections, Photos/Media/Files,

\cdots

(iii) Data Collected: The information gathered by the app and its purpose.

Example: Approximate location for analytics, advertising

\cdots

; Financial info for purchase history

\cdots

(iv) Data Shared: The information shared with third parties.

Example: Personal info such as email address for advertising or marketing; App activity such as

\cdots

(v) Security Practices: The measures and protocols implemented to protect data from unauthorized access.

Example: Data is encrypted in transit; Your data is transferred over a secure connection.

(vi) App Executable: The application executable file, identified by the .apk extension.

Example: com.aurecas.stickmanzombieshooter.apk

3.2.2. Global User Preferences

After extracting app aspects from the $\mathsf{MobileRec}$ dataset, we gather global user preferences in a structured format. We utilize gpt-3.5-turbo to analyze user reviews and identify the aspects being discussed in each review. Subsequently, we aggregate this information to compute the percentage of reviews that address each specific aspect. We refer to these statistics as $G=\{a_{i}=pr(a_{i});i\in 1,2,\cdots,m\}.$ This analysis provides relative importance of different aspects from a global perspective. For instance, if a large proportion of reviews mention the “performance” aspect, it indicates that performance is a significant consideration for users.

These statistics play a key role in guiding the interactions of the computer simulator during semantic-level conversations. Prioritizing aspects that are frequently mentioned in user reviews increases the likelihood of a successful conversation. Conversely, querying about aspects that garner little attention from users may not only be unhelpful for recommendation purposes but also risk the failure of the conversation. It is crucial to note that when selecting an aspect to inquire from the user, we adopt a probabilistic approach. This means that aspects with higher probabilities are given higher priority. This approach is chosen over a deterministic one to ensure diverse and dynamic conversations. A deterministic approach would result in repetitive conversations, with the system side of the dialogue following the same order consistently. By employing a probabilistic approach, we introduce variability and spontaneity into the conversations.

3.2.3. Interaction Sampling and User Profile

To ensure that conversations are grounded in real interactions and to guide the user simulator effectively during semantic-level conversations, it is essential to sample user interactions from the traditional sequential recommendation dataset. To sample user interactions, we employ a weighted distribution that assigns a higher probability of selection to interactions that have longer reviews and diverse categories. The distribution is defined as $w(l_{i}^{2}\times\texttt{diversity})$ , where $w(.)$ denotes the weight, $l_{i}^{2}$ represents the squared review length for interaction $d_{i}$ , and diversity is inversely proportional to the number of apps already sampled within the same category as $d_{i}$ . This approach ensures that longer reviews are more likely to be selected while also promoting diversity in the sampled interactions across different app categories. Furthermore, longer reviews tend to cover a broader spectrum of topics, aspects, and sentiments related to the app. This increased information density enhances the likelihood of gaining deeper insight into user preferences, thereby offering more informative guidance for the user simulator.

To construct the user profile, we gather all previous interactions associated with the user involved in the sampled interaction $d_{i}$ . We accumulate the values for all the aspects and refer to the profile for $k-$ th user as $u_{k}=\{(a_{i}=v_{i});i\in 1,2,\cdots,m\}$ . Both the user profile and the sampled interaction collectively guide the user simulator. They provide essential information and insights into the user’s preferences, behaviors, and past interactions, facilitating the generation of contextually relevant responses in the simulation framework.

3.2.4. User System Simulation

The computer simulator initiates the conversation by deciding to ask a question aimed at understanding the user’s interests. This involves sampling an aspect from global user preferences $G$ according to the normalized probability distribution over the aspects. We employ gpt-3.5-turbo to generate a contextual natural language query, with the prompt guiding the formulation of the question. The human simulator responds to the question posed by the computer simulator. If the sampled review discussed the aspect in question, the human simulator provides an answer. Otherwise, the response indicates disinterest (i.e., value 0) in that particular aspect. Once again, gpt-3.5-turbo is utilized to generate a natural language response, with the prompt dictating the model for formulating a response based on aspect value and conversational context. Following several exchanges, at turn $k$ , a recommendation is presented to the user. This recommendation corresponds to the actual app that the user had interacted with in the sampled interaction $d_{i}$ . Following the recommendation, the user simulator may pose follow-up questions based on the user’s profile characteristics. These questions are sampled in accordance with the user’s preferences captured in the user profile, such as a propensity for prioritizing security concerns. Subsequently, the computer simulator provides answers to the user simulator’s inquiries, drawing upon information from rich metadata about apps.

We employ automatic detection mechanisms to identify failed simulations, particularly focusing on scenarios where multiple aspects are queried that users express no interest in. Conversations of this nature fail to contribute meaningfully to guiding recommendation models due to their lack of relevance to the user’s preferences. Moreover, cases where no user history is available have also been sampled, enabling the simulation of cold-start scenarios. This approach ensures that recommendation models can not get away with only considering the historical interactions of the users. The final dataset has undergone human verification to ensure the informativeness and coherence of the dialogs.

3.3. Dataset Features

The proposed dataset comprises several features, encompassing users’ historical interactions, explicit interest elicitation turns, an annotated recommendation turn, and subsequent question-answer exchanges about the recommended app. Table 3 provides detailed descriptions for the important features of $\mathsf{MobileConvRec}$ .

In addition to the basic metadata available in $\mathsf{MobileRec}$ , we have broadened our dataset’s scope to encompass more exhaustive metadata attributes. This expansion entails comprehensive details regarding the permissions sought by apps from users, intricacies of data collection specifying its purpose, insights into data-sharing practices with third parties, and security measures governing data transmission and sharing. Furthermore, we provide access to the executable (i.e., .apk) files of free apps, enabling researchers to conduct thorough analyses of the actual binary code, should they desire a deeper exploration.

The sharing of .apk files will be restricted to research and educational purposes only, as per Google Play store policies that prohibit the open distribution of executable files. Access will be granted exclusively to academic researchers upon request. The conversations and app metadata are accessible to the public via Hugging Face datasets.

3.4. Dataset Analysis

In Figure 3, we showcase the top 10 app categories. It’s notable that these categories collectively constitute over 50% of the dataset, aligning closely with the composition of the original recommendation dataset, $\mathsf{MobileRec}$ . Our carefully designed sampling approach facilitated the inclusion of apps from a wide range of categories – spanning all 45 categories – while maintaining proportions that reflect the real-world popularity of these categories.

In Figure 4, we depict the distribution of the number of turns per dialogue. The majority of dialogues consist of turns ranging between 10 and 16, with each turn comprising one user utterance and one system utterance. We believe that this distribution of turns per dialogue poses a decent challenge for recommendation models, while also providing ample learning opportunities for them.

In Figure 5, we illustrate the distribution of the number of words per turn. A noteworthy observation is the diversity within the dataset, encompassing turns with varying lengths, spanning from brief exchanges to more extensive dialogues. This breadth of conversational lengths is anticipated to equip the recommendation models with the capability to effectively handle a wide spectrum of conversations, ensuring its adaptability to diverse user interactions.

3.5. Potential Usage Scenarios

The proposed dataset, $\mathsf{MobileConvRec}$ , serves as a diverse resource for training both sequential and conversational recommender systems, offering a unique blend of users’ historical interactions and contextual conversational data. This integration presents an opportunity for recommender systems to obtain richer insights and refine their recommendations with a deeper understanding of user preferences. Notably, existing research has predominantly focused on either historical interactions or conversational context, often overlooking the potential synergies afforded by combining the two.

The inclusion of rich metadata about apps further enhances the dataset’s utility, enabling follow-up question-answering functionalities that have been largely neglected by conversational recommender systems thus far. Moreover, the availability of executable files in the dataset offers a unique opportunity for conducting security and privacy-related analyses, allowing for a thorough verification of the functional correctness, and program analysis of the apps. Unlike relying solely on developer-provided information, which may be incomplete or biased, the availability of executable files facilitates a more robust evaluation process. By analyzing the executable files, researchers can explore the inner workings of the apps, uncovering potential security vulnerabilities, privacy concerns Such comprehensive analyses of executables hold promise in bolstering the safety and security of recommendations, thereby enhancing user trust and confidence.

In this work, our primary focus lies in establishing baseline results for conversational mobile app recommendations. Moreover, we expect the manifold potential applications of this dataset across various domains and encourage the broader research community to explore and leverage its diverse capabilities.

Table 4. Recommendation Generation Experiment.

Input	Model	Success Rate
Dialog Context	GPT-2	65.6 %
Dialog Context	Flan-T5	36.4 %
Dialog Context + Previous Interactions	GPT-2	66.6 %
Dialog Context + Previous Interactions	Flan-T5	37.7 %
Dialog Context + Sampled Candidates	GPT-2	85.3 %
Dialog Context + Sampled Candidates	Flan-T5	86.0 %
Dialog Context + Similar Candidates	GPT-2	47.9 %
Dialog Context + Similar Candidates	Flan-T5	53.6 %
Dialog Context + Sampled Candidates + Previous Interactions	GPT-2	82.6 %
Dialog Context + Sampled Candidates + Previous Interactions	Flan-T5	86.3 %
Dialog Context + Similar Candidates + Previous Interactions	GPT-2	47.2 %
Dialog Context + Similar Candidates + Previous Interactions	Flan-T5	54.7 %

4. Baselines

We fine-tuned pre-trained GPT-2 (Radford2019LanguageMA, ) and Flan-T5 (Chung2022, ) models to establish baseline results across various experimental setups.

Table 5. Candidate Apps Ranking Experiment: Hit@1-to-10.

Input	Model	Hit@1	@2	@3	@4	@5	@6	@7	@8	@9	@10
Dialog Context + Sampled Candidates	GPT-2	0.596	0.691	0.741	0.774	0.797	0.820	0.835	0.848	0.864	0.879
Dialog Context + Sampled Candidates	Flan-T5	0.893	0.938	0.951	0.958	0.962	0.965	0.966	0.970	0.972	0.973
Dialog Context + Similar Candidates	GPT-2	0.272	0.371	0.436	0.485	0.527	0.564	0.600	0.638	0.673	0.701
Dialog Context + Similar Candidates	Flan-T5	0.567	0.702	0.756	0.800	0.832	0.848	0.862	0.874	0.884	0.895
Dialog Context + Sampled Candidates + Previous Interactions	GPT-2	0.706	0.791	0.823	0.846	0.861	0.879	0.890	0.899	0.908	0.914
Dialog Context + Sampled Candidates + Previous Interactions	Flan-T5	0.897	0.944	0.956	0.961	0.963	0.967	0.970	0.973	0.973	0.977
Dialog Context + Similar Candidates + Previous Interactions	GPT-2	0.343	0.460	0.526	0.577	0.622	0.653	0.686	0.712	0.744	0.766
Dialog Context + Similar Candidates + Previous Interactions	Flan-T5	0.577	0.715	0.777	0.814	0.836	0.856	0.873	0.883	0.893	0.905

Table 6. Candidate Apps Ranking Experiment: NDCG@1-to-10.

Input	Model	NDCG@1	@2	@3	@4	@5	@6	@7	@8	@9	@10
Dialog Context + Sampled Candidates	GPT-2	0.596	0.656	0.681	0.695	0.704	0.713	0.718	0.721	0.726	0.731
Dialog Context + Sampled Candidates	Flan-T5	0.893	0.921	0.928	0.931	0.933	0.934	0.934	0.935	0.936	0.936
Dialog Context + Similar Candidates	GPT-2	0.272	0.335	0.367	0.388	0.404	0.417	0.429	0.441	0.452	0.460
Dialog Context + Similar Candidates	Flan-T5	0.567	0.652	0.679	0.698	0.710	0.716	0.721	0.724	0.728	0.731
Dialog Context + Sampled Candidates + Previous Interactions	GPT-2	0.706	0.760	0.776	0.786	0.792	0.798	0.802	0.804	0.807	0.809
Dialog Context + Sampled Candidates + Previous Interactions	Flan-T5	0.897	0.926	0.932	0.935	0.935	0.937	0.938	0.939	0.939	0.940
Dialog Context + Similar Candidates + Previous Interactions	GPT-2	0.343	0.417	0.450	0.472	0.489	0.500	0.511	0.519	0.529	0.535
Dialog Context + Similar Candidates + Previous Interactions	Flan-T5	0.577	0.664	0.695	0.711	0.720	0.727	0.732	0.735	0.738	0.742

4.1. Experimental Settings

To ensure robust evaluation, we partitioned the data into distinct training, validation, and testing sets based on the date of interaction. Specifically, interactions occurring on past dates were allocated to the training and testing sets, with the most recent dates exclusively designated for testing. This approach ensures that the models are trained and evaluated on temporally diverse datasets, enabling a comprehensive assessment of their performance across different periods. We consider the following experimental setups.

Recommendation Generation. We assess the models’ ability to generate app names as recommendations through fuzzy matching between the ground truth and the generated app names. This experiment involves four different types of inputs to the models: (i) only the dialog context; (ii) the combination of the dialog context and users’ historical interactions; (iii) the combination of the dialog context and the set of candidate apps available for recommendation; and (iv) the combination of the dialog context, the set of candidate apps, and previous interactions. In all our experiments, the number of candidate apps is set to 25, with one being the ground truth app that the user interacted with, and the 24 other different apps. These 24 candidate apps are generated in one of two ways: (i) randomly sampling 24 apps (Sampled Candidates); or (ii) selecting them from a group of apps similar to the ground truth recommended app (Similar Candidates).

Candidate Apps Ranking. We assess the models’ ability to rank a set of candidate apps. We consider two different types of inputs to the models: (i) the combination of the dialog context and the set of candidate apps; and (ii) the combination of the dialog context, the set of the candidate apps, and the users’ historical interactions. Similar to the previous experiment, we utilize two sets of candidate apps: (i) Sampled Candidates, which consists of the ground truth app and 24 apps randomly selected from the entire app pool, and (ii) Similar Candidates, which includes the ground truth app and 24 apps similar to the ground truth app.

Response Generation. In this experiment, we evaluate the models’ proficiency in generating appropriate responses based on dialog context. This encompasses the models’ ability to elicit users’ preferences, recommend suitable apps in natural language text, and respond to users’ inquiries about the recommended apps.

4.2. Evaluation Metrics

For each experimental setup, we use well-established metrics. (i) For the recommendation generation task, we employ the success rate metric. Specifically, the success rate calculates the percentage of apps where the generated app name and the ground-truth app name have a Levenshtein distance similarity ratio (Levenshtein1965BinaryCC, ) of more than 0.95. (ii) For the apps ranking task evaluation, we utilize standard metrics (jarvelin2002cumulated, ; recbole[1.0], ; li2020sampling, ): Hit@K and NDCG@K where $K\in\{1,2,3,\cdots,10\}$ . (iii) For response generation task evaluation, we use BLEU score (papineni-etal-2002-bleu, ).

5. Results and Discussion

Table 4 presents the results of the experiment, which focused on training models to directly generate recommended app names. We notice that incorporating candidate apps (Sampled Candidates) as input to the model significantly enhances its ability to accurately generate the recommended app names. Specifically, we observe a remarkable 136.26% improvement in the performance of the Flan-T5 model when both dialog context and sampled candidate apps are utilized, compared to when only dialog context is employed as input (86.0 vs 36.4). Similar improvement can also be observed for the GPT-2 model when both dialog context and sampled candidates are utilized (85.3 vs 65.6). The substantial improvement observed was anticipated, as providing sampled candidate apps (i.e., 25 apps) as input significantly reduces the pool of apps to be recommended (from over 1.7K in the training data to just 25). Furthermore, since the names of the apps are provided in the context, the models are less prone to errors during generation. Although there is no definitive winner, our observations indicate that the Flan-T5 model outperforms the GPT-2 model when the input to the model includes candidate apps, whether these candidate apps are Similar Candidates or Sampled Candidates. These findings imply that the Flan-T5 model demonstrates superior proficiency compared to GPT-2 in selecting the appropriate apps when provided as context.

Additionally, we observe ambiguity regarding the impact of incorporating historical interactions on the results. While intuitively, leveraging historical interactions should enhance performance, our experiments do not consistently demonstrate improvements. This suggests that simply passing historical interactions as part of the input may not be the optimal approach and warrants further investigation.

In addition to this, we observe a noticeable difference in model performance when comparing the use of Sampled Candidates versus Similar Candidates as input. For instance, both GPT-2 and Flan-T5 models show approximately 43.8% and 37.6% lower success rates respectively when the input consists of dialog context and similar candidates, compared to when the input includes dialog context and sampled candidates. This decline in performance is expected, as the presence of similar candidates makes it more challenging for the models to select the correct app.

Moreover, upon subjectively evaluating the success and failure cases, we observed that the pre-trained models exhibit a bias towards more popular apps. For instance, VLC for Android is consistently favored over MX Player Pro. This observation underscores potential avenues for further investigation into the factors influencing model preferences and their implications for recommendation systems.

Table 5 and Table 6 present the results for the Hit and NDCG metrics, respectively, for the candidate ranking experiment. Both models exhibit improved quantitative scores for Hit and NDCG metrics as the value of k increases, which aligns with desirable behavior. While the overall results for both models do not significantly differ, we observe slightly better performance from the Flan-T5 model. Particularly noteworthy is the case where the input comprises only Dialog Context and Sampled Candidates, where the Flan-T5 model outperforms the GPT-2 model by 35.74% in Hit@2 and by 40.39% in NDCG@2. These findings underscore the potential superiority of the Flan-T5 model in candidate ranking tasks. Furthermore, similar to the previous experiment (recommendation generation experiment), we observe that the performance of all models declines when using Similar Candidates as input compared to using Sampled Candidates.

Table 7. Response Generation Experiment.

Models	BLEU-4 Score
GPT-2	0.1934
Flan-T5	0.2998

The performance of the models in the response generation experiment is shown in Table 7. In this experiment, we again observe that the Flan-T5 model outperforms the GPT-2 model by 55.01%, achieving a performance score of 0.2998 compared to 0.1934. While the BLEU scores for both models may not be particularly high, our subjective evaluations suggest that both models exhibit good performance. As an example, during user needs elicitation and response generation, generations such as “What average rating do you typically look for when deciding to install a mobile application?” or “Got it! how important is the reputation or credibility of the developer to you when choosing a mobile app?” can yield significantly different BLEU scores based on the corresponding ground truth utterance. However, from a qualitative perspective, both responses are perfectly reasonable and effectively advance the conversation.

6. Conclusion

This paper introduces $\mathsf{MobileConvRec}$ , a dataset tailored for conversational mobile app recommendations. The key novelty of the proposed dataset is the integration of users’ historical interactions within multi-turn dialogs, providing a more holistic view of user preferences. $\mathsf{MobileConvRec}$ comprises over 12.2K multi-turn dialogs encompassing 11.8K unique users interacting with 1,730 apps spanning 45 categories. These interactions accumulate to over 156K turns in conversations, providing an unparalleled level of detail for understanding user preferences across diverse app categories. Furthermore, each app in the dataset includes comprehensive metadata, including details such as app permissions, data collection and sharing practices, security policies implemented by developers, and binary executables of free apps. We showcase the utility of $\mathsf{MobileConvRec}$ as an experimental testbed for research in conversational app recommendation by conducting a comparative evaluation of various pre-trained language models. This evaluation also establishes baseline results that can serve as valuable reference points for the research community.

References

[1] Movielens. https://grouplens.org/datasets/movielens/, 2022. Accessed: 2022-11-06.
[2] Top 20 play store app reviews. https://www.kaggle.com/datasets/odins0n/top-20-play-store-app-reviews-daily-update, 2022. Accessed: 2022-12-09.
[3] Apple. Apple app store. https://apps.apple.com/, 2022. Accessed: 2022-11-06.
[4] L. Ceci. Number of apps available in leading app store. https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/. Accessed: 2022-11-06.
[5] Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. Towards conversational recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 815–824, 2016.
[6] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, ** Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022.
[7] J. Degenhard. Number of apps available in leading app store. https://www.statista.com/forecasts/1143723/smartphone-users-in-the-world. Accessed: 2022-02-02.
[8] Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, and Jason Weston. Evaluating prerequisite qualities for learning end-to-end dialog systems. arXiv preprint arXiv:1511.06931, 2015.
[9] Umar Farooq, AB Siddique, Fuad Jamour, Zhijia Zhao, and Vagelis Hristidis. App-aware response synthesis for user reviews. In 2020 IEEE International Conference on Big Data (Big Data), pages 699–708. IEEE, 2020.
[10] Moghis Fereidouni, Adib Mosharrof, Umar Farooq, and AB Siddique. Proactive prioritization of app issues via contrastive learning. In 2022 IEEE International Conference on Big Data (Big Data), pages 535–544. IEEE, 2022.
[11] Zuohui Fu, Yikun Xian, Yaxin Zhu, Shuyuan Xu, Zelong Li, Gerard De Melo, and Yongfeng Zhang. Hoops: Human-in-the-loop graph reasoning for conversational recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2415–2421, 2021.
[12] Cuiyun Gao, Jichuan Zeng, Xin Xia, David Lo, Michael R Lyu, and Irwin King. Automating app review response generation. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 163–175. IEEE, 2019.
[13] Google. Google play store. https://play.google.com/store/apps, 2022. Accessed: 2022-11-06.
[14] Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794, 2022.
[15] Shirley Anugrah Hayati, Dongyeop Kang, Qingxiaoyang Zhu, Weiyan Shi, and Zhou Yu. Inspired: Toward sociable recommendation dialog systems. arXiv preprint arXiv:2009.14306, 2020.
[16] Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian McAuley. Large language models as zero-shot conversational recommenders. In Proceedings of the 32nd ACM international conference on information and knowledge management, pages 720–730, 2023.
[17] Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002.
[18] Meihuizi Jia, Ruixue Liu, Peiying Wang, Yang Song, Zexi Xi, Haobin Li, Xin Shen, Meng Chen, **hui Pang, and Xiaodong He. E-convrec: a large-scale conversational recommendation dataset for e-commerce customer service. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5787–5796, 2022.
[19] Dongyeop Kang, Anusha Balakrishnan, Pararth Shah, Paul Crook, Y-Lan Boureau, and Jason Weston. Recommendation as a communication game: Self-supervised bot-play for goal-oriented dialogue. arXiv preprint arXiv:1909.03922, 2019.
[20] Hammad Khalid. On identifying user complaints of ios apps. In 2013 35th international conference on software engineering (ICSE), pages 1474–1476. IEEE, 2013.
[21] Hammad Khalid, Emad Shihab, Meiyappan Nagappan, and Ahmed E Hassan. What do mobile app users complain about? IEEE software, 32(3):70–77, 2014.
[22] Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 10:707–710, 1965.
[23] Dong Li, Ruoming **, **g Gao, and Zhi Liu. On sampling top-k recommendation evaluation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2114–2124, 2020.
[24] Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. Towards deep conversational recommendations. Advances in neural information processing systems, 31, 2018.
[25] Lizi Liao, Le Hong Long, Zheng Zhang, Minlie Huang, and Tat-Seng Chua. Mmconv: an environment for multimodal conversational search across multiple domains. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pages 675–684, 2021.
[26] Yuanxing Liu, Weinan Zhang, Baohua Dong, Yan Fan, Hang Wang, Fan Feng, Yifan Chen, Ziyu Zhuang, Hengbin Cui, Yongbin Li, et al. U-need: A fine-grained dataset for user needs-centric e-commerce conversational recommendation. arXiv preprint arXiv:2305.04774, 2023.
[27] Zeming Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, and Ting Liu. Towards conversational recommendation over multi-type dialogs. arXiv preprint arXiv:2005.03954, 2020.
[28] Walid Maalej and Hadeer Nabil. Bug report, feature request, or simply praise? on automatically classifying app reviews. In 2015 IEEE 23rd international requirements engineering conference (RE), pages 116–125. IEEE, 2015.
[29] M. H. Maqbool, Umar Farooq, Adib Mosharrof, A. B. Siddique, and Hassan Foroosh. Mobilerec: A large scale dataset for mobile apps recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 3007–3016, New York, NY, USA, 2023. Association for Computing Machinery.
[30] Julian McAuley. Amazon product data. http://jmcauley.ucsd.edu/data/amazon/, 2022. Accessed: 2022-11-06.
[31] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
[32] Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba. Opendialkg: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 845–854, 2019.
[33] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-**g Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
[34] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
[35] Filip Radlinski, Krisztian Balog, Bill Byrne, and Karthik Krishnamoorthi. Coached conversational preference elicitation: A case study in understanding movie preferences. 2019.
[36] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
[37] Kamonphop Srisopha, Daniel Link, and Barry Boehm. How should developers respond to app reviews? features predicting the success of developer responses. EASE 2021, page 119–128, New York, NY, USA, 2021. Association for Computing Machinery.
[38] Hu Xu, Seungwhan Moon, Honglei Liu, Bing Liu, Pararth Shah, and Philip S Yu. User memory reasoning for conversational recommendation. arXiv preprint arXiv:2006.00184, 2020.
[39] Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min, Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. Recbole: Towards a unified, comprehensive and efficient framework for recommendation algorithms. In CIKM, pages 4653–4664. ACM, 2021.
[40] Kun Zhou, Yuanhang Zhou, Wayne Xin Zhao, Xiaoke Wang, and Ji-Rong Wen. Towards topic-guided conversational recommender system. arXiv preprint arXiv:2010.04125, 2020.