\useunder

\ul

Prompting Large Language Models for Recommender Systems: A Comprehensive Framework and Empirical Analysis

Lanling Xu [email protected] 0000-0002-7464-3776 , Junjie Zhang [email protected] 0009-0008-8864-915X Gaoling School of Artificial Intelligence, Renmin University of ChinaBei**gChina , Bingqian Li Bei**g Key Laboratory of Big Data Management and Analysis MethodsBei**gChina , **peng Wang , Mingchen Cai Meituan GroupBei**gChina , Wayne Xin Zhao and Ji-Rong Wen Gaoling School of Artificial Intelligence, Renmin UniversityBei**gChina

(2018; 8 January 2024)

Abstract.

Recently, large language models such as ChatGPT have showcased remarkable abilities in solving general tasks, demonstrating the potential for applications in recommender systems. To assess how effectively LLMs can be used in recommendation tasks, our study primarily focuses on employing LLMs as recommender systems through prompting engineering. We propose a general framework for utilizing LLMs in recommendation tasks, focusing on the capabilities of LLMs as recommenders. To conduct our analysis, we formalize the input of LLMs for recommendation into natural language prompts with two key aspects, and explain how our framework can be generalized to various recommendation scenarios. As for the use of LLMs as recommenders, we analyze the impact of public availability, tuning strategies, model architecture, parameter scale, and context length on recommendation results based on the classification of LLMs. As for prompt engineering, we further analyze the impact of four important components of prompts, i.e., task descriptions, user interest modeling, candidate items construction and prompting strategies. In each section, we first define and categorize concepts in line with the existing literature. Then, we propose inspiring research questions followed by experiments to systematically analyze the impact of different factors on two public datasets. Finally, we summarize promising directions to shed lights on future research.

Large Language Models, Recommender Systems, Empirical Study

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXX.XXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Information systems Recommender systems

1. Introduction

In order to alleviate the problem of information overload (Ni et al., 2019; Hou et al., 2022), recommender systems explore the needs of users and provide them with recommendations based on their historical interactions, which are widely studied in both industry and academia (Rendle et al., 2009; Guo et al., 2017; He et al., 2017, 2020). Over the past decade, various recommendation algorithms have been proposed to solve recommendation tasks by capturing the personalized interaction patterns from user behaviors (Kang and McAuley, 2018; Zhou et al., 2020). Despite the progress of conventional recommenders, the performance is highly dependent on the limited training data from a few datasets and domains, and there are two major drawbacks. On the one hand, traditional models lack the general world knowledge beyond interaction sequences. For complex scenarios that need to think or plan, existing methods do not have commonsense knowledge to solve such tasks (Wei et al., 2023; Harte et al., 2023; Xi et al., 2023; Liu et al., 2023a). On the other hand, traditional models cannot truly understand intentions and preferences of users. The recommendation results do not have explainability, and requirements expressed by users in explicit forms such as natural languages are difficult to consider (Li et al., 2023e; Yuan et al., 2023a; Li et al., 2023c).

Recently, Large Language Models (LLMs) such as ChatGPT have demonstrated impressive abilities in solving general tasks (Haleem et al., 2022; Wu et al., 2023a), showing their potential in develo** next-generation recommender systems. The advantages of incorporating LLMs into recommendation tasks are two-fold. Firstly, the excellent performance of LLMs in complex reasoning tasks indicates the rich world knowledge and superior inference ability, which can effectively compensate for the local knowledge of traditional recommenders (Sanner et al., 2023; Mysore et al., 2023; Agrawal et al., 2023). Secondly, the language modeling abilities of LLMs can seamlessly integrate massive textual data, enabling them to extract features beyond IDs and even understand user preferences explicitly (He et al., 2023; Li et al., 2023g). Therefore, researchers have attempted to leverage LLMs for recommendation tasks.

Typically, there are three ways to employ LLMs to make recommendations: (1) LLMs can serve as the recommender to make recommendation decisions, encompassing both discriminative and generative recommendations (Hou et al., 2023; Zhang et al., 2023f; Dai et al., 2023a; Gao et al., 2023; Bao et al., 2023b). (2) LLMs can be leveraged to enhance traditional recommendation models by extracting semantic representations of users and items from text corpora. The extensive semantic information and robust planning capabilities of LLMs are integrated into traditional models (Hou et al., 2022; Du et al., 2023; Wang et al., 2023a; Harte et al., 2023; Agrawal et al., 2023; Wei et al., 2023; Xi et al., 2023; Liu et al., 2023a). (3) LLMs are utilized as the recommendation simulator to execute external generative agents in the recommendation process, where users and items may be empowered by LLMs to stimulate the virtual environment (Wang et al., 2023i, g; Zhang et al., 2023e, c; Friedman et al., 2023). We mainly focus on the first scenario in this paper.

Considering the gap between the general knowledge from large language models and the domain knowledge from recommendation models (Zhang et al., 2023b; Bao et al., 2023b), there are two key factors for prompting LLMs as recommenders, i.e., how to select a LLM as the foundation model and how to construct a prompt as the prompting text. As for LLMs, a growing number of open-source and closed-source models have emerged, and the same model also has different variants due to settings such as parameter scales and context lengths (Zhao et al., 2023). There are notable variations in the performance of different LLMs when it comes to general language tasks such as generation and reasoning (Wu et al., 2023a; Haleem et al., 2022). However, the performance differences of LLMs in recommendation tasks have not been fully explored. It is worth discussing how to select corresponding LLMs for specific scenarios and develop corresponding training strategies. As for the prompt, it is an important medium for interactions between humans and language models, and a well-designed prompt can better stimulate the powerful capabilities of LLMs (Liu et al., 2023f; Le Scao and Rush, 2021). To stimulate the recommendation ability of language models, prompt engineering should involve not only task description and prompting strategies for general tasks, but also the incorporation of user interest modeling and the creation of candidate items in recommender systems (Yao et al., 2023; Fan et al., 2023; Liu et al., 2023f).

Table 1. An overview of the primary discoveries presented in our work. We summarize new findings in the second column as “new findings”, and conduct experiments to verify findings discussed in existing literature as “re-validated findings”.

Aspect

New findings

Re-validated findings

LLMs

$\bullet$ LLMs have cold-start recommendation capabilities, but are inferior to fully-trained recommenders. Fine-tuning LLMs can surpass traditional models.

$\bullet$ The larger the parameter scale, the better the recommendation ability, while a longer maximum context length leads to worse recommendations.

$\bullet$ Fine-tuning all parameters of LLMs for recommendation is more effective than parameter-efficient fine-tuning, but more training time is required.

$\bullet$ Instruction tuning can enhance the fine-tuning results of LLMs on recommendations (Bao et al., 2023b; Touvron et al., 2023b).

$\bullet$ In few-shot training scenarios, LLMs are more capable of adapting to recommendation tasks compared to traditional models (Liao et al., 2023; Fu et al., 2023).

$\bullet$ Despite capabilities, there are limitations leveraging LLMs as recommenders, such as position bias (Hou et al., 2023; Ma et al., 2023) and lack of domain knowledge (Yao et al., 2023; Zheng et al., 2023).

Prompts

Task description

$\bullet$ Our framework can be adapted to point-wise, pair-wise and list-wise recommendation tasks.

$\bullet$ Different recommendation tasks can be implemented by LLMs through prompts (Wang et al., 2023c; Geng et al., 2022).

User interest Modeling

$\bullet$ For short-term interest, it is preferable to summarize recent interactions into text and recommend.

$\bullet$ For long-term interest, it is useful to maintain a personalized memory to store and retrieve.

$\bullet$ Long-term preferences and short-term intentions should be effectively combined to model users.

$\bullet$ For short-term interest, only truncating the most recent items is not the optimal strategy (Lin et al., 2023b).

$\bullet$ Increasing the number of historical items to represent users brings insignificant gains for LLMs (Hou et al., 2023).

$\bullet$ Personalized user profiles and customized item descriptions assist in the user interest modeling for LLM-based recommendations (Yao et al., 2023; Shu et al., 2023).

Candidate items construction

$\bullet$ Candidate items construction for LLMs is crucial to the final recommendation results.

$\bullet$ Retrieving candidate items by traditional recommendation models first, and then re-ranking items by LLMs can further improve the results, but the performance varies on specific methods and datasets.

$\bullet$ LLMs can select from a given candidate item list (Dai et al., 2023a; Liu et al., 2023c), as well as directly generate recommendation results (Liu et al., 2023a; Wang et al., 2023e).

$\bullet$ Indexing methods and grounding strategies of items are important factors affecting the effectiveness of LLM-based recommendations (Lin et al., 2023c; Liao et al., 2023; Li et al., 2023b).

Prompting strategies

$\bullet$ In chain-of-thought prompting, specific problem decomposition is required for recommendation tasks rather than general prompts.

$\bullet$ The few-shot prompting strategy has insignificant advantage in recommendation scenarios.

$\bullet$ Although provided with historical items in chronological order, LLMs still needs explicit guidance to understand the importance of recent items (Hou et al., 2023; Ma et al., 2023).

$\bullet$ Role-playing and expert-like prompts can leverage the capabilities of LLMs in specific fields (** et al., 2023; Wang et al., 2023h).

Overall

$\bullet$ Leveraging LLMs as recommenders lies in stimulating the general knowledge of LLMs and integrating the domain knowledge with the user interest.

$\bullet$ LLMs are restricted by the unacceptable inference time (Li et al., 2023b), expensive memory cost (Touvron et al., 2023b), limited context length (Lin et al., 2023b) and black-box abilities (Li et al., 2023h).

Although existing studies have made initial attempts to explore the recommendation capabilities of LLMs like ChatGPT (Shu et al., 2023; Dai et al., 2023a; Gao et al., 2023; Hou et al., 2023; Liu et al., 2023c), and some studies have used paradigms such as fine-tuning and instruction tuning to train LLMs in the field of recommender systems (Zhang et al., 2023f; Zheng et al., 2023; Bao et al., 2023a, b), they focus on exploring the performance of a certain task rather instead of constructing a comprehensive framework to formalize the potential applications of LLM-powered recommender systems. There are also systematic reviews concentrating on the progress of LLMs (Zhao et al., 2023) and surveys of recommender systems empowered by LLMs (Li et al., 2023l; Lin et al., 2023a; Wu et al., 2023b). However, previous surveys generally use specific criteria to classify existing work and introduce them separately. They mainly focus on showcasing related work and summarizing advantages and limitations, rather than conducting additional experiments to validate existing results and explore new discoveries. Our work focuses on the ability of LLMs to directly serve as recommenders, aiming to establish a general framework of \ulPrompting \ulLarge \ulLanguage \ulModels for \ulRecommendation (ProLLM4Rec).

In order to conduct our analysis for ProLLM4Rec, we formalize the input of LLMs for recommendation into natural language prompts with two key aspects: LLMs and prompts, and explain how our framework can be generalized to various recommendation scenarios and tasks. As for the use of LLMs as recommenders, we analyze the impact of the public availability, tuning strategies, model architecture, parameter scale, and context length on recommendation results based on the classification of LLMs. As for prompt engineering, we further analyze the impact of four important components of prompts, i.e., task description, user interest modeling, candidate items construction, and prompting strategies. Given personalized prompts that include task description and user interest, the LLM selects, generates, or explains candidate items based on general world knowledge and personalized user profiles. For each module, we first define and categorize concepts in line with the existing literature. Then, we propose inspiring research questions, followed by detailed experiments to systematically analyze the impact of different conditions on the recommendation performance. Based on the empirical analysis, we finally summarize empirical findings for future research.

In general, the contributions of our work can be summarized as follows:

•

We derive a general framework ProLLM4Rec to sum up existing work of utilizing LLMs as foundation models for recommendation, which can be generalized to multiple scenarios and tasks by different LLMs and prompts.
•

We provide a systematic analysis on leveraging LLMs for recommendation, focusing on two aspects: LLMs and prompt engineering. The use of LLMs includes analysis of public availability, tuning strategies, model architecture, parameter scale, and context length. Moreover, prompt engineering consists of discussions on task description, user interest modeling, candidate items construction and prompting strategies. For each aspect, we define and describe with concepts first, and then provide reference solutions with experiments.
•

Extensive experiments on two public datasets conclude key findings for recommendation with LLMs. As listed in Table 1, our findings include experimental settings on each aspect of our proposed framework, and obtain empirical experience on evaluating the performance of LLMs on recommendation tasks for future research.

In what follows, we first review the related work in Section 2. In Section 3, we present our proposed general framework and its instantiation, and introduce overall settings of the following experiments. As the core components of this paper, we discuss two main aspects of ProLLM4Rec, i.e., LLMs and prompts in Section 4 and Section 5, respectively. For each aspect, we generalize key factors that affect recommendation results, and conduct corresponding experiments to summarize empirical findings. At last, Section 6 concludes this paper and sheds lights on future directions.

2. RELATED WORK

2.1. Recommender Systems

For tackling the challenge of information overload (Ni et al., 2019; Thorat et al., 2015; Yuan et al., 2023b), recommender systems have become pivotal tools for delivering personalized contents for users across various domains. In line with previous studies, recommendation algorithms aim to derive user preferences and behavioral patterns from their historical interactions. The most common technique for the interaction-based recommendation is Collaborative Filtering (CF) (Sarwar et al., 2001; Su and Khoshgoftaar, 2009), which recommends items based on preferences of similar users. Matrix Factorization (MF) (Koren et al., 2009) is a prevalent approach in collaborative filtering, and it constructs embedding representations for users and items from the interaction matrix, facilitating the algorithm to calculate similarity scores efficiently. Furthermore, Neural Collaborative Filtering (NCF) (He et al., 2017), integrating deep neural networks, replaces the inner product used in MF with a neural architecture, thereby demonstrating better performance than previous methods. Contemporary advancements in deep neural network architectures have enhanced the integration of user and item embeddings (Liu et al., 2023b). For example, since recommendation data can be represented as graph-structured data, Graph Neural Network (GNN) (Wu et al., 2022) can be utilized to encode the information of the interaction graph (nodes consist of users and items), and generate meaningful representations via message propagation and contrastive learning strategies (Wang et al., 2019; He et al., 2020; Wu et al., 2021a; Lin et al., 2022). As Pre-trained Language Models (PLM) gain prominence, there is a growing interest in pre-trained large-scale recommendation models powered by PLMs (Zhou et al., 2020; Zhang et al., 2021; Hou et al., 2022). In addition to user-item pairs and IDs, content-based recommendation algorithms leverage auxiliary modalities such as textual and visual information to augment user and item representations in recommendation tasks (Qiu et al., 2021; Yuan et al., 2023b; Wang, 2023).

2.2. Large Language Models for Recommender Systems

Large Language Models (LLMs) are a cutting-edge advancement in artificial intelligence that excel in understanding and generating human-like texts (Touvron et al., 2023a; Zeng et al., 2022; Raffel et al., 2020). LLMs are usually transformer-based models and trained on vast amounts of textual data with billions of parameters, allowing them to comprehend contexts, generate coherent sentences, and even mimic human conversations (Haleem et al., 2022; Wu et al., 2023a). Through this process, LLMs have shown prominent potentials in the field of Natural Language Processing (NLP), and have demonstrated various incredible capabilities in dealing with complex NLP tasks, including but not limited to In-Context Learning (ICL) (Brown et al., 2020), instruction following (Touvron et al., 2023b) and step-by-step reasoning abilities (Zhao et al., 2023). Recently, LLMs have been increasingly integrated into recommender systems to provide personalized recommendations (Li et al., 2023l; Wu et al., 2023b). Recent studies have explored the fusion of LLMs with recommender systems, which can be divided into the three paradigms, i.e., LLM as recommendation model (Section 2.2.1), LLM improves recommendation models (Section 2.2.2) and LLM as recommendation simulator (Section 2.2.3) as follows.

2.2.1. LLM as Recommendation Model

This paradigm takes the LLM as a recommender system. Employing diverse strategies like pre-training, fine-tuning, or prompting, LLMs can combine general knowledge with input data to yield personalized recommendations for users (Hou et al., 2023; Dai et al., 2023a; Ji et al., 2023). Due to the variety of recommendation tasks, LLM as recommendation model can be categorized into two types: discriminative recommendation and generative recommendation.

$\bullet$ Discriminative recommendation instructs LLMs to make recommendation decisions on the given candidate items, usually focusing on item scoring (Fu et al., 2023) and re-ranking tasks (Dai et al., 2023a). For Click-Through Rate (CTR) prediction tasks, Liu et al. (Liu et al., 2023c) designed specific zero-shot and few-shot prompts to evaluate abilities of LLMs on rating predictions. LLMs were required to assign a score for the item according to the previous rating history of users and the score range given in prompts, while the result indicated that LLMs can outperform classical rating methods (e.g., MF and MLP) in few-shot conditions (Liu et al., 2023c). Kang et al. (Kang et al., 2023) further formulated the rating prediction task as multi-class classification and regression task, investigating the influence of model size on recommendation performance. Different from these methods, Hou et al. (Hou et al., 2023) structured a re-ranking task, employing in-context learning approaches for LLMs to rank items in the candidate pool. Previous studies highlighted the sensitivity of LLMs to the sequence of interaction histories provided in prompts (Ma et al., 2023), which can be alleviated by strategies such as recency-focused prompting (Hou et al., 2023).

$\bullet$ Generative recommendation requires LLMs to generate items recommended to users, either from candidate item lists within prompts or from LLMs with general knowledge (Li et al., 2023l). GenRec (Ji et al., 2023) leveraged the contextual comprehension ability of LLMs to transform interaction histories into formulated prompts for next-item predictions. To address instances where GenRec might propose items absent in candidate lists, GPT4Rec (Li et al., 2023m) came up with the method that used BM25 algorithm to retrieve the most similar item in candidate item lists with the item generated by LLMs. In addition to top-n recommendations, LLMs can be leveraged for generative tasks such as explainable recommendations (Lei et al., 2023; Colas et al., 2023; Li et al., 2023f; Yang et al., 2023b; Luo et al., 2023a) and review summarization (Geng et al., 2022; Liu et al., 2023d; Wang et al., 2023c). Moreover, with the incredible abilities in dialogue comprehension and communication, LLMs are naturally considered as the backbone of conversational and interactive recommender systems. ChatRec (Gao et al., 2023) designed an interactive recommendation framework based on ChatGPT, which can comprehend requirements of users through multi-turn dialogues and traditional recommendation models. Moreover, RecLLM (Friedman et al., 2023) combined the dialogue management module with a ranker module and a controllable LLM-based user simulator to generate synthetic conversations for tuning system modules. Apart from these methods, InteRecAgent (Huang et al., 2023) employed LLMs as the brain and recommender models as tools, combining their respective strengths to create an interactive recommender system (Huang et al., 2023). As a conversational recommender system, InteRecAgent enabled traditional recommender systems to become interactive systems with a natural language interface through the integration of LLMs.

There are mainly two paradigms for adapting LLMs as recommenders, i.e., non-tuning paradigm and tuning paradigm.

$\bullet$ Non-tuning paradigm keeps parameters of LLMs fixed and extracts the general knowledge of LLMs with prompting strategies. Existing work of non-tuning paradigm focuses on designing appropriate prompts to stimulate recommendation abilities of LLMs (Yao et al., 2023; Wang et al., 2023c; Li et al., 2023k). Liu et al. (Liu et al., 2023c) proposed a prompt construction framework to evaluate abilities of ChatGPT on five common recommendation tasks, each type of prompts contained zero-shot and few-shot versions. Hou et al. (Hou et al., 2023) not only used prompts to evaluate abilities of LLMs on sequential recommendation, but also introduced recency-focused prompting and in-context learning strategies to alleviate order perception and position bias issues of LLMs. ChatRec (Gao et al., 2023) and InteRecAgent (Huang et al., 2023) mentioned above are also within the classic non-tuning paradigm.

$\bullet$ Tuning paradigm aims to update parameters of LLMs to inject recommendation capabilities into LLM itself. The tuning strategies include fine-tuning (Zheng et al., 2023; Hu et al., 2021) and instruction tuning (Luo et al., 2023b; Qiu et al., 2023). P5 (Geng et al., 2022) proposed five types of instructions targeting at different recommendation tasks to fine-tune a T5 (Raffel et al., 2020) model. The instructions were formulated based on conventional recommendation datasets with designed templates, which equipped LLMs with generation abilities for unseen prompts or items (Geng et al., 2022). InstructRec (Zhang et al., 2023f) further designed abundant instructions for tuning, including 39 manually designed templates with preference, intention, task form and context of a user. Compared with these methods, TallRec (Bao et al., 2023b) used LoRA (Hu et al., 2021), a parameter-efficient tuning method, to handle the two-stage tuning for LLMs. It was first fine-tuned on general data of Alpaca (Taori et al., 2023), and then further fine-tuned with the historical information of users.

Although LLMs as recommendation models present a way of utilizing the common knowledge of LLMs, it still encounters some problems to be coped with. Due to the high computational cost (Touvron et al., 2023b; Liao et al., 2023) and slow inference time (Li et al., 2023b), LLMs are struggled to be efficient enough compared to traditional recommendation methods (Hou et al., 2023; Gao et al., 2023). Additionally, constraints on input sequence length will limit the amount of external information (e.g., candidate item lists) (Lin et al., 2023b), leading to the degrading performance of LLMs in scenarios such as sequential recommendation. Furthermore, since information in recommendation tasks is challenging to be expressed in natural language (Lin et al., 2023c; Yuan et al., 2023a), it is hard to formulate appropriate prompts that make LLMs truly understand what they are required to do, leading to unexpected performance.

2.2.2. LLM Improves Recommendation Models

This method mainly utilizes LLMs to generate auxiliary information to enhance the performance of recommendation models (Du et al., 2023; Wang et al., 2023a; Wei et al., 2023), based on the reasoning abilities and common knowledge. The research on how to improve recommendation models with LLMs can be divided into three categories, i.e., LLM as feature encoder, LLM for data augmentation and LLM co-optimized with domain-specific models.

$\bullet$ LLM as feature encoder. The representation embeddings of users and items are important factors in classical recommender systems (Rendle et al., 2009; He et al., 2020). LLMs serving as feature encoders can generate related textual data of users and items, and enrich their representations with semantic information. U-BERT (Qiu et al., 2021) injected user representations with user review texts, item review texts and domain IDs, augmenting the contextual semantic information in user vectors. Wu et al. (Wu et al., 2021b), on the other hand, employed language models to generate item representations for news recommendation. With the development of LLMs and prompting strategies, BDLM (Zhang et al., 2023d) constructed the prompt consisting of interaction and contextual information into LLMs, and obtained top-layer feature embeddings as user and item representations.

$\bullet$ LLM for data augmentation. For this paradigm, LLMs are required to generate auxiliary textual information for data augmentation (Agrawal et al., 2023; Wei et al., 2023; Liu et al., 2023b, a). By using prompting or in-context learning strategies, the related knowledge will be extracted out in different text forms to facilitate recommendation tasks (Xi et al., 2023; Du et al., 2023; Wang et al., 2023a). One form of auxiliary textual information is summarization or text generation, enabling LLMs to enrich representations of users or items (Wang, 2023). For example, Du et al. (Du et al., 2023) proposed a job recommendation model which utilized the capability of LLMs for summarization to extract user information and job requirements. Considering item descriptions and user reviews, KAR (Xi et al., 2023) extracted the reasoning knowledge on user preferences and the factual knowledge on items through specifically designed prompts, while SAGCN (Liu et al., 2023b) utilized a chain-based prompting strategy to generate semantic information. Another form of using the textual features generated from LLMs is for graph augmentation in the recommendation field. LLMRG (Wang et al., 2023a) leveraged LLMs to extend nodes in recommendation graphs. The resulting reasoning graph was encoded using GNN, which served as additional input to enhance sequential models. LLMRec (Wei et al., 2023) adopted three types of prompts to generate information for graph augmentation, including implicit feedback, user profile and item attributes.

$\bullet$ LLM co-optimized with domain-specific models. The categories mentioned above mainly focus on the impact of common knowledge for domain-specific models (Wang, 2023). However, LLM itself often struggles to handle domain-specific tasks due to the lack of task-related information (Yao et al., 2023; Kang et al., 2023). Therefore, some studies conducted experiments to bridge the gap between LLMs and domain-specific models. BDLM (Zhang et al., 2023d) proposed an information sharing module serving as an information storage mechanism between LLMs and domain-specific models. The user embeddings and item embeddings stored in the module were updated in turn by the LLM and the domain-specific model, enhancing the performance of both sides. CoLLM (Zhang et al., 2023b) combined LLMs with a collaborative model, which formed collaborative embeddings for LLM usage. By tuning LLM and collaborative module, CoLLM showed great improvements in both warm and cold-start scenarios. In conversational recommender systems, approaches such as ChatRec (Gao et al., 2023) and InteRecAgent (Huang et al., 2023) considered LLMs as the backbone, and leveraged traditional recommendation models for candidate item retrieval.

In addition to the context limitation and computational cost of LLMs (Touvron et al., 2023b), the paradigm that LLM improves recommendation models also encounters other problems. (1) Although LLMs can enhance offline recommender systems to avoid online latency, this paradigm also limits the ability of LLMs to model real-time collaborative filtering information, neglecting the key factor for recommendation (Wang, 2023; Xi et al., 2023; Wei et al., 2023). (2) Feature encoding, data augmentation, and collaborative training inevitably expose the user data to LLMs, which may bring privacy, security and ethical issues (Weidinger et al., 2021; Shen et al., 2023; Carranza et al., 2023).

2.2.3. LLM as Recommendation Simulator

Due to the gap between offline metrics and online performance of recommendation methods (Pan et al., 2022; He et al., 2017), it is necessary for the designed approach to get intents of users by simulating real-world elements. In this way, LLM as the recommendation simulator is introduced by taking LLMs as the foundational architecture of generative agents, and agents simulate the virtual users in the recommendation environment (Wang et al., 2023i; Zhang et al., 2023e, c).

Recently, there emerged a lot of work studying the performance of LLMs as the recommendation simulator. Agent4rec (Zhang et al., 2023e) was a movie simulator consisting of two core fractions: LLM-empowered generative agents and recommendation environment. The work equipped each agent with user profile, memory and actions modules, map** basic behaviors of real-world users. AgentCF (Zhang et al., 2023c), on the other hand, considered not only users but also items as agents. It captured the two-sided relations between users and items, and optimized these agents by prompting them to reflect on and adjust the misleading simulations collaboratively (Zhang et al., 2023c). Moreover, in addition to behaviors within the recommender system, RecAgent (Wang et al., 2023i, g) took external influential factors of user agent simulation into account, such as friend chatting and social advertisement. In order to describe users accurately, RecAgent applied five features for users, and implemented two global functions including real-human playing and system intervention to operate agents flexibly.

Although LLM as recommendation simulator aims to imitate real-world recommendation behaviors to enhance the recommendation performance, it still has deficiencies in some aspects. Firstly, since current work is mainly demo systems that operate a few agents (Wang et al., 2023i, g), there still exists a gap between virtual agent environment and real-world practical recommendation applications, which requires further research and development. Additionally, LLMs may arise privacy and safety concerns. Many studies take ChatGPT as the architecture of agents, presenting security risks to the recommended information for users (Zhang et al., 2023c). Moreover, Zhang et al. (Zhang et al., 2023e) have explored that hallucination in LLMs can exert huge impact on recommendation simulations. The LLM sometimes fails to accurately simulate human users, such as providing inconsistent score for an item and fabricating non-existent items for rating.

2.3. Differences with Existing Work

Previous surveys of LLMs for recommender system usually categorized existing work with a classification standard and introduced studies respectively. Lin et al. (Lin et al., 2023a) categorized existing work into the targets and the methods of adapting LLMs to recommendation tasks. Wu et al. (Wu et al., 2023b) mainly focused on the form of information between LLMs and recommender systems, including LLM embeddings, LLM tokens and LLMs. Fan et al. (Fan et al., 2023) summarized the framework into four sections, involving deep representation learning, pre-training, fine-tuning and prompting of LLMs. These research mainly concentrated on demonstrating related work and summarizing the advantages and limitations.

Compared to previous work, we concentrate on the ability of LLMs leveraged for recommendation tasks, and provide a systematic empirical analysis on prompting LLM-based recommendations by devising a general framework ProLLM4Rec. We mainly focus on two aspects, i.e., LLMs and prompt engineering, providing definitions and solutions from both conceptual and methodological perspectives. Furthermore, we conduct experiments to discover new findings and validate results previously discussed in existing research, serving as an inspiration for future research efforts.

3. GENERAL FRAMEWORK AND OVERALL SETTINGS

Refer to caption — Figure 1. The overall framework of our proposed ProLLM4Rec.

In this section, we present our proposed ProLLM4Rec with two important components, namely LLMs and prompts. Generally speaking, the rich world knowledge and general capabilities of LLMs demonstrate the potential to develop LLM-powered recommender systems. Nevertheless, it is essential to introduce appropriate prompting engineering to provide domain-specific knowledge on recommendation tasks for LLMs to serve as personalized recommenders. In the following, we first describe our general framework by defining several key elements (Section 3.1). Then, we explain how our framework can be generalized to various recommendation scenarios and tasks by framework instantiation (Section 3.2). Finally, we introduce the details of the overall experimental settings for further analysis (Section 3.3).

3.1. Overview of the Approach

When it comes to the use of LLMs in recommender systems, the first step involves choosing suitable LLMs tailored to specific scenarios, with an emphasis on their distinct capabilities. Subsequently, it is important to conduct prompt engineering to instruct LLMs to perform effective recommendations. Typically, in the realm of recommender systems using LLMs, despite the task settings in existing studies vary a lot, the overall prompting format remains relatively consistent with minor variations (Hou et al., 2023; Gao et al., 2023; Dai et al., 2023a; Wang et al., 2023c). Therefore, to unify existing prompting approaches for different recommendation purposes, as illustrated in Fig. 1, we establish a general framework for prompt engineering in ProLLM4Rec, consisting of four key factors in the prompts. Firstly, task description is required to clearly express the specific goal of recommendation tasks. Secondly, prompts need to be carefully designed, to express user interest and meanwhile enable LLMs to provide personalized recommendations with both world and domain knowledge. Thirdly, the purpose of a recommender system is to provide users with candidate items, and potential candidate items should be constructed for LLMs to facilitate the recommendation with the understanding of domain-specific item information. Furthermore, special prompting strategies can be further employed to enhance the specialized recommendation capabilities of LLMs. In what follows, we will thoroughly examine the impact of each factor on the recommendation performance.

3.1.1. Key Elements in Our Framework

To carry out the experiments, we first describe the key elements in ProLLM4Rec, to clarify their definitions and scope in this work. Specially, we introduce the following five elements for our framework:

$\bullet$ Large language models. As proposed in previous research (Hou et al., 2023; Zhang et al., 2023b; Li et al., 2023a; Bao et al., 2023a), there exists a large gap between general language modeling and personalized user behavioral modeling, making it non-trivial to utilize LLMs in recommender systems. In this work, we investigate the efficacy of LLMs from perspectives of public availability, tuning strategies, model architecture, parameter scale, and context length, aiming to gain insights into the selection of appropriate LLMs for performing recommendation tasks. Observations and discussions on the use of LLMs are presented in Section 4.

$\bullet$ Task description. To adapt LLMs to the scenario of recommendation, it is necessary to clearly express the context and target of recommender systems for LLMs in the prompt, i.e., task description. With different prompt descriptions of tasks, large language models can be utilized to various recommendation scenarios and tasks such as click-through rate predictions (Kang et al., 2023; Bao et al., 2023b, a), sequential recommendation (Hou et al., 2023; Dai et al., 2023a; Zheng et al., 2023) and conversational recommender systems (He et al., 2023; Gao et al., 2023; Wang et al., 2023h).

$\bullet$ User interest modeling. The modeling of user interest is the key to recommendation tasks (Rendle et al., 2009; He et al., 2017). When leveraging LLMs for recommendation, users are generally expressed in natural language text, which is different from traditional approaches capturing user preference from ID-based behavior sequences (Kang and McAuley, 2018; Wu et al., 2021a). In this paper, we mainly consider the reflected user interest based on his or her interaction behaviors with interacted items. Especially, as detailed in Section 5.2, we employ item description texts, user profiles, and historical interactions between users and items to reveal the underlying user interest using natural languages (Wang et al., 2023c; Lin et al., 2023b; Shu et al., 2023; Yao et al., 2023).

$\bullet$ Candidate items construction. The purpose of recommender systems is to provide users with items to choose from, so candidate items construction is a crucial step in our framework (Hou et al., 2023; Zhang et al., 2023f; Dai et al., 2023a). A simple approach is to provide several candidate items in prompts for the LLM, e.g., the items recalled by traditional recommendation models (Lin et al., 2023b; Yue et al., 2023). Due to the input length limitation of LLMs, it is not possible to include all items in the prompts. In addition to selecting suitable candidate sets, there are also methods that directly generate candidate items by LLMs, utilizing strategies such as output probability distribution (Yue et al., 2023) and vector quantization (Zheng et al., 2023) for item indexing and grounding. Section 5.3 will focus on the construction strategies of candidate items, including selection and generation.

$\bullet$ Prompting strategies. Despite the impressive capabilities of LLMs, they tend to exhibit unsatisfactory performance in providing personalized recommendations (Hou et al., 2023; Dai et al., 2023a; Liu et al., 2023c; Kang et al., 2023). The reason may stem from the significant semantic gap between the general knowledge encoded in LLMs and the domain-specific behavioral pattern and item catalogs of recommender systems (Zhang et al., 2023b; Li et al., 2023a). To specialize LLMs to recommender systems, we summarize and propose several prompting strategies specialized for recommendation tasks. Details will be discussed in Section 5.4.

Table 2. Instantiation of existing work for ProLLM4Rec.

LLMs	Task description	User interest	Candidate items	Prompting strategies	Related Work
Not tuning setting
	CTR predictions, rating, re-ranking	recent and relevant items (with attributes), user profile	pointwise, pairwise, listwise item(s)	chain of thoughts, in-context learning, role prompting	(Di Palma et al., 2023; Kang et al., 2023; Dai et al., 2023a; Shu et al., 2023; Hou et al., 2023; Wang and Lim, 2023; Wang et al., 2023c; Liu et al., 2023c; Yao et al., 2023; Zhiyuli et al., 2023; Liu et al., 2023d; Li et al., 2023k; Wang et al., 2023f; Liu et al., 2023e; Ma et al., 2023)
ChatGPT, GPT-4	conversational recommender systems	user explicit interest, interactive feedback	recalled from traditional models	role prompting	(Gao et al., 2023; Friedman et al., 2023; He et al., 2023; ** et al., 2023; Zhu et al., 2023; Wang et al., 2023h; Huang et al., 2023)
	generative recommendation	recent items (with attributes), user profile	(not provided, generation methods)	basic prompts	(Liu et al., 2023a; Wang et al., 2023e)
(Parameter-efficient) Fine-tuning setting
LLaMA, LLaMA2, Vicuna, ChatGPT	CTR predictions, rating, re-ranking	recent and relevant items (with attributes), user profile, collaborative embedding	pointwise, listwise item(s)	chain of thoughts, role prompting, soft prompting	(Bao et al., 2023b; Zhang et al., 2023b; Kang et al., 2023; Yang et al., 2023a; Lin et al., 2023b; Yue et al., 2023; Liao et al., 2023; Wang et al., 2023b; Di Palma et al., 2023; Li et al., 2023h; Fu et al., 2023; Shi et al., 2023)
LLaMA, LLaMA2, BART, GPT	recall, retrieving	recent and relevant items (with attributes)	(not provided, item grounding methods)	basic prompts	(Lin et al., 2023c; Bao et al., 2023a; Zheng et al., 2023; Li et al., 2023b; Petrov and Macdonald, 2023; Li et al., 2023m; Ji et al., 2023; Zhang et al., 2021)
Instruction tuning setting
(Flan-)T5, LLaMA2	rating, ranking, retrieving, explanation, summarization	recently interacted items, user profile, short-term intentions	pointwise, pairwise, listwise item(s)	basic prompts	(Geng et al., 2022; Zhang et al., 2023f; Li et al., 2023j; Chu et al., 2023; Qiu et al., 2023; Luo et al., 2023b)

3.2. Instantiation of ProLLM4Rec

By combining the key elements mentioned above, we can instantiate various types of recommender systems in our framework with the following five steps. Specifically, (1) we can employ LLMs with varying levels of public availability, different tuning strategies, model architectures, parameter scales, and context lengths. (2) We can define a range of task description, such as retrieving, rating, recalling, and ranking. (3) Regarding user interest modeling, we can employ different types of interest, representation forms, and modeling methods. (4) When collecting candidate items, we take into account their different representation types, sources, and grounding methods. (5) Additionally, we can introduce several well-designed prompting strategies to effectively guide the recommendation capabilities of LLMs. To demonstrate the compatibility and versatility of our framework, we summarize previous work on LLM-powered recommender systems in Table 2 based on various settings in our framework ProLLM4Rec.

3.3. Experimental Settings

In this section, we introduce the overall settings of the following experiments. We first describe the basic information of the two public datasets, and then present the configurations and implementations to conduct our study in detail.

3.3.1. Datasets

The domain characteristics of movies and books are closer to the general knowledge of LLMs, which facilitates the further analysis. Considering the scale, popularity and side information of public datasets, we select two representative datasets to conduct our study, i.e., MovieLens-1M (Harper, 2015) and Amazon Books (2018) (Ni et al., 2019) as follows.

$\bullet$ MovieLens-1M (Harper, 2015) is one of the most widely used benchmark datasets in the field of recommender systems, covering movie ratings and attributes on the website movielens.org. We use the one million version from the MovieLens datasets.

$\bullet$ Amazon Books (2018) (Ni et al., 2019) is an updated version of the Amazon review dataset released in 2014. At first, Amazon only operated online book sales business, so the data in the book field is the most abundant. To improve the data quality, we filter out inactive users and unpopular products, and remove dirty data without necessary attributes.

In our research, we are concerned about how LLMs can fully utilize the domain knowledge to make recommendations, and use the title of items as the input for prompts. However, titles are not enough to describe items, and there are deviations between the text of the title and the content of the item itself (e.g., the movie Twelve Monkeys). Therefore, we further investigate the benefits of detailed item descriptions on the recommendation effect. As shown in Table 3, there are no item descriptions in the original dataset of MovieLens, only the release year, title, and genre. To enrich the movie dataset, we use the general knowledge of ChatGPT ¹¹1The URL of ChatGPT API: https://chat.openai.com/. Note that there are multiple versions of the ChatGPT API, and OpenAI has released interfaces on March 1 and June 13, 2023, respectively. In the absence of clear annotations, the ChatGPT used in this article is “gpt3.5-turbo-4k-0613”. to generate text descriptions for movies.

Table 3. Statistics of two public datasets for ProLLM4Rec.

Dataset	#User	#Item	#Interaction	Sparsity	Item Attributes
MovieLens-1M	6,040	3,706	1,000,209	4.4642%	release year, title, genre
Amazon-Books	13,469	12,984	1,142,940	0.6536%	title, categories, brand, price, description

3.3.2. Configuration and Implementation

As for ProLLM4Rec, we can directly evaluate the cold-start recommendation ability of large language models in the zero-shot setting, as well as evaluate the fine-tuned ability with few or full recommendation samples in the fine-tuning setting. Considering the typical scenarios of LLMs in recommender systems, we conduct experiments on two representative task settings, i.e., (1) the zero-shot ranking task without modifying parameters of LLMs (Dai et al., 2023a; Hou et al., 2023; Gao et al., 2023; Ma et al., 2023) and (2) the click-through rate prediction task with LLMs tuned (Bao et al., 2023b; Kang et al., 2023; Fu et al., 2023; Shi et al., 2023).

$\bullet$ Zero-shot ranking task. On the one hand, we evaluate the zero-shot recommendation performance of LLMs for cold-start scenarios to study the effect of LLMs and the design of prompts. In this paper, our approach mainly concentrates on ranking tasks that better reflect the capabilities of LLMs (Dai et al., 2023a; Hou et al., 2023; Gao et al., 2023; Ma et al., 2023). As shown in Fig. 1, information of users and items is encoded into the prompt as inputs for LLMs. In this setting, we do not modify the parameters of LLMs, so the evaluated models are closed-source large language models or open-source models without fine-tuning. To conduct experiments on the impact of each factor on recommendation results with LLMs, we implement the overall architecture based on the open-source recommendation library RecBole (Zhao et al., 2021, 2022; Xu et al., 2023) and the zero-shot re-ranker LLMRank (Hou et al., 2023). Our basic prompt used in the MovieLens-1M dataset for the zero-shot ranking task is shown in Fig. 2(a).

$\bullet$ Click-through rate prediction task with LLMs tuned. On the other hand, we evaluate the fine-tuned recommendation performance of LLMs to explore how LLMs adapt to recommendation scenarios with data provided (Bao et al., 2023b; Kang et al., 2023; Fu et al., 2023; Shi et al., 2023). Although our framework can be generalized to various recommendation tasks, we concentrate on exploring the fine-tuning performance of LLMs with point-wise Click-Through Rate (CTR) prediction tasks to reduce selection bias. In this setting, we not only consider fine-tuning LLMs using recommendation data, but also devise a two-stage approach of using instruction data to fine-tune LLMs first, and then implement recommendation fine-tuning for further adaptation. Specifically, we compare the fine-tuned recommendation performance of the original LLM and the LLM after instruction tuning, respectively. LLaMA-7B (Touvron et al., 2023a) and LLaMA2-7B (Touvron et al., 2023b) are the original models, while Alpaca-lora-7B (Taori et al., 2023) and LLaMA2-chat-7B (Touvron et al., 2023b) are LLMs after instruction tuning. In terms of the tuning strategies of LLMs, we report results with both parameter-efficient fine-tuning and complete fine-tuning. We implement the fine-tuning framework based on the open-source library transformers and the instruction tuning code of LLaMA with Stanford Alpaca data ²²2The repository of Alpaca-LoRA: https://github.com/tloen/alpaca-lora.. Our instruction data used in the MovieLens-1M dataset for the click-through rate prediction task is illustrated in Fig. 2(b).

3.3.3. Evaluation Metrics

As for the zero-shot ranking task, considering economic and efficiency factors, we refer to existing literature (Hou et al., 2023; Shu et al., 2023) to randomly sample 200 users instead of evaluating results on the whole dataset. For each user from the sample set, we sort all items that the user interacts with in the chronological order. Then, we evaluate the results based on the leave-one-out strategy and treat the last interacted item as the ground truth. For performance comparison, we fix the length of candidate items to 20 as in (Hou et al., 2023), and mix the other 19 items from the ground truth in random positions by default. As for evaluation, we utilize two widely used ranking metrics in recommender systems, i.e., Recall (Buckland and Gey, 1994) and Normalized Discounted Cumulated Gain (NDCG) (Järvelin and Kekäläinen, 2002). Since 20 items are selected for candidate generation, we set $k=20$ for Recall@ $k$ to measure whether all ground-truth items have been recalled. In existing literature, the re-generation method of multiple results until the format requirements are met can be employed to obtain the final response (Li et al., 2023i), while we consider the recall capability within one inference for fair evaluation. Furthermore, we set $k=1,10,20$ for NDCG to explore the detailed recommendation performance in terms of ranking abilities. To assure a scientific research, we repeat each experiment three times and take the average value as the final result.

As for the CTR prediction task, we first sort the original dataset by timestamp, use the latest 10,000 records for training and evaluation, and regard other previous data as the interaction history of users. Then, we split the ten thousand interactions into the training, validation and test sets in ratio of 8:1:1. For each interaction, we retain 10 historical interacted items as user representations, and set thresholds based on rating data to obtain user preferences (Bao et al., 2023b). Interactions with a rating higher than or equal to the threshold will be considered items that the user likes, while the opposite indicates dislikes. For the MovieLens-1M dataset, the rating threshold is 4, and we set 5 for the Amazon-Books dataset. We employ the training data to fine-tuning LLMs and evaluate the recommendation performance on the test set. The validation set is used for selecting best checkpoints during the training process. As for evaluation of prediction, we utilize the widely used metric for CTR predictions, i.e., accuracy. In a random state, the accuracy is around 0.5.

3.3.4. Discussion on Variable Factors

To conduct our analysis, we mainly focus on two key aspects, i.e., LLMs and prompts. As for the effects of LLMs, we analyze the impact of public availability, tuning strategies, model architecture, parameter scale, and context length on recommendation results based on the classification of LLMs. As for prompt engineering, we further analyze the impact of four important components of prompts, i.e., task description, user interest modeling, candidate items construction and prompting strategies. Due to the fact that all factors have an impact on the final result, it is also crucial to select the other aspects when focusing on one aspect. Limited by resources and efficiency, it is neither necessary nor feasible to exhaust all possibilities. When not explicitly specified, the LLM uses ChatGPT released in June 2023 to ensure the consistent recommendation quality. For the design of prompts, we refer to the template in (Hou et al., 2023) and emphasize the most recent items to re-rank the random 20 candidate items. The default method for modeling user interest is to concatenate the title sequence of recently interacted items using natural languages.

4. THE IMPACT OF LARGE LANGUAGE MODELS AS RECOMMENDER SYSTEMS

LLMs are the core of our framework ProLLM4Rec, and directly determine the performance of LLMs as recommender systems (Liu et al., 2023d, c; Kang et al., 2023; Hou et al., 2023). Therefore, it is worth exploring how to choose a suitable LLM as the foundation for recommendation. In this section, we compare the differences between LLMs and traditional models on the recommendation performance, discuss how the different properties and tuning strategies of LLMs affect the recommendation results, present the limitations of LLMs as recommenders, and draw empirical conclusions through systematic experiments.

4.1. Classification of Large Language Models

In this paper, we consider language models that have a size larger that one billion as LLMs. In line with existing researches (Zhao et al., 2023; Wu et al., 2023b; Lin et al., 2023a; Li et al., 2023l; Fan et al., 2023), LLMs can be categorized into different classes from several perspectives. Most typically, LLMs can be divided into open-source and closed-source models in terms of the public availability. When it comes to leveraging LLMs as the foundation model in recommender systems, tuning strategies can adjust LLMs towards specific recommendation tasks. From the perspective of the model architecture, various LLMs can also be categorized into types of encoder-decoder, causal decoder, and prefix decoder (Zhao et al., 2023). For the same framework of a LLM, it is widely acknowledged that the parameter scale and context length are two key factors that jointly determine abilities of LLMs (Lin et al., 2023b; Hou et al., 2023). To explore the recommendation performance with respect to different variants of LLMs, we focus on five aspects, i.e., public availability, tuning strategies, model architecture, parameter scale and context length as follows.

4.1.1. Public Availability

According to whether the model checkpoints can be publicly obtained, existing LLMs can be divided into open-source models and closed-source models, and both can be leveraged as recommender systems.

$\bullet$ Open-source models refer to LLMs whose model checkpoints can be publicly accessible. As shown in Table 2, researchers often use recommendation data to fine-tune open-source models for performance improvement. As the representative of open-source models, LLaMA (Touvron et al., 2023a) and its variants like Vicuna (Chiang et al., 2023) are widely used when leveraging LLMs for recommender systems (Zheng et al., 2023; Li et al., 2023b). Parameter-Efficient Fine-Tuning (PEFT) strategies such as Low-Rank Adaptation (LoRA) (Hu et al., 2021) are frequently adopted for recommendation data considering the trade-off between effect and efficiency (Liao et al., 2023; Bao et al., 2023b). Other models such as the Flan-T5 (Chung et al., 2022) series from Google Inc. and ChatGLM (Zeng et al., 2022) from Tsinghua University are also popular in the field of recommender systems. The publicly available checkpoints of open-source models provide flexibility for LLMs to modify parameters tailored for recommendation tasks.

$\bullet$ Closed-source models refer to LLMs whose model checkpoints can not be publicly accessible. For the closed-source LLMs utilized for recommenders, researchers generally study the zero-shot recommendation ability in cold-start scenarios. The most typical closed-source model is the ChatGPT series from OpenAI. The subsequent GPT-4 has stronger capabilities compared to ChatGPT, but it is still not open-source. In this paper, ChatGPT refers to the API function of “gpt3.5-turbo-4k-0613” unless specified. Without checkpoints, OpenAI provides several ways for researchers and users to improve the model performance on specific tasks, such as plugins for website browsing (Lewis et al., 2020) and interfaces for fine-tuning ChatGPT (Li et al., 2023h). However, the flexibility of closed-source models as recommender systems is still limited due to the expensive price and black-box parameters. Faced with this challenge, existing literature has explored to inject knowledge of recommender systems into closed-source models by means of prompt design (Yao et al., 2023; Wang and Lim, 2023), retrieval enhancement (Huang et al., 2023; Wang et al., 2023c; Shu et al., 2023) and combination of traditional recommendation models (Wang et al., 2023h; Gao et al., 2023).

4.1.2. Tuning Strategies

During the deployment of LLMs in recommender systems, we can also classify existing work depending on whether LLMs are fine-tuned, as well as the various fine-tuning strategies employed, i.e., not-tuning setting, fine-tuning setting and instruction tuning setting.

$\bullet$ Not-tuning setting means evaluating the zero-shot recommendation ability of LLMs, which is generally used for closed-source models such as ChatGPT. By designing different prompt templates, LLMs without fine-tuning can be directly used for recommendation tasks such as click-through rate predictions (Di Palma et al., 2023; Kang et al., 2023), sequential recommendation (Hou et al., 2023; Wang et al., 2023c), and conversational recommender systems (He et al., 2023; Wang et al., 2023h; ** et al., 2023). In this case, the user interest is expressed explicitly (e.g., ratings and reviews) or implicitly (interacted items of users), and the limited candidate items can be recalled by traditional models, while prompting strategies such as role prompting and chain of thoughts are used. Inspired by Artificial Intelligence Generated Content (AIGC), it is worth noting that the excellent generation ability of LLMs provides opportunities for generative recommendation (Liu et al., 2023a; Wang et al., 2023e). Without providing candidate items, generative language models can directly generate the desired items that users need based on recommendation requirements, and they can also be generalized into our framework ProLLM4Rec as shown in Table 2.

$\bullet$ Fine-tuning setting means that using recommendation data to fine-tune LLMs as recommender systems. Considering cost and efficiency, researchers often use parameter-efficient fine-tuning (e.g., Low-Rank Adaptation of Large Language Models, LoRA (Hu et al., 2021)) to quickly adapt to recommendation scenarios (Bao et al., 2023b; Liao et al., 2023; Zheng et al., 2023). As for LLMs, open-source models based on LLaMA (Touvron et al., 2023a) are widely used, including but not limited to LLaMA, LLaMA2 and Vicuna (Chiang et al., 2023) with different parameter sizes. Based on whether candidate items are provided, existing work on fine-tuning LLMs can be further divided into two kinds of recommendation tasks. On the one hand, researchers have explored fine-tuning models for recommendation that provide candidate items such as rating, re-ranking and predictions. Specifically, the fine-tuning interface of the closed-source model ChatGPT has brought new breakthroughs to the research of LLMs, and there have been attempts to fine-tune ChatGPT for recommendation tasks (Li et al., 2023h). On the other hand, LLMs can also be fine-tuned for recall tasks of recommender systems by retrieving candidates from the whole item pool (Bao et al., 2023a; Lin et al., 2023c; Petrov and Macdonald, 2023; Li et al., 2023b). Through well-designed indexing, alignment, and retrieval strategies, directly generating recommendation items without providing lengthy candidate sequences is more suitable for practical application scenarios, which has not yet been fully explored.

$\bullet$ Instruction tuning setting means providing template instructions of recommenders as prompts to tuning targeted LLMs, generally involving multiple recommendation tasks (Zhang et al., 2023f; Geng et al., 2022; Qiu et al., 2023; Chu et al., 2023). However, some fine-tuning methods (e.g., TALLRec (Bao et al., 2023b)) also involve the form of instructions. In order to avoid ambiguity, we consider the instructions of a single task template as fine-tuning settings, and the instructions of multiple tasks as instruction-tuning settings. As the backbone of recommender systems, researchers generally use T5 or Flan-T5 for various recommendation scenarios such as rating, ranking, retrieving, explanation generation and news recommendation. Besides Flan-T5, RecRanker (Luo et al., 2023b) is also proposed to instruction tuning LLaMA2 as the ranker for top- $k$ recommendation. By instantiating appropriate instructions for each recommendation task, this setting can also be summarized into our framework.

4.1.3. Model Architecture

For the architecture design of LLMs leveraged as recommender systems, we consider three mainstream architectures as summarized in (Zhao et al., 2023), i.e., encoder-decoder, prefix decoder, and causal decoder. The recommendation scenarios for each framework are introduced in what follows.

$\bullet$ The encoder-decoder architecture adopts two Transformer blocks as the encoder and decoder, which is the basis of BERT (Vaswani et al., 2017). In line with existing literature on LLM-based recommender systems, the bidirectional property of the encoder-decoder architecture allows LLMs to easily customize encoders and decoders towards recommendation (e.g., dual-encoder considering both ids and text (Qiu et al., 2023)), and conveniently adapt to multiple recommendation tasks (Chu et al., 2023; Li et al., 2023j; Geng et al., 2022; Zhang et al., 2023f; Qiu et al., 2023). A few LLMs use the encoder-decoder architecture, and the typical one is the series of T5 and its variant Flan-T5 (Chung et al., 2022).

$\bullet$ The prefix decoder architecture is also the decoder-only architectures, which is known as non-causal decoder. It can bidirectionally encode the prefix tokens like the encoder-decoder architecture, and perform unidirectional attention on the generated tokens like the causal decoder architecture. One of the representative LLMs based on the prefix decoder architecture is ChatGLM (Zeng et al., 2022) and its variants, and researchers have attempted to explore the recommendation performance of ChatGLM as one of the benchmarks in related work (Liu et al., 2023d).

$\bullet$ The causal decoder architecture has been widely adopted in various LLMs, and the series of GPT (Brown et al., 2020) and LLaMA (Touvron et al., 2023a) are the most representative models. It uses the unidirectional attention mask, and only decoders are deployed to process both the input and output tokens. Due to the popularity of the causal decoder architecture, most LLM-based recommender systems employ this framework to adapt to different recommendation tasks such as click-through rate predictions, sequential recommendation and conversational recommender systems.

4.1.4. Parameter Scale

To meet the diverse needs of different users, the most typical variant of LLMs is the parameter scale. In general, open-source models have multiple parameter sizes to choose from, and larger parameter sizes generally mean better capabilities (Touvron et al., 2023a; Zhao et al., 2023). But meanwhile, the corresponding computational and spatial complexity will also increase. Considering the memory and efficiency issues for experiments, researchers in the field of LLM-based recommender systems generally use LLMs with parameters no more than 10B, while the performance of LLMs with larger parameters remains to be further explored in the field of recommender systems (Kang et al., 2023).

4.1.5. Context Length

Another property closely related to user needs is the length of the input context. The inability to handle inputs with longer contexts means that decision-making cannot be accurately developed, thereby limiting the model capabilities (Zhao et al., 2023; Touvron et al., 2023b). When the user input exceeds the limited length, the input will be truncated, so sufficient context length is crucial for the user experience (Hou et al., 2023; Dai et al., 2023a; Ma et al., 2023). However, different lengths of context inputs imply different model architectures and parameters. When expanding the context length of a model, it often leads to higher time and memory complexity. To address the length limitation of LLMs, existing methods either selectively discard previous contexts using sliding windows (Xiao et al., 2023), or only sample a portion of the context for retrieval augmentation (Lewis et al., 2020; Lin et al., 2023b), or employ small models without emergence ability. Despite recent strategies, the limitation of context length has not yet been truly resolved. Considering economic and efficiency issues, existing LLMs including open-source and closed-source models only provide a limited number of context length options. In this paper, we mainly focus on several classic lengths, i.e., 2K, 4K, 8K, 16K and 32K (K is the abbreviation for one thousand, similarly hereinafter).

4.2. Research Questions and Experimental Setup

4.2.1. Research Questions

In this section, we conduct experiments to verify the impact of different factors of LLMs on recommendation results. Specifically, we focus on the following four research questions.

•

RQ1: What are the differences between the recommendation ability of LLMs and traditional recommenders?
•

RQ2: How do the different attributes of LLMs, including the public availability, model architectures, parameter scales, and context lengths affect the recommendation performance and inference time?
•

RQ3: What are the similarities and differences in recommendation results of LLMs with different tuning strategies? Is the LLM after instruction tuning more suitable for recommendation tasks?
•

RQ4: What are the limitations of leveraging LLMs as recommender systems?

4.2.2. Evaluated Models

As for the experimental settings, we consider the following baselines and LLMs.

•

Random: Random baseline recommends the $k$ (k=20 in this section) candidate items in a random order, which is the basic situation to evaluate the metric values of each dataset.
•

Pop: Pop method always ranks the candidate items based on their interaction times with users in the training set. We consider it as the fully-trained method since it uses the statistical information of datasets.
•

BPR (Rendle et al., 2009): BPR is one of the typical traditional models that utilize matrix factorization for recommendation. It is trained in the pair-wise paradigm a.k.a., BPR training loss without considering temporal information.
•

SASRec (Kang and McAuley, 2018): SASRec is a sequential recommendation model based on the backbone of the classic self-attention network a.k.a., Transformer (Vaswani et al., 2017), and achieves comparable performance among sequential models.
•

ChatGPT: ChatGPT is a closed-source large-scale pre-trained language model developed by OpenAI. Note that OpenAI has released interfaces of ChatGPT on March 1 and June 13, 2023, respectively. Considering the up-to-date requirements, we adopt the version on June 13 for recency, the same as GPT-4.
•

GPT-4: GPT-4 is the latest generation of closed-source natural language processing model launched by the OpenAI company. Experiments have shown that GPT-4 is significantly superior to ChatGPT in multiple tasks.
•

Flan-T5 (Chung et al., 2022): Flan-T5 is an open-source language model based on the encoder-decoder architecture T5 released by Google (Raffel et al., 2020). Flan-T5 is extended from T5 by a multi-task fine-tuning paradigm i.e., instruction tuning to enhance the generalization of different tasks. There are multiple variants of Flan-T5 in terms of parameters, including Flan-T5-Small (80M), Flan-T5-Base (250M), Flan-T5-Large (780M), Flan-T5-XL (3B) and Flan-T5-XXL (11B). Since the first three models are too small to meet the requirements of LLMs discussed in this paper (1B, B is short for billion and the same below), we consider the Flan-T5-XL and Flan-T5-XXL for comparison.
•

ChatGLM (Zeng et al., 2022): ChatGLM is an open-source bilingual dialogue LLM that supports both Chinese and English, based on the General Language Model (GLM) (Zeng et al., 2022) with the prefix decoder architecture. The team released the second version ChatGLM2 and the third version ChatGLM3 in June 2023 and October 2023, respectively.
•

LLaMA (Touvron et al., 2023a): LLaMA is an open-source language model introduced by MetaAI from the causal decoder architecture with four sizes (6B, 13B, 35B and 65B). Due to its outstanding performance and low computational cost, LLaMA has received much attention from researchers so far, and Vicuna (Chiang et al., 2023) is one of the most popular variants by extending LLaMA. To further improve abilities of LLaMA, MetaAI released LLaMA2 as the next generation of open-source large language models in July, 2023. In addition to the regular version, MetaAI also provide the chatting version of LLaMA2 (i.e., LLaMA2-chat) (Touvron et al., 2023b), and it is specifically tuned for the dialogue scenario by Reinforcement Learning with Human Feedback (RLHF).

Table 4. Overall performance of different models on recommendation. We consider both the fully-trained settings for traditional models and zero-shot settings for LLMs. Note that there are always ground-truth items in the randomly selected 20 candidates, so the ideal recall@20 equals to 1. “IT” is for “Inference Time”, and we record the average inference time for each user measured in seconds (s). “N/A” is the abbreviation for “Not Applicable” since the inference time of closed-source models is unknown.

			MovieLens-1M				Amazon-Books
model	context length	param. size	recall@20	ndcg@1	ndcg@10	IT (s)	recall@20	ndcg@1	ndcg@10	IT (s)
Fully-trained settings for traditional models
Random	-	0	1.0000	0.0300	0.2081	0.01	1.0000	0.0350	0.2628	0.01
Pop	-	1	1.0000	0.1800	0.4841	0.03	1.0000	0.1000	0.2672	0.03
BPR	-	<1M	1.0000	0.2550	0.5743	0.04	1.0000	0.2950	0.6236	0.04
SASRec	-	<1M	1.0000	0.6400	0.7916	1.07	1.0000	0.6800	0.8305	1.49
Zero-shot settings for LLMs
Closed-source LLMs
	4K	-	0.9583	0.1817	0.3985	n/a	0.9850	0.2467	0.4276	n/a
ChatGPT	16K	-	0.9600	0.1500	0.3735	n/a	0.9800	0.2400	0.4032	n/a
GPT-4	8K	-	0.9900	0.3100	0.5828	n/a	1.0000	0.3300	0.5631	n/a
Open-source LLMs with the encoder-decoder architecture
		3B	0.0050	0.0000	0.0016	3.51	0.0000	0.0000	0.0000	4.33
Flan-T5	0.5k	11B	0.0050	0.0000	0.0016	5.21	0.0000	0.0000	0.0000	8.28
Open-source LLMs with the prefix decoder architecture
ChatGLM	2K	6B	0.7750	0.0300	0.1945	19.12	0.7000	0.0350	0.2026	19.52
ChatGLM2	32K	6B	0.1900	0.0450	0.0885	11.06	0.1600	0.0250	0.0680	13.62
	2K	6B	0.6900	0.0950	0.2762	7.95	0.6550	0.0450	0.2273	10.79
ChatGLM3	32K	6B	0.7750	0.0700	0.2579	14.06	0.7050	0.0550	0.2068	16.89
Open-source LLMs with the causal decoder architecture
		7B	0.2700	0.0350	0.1068	9.11	0.2650	0.0200	0.0992	9.35
		13B	0.2500	0.0300	0.1028	9.51	0.2250	0.0250	0.0867	9.94
		33B	0.3900	0.0400	0.1328	92.88	0.2950	0.0350	0.1015	106.61
LLaMA	2K	65B	0.5300	0.0450	0.1913	171.39	0.3750	0.0500	0.1253	182.51
		7B	0.2650	0.0500	0.1078	8.17	0.4400	0.0550	0.1559	11.12
Vicuna	2K	13B	0.4100	0.0550	0.1507	9.26	0.4800	0.0700	0.1813	12.72
		7B	0.2350	0.0500	0.0888	9.97	0.5600	0.0650	0.1648	10.01
		13B	0.4500	0.0700	0.1215	15.64	0.5100	0.0750	0.2150	14.44
LLaMA2	4K	70B	0.7600	0.1250	0.2918	24.38	0.6600	0.1150	0.2912	24.63
		7B	0.7950	0.0900	0.2744	9.80	0.7050	0.1500	0.3217	9.86
		13B	0.8050	0.1650	0.3866	14.80	0.7500	0.1650	0.3411	12.25
LLaMA2 (chat)	4K	70B	0.9550	0.2430	0.4344	23.07	0.8850	0.2230	0.3827	25.01

4.3. Observations and Discussion

4.3.1. LLMs Compared to Traditional Recommenders (RQ1)

Compared to traditional models based on collaborative filtering of interacted data in fully-trained settings, we provide the cold-start recommendation performance of LLMs in zero-shot settings. In what follows, we introduce the empirical findings on the recommendation effect of different models from three aspects, i.e., recommendation performance, the impact of historical item sequences and inference time.

$\bullet$ Recommendation performance of LLMs. As shown in Table 4, we provide the fully-trained results of four traditional methods, as well as the zero-shot recommendation performance of various LLMs. For traditional recommenders, BPR (Rendle et al., 2009) based on collaborative filtering is significantly better than Pop based on popularity, and the sequential recommendation model SASRec (Kang and McAuley, 2018) combined with attention mechanism and temporal information is significantly better than BPR, which is consistent with the results in existing literature (Rendle et al., 2009; Kang and McAuley, 2018; Zhou et al., 2020). It is worth noting that for the 20 candidate items, LLMs that rely on natural languages cannot completely recall all items in most cases, and several models with poor abilities can only output a dozen items, greatly limiting the accuracy of recommendation results. Therefore, recall@20 indicates the ability of LLMs to memorize, re-rank and output candidate items, and several approaches such as the re-generation method (Li et al., 2023i) and probability distribution outputs (Yue et al., 2023) have been proposed to improve the recall performance. While for the traditional models, all candidate items can be recalled (i.e., recall@20 equals to 1). Because the candidate items are not recalled completely, the recommendation effect of a few LLMs (e.g., Flan-T5 and ChatGLM) is not even as good as the random baseline. For most LLMs, the zero-shot recommendation performance is not as good as the baseline method Pop based on popularity of interactions in the dataset (Liu et al., 2023c; Hou et al., 2023; Dai et al., 2023a). However, the powerful LLMs like ChatGPT and LLaMA-70B (chat) (Touvron et al., 2023b) can achieve better results than Pop in zero-shot settings. Furthermore, GPT-4 can even perform better than the fully-trained matrix factorization model BPR on two datasets, indicating the potential of LLMs to serve as the backbone of recommender systems. In addition, the significant differences between the results of LLMs demonstrate the importance of selecting appropriate LLMs for downstream recommendation tasks (Kang et al., 2023; Liu et al., 2023d, c).

$\bullet$ The impact of historical item sequences. As for ranking tasks in Fig. 2(a), the recently interacted historical items are used as the user representations. However, there is no standard value for the number of items that represent users. To analyze the impact of historical item sequences on the recommendation performance, we conduct experiments to explore the recommendation effect of the sequential recommender SASRec (Kang and McAuley, 2018) and the closed-source LLM ChatGPT with different numbers of historical interactions. For the fully-trained SASRec, the maximum length of the historical item sequence will affect the model framework and prediction results (Kang and McAuley, 2018; Hou et al., 2022). In order to ensure the model uniformity, we fix the model checkpoint with the historical item sequence of 50 as the “SASRec (fixed)” for comparison, and evaluate the recommendation performance (NDCG@10 (Järvelin and Kekäläinen, 2002)) with the number of historical items at 1, 5, 10, 20, 30, 40 and 50, respectively. For ChatGPT and GPT-4, we get zero-shot results with different items to verify whether the powerful LLMs can deal with the long context for recommendation. As illustrated in Fig. 3, with the increasing number of historical items, the results of SASRec improve steadily, while the the recommendation performance of LLMs changes little. Consistent conclusions on two datasets can be drawn that even if LLMs can accept more historical items for user representations, increasing the number of historical items does not bring significant gains in the recommendation performance. The performance trends of LLMs show that the increased historical item sequence is not fully utilized by the language model, indicating the importance of selecting appropriate item sequences to represent users, and the inadequacy of LLMs for user interest mining. To improve the mining of user interest for LLMs, approaches such as retrieval augmentation (Lin et al., 2023b) and prompting strategies (Wang et al., 2023c; Yao et al., 2023) can be used, which will be analyzed in Section 5.

$\bullet$ Inference time of LLMs. In the actual deployment of recommender algorithms, the inference efficiency is the decisive factor for industrial applications (Sun et al., 2024; Wang et al., 2023b). In general, the inference time of models is closely related to the size of parameters. As shown in the last column of Table 4, for the lightweight traditional recommenders, the inference time of SASRec is about 1 second for each user. However for LLMs, except that closed-source models cannot accurately obtain inference time due to limitations of the API, the inference time of open-source models takes nearly 10 or more seconds for one prediction, leading to an unacceptable time delay in practical applications.

4.3.2. LLMs on Recommendations w.r.t. Four Aspects (RQ2)

For different LLMs, differences in public availability and model architecture will lead to different recommendation scenarios, results and inference time (Kang et al., 2023; Liu et al., 2023d). For the same LLM, the parameter scale and context length also affect the efficiency and effectiveness of language models (Zhao et al., 2023). Therefore, we explore the impact of different LLMs on recommendations from four aspects, namely public availability, model architecture, parameter scale and context length as follows.

$\bullet$ Public availability. As shown in Table 4, closed-source models achieve significantly better results than the open-source models in the cold-start scenario, but they cannot outperform fully-trained sequential models. In terms of LLMs, ChatGPT with zero-shot settings has comparable recommendation performance with the fully-trained Pop especially on the sparse Amazon-Books dataset, indicating the fundamental ability of LLMs on recommendation tasks. Furthermore, the upgraded GPT-4 exceeds ChatGPT by a large margin due to its strong zero-shot generalization ability. The superior zero-shot performance of GPT sheds lights on leveraging LLMs for recommendation. However, the open-source models always get poor results compared to GPT-4 in the zero-shot settings, while LLaMA2-chat-70B has the comparable recommendation performance with ChatGPT. The reason is that open-source models lack comprehensive cold-start capabilities, and their strength lies in the ability to integrate domain knowledge through strategies such as prompt tuning. In line with previous studies on LLMs (Hou et al., 2023; Kang et al., 2023; Ma et al., 2023; Liu et al., 2023c), employing a closed-source model in cold-start scenarios yields better results, while an open-source model is more flexible and easy to use when tuning is needed (Bao et al., 2023b; Zheng et al., 2023; Li et al., 2023b; Luo et al., 2023b).

$\bullet$ Model architecture. For the model architecture, Flan-T5 (Chung et al., 2022) based on the encoder-decoder architecture has almost no ability to recommend items in the cold-start setting, as its training corpus does not involve specialized instructions of our task. Trained with prompts of recommendations, the encoder-decoder architecture is suitable for prompt tuning and instruction tuning (Zhang et al., 2023f; Geng et al., 2022; Chu et al., 2023). Similarly, the first and second versions of ChatGLM (Zeng et al., 2022) based on the prefix decoder perform poorly on the zero-shot ranking task, and are not as good as Vicuna and LLaMA2 based on the causal decoder. However, the third version of ChatGLM, i.e., ChatGLM3 has comparable recommendation performance with Vicuna (Chiang et al., 2023) and LLaMA2 (Touvron et al., 2023b), which further indicates the importance of selecting an advanced foundation model. In terms of the series of LLaMA models (Touvron et al., 2023a), LLaMA2 is better than Vicuna, and Vicuna is better than LLaMA, which is related to their training data and release time. Furthermore, the chat version of LLaMA2, i.e., LLaMA2-chat is a series that uses conversational dialog instructions to fine-tune LLaMA2 (Touvron et al., 2023b), which is more suitable for our tasks in ranking settings. Therefore, the results of LLaMA2-chat are significantly improved compared with LLaMA2, and LLaMA2-70B-chat even achieves better performance than the closed-source model ChatGPT. Generally speaking, researchers prefer to study recommendation tasks based on the causal decoder framework such as LLaMA, and the second version of LLaMA has a better generalization ability than the first version in recommendation tasks.

$\bullet$ Parameter scale. It is widely recognized that the larger the parameter size, the more powerful the LLMs (Zhao et al., 2023; Kang et al., 2023; Hou et al., 2023), the same applies in the field of recommender systems. To compare the effect and efficiency of LLMs w.r.t. the parameter scale, we compare the recommendation performance and inference time of LLaMA (Touvron et al., 2023a), Vicuna (Chiang et al., 2023), LLaMA2 (Touvron et al., 2023b), and LLaMA2-chat at different parameter scales in Fig. 4. As the scale of parameters enlarges, the recommendation performance and inference time of LLMs steadily increases, and both datasets (Fig. 4(a) and Fig. 4(b)) have consistent conclusions. Therefore, it is necessary to consider the trade-off between performance and efficiency when choosing the parameter scale. Moreover, the performance improvement on the increasing scale of LLaMA2 is more significant than that of LLaMA, indicating that the scale effect of LLMs has to do with the capabilities of the base model.

$\bullet$ Context length. Different LLMs have different maximum input limitations (Zhao et al., 2023; Touvron et al., 2023a, b), and a longer context input means LLMs can accommodate more historical items for recommendation. However, it remains to be explored whether the maximum input length of LLMs will affect the recommendation results when the length limitation is not exceeded. Therefore, we conduct experiments to investigate the differences in the recommendation performance between the two length versions of ChatGPT (4K and 16K) and ChatGLM3 (2K and 32K). As shown in Fig. 5, expanding the length limitation of LLMs does not necessarily mean the better recommendation performance, while there is a slight decrease in NDCG@10. Furthermore, when the maximum input of LLMs remains unchanged, increasing the historical input of users results in insignificant gains as shown in Fig. 3. Therefore, the key to the recommendation problem is to enable LLMs to effectively utilize the information within the limited context (Lin et al., 2023b; Yao et al., 2023), and a suitable context length selection for LLMs as recommender systems is worthy of deep consideration.

Table 5. Overall performance of LLMs on CTR predictions. There are three settings, i.e., zero-shot setting without fine-tuning, parameter-efficient fine-tuning (PEFT) setting with a few parameters tuned, and fine-tuning (FT) setting with all parameters tuned.

dataset	ml-1m			Amazon-Books
model	zero-shot	PEFT	FT	zero-shot	PEFT	FT
LLaMA-7B	0.4683	0.5479	0.6658	0.6488	0.8262	0.8469
Alpaca-LoRA-7B	0.5264	0.5767	\ul0.6702	0.6558	0.8440	0.8533
LLaMA2-7B	0.5284	\ul0.6133	0.6457	\ul0.6754	\ul0.8550	\ul0.8542
LLaMA2-chat-7B	\ul0.5255	0.6275	0.6731	0.7174	0.8560	0.8660

4.3.3. Comparisons of Tuning Strategies for LLMs (RQ3)

Due to the fact that LLMs are not customized to recommender systems during the training process, it is insufficient to only consider the zero-shot recommendation performance in cold-start scenarios (Haleem et al., 2022; Kang et al., 2023; Liu et al., 2023d; Luo et al., 2023b). In order to explore the impact of different training strategies of LLMs on recommendations, we compare the click-through rate prediction performance of four LLaMA-based LLMs on two datasets. Specifically, we consider three training settings of LLMs, i.e., the zero-shot setting without fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) setting (we use the LoRA (Hu et al., 2021) here) with a few parameters tuned, and Fine-Tuning (FT) setting with all parameters tuned, and summarize empirical conclusions from three aspects: overall performance of different settings, the impact of instruction tuning, and the impact of training data.

$\bullet$ Overall performance of different settings. As shown in Table 5, we can see that the results of fine-tuning LLMs (PEFT and FT) on only 256 samples are significantly better than the zero-shot performance in cold-start scenarios, and empirical findings are consistent across four LLMs (LLaMA, Alpaca-LoRA, LLaMA2, LLaMA2-chat) on both datasets. Furthermore, considering the two kinds of fine-tuning strategies, the performance of the fine-tuning setting is even better than that of the PEFT setting since more parameters are tuned (Hu et al., 2021). In addition to recommendation effects, training efficiency is also a performance that deserves attention. Therefore, we compare the training time of the two fine-tuning strategies on two datasets. As shown in Fig. 6, the time for parameter-efficient fine-tuning is significantly less than that for fine-tuning all parameters, which is in agreement with the existing literature (Hu et al., 2021; Zhao et al., 2023). Therefore, the balance between efficiency and effectiveness needs to be further considered.

$\bullet$ The impact of instruction tuning. Before fine-tuning LLMs using the recommendation data, we are also concerned about whether the instruction tuning of LLMs using general data can further enhance the subsequent fine-tuning on recommendations. Therefore, we select two pairs of LLMs as controls, namely LLaMA and Alpaca-LoRA, as well as LLaMA2 and LLaMA2-chat. As for the Alpaca-LoRA model, it is the tuned version of LLaMA (Touvron et al., 2023a) using the the LoRA strategy (Hu et al., 2021) on the Alpaca dataset (Taori et al., 2023). As for the LLaMA2-chat model, it is also the updated version of LLaMA2 using the RLHF strategy (Touvron et al., 2023b) on the conversational dataset. As illustrated in Table 5 and Fig. 7, the recommendation results of Alpaca-LoRA are better than that of LLaMA, and the performance of LLaMA2-chat outforms that of LLaMA2, indicating the effectiveness of performing fine-tuning on general instructions. That is to say, enhancing the general knowledge of language models can also improve the domain capabilities of LLMs on specified recommendation tasks (Bao et al., 2023b).

$\bullet$ The impact of training data. For few-shot fine-tuning strategies of LLMs, we conduct experiments to explore the impact of training samples on the recommendation performance. As shown in Fig. 7, we visualize the fine-tuning results corresponding to different numbers of training samples with the LoRA strategy, and also consider the traditional recommender SASRec (Kang and McAuley, 2018). Despite a few sampled data, LLMs such as LLaMA-7B can quickly adapt to the task of CTR predictions, achieving noticeable improvements as the number of training samples increases. With only a small number of training samples, we can stimulate the appreciable recommendation ability of LLMs, reflecting their remarkable emergence ability. However, when the training samples increase from 0 to 256, the performance of the traditional recommendation model SASRec still remains 0.5. Experimental results show that LLMs have advantages over conventional recommenders on abilities of few-shot learning and adapting (Bao et al., 2023b). In addition, the comparison without sampled data in Fig. 7 also highlights the zero-shot cold-start ability of large language models.

4.3.4. Case study of limitations (RQ4)

Despite powerful capabilities of LLMs, there are also some limitations using LLMs as recommender systems. To illustrate the limitations of LLMs for recommendation, we choose two recommendation cases of the powerful closed-source model GPT-4 in Fig. 8 to analyze the possible failure scenarios of LLMs.

$\bullet$ Position bias. On the one hand, the generation results of LLMs have randomness, leading to instability in recommendation results, and the position bias is a typical manifestation. As shown in the first case of Fig. 8, the LLM places the ground-truth movie “The Mask” at the bottom of ranking results, because “The Mask” is at the end of the candidate item sequence. When the order of candidate items is adjusted, the recommendation results will also change, indicating the position bias during recommendations. Existing literature has also found the position bias in LLM-based recommendations (Hou et al., 2023; Ma et al., 2023; Zhang et al., 2023f), and methods such as bootstrap** (Hou et al., 2023) and Bayesian probability framework (Ma et al., 2023) have been proposed to calibrate unstable results. However, the instability of LLMs is still an unresolved issue.

$\bullet$ Lack of domain knowledge. On the other hand, the lack of domain knowledge in recommendations can lead to misunderstandings of LLMs. From the right example in Fig. 8, it can be seen that the LLM places the second movie in the candidate list, i.e., “Three Colors: White” at the first position in the re-ranking result, which still indicates the position bias. Furthermore, the LLM places the ground-truth item “Dick” at the bottom of the ranking list. The possible reason is that relying solely on the movie title “Dick”, LLMs cannot infer that the movie is a comedy film about two teenage girls, which has a greater similarity to the recently watched movie “Showgirls” by users. This case demonstrates that LLMs may make inappropriate decisions due to a lack of domain knowledge in the field of recommender systems.

In addition to the two cases in Fig. 8, LLMs are also limited by several factors such as the inference time (Bao et al., 2023a), context length (Lin et al., 2023b), and memory cost (Kang et al., 2023). Therefore, exploring the practical application of LLMs through methods such as knowledge distillation (Sun et al., 2024; Tian et al., 2023) has both academic and industrial values.

5. PROMPT ENGINEERING SPECIALIZED FOR RECOMMENDER SYSTEMS

Prompt is an important medium for interactions between humans and language models, and a well-designed prompt can better stimulate the powerful capabilities of LLMs (Liu et al., 2023f; Le Scao and Rush, 2021). Improving the performance of artificial intelligence by designing and improving prompts is known as prompt engineering. When prompting LLMs as recommender systems, although the specific prompts in different studies may not be the same, the format is largely identical with only minor differences (Hou et al., 2023; Gao et al., 2023; Dai et al., 2023a; Wang et al., 2023c). In our work, we concentrate on the general framework of prompt engineering specialized for recommender systems, and summarize four key elements of prompt formats. Firstly, suitable task description is the primary condition for constructing prompts (Geng et al., 2022; Wang et al., 2023c). Secondly, the characteristic of recommendation tasks lies in the mining and utilization of the user personalized interest, and user interest modeling is the essential aspect to reflect the true intentions and preferences of users (Xi et al., 2023; Yao et al., 2023). Thirdly, the purpose of recommender systems is to provide users with appropriate items, and the selection, matching, or generation of candidate items is a worthwhile issue to consider, i.e., candidate item construction (Lin et al., 2023b). Fourthly, prompting strategies are important for eliciting the planning and reasoning abilities of LLMs, which are the key to distinguishing LLMs from other language models (Zhao et al., 2023). As shown in Figure 2(a), we refer to prompts in related work (Hou et al., 2023; Zhao et al., 2023) as the basic prompt, and further analyze the effects of task description, user interest modeling, candidate items construction and prompting strategies in the experimental part.

5.1. Task Description

Task description of recommendation tasks for ProLLM4Rec is a key part of prompts that makes LLM understand the goal of specific recommendation assignments. Typically, based on the different recommendation objectives, existing recommendation tasks can be divided into the following categories: point-wise optimization for a single item (e.g., binary cross entropy loss (Rendle, 2010)), pair-wise optimization for positive and negative items (e.g., Bayesian personalized ranking loss (Rendle et al., 2009)), and list-wise optimization for multiple items (e.g., variational auto-encoders (Liang et al., 2018)). For ProLLM4Rec, we can adapt our work into multiple tasks by changing the description of prompts and the form of candidate items. That is to say, our framework can not only be used for re-ranking (Hou et al., 2023) and retrieving (Li et al., 2023b) tasks, but also support task forms such as point-wise prediction (Kang et al., 2023), pair-wise comparison (Dai et al., 2023a) and list-wise re-ranking formats.

$\bullet$ Point-wise recommendation (Cheng et al., 2016) regards the recommendation task as a binary classification problem, such as the rating scoring task (Geng et al., 2022; Chu et al., 2023) and click-through rate (CTR) predictions (Kang et al., 2023; Bao et al., 2023b). It generates a score or likelihood of preference for a given user-item pair with feature engineering. The recommender system ranks items solely based on the individual characteristics or historical interactions of users without considering the comparison and mutual influence between items in the input ranking list. When utilizing LLMs for point-wise recommendation, the description mainly involves the user-item information with the range of score or answer list to limit output formats of LLMs.

$\bullet$ Pair-wise recommendation (Rendle et al., 2009) involves comparing two items in pairs to determine the relative preference for a particular user. Instead of focusing on individual items, it evaluates pairs of items and calculates the semantic distance between them. This method often creates item pairs, calculates relative scores, and then ranks or recommends items based on pair-wise comparisons. The description of pair-wise recommendation in prompts consists of a positive item and a negative item, instructing LLMs to give the answer from a binary choice list (Dai et al., 2023a).

$\bullet$ List-wise recommendation (Liang et al., 2018) involves optimizing an entire list of recommended items for a user. Instead of considering items individually or in pairs, this method treats the entire recommendation list as a single entity and captures the interior correlations within the list. It aims to create a ranked list of items that collectively maximizes user satisfaction. For LLM-based list-wise recommendations, the prompt contains a list of items and corresponding instructions, eliciting the ability of LLMs to explore the potential relationship within the item sequence (Hou et al., 2023; Ma et al., 2023).

$\bullet$ Matching (Bao et al., 2023a; Li et al., 2023b) in recommendation involves obtaining a small subset of candidates from the entire item pool for personalized users, and candidate items are not provided in prompts at this stage. Due to the lack of domain knowledge in recommender systems, leveraging LLMs for item retrieval requires additional strategies such as model training, item indexing and grounding approaches (Zheng et al., 2023). The prompt for matching tasks includes the description of a user in different forms, e.g., textual profiles and user-item interaction histories, in order to quickly match suitable items.

$\bullet$ Ranking (Luo et al., 2023b) in recommendation involves sorting or ordering items based on the predicted relevance or probability of interest to the target user. It aims to present the most relevant items at the top of the recommendation list. By analyzing historical user-item interactions, user preferences, item features, and the given cadidate item list, it requires LLMs to estimate the relevance of items and determine corresponding positions in the recommendation list (Li et al., 2023k; Luo et al., 2023b; Gao et al., 2023).

5.2. User Interest Modeling

Compared to general tasks solved by LLMs, the characteristic of recommendation tasks lies in the mining and utilization of the user personalized interest (Yao et al., 2023; Shu et al., 2023). Users are influenced by multiple factors when making recommendation decisions, including but not limited to their long-term preferences, short-term intentions, market popular tendencies, occasional environmental biases, and so on (Cheng et al., 2016; He et al., 2017). Among them, the most essential aspect is the long-term preferences and short-term intentions of users, which is the purpose of recommender systems (He et al., 2020; Zhou et al., 2020; Hou et al., 2022). In this section, we first classify the user interest type (Section 5.2.1), outline the user representation forms (Section 5.2.2), and summarize existing modeling methods (Section 5.2.3). Then, we discuss approaches to modeling short-term (Section 5.2.4) and long-term interest (Section 5.2.5) of users, and conduct experiments to provide empirical analysis and key findings.

5.2.1. User Interest Type

In the field of recommender systems, there are multiple ways to classify the user interest (Rendle et al., 2009; Guo et al., 2017; Zhou et al., 2020). For example, based on the type of feedback between users and the platform, the user interest can be divided into the explicit interest (e.g., ratings and reviews) and implicit interest (e.g., clicks). Although the explicit feedback can reflect the true intentions of users, the sparse and expensive data limits its application scenarios. In general, we mainly explore user interest modeling in LLMs under implicit feedback scenarios. Another classification of the user interest is based on the time duration and stability, i.e., short-term intentions, long-term preferences, and hybrid interest as follows:

$\bullet$ Short-term intentions refer to the sudden and accidental intentions or tendencies of personalized users in recent interactions, which are prone to change and can be influenced by environmental factors in recommender systems (Lin et al., 2023b; Hou et al., 2023). That is to say, short-term intentions are recent, temporary, and variable (Zhang et al., 2023f).

$\bullet$ Long-term preferences refer to the stable preferences of users towards certain content, themes, and elements, which are not easily changed in the short term and will continue to affect recommendation decisions of users (Shu et al., 2023; Wang et al., 2023c). In contrast to short-term intentions, long-term preferences are long-term, sustained and stable.

$\bullet$ Hybrid interest refers to the combination of long-term preferences and short-term intentions (Friedman et al., 2023; Liu et al., 2023e; Zhiyuli et al., 2023). On the one hand, short-term intentions are dominated by long-term preferences (Yue et al., 2023). On the other hand, long-term preferences also consist of the short-term interest across multiple time periods (Li et al., 2023j; Qiu et al., 2023).

5.2.2. User Representation Forms

In ProLLM4Rec, user interest modeling should consider both the input form for the prompt template and the specific storage form for the interest memory (Wang et al., 2023c). Generally, there are three types of input contents for interest modeling in LLMs: historical item lists, interest descriptions and user embeddings. Corresponding to different representation forms of the user interest, there are also various storage forms in the interest memory.

$\bullet$ Historical item lists refer to representing personalized users based on the sequence of historical interacted items, which is widely used in session-based recommendation and sequential recommendation (Hou et al., 2022; Zhou et al., 2020; Kang and McAuley, 2018). For token-based item indexing, the ID sequence of items is used for user modeling in LLMs, similar to traditional sequential recommendation models (Geng et al., 2022). However, without fine-tuning, LLMs cannot recognize the meaning of corresponding IDs. For description-based item indexing, existing researches generally concatenate the attributes (e.g., titles) of items in the temporal order, and then input the item sequence as text descriptions to LLMs as user representations (Li et al., 2023d; Zhiyuli et al., 2023; Hou et al., 2023). Only the item IDs that the user has interacted with need to be stored as the interest memory. The list of items can be stored in the form of numpy or tensor arrays. Despite the simplicity and effectiveness, text attributes such as titles cannot fully represent items with ambiguity (Li et al., 2023d). At the same time, item sequences inevitably contain noise information (Wei et al., 2023), and limited sequence lengths make it difficult to accurately express the user interest based on pure lists.

$\bullet$ Interest descriptions refer to representing users based on textual descriptions, which is more applicable to the text input of LLMs (Yao et al., 2023). That is to say, we can use natural languages to model the user interest in the form of text and input them into LLMs (Xi et al., 2023). To facilitate the storage and retrieval of textual descriptions, vector stores are commonly used (Touvron et al., 2023b). In LLMs, vector stores in the form of key-value pairs can connect implicit vector embeddings with explicit textual descriptions, improving the retrieval efficiency and comprehension ability (Li et al., 2023b; Petrov and Macdonald, 2023). However, the key lies in how to mine and describe the user interest, and this is what we should make efforts to research and discuss.

$\bullet$ User embeddings refer to concatenating embeddings of users to the input of LLMs as user representations (Zhang et al., 2023b), which is often used for efficient fine-tuning of language models. In this case, we assign each user a unique ID and corresponding embedding, and add the vector representations of users to the input of LLMs for fine-tuning. At the same time, user embeddings can also be trained from small-scale traditional recommendation models (Zhang et al., 2023d), thereby incorporating collaborative filtering features from other users. However, it is worth noting that vector embeddings of user representations are only suitable for scenarios where the domain knowledge can be injected into LLMs, and the black-box property of embeddings also poses challenges to the explainability and understanding of language models.

5.2.3. Modeling Methods

In line with existing literature (Zhao et al., 2023; Wang et al., 2023g), we classify the methods for modeling interest in LLMs into three categories, i.e., memory-based methods, retrieval-based methods and generation-based methods.

$\bullet$ Memory-based methods assign external memory to users for storing interest-related historical information (Xi et al., 2023; Shu et al., 2023). As for user representations, the encoded interest of users can be obtained from the memory (Zhang et al., 2023d), and LLMs are instructed to utilize the memorized interest with carefully designed prompts. It can be seen that the iterative updating of memories is the key to memory-based methods. As for personalized memory, there are three important operations (Wang et al., 2023g): (1) memory reading is to obtain contents of the interest memory based on the user identifier. (2) Memory writing is to augment new interest of users based on the latest interactions between users and items. (3) Memory reflection is the periodic examination and updating of existing issues in the memory. Strategies for memory reflection include but are not limited to self-summarization, self-correction, and reflection based on user feedback (Wang et al., 2023a). It is memory reflection that makes memory not just a stack of historical records but a summary of user interest (Yao et al., 2023). Note that memory-based methods specifically refer to obtaining contents from the memory without further processing procedures.

$\bullet$ Retrieval-based methods add a module for personalized interest retrieval on top of the memory-based methods, utilizing the personalized query and retrieval strategies to obtain user interest from the personalized memory (Wang et al., 2023c; Lin et al., 2023b; Huang et al., 2023). As for the personalized query, there are generally three ways to form a query for retrieval: (1) candidate items as search queries (Lin et al., 2023b) for relevant items, (2) recently interacted items as search queries for similar items (Zhang et al., 2023c), and (3) the user profile as search queries for personalized items (Wang et al., 2023c). For the criteria of retrieval, there are generally multiple trade-offs, including dimensions such as relevance, recency, and diversity with respect to the user interest (Wang et al., 2023g).

$\bullet$ Generation-based methods utilize the understanding and generation capabilities of LLMs to summarize, infer, and derive comprehensive user interest (Yao et al., 2023; Wang and Lim, 2023; Luo et al., 2023b). Generally, a combination of memory-based and retrieval-based methods is required for generation-based methods. For generative language models, the user interest induced by generative retrieval can further stimulate the prompting ability of LLMs (Liu et al., 2023a; Zhang et al., 2023e). The generative approach can also be used for data augmentation, leveraging the general knowledge of LLMs to augment textual descriptions of the user interest. In current recommendations, LLMs mainly utilize the retrieved relevant items to generate summarized interest descriptions, which can also serve as a strategy for the memory reflection (Wang et al., 2023c; Shu et al., 2023).

5.2.4. Research Questions and Observations for Modeling Short-term Intentions

As discussed in Section 5.2.1, the user interest modeling in prompts for LLMs can be divided into: (1) short-term intentions, and (2) long-term preferences. In this section, we explore methods for modeling short-term intentions w.r.t. the following two research questions.

•

RQ1: How to select recent and relevant items for modeling short-term intentions?
•

RQ2: Whether different modeling forms of short-term intentions have similar recommendation performance?

To analyze how to select recent and relevant items as the short-term intentions, we adopt the retrieval-based strategy to retrieve items from the long-term interest. Specifically, following the memory-based and retrieval-based strategy, the personalized query, e.g., candidate items, recently interacted items or the user profile, is encoded into vectors in a unified form. The most relevant items are retrieved from memory based on semantic similarity. Meanwhile, we set weights based on the recency of interactions so that the recent items have a higher probability of being retrieved. Finally, the corresponding retrieved text is obtained based on key-value pairs in the memory. Here, we explore the design of the personalized query and memory of users for item retrieval, and study the influence of different variants as follows.

•
Variants of the personalized query: as for the personalized query, we utilize the interest of users for retrieving items, and design the following two variants to compare strategies for constructing the query.
- –
  
  Short-term interest: we utilize the short-term interest for the memory retrieval of the long-term interest. The summarized text based on 10 recently interacted items is employed as the query for retrieving items.
- –
  
  Long- and short-term interest: we first summarize the personalized profile of each user based on all items he or she has interacted, then concatenate the user profile and the summarized text based on 10 recently interacted items for the memory retrieval of the long-term interest.
•
Variants of the memory: as for the personalized memory, we consider two variants of the memory contents.
- –
  
  Global memory: we utilize descriptions of items in the datasets as the memory. Since all users in a dataset share the same item space, the memory w.r.t. item descriptions is not the personalized memory but global memory, indicating that different users store the same content for the same item.
- –
  
  Personalized memory: for each item that a user interacts with, we instruct LLMs to generate personalized descriptions based on the rating score or comment text as personalized memories. Therefore, the description of the same item varies among personalized users, which is different from the global memory.

When the items utilized to recent interest modeling are fixed, different representation forms of the items may also affect the modeling results. To examine the impact of modeling forms on the short-term interest, we consider the recently interacted items without retrieval methods, and design the following four variants of short-term intentions:

•

Recent items: we use titles of the recent 10 items as short-term intentions. It is the default and baseline strategy.
•

Recent items + summarized text on recent items: we utilize the least-to-most prompting strategy as mentioned in Section 5, which summarizes the interest text based on the recently interacted 10 items, then concatenate the summarized text and the 10 items for modeling short-term intentions.
•

Personalized text of recent items: we construct the personalized memory for each user based on the user-item interactions, and use the personalized descriptions of recently interacted items as short-term intentions.
•

Recent items + personalized text of recent items: we employ not only titles of the recent 10 items, but also the personalized text of recent items to prompt LLMs as the short-term interest.

Table 6. Performance comparison on the design of personalized query and memory for item retrieval.

query	memory	MovieLens-1M		Amazon-Books
query	memory	ndcg@1	ndcg@10	ndcg@1	ndcg@10
short-term interest	global memory	0.2000	0.4295	0.2283	0.4144
long- and short-term interest	global memory	0.2200	0.4473	0.2400	0.3980
short-term interest	personalized memory	0.2400	0.4441	0.3250	0.4636
long- and short-term interest	personalized memory	0.2400	0.4563	0.3300	0.4712

Table 7. Performance comparison on various modeling forms of short-term interest.

Short-term interest	MovieLens-1M		Amazon-Books
Short-term interest	ndcg@1	ndcg@10	ndcg@1	ndcg@10
recent items	0.1817	0.3985	0.2467	0.4276
recent items + summarized text on recent items	0.2783	0.5099	0.2617	0.4729
personalized text of recent items	0.2100	0.4380	0.2500	0.4733
recent items + personalized text of recent items	0.2350	0.4483	0.3000	0.5076

As shown in Table 6-7, we summarize the following findings of modeling short-term intentions for prompting LLMs.

(1) The retrieval strategies for recent and relevant items (RQ1). First we conduct an experiment to explore the retrieval strategies for recent and relevant items, and evaluate the selection of query and memory for item retrieval. As shown in Table 6, we compare four combinations with queries and memories. In terms of the query, combining the design of long-term and short-term interest yields better results than only the short-term interest. When selecting items that represent interest for recommendation, the query needs to consider both recent tendencies and long-term preferences of users. Our results have shown that the personalized long-term interest of users can have a positive effect on retrieval for interest modeling. In terms of the memory, the design of the personalized memory with user interest outperforms the global memory with general descriptions. This result is in line with features of personalized recommender systems (Shu et al., 2023; Wang et al., 2023c), indicating the importance of personalization when designing the user interest for LLMs (Zhang et al., 2023c).

(2) The impact of different short-term intentions modeling forms (RQ2). To examine the effectiveness of various modeling forms w.r.t. the short-term interest, we design four variants based on the recently interacted 10 items and report their recommendation performance in Table 7. We can see that personalized memory can provide more information to LLMs than simple titles (comparison between the first and third lines), demonstrating the role of personalized memories with the user interest. On the other hand, we cannot completely replace the title with a description, as retaining the title and adding personalized descriptions perform even better results (comparison between the last two lines). Without the personalized memory, the least-to-most prompting strategy that allows LLMs to summarize recent items first can provide personalized descriptions and achieve performance improvement (comparison between the first two lines). There is also a difference in the impact of the short-term interest modeling forms on recommendation abilities between the two datasets, i.e., MovieLens and Amazon-Books. For the short-term intentions in the MovieLens dataset, the summarized text induced by LLMs is better than the personalized memory, while in the Amazon-Books dataset, results are opposite (comparison between the second and fourth lines). It is because there is no review comments but only the ratings in the MovieLens dataset, so capabilities of the personalized memory is relatively poor.

5.2.5. Research Questions and Observations for Modeling Long-term Preferences

In this section, we conduct experiments for modeling the long-term preferences of users in ProLLM4Rec. Specifically, we focus on the two research questions.

•

RQ1: Whether different modeling forms of long-term preferences have similar recommendation performance?
•

RQ2: How to combine short-term intentions and long-term preferences for the hybrid interest?

To evaluate the recommendation performance of different modeling methods for long-term preferences, we design the following variants of interest modeling forms to answer the first research question:

•

Recent items: it is the basic modeling form of selecting only recent 10 items for interest modeling.
•

Summarized user profile + recent items: it concatenates the user profile summarized by LLMs and the recently interacted 10 items for interest modeling, considering both the long-term and short-term interest.
•

Retrieved items from long-term history: it retrieves 10 relevant items from the long-term interest. Since recent items have a higher probability to be retrieved, this variant considers recency as well.
•

Retrieved items from long-term history + recent items: it uses both the retrieved 10 items from long-term history and the recent 10 items as the representation of user interest, considering both the long-term and short-term interest. It is worth noting that the retrieved items in this variant do not include the latest 10 items.
•

Retrieved personalized text from long-term history: it also retrieves 10 relevant items from the long-term history like “retrieved items from long-term history”, but this variant uses the personalized descriptions of items from the personalized memory instead of pure titles to represent the user interest.
•

Summarized text based on retrieved personalized text: it employs the retrieved personalized text from long-term history in the last variant by summarizing the retrieved 10 texts into one description. Not providing 10 specific items for LLMs, this variant only summarizes the user interest through personalized text.

Due to the positive impact of both long-term and short-term interest on the recommendation results of LLMs, we further explore different combination approaches of long-term and short-term interest to answer the second research question. Specifically, we consider the following four variants of the combination strategies.

•

Short-term interest only: it only utilizes the recently interacted 10 items as the baseline for comparison.
•

Concatenate long- and short-term interest: it combines the long-term and short-term interest by concatenating them together for user interest modeling.
•

Short-term as query for long-term interest: it combines the long-term and short-term interest by utilizing the recently interacted 10 items as the query for retrieving relevant items from the long-term history.
•

User profile and short-term as query for long-term interest: it retrieves recent and relevant items from the long-term interest, and for the query, both the personalized user profile and short-term interest are utilized.

Table 8. Performance comparison on various modeling forms of long-term interest.

User interest modeling	MovieLens-1M		Amazon-Books
User interest modeling	ndcg@1	ndcg@10	ndcg@1	ndcg@10
recent items	0.1817	0.3985	0.2467	0.4276
summarized user profile + recent items	0.2000	0.4466	0.2500	0.4925
retrieved items from long-term history	0.1950	0.4343	0.2850	0.4650
retrieved items from long-term history + recent items	0.2400	0.4441	0.3250	0.4636
retrieved personalized text from long-term history	0.2550	0.4714	0.3150	0.5241
summarized text based on retrieved personalized text	0.2000	0.4492	0.2500	0.4872

Table 9. Performance comparison on the combination strategies of long- and short-term interest.

Combination strategies	MovieLens-1M		Amazon-Books
Combination strategies	ndcg@1	ndcg@10	ndcg@1	ndcg@10
short-term interest only	0.1817	0.3985	0.2467	0.4276
concatenate long- and short-term interest	0.2000	0.4466	0.2500	0.4925
short-term as query for long-term interest	0.2400	0.4441	0.3250	0.4636
user profile and short-term as query for long-term interest	0.2400	0.4563	0.3300	0.4712

As shown in Table 8-9, key findings of the design for long-term preferences in ProLLM4Rec are summarized as follows.

(1) The impact of different long-term preferences modeling forms (RQ1). As for user interest modeling forms in the first research question, Table 8 illustrates the performance comparison on six modeling forms of the long-term interest. Through the comparison between the first two lines, we can see that long-term preferences of users can improve modeling effects of the short-term interest. Besides, modeling items retrieved from the long-term interest also improves the recommendation performance compared to only recent items (comparison between the first and third lines), and combining the retrieved items and recent item yields even better results in line four. On the whole, the first four lines in Table 8 indicate the key role of combining the long-term and short-term interest. Furthermore, retrieved personalized texts from the long-term memory greatly improve the recommendation performance compared to only using recent items. However, we further summarize the retrieved personalized text and results deteriorate, indicating that the rich information contained in the original sequence cannot be simply included as general descriptions. The last two lines in Table 8 highlight the important role of personalized memories w.r.t. personalized users.

(2) The combination strategies for the hybrid interest (RQ2). As for combination strategies of the long-term and short-term interest, we can see from Table 9 that simply concatenating the long-term and short-term sequences is the most commonly used and simple way, but not the optimal solution. One possible combination approach is to use the short-term interest as queries to retrieve the long-term interest, and empirical analysis shows better results. Performance comparison on the combination strategies validates the importance of combining the long-term and short-term interest.

5.3. Candidate Items Construction

5.3.1. Procedures of Candidate Items Construction

When using LLMs as a recommendation model, a crucial problem is that LLMs tend to recommend items not in the dataset (Geng et al., 2022). One of the approaches is to provide a limited number of candidate items for model selection. On the one hand, the source of candidate items determines the upper limit of personalized recommendation accuracy (Hou et al., 2023; Zhang et al., 2023f). On the other hand, the form of candidate items reflects the specific recommendation task and exerts huge impacts on performance of LLMs for recommendation (Dai et al., 2023a). As shown in Fig. 9, we will discuss the main procedures of processing candidate items, including the source, representation and grounding of candidate items, and then conduct experiments on the effects of candidate items in our framework ProLLM4Rec.

$\bullet$ Source of Candidate Items. In practical industrial applications, recommendation is a multi-stage process, including two kinds of typical stages, i.e., the recall stage and the re-ranking stage (Bao et al., 2023a; Guo et al., 2017; Rendle et al., 2009). The recall stage in recommender systems is used for preliminary filtering and selection from the entire set of item candidates, while the re-ranking stage is used for readjusting the selected candidates, which is more suitable for LLMs as the re-ranker (Hou et al., 2023; Ma et al., 2023). Therefore, the source of candidate items to be ranked is crucial for the results of LLMs for recommendation. Popular approaches for selection of candidate items include traditional recommendation models (Yue et al., 2023) and retrieval algorithms (Lin et al., 2023b), both of which provide a list of potentially recommended items in a comparably coarse granularity.

$\bullet$ Representation of Candidate Items. In the construction of prompts, we need to use historical items to represent users and provide candidate items for LLMs to rank, both of which involve item indexing (Geng et al., 2022; Lin et al., 2023c). However, there is a gap between the item representation in LLMs and recommendation, so the indexing of items between LLMs and recommender systems becomes the key and difficult point of ProLLM4Rec. There are three typical ways to index items: 1) token-based identifiers, 2) description-based identifiers and 3) hybrid identifiers (Zheng et al., 2023). For token-based identifiers, researchers often use numerical IDs to identify items (Geng et al., 2022). Since token-based identifiers are widely utilized in traditional recommendation model, they are naturally treated as an item-indexing way for aligning LLMs with recommendation tasks. For description-based identifiers, researchers generally use the title or other textual attributes of the item as an identifier, and formalize it into natural language as input (Dai et al., 2023a; Liu et al., 2023c). Due to the richness of natural language, description-based identifiers have high readability and flexibility. For hybrid identifiers, considering collaborative and textual information of users or items, LLMs can perceive both token-based and description-based identifiers (Liao et al., 2023; Li et al., 2023b). Current work using hybrid identifiers can be divided into different types. Some of them add item sequence with textual description and ID information respectively into different positions of prompts (Zhang et al., 2023b), while others combine token-based and description-based embeddings together for each item in sequences through concatenation (Li et al., 2023b).

$\bullet$ Grounding to Candidate Items. After constructing candidate items for recommendation, how to ground the output of LLMs into recommendation item lists is a crucial problem. Methods for grounding to candidate items are associated with different types of recommendation tasks. For discriminative recommendation, which provide candidate item lists in the prompts, it is easy for LLMs to follow instructions and generate the output right within the candidate item lists through elaborately designed prompts (Zhiyuli et al., 2023; Liao et al., 2023). For generative recommendation tasks, which do not explicitly provide candidate item list in prompts but require further processing for the output, recent work indicates effective approaches for map** the output of LLMs into candidate item lists (Zheng et al., 2023). For example, GPT4Rec (Li et al., 2023m) first generates hypothetical “search queries” using titles of recently interacted items, and then retrieves similar items from the item pool by BM25 algorithm. E4SRec (Li et al., 2023b) directly utilizes the nearest neighbor search between the output of LLMs and the vectors in the item linear projection component for grounding to candidate items. In this case, the recommendation process will not suffer from length limitation problems since they choose not to include candidate item lists into the input of LLMs.

As shown in Figure 1, our framework ProLLM4Rec involves two map** processes between the recommendation space and language space: (1) when using LLMs to represent users, items in the recommendation space need to be used as historical sequences. (2) When re-ranking candidate items, it is required to provide candidate items in the recommendation space for LLMs. The key issue of semantic space alignment is the indexing and grounding of items. As for the item indexing, pseudo ID-based sequences and description-based natural languages are two typical forms. As for the grounding methods, the generative method renders LLMs to output labels of all candidate items, and other map** strategies such as logits distribution and similarity calculation can also be considered (Yue et al., 2023; Li et al., 2023b).

5.3.2. Research Questions and Experimental Setup

In this section, we explore the effect of candidate items on ProLLM4Rec, and provide empirical findings on the following research questions:

•

RQ1: Whether different representations of candidate items lead to similar recommendation performance?
•

RQ2: Given a list of candidates recalled by traditional recommenders, can LLMs improve the recommendation results by further re-ranking the given candidates?

In this paper, we explore the performance differences between the following two item representation methods:

•

ID-based identifiers: we can assign serial numbers or indicators (e.g., ABCD) to the candidate items, and instruct LLMs to only output the re-ranked indicators instead of the complete text.
•

Description-based identifiers: we instruct LLMs to directly output the complete text e.g., titles of candidate items, and the re-ranking results are parsed from the textual output from LLMs.

Moreover, researchers tend to adopt a two-stage recommendation paradigm, in which conventional recommenders initially recall multiple candidates, and subsequently, LLMs evaluate and rank these candidates to provide enhanced recommendations. This approach effectively reduces the computational load of LLMs, enabling more efficient deployment. However, there is still no consensus regarding the effectiveness of LLMs in reranking the candidates retrieved by traditional recommendation models (Hou et al., 2023; Ren et al., 2023). In order to address this, we conduct experiments to compare the impact of LLMs in re-ranking the original recommendation results generated by traditional recommenders. We evaluate three classical recommenders, including Pop, BPR (Rendle et al., 2009), and SASRec (Kang and McAuley, 2018), and report the result in Table 11.

Table 10. Performance comparison on the grounding forms of LLMs w.r.t. candidate items.

Method	MovieLens-1M			Amazon-Books
Method	ndcg@1	ndcg@10	ndcg@20	ndcg@1	ndcg@10	ndcg@20
ID-based identifiers	0.2750	0.5423	0.5725	0.0600	0.1473	0.1685
Description-based identifiers	0.3100	0.5828	0.6043	0.1950	0.4546	0.5079

Table 11. Performance comparison on the effects of LLMs w.r.t. the source of candidate items.

Method	MovieLens-1M			Amazon-Books
Method	ndcg@1	ndcg@10	ndcg@20	ndcg@1	ndcg@10	ndcg@20
Pop	0.0000	0.0071	0.0110	0.0000	0.0000	0.0012
Pop (+ LLM)	0.0000	0.0033	0.0086	0.0000	0.0014	0.0014
Impr.	0.00%	-53.52%	-21.82%	0.00%	1400.00%	16.67%
BPR	0.0000	0.0084	0.0135	0.0000	0.0064	0.0076
BPR (+ LLM)	0.0050	0.0146	0.0170	0.0050	0.0082	0.0107
Impr.	5000.00%	73.81%	25.93%	5000.00%	28.13%	40.79%
SASRec	0.0750	0.1600	0.1836	0.0450	0.1129	0.1318
SASRec (+ LLM)	0.0700	0.1475	0.1736	0.0750	0.1242	0.1454
Impr.	-6.67%	-7.81%	-5.45%	66.67%	10.01%	10.32%

5.3.3. Observations and Discussion

As shown in Table 10-11, findings of candidate items construction are as follows.

(1) Grounding forms of candidate items (RQ1). Firstly, we focus on the grounding forms of candidate items. As shown in Table 10, we can see that whether in the movie dataset or the book dataset, the complete name of the output item is more convenient for extracting and grounding recommendation results than the specified identifier. In other words, description-based identifiers outperform ID-based identifiers when grounding candidate items for LLMs (Zhao et al., 2023; Hou et al., 2023). However, whether it is ID-based or description-based, the inference time of the grounding strategy based on generative output will increase with the increasing number of candidate items. A feasible solution is to use one prediction to obtain the probability score on all candidate items based on the output logits (Yue et al., 2023). Language models are more sensitive to the textual output rather than pure identifiers, but we can also use more advanced index strategies and grounding methods to improve the accuracy of text and item map** (Hua et al., 2023), which requires continuous exploration by future research (Ren et al., 2023).

(2) The evaluation for re-ranking abilities of LLMs (RQ2). Secondly, we discuss the performance difference between traditional algorithms and re-ranking results after LLMs. As shown in Table 11, LLMs improve results of the traditional recommendation method BPR on two datasets by a large margin. However, BPR is not specifically designed for sequential recommendation and its original performance is not good. As for the basic method Pop and the typical sequential recommendation model SASRec, the re-ranking effect of LLMs varies between the two datasets. For the MovieLens-1M dataset in the movie domain, our initial prompts cannot instruct LLMs to achieve better results than traditional methods such as Pop and SASRec. The possible reason is that this movie dataset is very dense, and fully-trained collaborative filtering signals are more important than general knowledge of movies. While in the Amazon-Books dataset, the re-ranking of LLMs can further improve results of Pop and SASRec, possibly because the general knowledge of LLMs for books can effectively adapt to the sparse recommendation data. The overall experimental results indicate that the re-ranking performance of LLMs for recommendation results varies depending on specific recommendation methods and datasets. However, it is acknowledged that the general knowledge of LLMs can complement the collaborative signals of traditional models, tap** the potential to improve traditional recommendation algorithms (Li et al., 2023a; Zhang et al., 2023b).

5.4. Prompting Strategies

With the construction of prompts for recommendation tasks, prompting strategies are leveraged to further elicit the general abilities of LLMs in text comprehending and problem solving (Zhao et al., 2023; Liu et al., 2023f; Le Scao and Rush, 2021). In this section, we first summarize prompting strategies specialized for recommender systems, and provide empirical findings on two research questions.

5.4.1. Prompting Strategies Specialized for Recommender Systems

In general tasks with LLMs, researchers use methods such as Chain-of-Thought (CoT) prompting to stimulate the logical reasoning ability of LLMs for solving complex problems (Diao et al., 2023; Liu et al., 2022). As for ProLLM4Rec, leveraging LLMs for recommender systems can also employ prompting strategies to further improve the performance of recommendations (Yao et al., 2023; Liu et al., 2023e). In addition, due to the user and item settings of the recommendation task, the prompting strategy needs to be customized based on the specific user needs in recommender systems. In this paper, we concentrate on typical prompting strategies for recommendation tasks, i.e., zero-shot prompting, few-shot prompting, recency-focused prompting, role prompting, chain-of-thought prompting and self-prompting strategy, and more advanced planning strategies in ProLLM4Rec are left for future exploration (Liu et al., 2023f).

$\bullet$ Zero-shot prompting is the basic scenario and common form among various prompting strategies. Zero-shot prompting directly provides the context information and task description for LLMs without reference examples. The task description and prompt formats vary in line with different recommendation tasks (Wang et al., 2023c; Liu et al., 2023c; Dai et al., 2023a).

$\bullet$ Few-shot prompting is opposite to zero-shot prompting, and it provides a few demonstrations in prompts to help LLMs better understand the user intention (Brown et al., 2020). While for recommendation tasks, the matching of provided examples with the current interest of personalized users determines the effectiveness of few-shot prompting. Therefore, it is worth discussing how to provide reliable context for subsequent recommendations (Hou et al., 2023).

$\bullet$ Recency-focused prompting is first proposed by (Hou et al., 2023) based on the fact that the next predicted item in recommendations has a greater correlation with the recently interacted item than other historical items. Therefore, explicitly emphasizing the recently interacted items in prompts is also a practical prompting strategy. In our initial prompt, recency-focused prompting is used by default.

$\bullet$ Role prompting refers to playing a “role-playing” game with the language model. In the recommendation task, we can specify the role of LLMs as the recommender system to serve users in a targeted manner, adding auxiliary information to describe the role of recommenders in detailed prompts such as conversational recommender systems (He et al., 2023; Wang et al., 2023h; ** et al., 2023).

$\bullet$ Chain-of-thought prompting is also known as the CoT prompting, which is widely applied in reasoning tasks such as question answering and mathematical inference (Wei et al., 2022). CoT prompting can elicit the ability of LLMs to solve problems step by step (Kojima et al., 2022), or further explicitly decompose the reasoning and analysis process of the task using least-to-most prompting strategy (Zhou et al., 2022). In the field of recommender systems, explicit steps can also be provided manually by researchers to assist in solving recommendation tasks (Wang and Lim, 2023). For ProLLM4Rec, we not only focus on the basic CoT prompts, but also explore the role of custom-designed step-by-step prompts.

$\bullet$ Self-prompting strategy means that the knowledge for answering questions can be obtained by prompting LLMs multiple times. LLMs can be required to generate relevant knowledge, providing necessary information for concepts in the original problem (Liu et al., 2022). Meanwhile, the randomness and self consistency (Wang et al., 2022) generated by LLMs allow them to generate multiple inference chains, and then use the majority voting method on the results obtained from all chains as final predictions. When using LLMs to the re-ranking stage of a recommender system, there are many possible combinations of candidate items, and the recommendation results from the same language model with the same prompt can be totally different. Therefore, bootstrap** with multiple trials is a scientific guarantee to reduce bias (Hou et al., 2023; Ma et al., 2023; Li et al., 2023i).

5.4.2. Research Questions and Experimental Setup

In this section, we explore the construction and design of prompting strategies for ProLLM4Rec, and provide empirical findings on the following two research questions:

•

RQ1: Whether different expressions of prompts lead to similar recommendation performance?
•

RQ2: How to choose the planning module for stimulating the recommendation ability of LLMs?

As for the strategy of prompt engineering in ProLLM4Rec, we conduct experiments with different variants of prompts compared with the basic prompt as follows:

•

Original: it is the basic prompt used in our experiments as illustrated in Figure 2(a).
•

w/o Recency-focused prompting: we remove prompts on the recently interacted item.
•

w/ Role prompting: we add the role-playing description to the original prompt.
•

w/o CoT (step-by-step): we remove the magical spell “let’s think step by step” in the original prompt.
•

w/ CoT (least-to-most): we obtain recommendation results by prompting LLMs twice. First, LLMs are prompted to summarize the personal preference of users based on the recently interacted item sequence. Then in the second stage, we concatenate the summarized profile to the original prompt for recommendation.
•

w/ ICL (self): we use the last interaction of the same user as the example for In-Context Learning (ICL) (Hou et al., 2023).
•

w/ ICL (others): we retrieve an example from historical sequences of other users for the current user, and add the demonstration to the original prompt. For the selection of examples, we encode recent items of users by textual representations, and search for similar user sequences based on the inner product between vectors.

Table 12. Performance comparison on the prompt engineering of LLMs in ProLLM4Rec.

	MovieLens-1M			Amazon-Books
Method	ndcg@1	ndcg@10	ndcg@20	ndcg@1	ndcg@10	ndcg@20
Original	0.1817	0.3985	0.4629	0.2467	0.4276	0.5054
Zero-shot prompting
w/o Recency-focused prompting	0.1550	0.3781	0.4546	0.2300	0.4256	0.4978
w/ Role prompting	0.2750	0.4821	0.5379	0.2800	0.4814	0.5489
w/o CoT (step-by-step prompting)	0.2400	0.4543	0.5193	0.2650	0.4439	0.5163
w/ CoT (least-to-most prompting)	0.2783	0.5099	0.5603	0.2617	0.4729	0.5403
Few-shot prompting
w/ ICL (self)	0.2000	0.4558	0.5171	0.2500	0.4265	0.5055
w/ ICL (others)	0.1900	0.4015	0.4684	0.2500	0.4039	0.4950

5.4.3. Observations and Discussion

As shown in Table 12, findings on prompting strategies of LLMs are listed as follows:

(1) The impacts on different expressions of prompts (RQ1). We compare different prompting strategies for LLMs, and report results based on the classification of zero-shot prompting and few-shot prompting. As shown in Table 12, “w/o” denotes that we remove related descriptions compared to the original prompt, and “w/” denotes that we add corresponding instructions to the prompt. As for the zero-shot prompting strategies, we can see that removing the recency-focused prompting sentence (w/o recency-focused) largely decreases the recommendation performance, demonstrating the key role of recently interacted items in recommendation. Although we provide historical items in chronological order based on timestamps, LLMs still needs explicit guidance to understand the importance of recent items, indicating that heuristic knowledge in the recommendation field needs to be supplemented for LLMs. Meanwhile, adding the role prompting descriptions to the original prompt (w/ role prompting) significantly improves the performance of zero-shot prompting, which shows that role-playing and expert-like prompts can better leverage the capabilities of LLMs in specific fields or tasks (Zhao et al., 2023). In addition, an important strategy for zero-shot prompting is the chain of thoughts, and we compare effects of two kinds of CoT prompting. When we remove the basic prompting sentence “let’s think step by step” in the original prompt, the recommendation effect is improved, possibly due to the fact that following the step-by-step prompts is not conducive to the extraction of results from textual outputs. It also indicates that specific problem decomposition may be required for recommendation tasks rather than general prompts, and the superior results of summarizing recent interest before recommendation (least-to-most prompting strategy) confirm this finding.

(2) The effects on planning modules of prompts (RQ2). In contrast to zero-shot prompting, the typical representative of few-shot prompting in ProLLM4Rec is in-context learning (ICL) based on contextual examples, and we study the ICL results considering demonstrations from the current user (self) and other users (others), respectively. From the results in Table 12, it can be seen that using examples of the target user is better than demonstrations from other users, indicating the personalized needs of each user. However, the strategy of few-shot prompting has insignificant advantage compared to the zero-shot prompting, and ICL is not fully applicable to recommendation scenarios.

6. Conclusion

This paper aims to provide a comprehensive exploration of Large Language Models (LLMs) to serve as recommender systems. It presents a systematic review of the advancements made in LLM-based recommendations, generalizing related work into multiple scenarios and tasks in terms of LLMs and prompts. We also conduct extensive experiments on two public datasets to investigate empirical findings for recommendation with LLMs. Our objective is to assist researchers in gaining a deeper understanding of the characteristics, strengths, and limitations of LLMs utilized as recommender systems. Considering the significant progress in LLMs, the development of LLM-based recommendations holds the potential to better align the powerful capabilities of LLMs with the evolving needs of intended users in the field of recommender systems. By addressing current challenges, we hope that our work will contribute to the advancement of LLM-based recommendations and serve as an inspiration for future research efforts. Last but not least, we outline promising directions for future research in utilizing LLMs for recommendation as follows.

$\bullet$ Efficiency optimization of LLMs for recommendation. The key limitation of leveraging LLMs in industrial recommender systems is efficiency(Li et al., 2023l; Wu et al., 2023b; Fan et al., 2023), including considerations of both time and space. On the one hand, the fine-tuning and inference efficiency of LLMs cannot compare to traditional recommendation models (Zheng et al., 2023; Hou et al., 2023). While techniques such as parameter-efficient fine-tuning can aid in kee** LLMs updated in a computationally efficient manner, recommender systems need to iterate continuously over time, i.e., incremental learning. Frequent updates of LLMs inevitably bring spatial and temporal burdens to recommender systems (Shi et al., 2023). On the other hand, billions of parameters in LLMs also pose challenges for the lightweight deployment of recommendation algorithms (Touvron et al., 2023a, b). Therefore, efficiency optimization of LLMs utilized as recommender systems is one of the prerequisites for large-scale applications, which has widespread application prospects and scientific research values (Shi et al., 2023; Wang et al., 2023b; Huang et al., 2023).

$\bullet$ Knowledge distillation of LLMs for recommendation. Since LLMs as recommenders are limited by efficiency, another feasible approach is to distill (Sun et al., 2024; Li et al., 2023a) the recommendation capabilities of LLMs into lightweight models, striking a balance between efficiency and effectiveness. Specifically, knowledge distillation is a classic model compression method adopted in recommender systems (Sun et al., 2024; Tian et al., 2023), with the core idea of guiding lightweight student models to “imitate” teacher models with better performance and more complex structures such as LLMs. In the field of recommender systems, the collaborative optimization of LLMs and recommendation models can also be seen as the distillation process to inject knowledge from LLMs to traditional recommenders, enhancing representations of users and items from semantic features (Li et al., 2023a; Qiu et al., 2023; Xi et al., 2023). Due to the fact that knowledge distillation can improve efficiency while retaining the capabilities of LLMs for recommendation, more applications need to be fully explored.

$\bullet$ Multimodal recommendations with LLMs. In addition to IDs and text, multimodal recommendations with LLMs hold considerable promise and warrant comprehensive exploration with the evolving landscape of media consumption (Yang et al., 2023c; Zhang et al., 2023g). The essence of multimodal recommendations resides in the fusion of textual and visual information for enhanced user engagement (Pan et al., 2022; Harrison et al., 2023), and the dual functionality of LLMs is pivotal in this context. LLMs possess the capability to function as multimodal LLMs, enabling the incorporation and encoding of visual information extracted from images. Furthermore, images can be transformed into textual representations by multimodal encoders first, and LLMs are mainly used for the subsequent integration of diverse modalities (Harrison et al., 2023). In addition, the rich multimodal attributes also provide a basis for diversified recommendation results (Lin et al., 2023c). As the field progresses, an emphasis on the reproducibility, benchmarking, and standardization of evaluation datasets and metrics will be essential to foster a cohesive and informed advancement in multimodal recommender systems leveraging LLMs.

$\bullet$ Fairness-aware recommendations of LLMs. Future research in the domain of fairness-aware recommendations with LLMs presents a compelling avenue for scholarly inquiry (Wang et al., 2023d; Zhang et al., 2023a; Li et al., 2023k; Dai et al., 2023b). In line with existing work, researchers have explored that retrievers in information retrieval are biased towards contents generated by LLMs (Dai et al., 2023b), and LLMs utilized as recommender systems also output unfair recommendation results (Wang et al., 2023d; Zhang et al., 2023a; Li et al., 2023k), necessitating a profound investigation into methodologies that ensure equitable and unbiased outcomes for LLM-based recommendations. It is imperative to scrutinize existing fairness-aware algorithms and develop novel approaches that cater to the intricacies of language models, particularly in understanding how fairness metrics align with user expectations. As LLMs continue to evolve, the research community must collaborate to establish standardized evaluation metrics and benchmarks for fairness-aware recommendations, ensuring the reproducibility and comparability of findings across diverse studies in the era of LLMs. In essence, fairness-aware recommendations with LLMs are poised to contribute substantially to the development of ethical and equitable recommender systems (Wang et al., 2023d; Zhang et al., 2023a).

$\bullet$ General-purpose LLMs in the vertical field of recommender systems. Develo** a comprehensive framework for LLMs to address multiple recommendation tasks represents a significant avenue for scholarly exploration (Geng et al., 2022; Chu et al., 2023; Zhang et al., 2023f; Wang et al., 2023c). The endeavor involves formulating a unified and versatile structure that accommodates the intricacies of various recommendation tasks (Dai et al., 2023a), encompassing diverse modalities and user preferences (Lin et al., 2023c; Harrison et al., 2023). In addition, using agents of LLMs to simulate recommendation scenarios and make dynamic decisions themselves is also a feasible application of general-purpose LLMs (Wang et al., 2023i; Zhang et al., 2023c; Wang et al., 2023g). Research efforts should focus on refining the architecture, training methodologies, and adaptability of such a general framework to ensure optimal performance across different recommendation domains. Investigating transfer learning techniques within this framework, enabling the transfer of knowledge between recommendation tasks, is crucial for enhancing efficiency and leveraging shared information (Zhang et al., 2023f; Bao et al., 2023b). In general, general-purpose LLMs for recommendation have the potential to revolutionize recommender systems by providing a unified and scalable solution capable of addressing the multifaceted challenges posed by diverse recommendation tasks.

$\bullet$ Privacy and ethical concerns. Prior studies (Shen et al., 2023; Weidinger et al., 2021) have highlighted the potential issue of language models generating unreliable or personal contents based on certain prompts and insecure instructions. Recommendation systems involve massive amounts of the user data (Ni et al., 2019; Guo et al., 2017), and it is crucial to remove private and potentially harmful information stored in LLMs to enhance the privacy and security of LLM-based applications (Carranza et al., 2023; Lei et al., 2023). Notably, researchers have observed that Reinforcement Learning from Human Feedback (RLHF) and model editing techniques (Geva et al., 2022) have the potential to restrain the generation of poisonous or harmful contents from LLMs, thereby mitigating privacy and ethical concerns associated with privacy-preserving recommender systems (Carranza et al., 2023). Nevertheless, it is imperative to acknowledge that the alignment technology of LLMs may be vulnerable to misuse. The apprehension exists that LLMs could be manipulated by malicious users to selectively influence agents and amplify specific viewpoints within recommender systems. Consequently, addressing the security and privacy implications of LLM-based recommendations is crucial, and there is a need to formulate public regulations to ensure responsible usage and mitigate potential risks (Weidinger et al., 2021).

Acknowledgements

The authors would like to thank Chuyuan Wang and Chenrui Zhang for participating in discussions of this paper. Lanling Xu is supported by Meituan Group during her research internship. Xin Zhao is the corresponding author.

References

(1)
Agrawal et al. (2023) Saurabh Agrawal, John Trenkle, and Jaya Kawale. 2023. Beyond Labels: Leveraging Deep Learning and LLMs for Content Metadata. In RecSys. ACM, 1.
Bao et al. (2023a) Keqin Bao, Jizhi Zhang, Wenjie Wang, Yang Zhang, Zhengyi Yang, Yancheng Luo, Fuli Feng, Xiangnaan He, and Qi Tian. 2023a. A bi-step grounding paradigm for large language models in recommendation systems. arXiv preprint arXiv:2308.08434 (2023).
Bao et al. (2023b) Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023b. TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation. In RecSys. ACM, 1007–1014.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
Buckland and Gey (1994) Michael Buckland and Fredric Gey. 1994. The relationship between recall and precision. Journal of the American society for information science 45, 1 (1994), 12–19.
Carranza et al. (2023) Aldo Gael Carranza, Rezsa Farahani, Natalia Ponomareva, Alex Kurakin, Matthew Jagielski, and Milad Nasr. 2023. Privacy-Preserving Recommender Systems with Synthetic Query Generation using Differentially Private Large Language Models. arXiv preprint arXiv:2305.05973 (2023).
Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. 7–10.
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
Chu et al. (2023) Zhixuan Chu, Hongyan Hao, Xin Ouyang, Simeng Wang, Yan Wang, Yue Shen, **jie Gu, Qing Cui, Longfei Li, Siqiao Xue, et al. 2023. Leveraging large language models for pre-trained recommender systems. arXiv preprint arXiv:2308.10837 (2023).
Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).
Colas et al. (2023) Anthony Colas, Jun Araki, Zhengyu Zhou, Bingqing Wang, and Zhe Feng. 2023. Knowledge-grounded Natural Language Recommendation Explanation. arXiv preprint arXiv:2308.15813 (2023).
Dai et al. (2023a) Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxiang Sun, Xiao Zhang, and Jun Xu. 2023a. Uncovering ChatGPT’s Capabilities in Recommender Systems. In RecSys. ACM, 1126–1132.
Dai et al. (2023b) Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, and Jun Xu. 2023b. Llms may dominate information access: Neural retrievers are biased towards llm-generated texts. arXiv preprint arXiv:2310.20501 (2023).
Di Palma et al. (2023) Dario Di Palma, Giovanni Maria Biancofiore, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia, and Eugenio Di Sciascio. 2023. Evaluating ChatGPT as a Recommender System: A Rigorous Approach. arXiv preprint arXiv:2309.03613 (2023).
Diao et al. (2023) Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. 2023. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246 (2023).
Du et al. (2023) Yingpeng Du, Di Luo, Rui Yan, Hongzhi Liu, Yang Song, Hengshu Zhu, and Jie Zhang. 2023. Enhancing Job Recommendation through LLM-based Generative Adversarial Networks. arXiv preprint arXiv:2307.10747 (2023).
Fan et al. (2023) Wenqi Fan, Zihuai Zhao, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Jiliang Tang, and Qing Li. 2023. Recommender systems in the era of large language models (llms). arXiv preprint arXiv:2307.02046 (2023).
Friedman et al. (2023) Luke Friedman, Sameer Ahuja, David Allen, Terry Tan, Hakim Sidahmed, Changbo Long, Jun Xie, Gabriel Schubiner, Ajay Patel, Harsh Lara, et al. 2023. Leveraging Large Language Models in Conversational Recommender Systems. arXiv preprint arXiv:2305.07961 (2023).
Fu et al. (2023) Zichuan Fu, Xiangyang Li, Chuhan Wu, Yichao Wang, Kuicai Dong, Xiangyu Zhao, Mengchen Zhao, Huifeng Guo, and Ruiming Tang. 2023. A Unified Framework for Multi-Domain CTR Prediction via Large Language Models. arXiv:2312.10743 [cs.IR]
Gao et al. (2023) Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Zhang. 2023. Chat-rec: Towards interactive and explainable llms-augmented recommender system. arXiv preprint arXiv:2303.14524 (2023).
Geng et al. (2022) Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems. 299–315.
Geva et al. (2022) Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 30–45.
Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 1725–1731.
Haleem et al. (2022) Abid Haleem, Mohd Javaid, and Ravi Pratap Singh. 2022. An era of ChatGPT as a significant futuristic support tool: A study on features, abilities, and challenges. BenchCouncil transactions on benchmarks, standards and evaluations 2, 4 (2022), 100089.
Harper (2015) F Maxwell Harper. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5 4 (2015) 1–19. F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5 4 (2015) 1–19.
Harrison et al. (2023) Rachel M Harrison, Anton Dereventsov, and Anton Bibin. 2023. Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging. arXiv preprint arXiv:2309.01026 (2023).
Harte et al. (2023) Jesse Harte, Wouter Zorgdrager, Panos Louridas, Asterios Katsifodimos, Dietmar Jannach, and Marios Fragkoulis. 2023. Leveraging Large Language Models for Sequential Recommendation. In RecSys. ACM, 1096–1102.
He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639–648.
He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web. 173–182.
He et al. (2023) Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian J. McAuley. 2023. Large Language Models as Zero-Shot Conversational Recommenders. In CIKM. ACM, 720–730.
Hou et al. (2022) Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 585–593.
Hou et al. (2023) Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. 2023. Large language models are zero-shot rankers for recommender systems. arXiv preprint arXiv:2305.08845 (2023).
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
Hua et al. (2023) Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. 2023. How to Index Item IDs for Recommendation Foundation Models. arXiv preprint arXiv:2305.06569 (2023).
Huang et al. (2023) Xu Huang, Jianxun Lian, Yuxuan Lei, **g Yao, Defu Lian, and Xing Xie. 2023. Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations. arXiv preprint arXiv:2308.16505 (2023).
Järvelin and Kekäläinen (2002) Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446.
Ji et al. (2023) Jianchao Ji, Zelong Li, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Juntao Tan, and Yongfeng Zhang. 2023. Genrec: Large language model for generative recommendation. arXiv e-prints (2023), arXiv–2307.
** et al. (2023) Jiarui **, Xianyu Chen, Fanghua Ye, Mengyue Yang, Yue Feng, Weinan Zhang, Yong Yu, and Jun Wang. 2023. Lending Interaction Wings to Recommender Systems with Conversational Agents. arXiv preprint arXiv:2310.04230 (2023).
Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
Kang et al. (2023) Wang-Cheng Kang, Jianmo Ni, Nikhil Mehta, Maheswaran Sathiamoorthy, Lichan Hong, Ed Chi, and Derek Zhiyuan Cheng. 2023. Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction. arXiv preprint arXiv:2305.06474 (2023).
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199–22213.
Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
Le Scao and Rush (2021) Teven Le Scao and Alexander M Rush. 2021. How many data points is a prompt worth?. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2627–2636.
Lei et al. (2023) Yuxuan Lei, Jianxun Lian, **g Yao, Xu Huang, Defu Lian, and Xing Xie. 2023. RecExplainer: Aligning Large Language Models for Recommendation Model Interpretability. arXiv preprint arXiv:2311.10947 (2023).
Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
Li et al. (2023d) Jiacheng Li, Ming Wang, ** Li, **miao Fu, Xin Shen, **gbo Shang, and Julian McAuley. 2023d. Text Is All You Need: Learning Language Representations for Sequential Recommendation. arXiv preprint arXiv:2305.13731 (2023).
Li et al. (2023e) Jiacheng Li, Ming Wang, ** Li, **miao Fu, Xin Shen, **gbo Shang, and Julian J. McAuley. 2023e. Text Is All You Need: Learning Language Representations for Sequential Recommendation. In KDD. ACM, 1258–1267.
Li et al. (2023m) **ming Li, Wentao Zhang, Tian Wang, Guanglei Xiong, Alan Lu, and Gerard Medioni. 2023m. GPT4Rec: A generative framework for personalized recommendation and user interests interpretation. arXiv preprint arXiv:2304.03879 (2023).
Li et al. (2023f) Lei Li, Yongfeng Zhang, and Li Chen. 2023f. Personalized prompt learning for explainable recommendation. ACM Transactions on Information Systems 41, 4 (2023), 1–26.
Li et al. (2023g) Lei Li, Yongfeng Zhang, and Li Chen. 2023g. Prompt Distillation for Efficient LLM-based Recommendation. In CIKM. ACM, 1348–1357.
Li et al. (2023l) Lei Li, Yongfeng Zhang, Dugang Liu, and Li Chen. 2023l. Large Language Models for Generative Recommendation: A Survey and Visionary Discussions. arXiv preprint arXiv:2309.01157 (2023).
Li et al. (2023c) Ruyu Li, Wenhao Deng, Yu Cheng, Zheng Yuan, Jiaqi Zhang, and Fajie Yuan. 2023c. Exploring the Upper Limits of Text-Based Collaborative Filtering Using Large Language Models: Discoveries and Insights. arXiv preprint arXiv:2305.11700 (2023).
Li et al. (2023a) Xiangyang Li, Bo Chen, Lu Hou, and Ruiming Tang. 2023a. CTRL: Connect Tabular and Language Model for CTR Prediction. arXiv preprint arXiv:2306.02841 (2023).
Li et al. (2023b) Xinhang Li, Chong Chen, Xiangyu Zhao, Yong Zhang, and Chunxiao Xing. 2023b. E4SRec: An Elegant Effective Efficient Extensible Solution of Large Language Models for Sequential Recommendation. arXiv:2312.02443 [cs.IR]
Li et al. (2023h) Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. 2023h. Exploring Fine-tuning ChatGPT for News Recommendation. arXiv preprint arXiv:2311.05850 (2023).
Li et al. (2023i) Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. 2023i. Exploring Fine-tuning ChatGPT for News Recommendation. arXiv:2311.05850 [cs.IR]
Li et al. (2023j) Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. 2023j. PBNR: Prompt-based News Recommender System. arXiv preprint arXiv:2304.07862 (2023).
Li et al. (2023k) Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. 2023k. A Preliminary Study of ChatGPT on News Recommendation: Personalization, Provider Fairness, Fake News. arXiv preprint arXiv:2306.10702 (2023).
Liang et al. (2018) Dawen Liang, Rahul G Krishnan, Matthew D Hoffman, and Tony Jebara. 2018. Variational autoencoders for collaborative filtering. In Proceedings of the 2018 world wide web conference. 689–698.
Liao et al. (2023) Jiayi Liao, Sihang Li, Zhengyi Yang, Jiancan Wu, Yancheng Yuan, Xiang Wang, and Xiangnan He. 2023. LLaRA: Aligning Large Language Models with Sequential Recommenders. arXiv:2312.02445 [cs.IR]
Lin et al. (2023a) Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, et al. 2023a. How Can Recommender Systems Benefit from Large Language Models: A Survey. arXiv preprint arXiv:2306.05817 (2023).
Lin et al. (2023b) Jianghao Lin, Rong Shan, Chenxu Zhu, Kounianhua Du, Bo Chen, Shigang Quan, Ruiming Tang, Yong Yu, and Weinan Zhang. 2023b. ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation. arXiv preprint arXiv:2308.11131 (2023).
Lin et al. (2023c) Xinyu Lin, Wenjie Wang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. 2023c. A Multi-facet Paradigm to Bridge Large Language Model and Recommendation. arXiv preprint arXiv:2310.06491 (2023).
Lin et al. (2022) Zihan Lin, Changxin Tian, Yupeng Hou, and Wayne Xin Zhao. 2022. Improving graph collaborative filtering with neighborhood-enriched contrastive learning. In Proceedings of the ACM Web Conference 2022. 2320–2329.
Liu et al. (2023e) Dairui Liu, Boming Yang, Honghui Du, Derek Greene, Aonghus Lawlor, Ruihai Dong, and Irene Li. 2023e. RecPrompt: A Prompt Tuning Framework for News Recommendation Using Large Language Models. arXiv:2312.10463 [cs.IR]
Liu et al. (2023b) Fan Liu, Yaqi Liu, Zhiyong Cheng, Liqiang Nie, and Mohan Kankanhalli. 2023b. Understanding Before Recommendation: Semantic Aspect-Aware Review Exploitation via Large Language Models. arXiv:2312.16275 [cs.IR]
Liu et al. (2022) Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Ye** Choi, and Hannaneh Hajishirzi. 2022. Generated Knowledge Prompting for Commonsense Reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3154–3169.
Liu et al. (2023c) Junling Liu, Chao Liu, Renjie Lv, Kang Zhou, and Yan Zhang. 2023c. Is chatgpt a good recommender? a preliminary study. arXiv preprint arXiv:2304.10149 (2023).
Liu et al. (2023d) Junling Liu, Chao Liu, Peilin Zhou, Qichen Ye, Dading Chong, Kang Zhou, Yueqi Xie, Yuwei Cao, Shou** Wang, Chenyu You, et al. 2023d. Llmrec: Benchmarking large language models on recommendation task. arXiv preprint arXiv:2308.12241 (2023).
Liu et al. (2023f) Pengfei Liu, Weizhe Yuan, **lan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023f. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
Liu et al. (2023a) Qijiong Liu, Nuo Chen, Tetsuya Sakai, and Xiao-Ming Wu. 2023a. A First Look at LLM-Powered Generative News Recommendation. arXiv preprint arXiv:2305.06566 (2023).
Luo et al. (2023b) Sichun Luo, Bowei He, Haohan Zhao, Yinya Huang, Aojun Zhou, Zongpeng Li, Yuanzhang Xiao, Mingjie Zhan, and Linqi Song. 2023b. RecRanker: Instruction Tuning Large Language Model as Ranker for Top-k Recommendation. arXiv:2312.16018 [cs.IR]
Luo et al. (2023a) Yucong Luo, Mingyue Cheng, Hao Zhang, Junyu Lu, Qi Liu, and Enhong Chen. 2023a. Unlocking the Potential of Large Language Models for Explainable Recommendations. arXiv:2312.15661 [cs.IR]
Ma et al. (2023) Tianhui Ma, Yuan Cheng, Hengshu Zhu, and Hui Xiong. 2023. Large Language Models are Not Stable Recommender Systems. arXiv:2312.15746 [cs.IR]
Mysore et al. (2023) Sheshera Mysore, Andrew McCallum, and Hamed Zamani. 2023. Large Language Model Augmented Narrative Driven Recommendations. In RecSys. ACM, 777–783.
Ni et al. (2019) Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 188–197.
Pan et al. (2022) Xingyu Pan, Yushuo Chen, Changxin Tian, Zihan Lin, **peng Wang, He Hu, and Wayne Xin Zhao. 2022. Multimodal meta-learning for cold-start sequential recommendation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 3421–3430.
Petrov and Macdonald (2023) Aleksandr V Petrov and Craig Macdonald. 2023. Generative Sequential Recommendation with GPTRec. arXiv preprint arXiv:2306.11114 (2023).
Qiu et al. (2023) Junyan Qiu, Haitao Wang, Zhaolin Hong, Yi** Yang, Qiang Liu, and Xingxing Wang. 2023. ControlRec: Bridging the Semantic Gap between Language Model and Personalized Recommendation. arXiv:2311.16441 [cs.IR]
Qiu et al. (2021) Zhaopeng Qiu, Xian Wu, **gyue Gao, and Wei Fan. 2021. U-BERT: Pre-training user representations for improved recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4320–4327.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
Ren et al. (2023) Xubin Ren, Wei Wei, Lianghao Xia, Lixin Su, Suqi Cheng, Junfeng Wang, Dawei Yin, and Chao Huang. 2023. Representation Learning with Large Language Models for Recommendation. arXiv preprint arXiv:2310.15950 (2023).
Rendle (2010) Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International conference on data mining. IEEE, 995–1000.
Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. 452–461.
Sanner et al. (2023) Scott Sanner, Krisztian Balog, Filip Radlinski, Ben Wedin, and Lucas Dixon. 2023. Large Language Models are Competitive Near Cold-start Recommenders for Language- and Item-based Preferences. In RecSys. ACM, 890–896.
Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. 285–295.
Shen et al. (2023) Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. 2023. In chatgpt we trust? measuring and characterizing the reliability of chatgpt. arXiv preprint arXiv:2304.08979 (2023).
Shi et al. (2023) Tianhao Shi, Yang Zhang, Zhijian Xu, Chong Chen, Fuli Feng, Xiangnan He, and Qi Tian. 2023. Preliminary Study on Incremental Learning for Large Language Model-based Recommender Systems. arXiv:2312.15599 [cs.IR]
Shu et al. (2023) Yubo Shu, Hansu Gu, Peng Zhang, Haonan Zhang, Tun Lu, Dongsheng Li, and Ning Gu. 2023. RAH! RecSys-Assistant-Human: A Human-Central Recommendation Framework with Large Language Models. arXiv preprint arXiv:2308.09904 (2023).
Su and Khoshgoftaar (2009) Xiaoyuan Su and Taghi M Khoshgoftaar. 2009. A survey of collaborative filtering techniques. Advances in artificial intelligence 2009 (2009).
Sun et al. (2024) Wenqi Sun, Ruobing Xie, Junjie Zhang, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2024. Distillation is All You Need for Practically Using Different Pre-trained Recommendation Models. arXiv:2401.00797 [cs.IR]
Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
Thorat et al. (2015) Poonam B Thorat, Rajeshwari M Goudar, and Sunita Barve. 2015. Survey on collaborative filtering, content-based filtering and hybrid recommendation system. International Journal of Computer Applications 110, 4 (2015), 31–36.
Tian et al. (2023) Zhen Tian, Ting Bai, Zibin Zhang, Zhiyuan Xu, Kangyi Lin, Ji-Rong Wen, and Wayne Xin Zhao. 2023. Directed Acyclic Graph Factorization Machines for CTR Prediction via Knowledge Distillation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 715–723.
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Wang et al. (2023b) Dui Wang, Xiangyu Hou, Xiaohui Yang, Bo Zhang, Renbing Chen, and Daiyue Xue. 2023b. Multiple Key-value Strategy in Recommendation Systems Incorporating Large Language Model. arXiv preprint arXiv:2310.16409 (2023).
Wang and Lim (2023) Lei Wang and Ee-Peng Lim. 2023. Zero-Shot Next-Item Recommendation using Large Pretrained Language Models. arXiv preprint arXiv:2304.03153 (2023).
Wang et al. (2023g) Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, **gsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2023g. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432 (2023).
Wang et al. (2023i) Lei Wang, **gsen Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, and Ji-Rong Wen. 2023i. RecAgent: A Novel Simulation Paradigm for Recommender Systems. arXiv preprint arXiv:2306.02552 (2023).
Wang et al. (2023d) Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023d. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926 (2023).
Wang et al. (2023e) Wenjie Wang, Xinyu Lin, Fuli Feng, Xiangnan He, and Tat-Seng Chua. 2023e. Generative recommendation: Towards next-generation recommender paradigm. arXiv preprint arXiv:2304.03516 (2023).
Wang et al. (2019) Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval. 165–174.
Wang et al. (2023h) Xiaolei Wang, Xinyu Tang, Xin Zhao, **gyuan Wang, and Ji-Rong Wen. 2023h. Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models. In EMNLP. Association for Computational Linguistics, 10052–10065.
Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
Wang et al. (2023a) Yan Wang, Zhixuan Chu, Xin Ouyang, Simeng Wang, Hongyan Hao, Yue Shen, **jie Gu, Siqiao Xue, James Y Zhang, Qing Cui, et al. 2023a. Enhancing recommender systems with large language model reasoning graphs. arXiv preprint arXiv:2308.10835 (2023).
Wang et al. (2023c) Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, and Yingzhen Yang. 2023c. Recmind: Large language model powered agent for recommendation. arXiv preprint arXiv:2308.14296 (2023).
Wang et al. (2023f) Yu Wang, Zhiwei Liu, Jianguo Zhang, Weiran Yao, Shelby Heinecke, and Philip S. Yu. 2023f. DRDT: Dynamic Reflection with Divergent Thinking for LLM-based Sequential Recommendation. arXiv:2312.11336 [cs.IR]
Wang (2023) Zhoumeng Wang. 2023. Empowering Few-Shot Recommender Systems with Large Language Models – Enhanced Representations. arXiv:2312.13557 [cs.IR]
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
Wei et al. (2023) Wei Wei, Xubin Ren, Jiabin Tang, Qinyong Wang, Lixin Su, Suqi Cheng, Junfeng Wang, Dawei Yin, and Chao Huang. 2023. LLMRec: Large Language Models with Graph Augmentation for Recommendation. arXiv preprint arXiv:2311.00423 (2023).
Weidinger et al. (2021) Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. 2021. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359 (2021).
Wu et al. (2021b) Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. 2021b. Empowering news recommendation with pre-trained language models. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1652–1656.
Wu et al. (2021a) Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. 2021a. Self-supervised graph learning for recommendation. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 726–735.
Wu et al. (2023b) Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, et al. 2023b. A Survey on Large Language Models for Recommendation. arXiv preprint arXiv:2305.19860 (2023).
Wu et al. (2022) Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui. 2022. Graph neural networks in recommender systems: a survey. Comput. Surveys 55, 5 (2022), 1–37.
Wu et al. (2023a) Tianyu Wu, Shizhu He, **g** Liu, Siqi Sun, Kang Liu, Qing-Long Han, and Yang Tang. 2023a. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica 10, 5 (2023), 1122–1136.
Xi et al. (2023) Yunjia Xi, Weiwen Liu, Jianghao Lin, Jieming Zhu, Bo Chen, Ruiming Tang, Weinan Zhang, Rui Zhang, and Yong Yu. 2023. Towards Open-World Recommendation with Knowledge Augmentation from Large Language Models. arXiv preprint arXiv:2306.10933 (2023).
Xiao et al. (2023) Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient Streaming Language Models with Attention Sinks. arXiv preprint arXiv:2309.17453 (2023).
Xu et al. (2023) Lanling Xu, Zhen Tian, Gaowei Zhang, Junjie Zhang, Lei Wang, Bowen Zheng, Yifan Li, Jiakai Tang, Zeyu Zhang, Yupeng Hou, et al. 2023. Towards a More User-Friendly and Easy-to-Use Benchmark Library for Recommender Systems. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2837–2847.
Yang et al. (2023a) Fan Yang, Zheng Chen, Ziyan Jiang, Eunah Cho, Xiaojiang Huang, and Yanbin Lu. 2023a. PALR: Personalization Aware LLMs for Recommendation. arXiv e-prints (2023), arXiv–2305.
Yang et al. (2023c) Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2023c. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv preprint arXiv:2310.11441 (2023).
Yang et al. (2023b) Zhengyi Yang, Jiancan Wu, Yanchen Luo, Jizhi Zhang, Yancheng Yuan, An Zhang, Xiang Wang, and Xiangnan He. 2023b. Large Language Model Can Interpret Latent Space of Sequential Recommender. arXiv preprint arXiv:2310.20487 (2023).
Yao et al. (2023) **g Yao, Wei Xu, Jianxun Lian, Xiting Wang, Xiaoyuan Yi, and Xing Xie. 2023. Knowledge Plugins: Enhancing Large Language Models for Domain-Specific Recommendations. arXiv preprint arXiv:2311.10779 (2023).
Yuan et al. (2023a) Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. 2023a. Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited. In SIGIR. ACM, 2639–2649.
Yuan et al. (2023b) Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. 2023b. Where to go next for recommender systems? id-vs. modality-based recommender models revisited. arXiv preprint arXiv:2303.13835 (2023).
Yue et al. (2023) Zhenrui Yue, Sara Rabhi, Gabriel de Souza Pereira Moreira, Dong Wang, and Even Oldridge. 2023. LlamaRec: Two-Stage Recommendation using Large Language Models for Ranking. arXiv preprint arXiv:2311.02089 (2023).
Zeng et al. (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022).
Zhang et al. (2023e) An Zhang, Leheng Sheng, Yuxin Chen, Hao Li, Yang Deng, Xiang Wang, and Tat-Seng Chua. 2023e. On Generative Agents in Recommendation. arXiv preprint arXiv:2310.10108 (2023).
Zhang et al. (2023a) Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023a. Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation. In RecSys. ACM, 993–999.
Zhang et al. (2023c) Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2023c. AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems. arXiv preprint arXiv:2310.09233 (2023).
Zhang et al. (2023f) Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2023f. Recommendation as instruction following: A large language model empowered recommendation approach. arXiv preprint arXiv:2305.07001 (2023).
Zhang et al. (2023d) Wenxuan Zhang, Hongzhi Liu, Yingpeng Du, Chen Zhu, Yang Song, Hengshu Zhu, and Zhonghai Wu. 2023d. Bridging the Information Gap Between Domain-Specific Model and General LLM for Personalized Recommendation. arXiv preprint arXiv:2311.03778 (2023).
Zhang et al. (2021) Yuhui Zhang, Hao Ding, Zeren Shui, Yifei Ma, James Zou, Anoop Deoras, and Hao Wang. 2021. Language models as recommender systems: Evaluations and limitations. (2021).
Zhang et al. (2023b) Yang Zhang, Fuli Feng, Jizhi Zhang, Keqin Bao, Qifan Wang, and Xiangnan He. 2023b. CoLLM: Integrating Collaborative Embeddings into Large Language Models for Recommendation. arXiv preprint arXiv:2310.19488 (2023).
Zhang et al. (2023g) Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2023g. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023).
Zhao et al. (2022) Wayne Xin Zhao, Yupeng Hou, Xingyu Pan, Chen Yang, Zeyu Zhang, Zihan Lin, **gsen Zhang, Shuqing Bian, Jiakai Tang, Wenqi Sun, et al. 2022. RecBole 2.0: Towards a More Up-to-Date Recommendation Library. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4722–4726.
Zhao et al. (2021) Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min, Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. 2021. RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. In CIKM. ACM, 4653–4664.
Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
Zheng et al. (2023) Bowen Zheng, Yupeng Hou, Hongyu Lu, Yu Chen, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation. arXiv preprint arXiv:2311.09049 (2023).
Zhiyuli et al. (2023) Aakas Zhiyuli, Yanfang Chen, Xuan Zhang, and Xun Liang. 2023. BookGPT: A General Framework for Book Recommendation Empowered by Large Language Model. arXiv preprint arXiv:2305.15673 (2023).
Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, et al. 2022. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In The Eleventh International Conference on Learning Representations.
Zhou et al. (2020) Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM international conference on information & knowledge management. 1893–1902.
Zhu et al. (2023) Yaochen Zhu, Liang Wu, Qi Guo, Liangjie Hong, and Jundong Li. 2023. Collaborative Large Language Model for Recommender Systems. arXiv preprint arXiv:2311.01343 (2023).