ReLLa: Retrieval-enhanced Large Language Models for Lifelong
Sequential Behavior Comprehension in Recommendation

Jianghao Lin [email protected] Shanghai Jiao Tong UniversityShanghaiChina , Rong Shan [email protected] Shanghai Jiao Tong UniversityShanghaiChina , Chenxu Zhu [email protected] Huawei Noah’s Ark LabShenzhenChina , Kounianhua Du [email protected] Shanghai Jiao Tong UniversityShanghaiChina , Bo Chen [email protected] Huawei Noah’s Ark LabShenzhenChina , Shigang Quan [email protected] Shanghai Jiao Tong UniversityShanghaiChina , Ruiming Tang [email protected] Huawei Noah’s Ark LabShenzhenChina , Yong Yu [email protected] Shanghai Jiao Tong UniversityShanghaiChina and Weinan Zhang [email protected] Shanghai Jiao Tong UniversityShanghaiChina

(2024)

Abstract.

With large language models (LLMs) achieving remarkable breakthroughs in natural language processing (NLP) domains, LLM-enhanced recommender systems have received much attention and have been actively explored currently. In this paper, we focus on adapting and empowering a pure large language model for zero-shot and few-shot recommendation tasks. First and foremost, we identify and formulate the lifelong sequential behavior incomprehension problem for LLMs in recommendation domains, i.e., LLMs fail to extract useful information from a textual context of long user behavior sequence, even if the length of context is far from reaching the context limitation of LLMs. To address such an issue and improve the recommendation performance of LLMs, we propose a novel framework, namely Retrieval-enhanced Large Language models (ReLLa) for recommendation tasks in both zero-shot and few-shot settings. For zero-shot recommendation, we perform semantic user behavior retrieval (SUBR) to improve the data quality of testing samples, which greatly reduces the difficulty for LLMs to extract the essential knowledge from user behavior sequences. As for few-shot recommendation, we further design retrieval-enhanced instruction tuning (ReiT) by adopting SUBR as a data augmentation technique for training samples. Specifically, we develop a mixed training dataset consisting of both the original data samples and their retrieval-enhanced counterparts. We conduct extensive experiments on three real-world public datasets to demonstrate the superiority of ReLLa compared with existing baseline models, as well as its capability for lifelong sequential behavior comprehension. To be highlighted, with only less than 10% training samples, few-shot ReLLa can outperform traditional CTR models that are trained on the entire training set (e.g., DCNv2, DIN, SIM). The code is available¹¹1PyTorch version: https://github.com/LaVieEnRose365/ReLLa²²2MindSpore version: https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/ReLLa.

Large Language Models; Recommender Systems; User Modeling

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Proceedings of the ACM Web Conference 2024; May 13–17, 2024; Singapore, Singapore^†^†booktitle: Proceedings of the ACM Web Conference 2024 (WWW ’24), May 13–17, 2024, Singapore, Singapore^†^†doi: 10.1145/3589334.3645467^†^†isbn: 979-8-4007-0171-9/24/05^†^†ccs: Information systems Recommender systems

1. Introduction

Refer to caption — Figure 1. The illustration of lifelong sequential behavior incomprehension problem for LLMs. We report the AUC performance of SIM and Vicuna-13B on MovieLens-1M dataset. While SIM enjoys steady performance improvement as the length of behavior sequence $K$ grows, Vicuna-13B only peaks at $K=15$ and fails to extract the useful information with further longer sequences (*i.e.*, $K>15$ ).

Recommender systems play a vital role in various online applications to alleviate the information overload problem and satisfy the users’ information needs (Guo et al., 2017; Xi et al., 2023a, b). Besides, large language models (LLMs) have flourished in the natural language processing (NLP) domain, showing impressive capacities in generating human-like texts for a wide range of tasks (Brown et al., 2020; Touvron et al., 2023; Wang et al., 2023; Zhang et al., 2023c). Consequently, recent works start to explore the potential of LLMs for recommender systems (Lin et al., 2023a; Hou et al., 2023b; Bao et al., 2023). They adopt LLMs directly for various recommendation tasks (e.g., listwise ranking, pointwise scoring), and find out that large language models depict promising performance in zero-shot and few-shot settings for recommendation (Zhang et al., 2023b; Bao et al., 2023).

In this paper, we focus on adapting and empowering a pure large language model for recommendation tasks in zero-shot and few-shot settings. First, we identify the lifelong sequential behavior incomprehension problem, i.e., LLMs fail to extract the useful information from a textual context of long user behavior sequence for recommendation tasks, even if the length of context is far from reaching the context limitation of LLMs. This problem is shown in Figure 1, where Vicuna-13B (Chiang et al., 2023; Touvron et al., 2023) is a popular open-source large language model with a context window of 2048 tokens. As we can observe, the traditional recommendation model (i.e., SIM) enjoys steady performance gains as the length of involved user sequence $K$ grows. However, the performance of Vicuna-13B reaches the peak at length $K=15$ and starts to decrease with longer behavior sequence $K>15$ , even if the number of involved tokens is far less than the context window limitation (i.e., 2048 tokens). While in common NLP tasks, LLMs can definitely exhibit exceptional performance if given a similar length of context (around 600+ tokens). Therefore, we argue that such an incomprehension problem on long user behavior sequence is special for LLMs in recommendation domains, where it is a rather difficult reasoning task to infer the user’s preference towards a certain candidate item based on the given user profile and behavior history.

To address the lifelong sequential behavior incomprehension problem, we propose a novel framework to develop Retrieval-enhanced Large Language models (ReLLa) for recommendation tasks in both zero-shot and few-shot settings. For zero-shot recommendation, we propose to conduct semantic user behavior retrieval (SUBR) to replace the simply truncated top- $K$ recent behaviors with the top- $K$ semantically relevant behaviors towards the target item. In this way, we improve the quality of data samples and reduce the difficulty for LLMs to extract useful information from user behavior sequences, therefore alleviating the incomprehension problem. For few-shot recommendation, apart from applying SUBR to improve the data quality of samples, we propose to perform retrieval-enhanced instruction tuning (ReiT) to further promote the ability of LLMs to handle inputs with long behavior sequences. We apply SUBR on training samples as the data augmentation techniques to obtain a mixed training dataset of both original and retrieval-enhanced training data samples, which increases the robustness and generalization ability of LLMs. More surprisingly, with only few-shot training samples (e.g., 8,192 data instances in MovieLens-25M dataset), ReLLa can outperform full-shot traditional recommendation models (e.g., DCNv2 (Wang et al., 2021), DIN (Zhou et al., 2018), and SIM (Pi et al., 2020)) that are trained with the entire training set (e.g., nearly 20M samples in MovieLens-25M dataset).

Main contributions of this paper are in three folds:

•

To the best of our knowledge, we are the first to identify and well formulate the lifelong sequential behavior incomprehension problem for LLMs in recommendation, where LLMs are generally incomprehensible to a textual context of long user behavior sequence, even if the length of context is far from reaching the context limitation.
•

We propose a novel ReLLa (Retrieval-enhanced Large Language Models) framework to mitigate the incomprehension problem of LLMs on long user behavior sequences. We design semantic user behavior retrieval (SUBR) to improve the data quality of data samples for zero-shot recommendation, and further propose retrieval-enhanced instruction tuning (ReiT) to promote the few-shot recommendation performance with a mixture of original and retrieval-enhanced training samples.
•

Extensive experiments on three real-world public datasets validate the effectiveness of our method compared with existing baselines. Note that the baseline models are trained in full-shot settings with the entire training set, while ReLLa is only trained with few-shot samples.

2. Preliminaries

In this paper, we focus on the click-through rate (CTR) prediction, which serves as the core component in recommender systems to estimate a user’s click probability towards a target item given a certain context (Lin et al., 2023b; Guo et al., 2017). The training dataset for CTR prediction is denoted as $\{(x_{i},y_{i})\}_{i=1}^{N}$ , where $N$ is the number of data samples (i.e., $N$ -shot). When adapting a pure large language model for such a pointwise scoring task, we need to clarify the following three key aspects: (1) what is the definition of zero-shot and few-shot recommendations, (2) how to formulate the textual input-output pairs, and (3) how to do pointwise scoring with LLMs.

2.1. Zero-shot and Few-shot Recommendations

Zero-shot recommendation implies that a model is directly employed for the target recommendation task without any tuning on the in-domain training data. Apparently, traditional recommendation models are incapable of accomplishing zero-shot recommendation tasks, since they are randomly initialized. However, LLMs possess a vast volume of open-world knowledge and logical reasoning abilities, which enable them to infer the user’s preference towards a certain target item based on the profile of user/item.

Few-shot recommendation refers to low-resource scenarios with $N$ training data samples. $N$ denotes the number of shots, which is a relatively small number. This highly requires the data efficiency characteristic of an algorithm to fully exploit the limited number of training samples to achieve better recommendation performance.

Extending from the definition of few-shot recommendation, we can therefore define full-shot recommendation as the setting where we train the model based on the entire training set.

2.2. Textual Input-Output Pair Formulation

For LLMs, we need to convert each data sample $x_{i}$ into textual sentences $x_{i}^{text}$ via hard prompt templates. Similarly, the binary label $y_{i}\in\{0,1\}$ is transformed into a pair of binary key answer words $y_{i}^{text}\in\{\text{``Yes''},\text{``No''}\}$ . We give an illustrative example of the input-output pair $(x_{i}^{text},y_{i}^{text})$ in Figure 2, where $x_{i}^{text}$ contains the descriptive texts for user profile, user behavior sequence, target item and task description, respectively.

Notably, the predominant factor that determines the length of context is derived from the user behavior sequence, the length of which can varies from tens to hundreds. For each input $x_{i}$ , we truncate the user behavior sequence to length $K$ . For example, the length of behavior sequence in Figure 2 is $K=4$ . While the common sequential CTR prediction settings usually truncate and adopt the most recent $K$ behaviors, ReLLa propose to conduct semantic user behavior retrieval to construct textual inputs with the most relevant $K$ behaviors towards the target item.

2.3. Pointwise Scoring with LLMs

The large language model takes as input the discrete tokens of $x_{i}^{text}$ , and generate the next token $\hat{y}_{i}^{text}$ as the output, the process of which can be formulated as follows:

(1)	$\displaystyle s_{i}$	$\displaystyle=\operatorname{LLM}(x_{i}^{text})\,\in\mathbb{R}^{V},$
	$\displaystyle p_{i}$	$\displaystyle=\operatorname{Softmax}(s_{i})\,\in\mathbb{R}^{V},$
	$\displaystyle\hat{y}_{i}^{text}$	$\displaystyle\sim p_{i}\,,$

where $V$ is the vocabulary size, and $\hat{y}_{i}^{text}$ is the next predicted token sampled from the probability distribution $p_{i}$ .

However, CTR prediction requires the model to do pointwise scoring, and the output should be floating-point number $\hat{y}_{i}\in[0,1]$ instead of a discrete token $\hat{y}_{i}^{text}$ . Therefore, following previous works (Bao et al., 2023; Zhang and Wang, 2023), we intercept the estimated scores $s_{i}\in\mathbb{R}^{V}$ , and conduct a bidimensional softmax over the corresponding scores of the binary key answer words. Suppose the vocabulary indices for “Yes” and “No” are $a$ and $b$ , respectively. The pointwise scoring of LLMs for CTR prediction can be written as:

(2)

\displaystyle\hat{y}_{i}=\frac{\exp(s_{i,a})}{\exp(s_{i,a})+\exp(s_{i,b})}\,% \in(0,1).

It is worth noting that such an estimated click-through rate $\hat{y}_{i}$ is only leveraged for evaluation on the testing set. We preserve the common instruction tuning and causal language modeling paradigm for LLMs if training is involved.

3. Methodology

In this section, we introduce the ReLLa (Retrieval-enhanced Large Language Models) framework in details.

3.1. Overview of ReLLa

In the ReLLa framework, we develop two key techniques for LLMs in zero-shot and few-shot recommendations, respectively.

For zero-shot recommendation, as illustrated in Figure 3, we propose to conduct semantic user behavior retrieval (SUBR) to improve the data quality of data samples. We first leverage the large language model to obtain the semantic vectors for each item. Then, for each textual data sample $x_{i}^{text}$ , we retrieve the most semantically relevant $K$ behaviors, which can substitute the original most recent $K$ behaviors.

For few-shot recommendation, as shown in Figure 5, we propose to perform retrieval-enhanced instruction tuning (ReiT) to promote the ability of LLMs to extract useful information from long behavior sequences. Notably, the semantic user behavior retrieval (SUBR) is adopted as the data augmentation technique to form the mixed training dataset. The mixture of both original and retrieval-enhanced data samples introduces more variety and patterns in the training set, thus increasing the robustness and generalization ability of LLMs for lifelong sequential behavior comprehension.

Although ReLLa is tuned in few-shot settings, we would like to again emphasize that other recommendation baseline models are trained in full-shot settings with the entire training set.

3.2. Semantic User Behavior Retrieval

In zero-shot settings, the parameters of LLMs cannot be tuned according to the in-domain training samples. Hence, as shown in Figure 3, semantic user behavior retrieval (SUBR) aims to improve the quality of each sample by replacing the simply truncated most recent $K$ behaviors with the most semantically relevant $K$ behaviors towards the target item. As suggested in previous works (Qin et al., 2020; Pi et al., 2020), the retrieved user behaviors can denoise the user history and convey more clear and essential user interests for the target item, while preserving the original length of user sequence as the model input.

Firstly, we conduct semantic item encoding to obtain the semantic vector for each item. For the $t$ -th item in the pool, a descriptive text is constructed via hard prompt template (an example is given in Figure 4, and is then fed through LLM. We perform average pooling over all the hidden states from the last layer of LLM, resulting in a vector $u_{t}\in\mathbb{R}^{D}$ , where $D$ is the hidden size of LLM (e.g., 4096 for Vicuna-7B, and 5120 for Vicuna-13B). A principal component analysis (PCA) (Shlens, 2014) module is further employed for both dimension reduction and denoising purposes, engendering the final semantic representation $v_{t}\in\mathbb{R}^{d}$ , where we set $d=512$ . Now we can measure the semantic relevance between each pair of items via the cosine similarity between their corresponding semantic representations.

Next, we can apply semantic user behavior retrieval on each testing sample to replace the original truncated top- $K$ recent behaviors with the top- $K$ semantically relevant behaviors towards the target item. In this way, we obtain a parallel retrieval-enhanced testing dataset with higher data quality, while kee** the length of input context roughly unchanged. Therefore, SUBR can improve the zero-shot recommendation performance, and mitigate the incomprehension problem on long user behavior sequences.

3.3. Retrieval-enhanced Instruction Tuning

As for few-shot recommendation, we denote the training dataset as $\{(x_{i}^{text},y_{i}^{text})\}_{i=1}^{N}$ , where $N$ is the number of shots (i.e., training samples). While previous works (Bao et al., 2023; Zhang et al., 2023b) directly employ instruction tuning for LLMs over the converted textual input-output pairs, we argue that simple instruction tuning could potentially expose large language models to risks of overfitting and catastrophic forgetting on limited number of training data (Korbak et al., 2022; Ramasesh et al., 2021).

To this end, we propose a novel retrieval-enhanced instruction tuning (ReiT), where semantic user behavior retrieval (SUBR) is adopted as the data augmentation technique to construct a mixed training dataset with enriched user behavior patterns. As shown in Figure 5, we apply SUBR on each training data to obtain its retrieval-enhanced counterpart $\tilde{x}_{i}^{text}$ . Next, we merge the original and retrieval-enhanced data instances to construct a mixed training dataset with total $2N$ samples. Finally, we conduct instruction tuning for LLMs on the mixed training data. The pattern enrichment brought by SUBR can regularize and prevent the large language model from overfitting, thus promoting its robustness and generalization ability to effectively extract essential knowledge from a long user behavior sequence of length $K$ .

We leverage the causal language modeling objective for instruction tuning to retain the original model structure:

(3)

\max_{\Theta}\sum\nolimits_{(x,y)\in\mathcal{M}}\,\sum\nolimits_{j=1}^{|y|}% \log P_{\Theta}(y_{j}|x,y_{<j}),

where $\Theta$ is the parameter of LLM, $\mathcal{M}$ is the mixed training dataset with total $2N$ data samples, $y_{j}$ is the $j$ -th token of the textual output $y$ , and $y_{<j}$ denotes the tokens before $y_{j}$ . There is no randomly initialized prediction layer appended upon LLM for CTR prediction with binary cross-entropy (BCE) loss. The CTR estimation method for pointwise scoring with LLMs discussed in Section 2.3 is only used for evaluation on the testing set.

While we maintain a mixed training dataset for instruction tuning, the testing set contains pure retrieval-enhanced data samples generated by SUBR, which is the same as zero-shot recommendation as described in Section 3.2. Moreover, we provide further discussion about ReiT to address readers’ possible concerns as follows:

•

Will ReiT cause the inconsistency between the training and testing data? Data augmentation is a common regularization technique, especially for low-resource few-shot settings in computer vision (CV) (Berthelot et al., 2019; Zhang et al., 2017) or natural language processing (NLP) (Li et al., 2022; Feng et al., 2021). The inconsistency would not exist, as long as the augmentation algorithm is sound and reasonable.
•

Which factor actually contribute to the performance improvement of ReiT? The doubled training samples, or the pattern enrichment? Both factors can lead to the final performance enhancement, but we argue that the pattern enrichment as regularization is a more important factor for model robustness. Empirical studies are provided in Section 4.5 to ablate and decouple these two factors.

4. Experiment

In this section, we conduct extensive experiments to answer the following research questions:

RQ1

How does ReLLa perform compared to existing baselines?
RQ2

Does ReLLa promote the lifelong sequential behavior comprehension ability of LLMs for recommendation tasks?
RQ3

How does the number of shots $N$ affect the performance?
RQ4

What are the influences of different components for ReLLa?
RQ5

How ReLLa help LLMs to better comprehend the user behavior sequence?

Due to the page limitation, we further provide additional experiments in Appendix D to verify the following core points:

•

The universality of the lifelong sequential behavior incomprehension problem and the generalization of our proposed ReLLa.
•

Analysis about the model parameter and inference time.
•

Ablation on PCA dimensionality and distance metrics for SUBR.
•

Analysis of potential reasons for the incomprehension problem.

4.1. Experiment Setup

4.1.1. Datasets

We conduct experiments on three real-world datasets (i.e., BookCrossing³³3http://www2.informatik.uni-freiburg.de/~cziegler/BX/, MovieLens-1M⁴⁴4https://grouplens.org/datasets/movielens/1m/ and MovieLens-25M⁵⁵5https://grouplens.org/datasets/movielens/25m/). We show the dataset statistics in Table 1 and give detailed data preprocessing information in Appendix B due to page limitations.

Table 1. The dataset statistics.

Dataset	#Users	#Items	#Samples	#Fields	#Features
BookCrossing	278,858	271,375	17,714	10	912,279
MovieLens-1M	6,040	3,706	970,009	10	16,944
MovieLens-25M	162,541	59,047	25,000,095	6	280,576

4.1.2. Evaluation Metrics

To evaluate the performance of our methods, we leverage AUC (area under the ROC curve), Log Loss (binary cross-entropy loss) and ACC (accuracy score) as the evaluation metrics. In CTR prediction, slightly higher AUC or lower Log Loss (e.g., 0.001) can be regarded as significant improvement (Lian et al., 2018; Wang et al., 2021).

4.1.3. Baseline Models

The CTR baseline models can be mainly classified into two categories: (1) traditional CTR models that take one-hot encoded IDs as inputs, and (2) LM-based models that incorporate pretrained language models and formulate CTR prediction as either text classification or sequence-to-sequence problem.

Traditional CTR models can be further categorized into (1) feature interaction models, and (2) user behavior models. We select DeepFM (Guo et al., 2017), AutoInt (Song et al., 2019), and DCNv2 (Wang et al., 2021) as representative feature interaction models, and choose GRU4Rec (Hidasi et al., 2016), Caser (Tang and Wang, 2018), SASRec (Kang and McAuley, 2018), DIN (Zhou et al., 2018), and SIM (Pi et al., 2020) as representative user behavior models. We apply average pooling over users’ historical behaviors, and regard the outputs as additional feature fields for the feature interaction models. SIM (Pi et al., 2020) is a classical sequential CTR model that leverages user behavior retrieval techniques to enhance the recommendation performance. We include it for fair comparison, since ReLLa incorporates semantic user behavior retrieval (SUBR). As for LM-based CTR models, we select CTR-BERT (Muhamed et al., 2021a), PTab (Liu et al., 2022), and P5 (Geng et al., 2022a) as the representative baselines. TALLRec (Bao et al., 2023) adopts the simple instruction tuning framework for LLMs, and we therefore include it in our ablation study in Section 4.5.

Table 2. The performance of different models in zero-shot, full-shot and few-shot settings. In full-shot setting, the baselines are trained on the entire training set. In few-shot setting, the number of training shots

N

is selected from

\{256(<1\%),1024(<10\%)\}

on BookCrossing dataset, and

\{8192(<1\%),65536(<10\%)\}

on MovieLens-1M and MovieLens-25M datasets. The best result is given in bold, and the second-best value is underlined. Rel.Impr denotes the relative AUC improvement rate of ReLLa against each baseline. The symbol

\ast

indicates statistically significant improvement of ReLLa over the best baseline with

p

-value ¡ 0.001.

Model		BookCrossing				MovieLens-1M				MovieLens-25M
Model		AUC	Log Loss	ACC	Rel.Impr	AUC	Log Loss	ACC	Rel.Impr	AUC	Log Loss	ACC	Rel.Impr
Zero-shot	Vicuna-7B	0.7011	0.9357	0.5378	3.45%	0.6739	0.9510	0.5644	4.07%	0.7468	0.6348	0.6392	-1.93%
	Vicuna-13B	0.7176	0.9507	0.5649	1.07%	0.6993	0.6291	0.6493	0.29%	0.7503	0.6308	0.6427	-2.39%
	ReLLa (Ours)	0.7253^∗	0.9277^∗	0.5750^∗	-	0.7013^∗	0.6250^∗	0.6507^∗	-	0.7324	0.5858^∗	0.7027^∗	-
Full-shot	DeepFM	0.7496	0.5953	0.6760	1.05%	0.7915	0.5484	0.7225	1.49%	0.8189	0.4867	0.7709	3.52%
	AutoInt	0.7481	0.6840	0.6365	1.26%	0.7929	0.5453	0.7226	1.31%	0.8169	0.4957	0.7689	3.77%
	DCNv2	0.7472	0.6816	0.6472	1.38%	0.7931	0.5464	0.7216	1.29%	0.8190	0.4989	0.7702	3.50%
	GRU4Rec	0.7479	0.5930	0.6777	1.28%	0.7926	0.5453	0.7225	1.35%	0.8186	0.4941	0.7700	3.55%
	Caser	0.7478	0.5990	0.6760	1.30%	0.7918	0.5464	0.7206	1.45%	0.8199	0.4865	0.7707	3.39%
	SASRec	0.7482	0.5934	0.6811	1.24%	0.7934	0.5460	0.7233	1.25%	0.8187	0.4956	0.7691	3.54%
	DIN	0.7477	0.6811	0.6557	1.31%	0.7962	0.5425	0.7252	0.89%	0.8190	0.4906	0.7716	3.50%
	SIM	0.7541	0.5893	0.6777	0.45%	0.7992	0.5387	0.7268	0.51%	0.8344	0.4724	0.7822	1.59%
	CTR-BERT	0.7448	0.5938	0.6704	1.71%	0.7931	0.5457	0.7233	1.29%	0.8079	0.5044	0.7511	4.93%
	PTab	0.7429	0.6154	0.6574	1.97%	0.7955	0.5428	0.7240	0.98%	0.8107	0.5022	0.7551	4.56%
	P5	0.7438	0.6128	0.6563	1.84%	0.7937	0.5478	0.7190	1.21%	0.8092	0.5030	0.7527	4.76%
Few-shot	ReLLa (¡1%)	0.7482	0.6265	0.6800	-	0.7927	0.5475	0.7196	-	0.8352	0.4693	0.7779	-
Few-shot	ReLLa (¡10%)	0.7575^∗	0.5919	0.6806	-	0.8033^∗	0.5362^∗	0.7280^∗	-	0.8477^∗	0.4524^∗	0.7925^∗	-

4.1.4. Implementation Details

We select Vicuna-13B (Chiang et al., 2023) released by FastChat⁶⁶6https://github.com/lm-sys/FastChat as the base LLM for ReLLa. All the experiments are conducted on V100 GPUs. For training resource efficiency, 8-bit quantization and low-rank adaption (LoRA) (Hu et al., 2021) are adopted for parameter-efficient finetuning (PEFT). We follow previous works (Bao et al., 2023; Chenghao Fan and Tian, 2023) to set the configuration of LoRA, with LoRA rank as 8, LoRA alpha as 16, and LoRA dropout as 0.05. The LoRA update matrices are applied on the query and value projection matrices of attention blocks. During instruction tuning, we adopt AdamW (Loshchilov and Hutter, 2017) optimizer with weight decay set to 0. The model is trained with a batch size selected from $\{128,256\}$ . The learning rate is initialized from $\{1\times 10^{-3},1.5\times 10^{-3}\}$ with linear scheduler. On BookCrossing dataset, the maximum training epoch is set to 10, while on MovieLens-1M and MovieLens-25M datasets, the maximum epoch is set to 5. The configuration of baselines is in Appendix C. The hard prompt templates for textual input-output pairs and item descriptions for all three datasets are in Appendix A.

Moreover, when constructing the hard prompt template for ReLLa, we remove all the pure ID fields, i.e., User ID and ISBN fields on BookCrossing dataset, User ID, Movie ID, and Zipcode fields on MovieLens-1M dataset, User ID and Movie ID fields on MovieLens-25M dataset. The reason is that LLMs possess limited perceptual abilities for pure ID texts (Lin et al., 2023a). Other fields are leveraged as user profile or item information in the prompt, as described in Section 2.2 and Appendix A. Note that we do not discard any features for other CTR baseline models, i.e., they take all the feature fields and user behavior sequences as inputs.

4.2. Overall Performance (RQ1)

We evaluate the performance of ReLLa in comparison to existing baseline models, and report the results in Table 2. Note that other recommendation baseline models are all trained in full-shot settings with the entire training set. We set the length of user behavior sequence $K$ to 60/30/30 for BookCrossing/MovieLens-1M/MovieLens-25M, respectively.

For zero-shot recommendation, we observe that:

•

The performance of Vicuna-7B is notably inferior to its 13B version on all three datasets. It demonstrates that a larger LLM possesses more excellent language comprehension and logical reasoning abilities, therefore leading to better zero-shot inference capability for user preference.
•

ReLLa significantly outperforms Vicuna-13B for all three metrics on BookCrossing and MovieLens-1M datasets. Although the AUC performance of ReLLa degenerates on MovieLens-25M, ReLLa attains significant improvements in terms of pointwise metrics (i.e., Log Loss and ACC). such phenomena validate the effectiveness of SUBR in reducing the difficulty for LLMs to extract useful information from user behavior sequences. Also, the AUC degeneration of AUC on MovieLens-25M reveals the potential instability of zero-shot LLMs for recommendation.

As for full-shot and few-shot settings, we can draw the following observations from Table 2:

•

SIM achieves the best performance among all the baseline models. SIM applies user behavior retrieval to reduce the noise of user sequences, which is essentially beneficial for CTR prediction. Besides, LM-based CTR models (i.e., CTR-BERT, PTab, P5) perform worse than most of the ID-based traditional CTR models, which is consistent with the results reported in (Li et al., 2023a; Rajput et al., 2023). These LM-based methods only incorporate small language models (e.g., BERT (Berthelot et al., 2019), T5 (Raffel et al., 2020)) for pure text-based recommendation, and therefore result in inferior performance.
•

ReLLa (few-shot) generally achieves significantly better performance over all the baseline models, except for few cases, which validates the effectiveness of our proposed retrieval-enhanced instruction tuning (ReiT). It is worth noting that ReLLa only utilizes less than 10% training samples for finetuning, while other baseline models are trained on the entire training set, e.g., $N=65,536$ for ReLLa and $N=19,349,912$ for SIM on MovieLens-25M dataset. This demonstrates the superior data efficiency of ReLLa for sequential recommendation tasks.

4.3. Sequential Behavior Comprehension (RQ2)

We vary the length of user behavior sequence $K$ to investigate its impact on CTR prediction performance, which can demonstrate the comprehension ability of a model towards user behavior sequences. Three different models, including SIM (full-shot), Vicuna-13B (zero-shot) and ReLLa (few-shot), are evaluated with different $K$ s. On BookCrossing dataset, $K$ ranges in $\{10,20,30,40,50,60\}$ , while on MovieLens-1M and MovieLens-25M datasets, $K$ ranges in $\{5,10,15,20,25,30\}$ . The numbers of shots are set to $256$ , $8192$ , $8192$ for BookCrossing, MovieLens-1M, and MovieLens-25M, respectively (i.e., ¡1% few-shot setting). The results are shown in Figure 6, from which we obtain the following observations:

•

As a traditional CTR prediction model, SIM (full-shot) (Pi et al., 2020) enjoys steady performance improvement as the length $K$ grows. This is consistent with our common understanding, where longer user behavior sequences can introduce more useful information to better accomplish the recommendation tasks.
•

However, the performance of Vicuna-13B (zero-shot) only arrives at the peak with $K=30/15/15$ on BookCrossing/MovieLens-1M/MovieLens-25M datasets, and then starts to decrease with further longer sequence. It is worth noting that the number of involved tokens (i.e., around 500/700/700 for three datasets respectively) is actually far from reaching the context limitation of Vicuna-13B (i.e., 2048 tokens). This indicates that it is non-trivial for LLMs to comprehend the textual context of long behavior sequences for recommendation, where a certain amount of in-domain knowledge is required.
•

ReLLa mitigates the incomprehension problem of LLMs on long user behavior sequences for recommendation. Compared with Vicuna-13B (zero-shot), whose performance drops when $K>30$ on BookCrossing and $K>15$ on MovieLens-1M and MovieLens-25M, there are no performance turning points for ReLLa. Similar to SIM, the AUC performance of ReLLa achieves continuous improvement as $K$ grows, validating the comprehension ability of ReLLa for the textual contexts with longer behavior sequences.

4.4. Data Efficiency (RQ3)

Focusing on few-shot settings, we investigate the data efficiency property by varying the number of shots $N$ . In Figure 7, we report the AUC performance of ReLLa and SIM (the best full-shot baseline) with different $N$ s. For BookCrossing dataset, $N$ ranges in $\{128,256,512,1024,2048,4096\}$ . For MovieLens-1M and MovieLens-25M datasets, $N$ ranges in $\{512,1024,2048,4096,8192,65536\}$ . The length of user behavior sequence $K$ is set to $K=60/30/30$ for BookCrossing/MovieLens-1M/MovieLens-25M datasets, respectively.

As depicted in Figure 7, both ReLLa and SIM attain performance enhancement as the number of shots $N$ gradually grows. However, with the same number of shots $N$ , ReLLa can outperform SIM significantly and consistently by a large margin. Moreover, when $N$ is extremely small (e.g., 128 and 256) on BookCrossing dataset, SIM even fails to accomplish the CTR prediction task where AUC is merely around 0.5. With limited number of training samples, ReLLa shows remarkable data efficiency property and display considerable few-shot inference ability due to the intrinsic logical reasoning abilities and possession of open-world knowledge of LLMs.

Table 3. The performance of different variants of ReLLa. We remove different components of ReLLa to evaluate the contribution of each part to the model. The best result is given in bold, and the second-best value is underlined.

Model Variant	BookCrossing			MovieLens-1M			MovieLens-25M
Model Variant	AUC	Log Loss	ACC	AUC	Log Loss	ACC	AUC	Log Loss	ACC
ReLLa (Ours)	0.7482	0.6265	0.6800	0.7927	0.5475	0.7196	0.8352	0.4693	0.7779
ReLLa (w/o Mixture)	0.7399	0.6002	0.6715	0.7849	0.5693	0.6985	0.8192	0.4904	0.7715
ReLLa (w/o Retrieval)	0.7167	0.9293	0.4898	0.7718	0.5795	0.7039	0.8174	0.4892	0.7685
ReLLa ( $\frac{1}{2}N$ -shot)	0.7415	0.6268	0.6462	0.7862	0.5781	0.6964	0.8231	0.5157	0.7672
ReLLa (w/o IT)	0.7253	0.9277	0.5750	0.7013	0.6250	0.6507	0.7324	0.5858	0.7027
ReLLa (w/o IT & Retrieval)	0.7176	0.9507	0.5649	0.6993	0.6291	0.6493	0.7503	0.6308	0.6427

4.5. Ablation Study (RQ4)

To analyze the efficacy of each component in our proposed ReLLa framework, we design the following model variants of ReLLa. We set $N=256/8192/8192$ (¡1% setting) and $K=60/30/30$ for BookCross-ing/MovieLens-1M/MovieLens-25M datasets, respectively.

•

ReLLa (Ours) is the complete version of our proposed method. The training data consists of both original and retrieval-enhanced samples, resulting in a mixed training dataset of $2N$ samples. The testing set only contains pure retrieval-enhanced samples.
•

ReLLa (w/o Mixture). We only maintain the retrieval-enhanced data instances to construct the training dataset of $N$ samples. The testing data is still all retrieval-enhanced samples.
•

ReLLa (w/o Retrieval). We remove the semantic user behavior retrieval for both training and testing samples. That is, training and testing data are all original samples without retrieval enhancements. The training set contains $N$ training samples. This variant indicates the vanilla instruction tuning version over Vicuna-13B, which is similar to TALLRec (Bao et al., 2023).
•

ReLLa ( $\frac{1}{2}N$ -shot). We halve the number of shots $N$ to $\frac{1}{2}N$ , i.e., from 256 to 128 on BookCrossing and from 8192 to 4096 on MovieLens-1M and MovieLens-25M. Therefore, the constructed mixed training set contains $N$ training samples. This variant is intended to decouple and investigate the factors of doubled training samples and pattern enrichment.
•

ReLLa (w/o IT). We remove the instruction tuning, while preserving the retrieval-enhanced samples for testing data. This variant indicates the zero-shot version of our proposed ReLLa.
•

ReLLa (w/o IT & Retrieval). We remove both the instruction tuning and retrieval operation. Therefore, the testing data only contains original data samples. This variant indicates the zero-shot version of vanilla Vicuna-13B.

The performance of these variants are presented in Table 3, from which we can draw the following observations:

•

For ReLLa (w/o Mixture) and ReLLa (w/o Retrieval), their training and testing data comprise exactly the same type of samples, i.e., either pure original samples or retrieval-enhanced samples respectively, which indicates that there is no data inconsistency between the training and testing phases. Nevertheless, both of them significantly underperform our proposed ReLLa by at least 1.12%, 0.99% and 1.95% on BookCrossing, MovieLens-1M and MovieLens-25M in AUC respectively. This highlights the importance of the data mixture strategy, the benefits of which can be broken down into two prominent factors: doubled training samples and pattern enrichment. Doubled training samples lead to a more thorough training process, while pattern enrichment can prevent the model from overfitting and therefore increase the model robustness.
•

We introduce the variant ReLLa ( $\frac{1}{2}N$ -shot) to further decouple and analyze the two factors mentioned above, i.e., doubled training samples and pattern enrichment. Its total number of training samples is the same as those of ReLLa (w/o Mixture) and ReLLa (w/o Retrieval), except that ReLLa ( $\frac{1}{2}N$ -shot) loses the sight of half truly training instances. In this case, ReLLa ( $\frac{1}{2}N$ -shot) still outperforms ReLLa (w/o Mixture) and ReLLa (w/o Retrieval) with 0.21%, 0.16% and 0.48% relative AUC improvement, and achieves comparable or better performance in Log Loss and ACC. This indicates that pattern enrichment as regularization plays a more vital role that contributes to the performance improvement.
•

Finally, comparing ReLLa (w/o IT) and ReLLa (w/o IT & Retrieval), which fall back into zero-shot settings, we can observe that ReLLa (w/o IT) generally achieves significant improvements over ReLLa (w/o IT & Retrieval), except for the AUC metric on MovieLens-25M. This demonstrates that semantic user behavior retrieval (SUBR) improves the quality of data samples and makes the filtered behavior sequence more friendly for LLM to extract useful knowledge.

4.6. Case Study (RQ5)

In this section, we conduct case study to further analyze how can ReLLa help LLM better understand the long user behavior sequence. As shown in Figure 8, we select a testing sample from MovieLens-25M dataset, and visualize the attention scores of target item over the user behavior sequence at the last hidden layer of three different models (i.e., Vicuna-13B, ReLLa (zero-shot), and ReLLa (few-shot)). The attention score for each historical item is computed by summing up the attention scores of every word token for the textual input of the corresponding item. In Figure 8, each historical item is represented as a rectangle with color ranging from yellow to green. The deeper green a rectangle possesses, the large attention score the corresponding historical item attains, thus contributing more to the final CTR estimation.

For Vicuna-13B (zero-shot), the largest attentions fall on the movie Roman Holiday and Warrior, which have little relationship with the target movie Thor: Ragnarok, and thus the model fails to correctly infer the user’s preference towards the target item. Equipped with semantic user behavior retrieval (SUBR), we can reduce the noise of user behavior sequence and bring in more relevant items. As shown in Figure 8, ReLLa (zero-shot) is able to put more attentions to superhero movies (e.g., Iron Man 3) that are semantically similar to the target item. However there are still outliers for ReLLa (zero-shot), e.g., the movie Kick-Ass 2 is generally non-correlated to Thor: Ragnarok produced the Marvel. Next, by further applying retrieval-enhanced instruction tuning (ReiT), we can observe that the large attention weights of ReLLa (few-shot) all fall onto relevant superhero movies that are also produced by the Marvel. Therefore, we can conclude that our proposed SUBR and ReiT can help LLM to correctly grasp the correlation between the target item and historical items, thus better comprehending the user behavior sequence.

5. Related Work

5.1. Traditional CTR Prediction

CTR prediction serves as the key component for various online applications (e.g., recommender systems (Xi et al., 2023a), advertising (Ou et al., 2023), and web search (Lin et al., 2021; Fu et al., 2023a; Dai et al., 2021)). It aims to accurately estimate the user’s click probability towards a certain target item in a given context (Zhang et al., 2021b). Traditional CTR prediction models can be mainly classified into two categories: (1) feature interaction based models, and (2) sequential recommendation models.

The feature interaction based models generally derive from POLY2 (Chang et al., 2010) and FM (Rendle, 2010). Their core idea is to capture the second- or high-order feature interaction patterns across multiple feature fields with different operators (e.g., product (Qu et al., 2016; Wang et al., 2021; Guo et al., 2017; Chen et al., 2021), convolution (Xin et al., 2019; Liu et al., 2019), and attention (Song et al., 2019; Xiao et al., 2017)). For examples, DCN (Wang et al., 2017), xDeepFM (Lian et al., 2018), and DCNv2 (Wang et al., 2021) apply product-based feature crossing operation at each layer for explicit high-order feature interaction modeling. AutoInt (Song et al., 2019) and InterHAt (Li et al., 2020) adopt the attention mechanism for feature interactions, which provides additional explainable prediction via attention weights.

The sequential recommendation model (Zhou et al., 2019; Pi et al., 2019; Zhou et al., 2018) focuses on user behavior modeling and seeks to dynamically capture users’ interests towards a target item according to the given behavior history. They leverage different architectures (e.g., RNN (Hidasi and Karatzoglou, 2018; Hidasi et al., 2016), CNN (Tang and Wang, 2018), attention (Zhou et al., 2019, 2018), memory bank (Pi et al., 2019; Ren et al., 2019)) to handle the user behavior sequence for user preference modeling. For instances, GRU4Rec (Hidasi et al., 2016) adopt the gated recurrent unit (GRU) (Chung et al., 2014) to encode the user’s sequential behaviors. Caser (Tang and Wang, 2018) introduces the convolution neural network (CNN) to model the union-level patterns among user behavior sequences.

5.2. Language Models for Recommendation

As suggested in previous work (Lin et al., 2023a), the adaption of language models to the field of recommender systems can be generally categorized according to the roles they serve in the recommendation pipeline, i.e., feature engineering (Liu et al., 2023a; Borisov et al., 2022; Li et al., 2023c; Mysore et al., 2023; Carranza et al., 2023; Christakopoulou et al., 2023), feature encoder (Muhamed et al., 2021b; Hou et al., 2022; Yu et al., 2021; Wang et al., 2022b; Hou et al., 2023a; Zhang et al., 2022; Fu et al., 2023b; Yuan et al., 2023; Qiu et al., 2021; Li et al., 2023b), scoring/ranking function (Liu et al., 2022; Kang et al., 2023; Zhang et al., 2021a; Li et al., 2023e; Bao et al., 2023; Li et al., 2023d; Zhang and Wang, 2023; Mao et al., 2023; Hua et al., 2023a; Geng et al., 2023; Hua et al., 2023b; Zhang et al., 2023a; Hou et al., 2023b; Chen, 2023; Petrov and Macdonald, 2023; Wang and Lim, 2023).

For feature engineering, large language models (LLMs) accept the raw data (e.g., user profiles and item descriptions) as input, and generate supplementary text-based attributes as data augmentations with delicately designed prompts and templates. For example, KAR (Xi et al., 2023b) utilizes the reasoning knowledge on user preferences and the factual knowledge on items by requesting LLMs with factorization prompting techniques. The obtained knowledge can serve as augmented features and promote the recommendation performance in a model-agnostic manner. GENRE (Liu et al., 2023a) employs LLMs to obtain news summarization, synthetic news pieces, and user profiles.

For feature encoder, LLMs are adopted as auxiliary textual feature encoders to (1) enrich the user/item representations with semantic information, and (2) enable cross-domain recommendation with the natural language interface. For instance, U-BERT (Qiu et al., 2021) enhances the user representation by encoding review texts into dense vectors via BERT. UniSRec (Hou et al., 2022) and VQ-Rec (Hou et al., 2023a) apply a fixed BERT as the encoder for item descriptive texts, in order to achieve unified cross-domain sequential recommendation.

For scoring/ranking function, researchers explore the potential of LLMs to directly serve as the core scoring or ranking module for recommendation, instead of an assistant role for conventional recommendation models (e.g., feature engineering or feature encoder). In this case, LLMs are employed to accomplish either the item scoring task (Liu et al., 2022; Kang et al., 2023; Zhang et al., 2021a; Li et al., 2023e; Bao et al., 2023; Li et al., 2023d; Zhang and Wang, 2023; Mao et al., 2023), or item generation task (Hua et al., 2023a; Geng et al., 2023; Hua et al., 2023b; Zhang et al., 2023a; Hou et al., 2023b; Chen, 2023; Petrov and Macdonald, 2023; Wang and Lim, 2023). Also, various works (Geng et al., 2022b; Cui et al., 2022; Zhang et al., 2023b; Liu et al., 2023b; Sun et al., 2023; Dai et al., 2023) attempt to utilize the multi-task capacity of LLMs, and instruct LLMs to solve the multiple tasks (e.g., both scoring and generation) through a unified language interface.

In this paper, we mainly focus on the utilization of LLMs as the scoring/ranking functions, where the pointwise scoring task is adopted for CTR prediction. To the best of our knowledge, we are the first to identify and well formulate the incomprehension problem of LLMs on lifelong user behavior sequences when adopting LLMs for scoring and ranking tasks. A novel ReLLa framework is proposed to mitigate such an issue by introducing the retrieval techniques to promote comprehension ability of LLMs and thus enhance their recommendation performance.

6. Conclusion

In this paper, we focus on adapting and empowering LLMs as the scoring/ranking function for recommendation tasks. We first identify and formulate the incomprehension problem of LLMs on lifelong sequential behaviors, i.e., LLMs fail to extract useful information from a textual context of long user behavior sequence, even if the length of context is far from reaching the context limitation of LLMs. Hence, we propose a novel ReLLa framework, where semantic user behavior retrieval (SUBR) and retrieval-enhanced instruction tuning (ReiT) are designed to address such an issue and therefore promote the recommendation performance. Extensive experiments validate the effectiveness of our proposed ReLLa compared with existing baselines. Specifically, leveraging only less than 10% training samples, few-shot ReLLa can outperform all the full-shot traditional CTR models that are trained on the entire training set. This demonstrate the superior data efficiency of ReLLa, as well as its comprehension ability towards long user behavior sequences.

Acknowledgements.

The Shanghai Jiao Tong University team is partially supported by National Key R&D Program of China (2022ZD0114804), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102) and National Natural Science Foundation of China (62177033, 62322603). The work is sponsored by Huawei Innovation Research Program. We thank MindSpore (min, 2020) for the partial support of this work, which is a new deep learning computing framework.

References

(1)
min (2020) 2020. MindSpore. https://www.mindspore.cn/
Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. 2023. The falcon series of open language models. arXiv preprint arXiv:2311.16867 (2023).
Bao et al. (2023) Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. arXiv preprint arXiv:2305.00447 (2023).
Berthelot et al. (2019) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems 32 (2019).
Borisov et al. (2022) Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. 2022. Language models are realistic tabular data generators. arXiv preprint arXiv:2210.06280 (2022).
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
Carranza et al. (2023) Aldo Gael Carranza, Rezsa Farahani, Natalia Ponomareva, Alex Kurakin, Matthew Jagielski, and Milad Nasr. 2023. Privacy-Preserving Recommender Systems with Synthetic Query Generation using Differentially Private Large Language Models. arXiv preprint arXiv:2305.05973 (2023).
Chang et al. (2010) Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-Jen Lin. 2010. Training and testing low-degree polynomial data map**s via linear SVM. Journal of Machine Learning Research 11, 4 (2010).
Chen et al. (2021) Bo Chen, Yichao Wang, Zhirong Liu, Ruiming Tang, Wei Guo, Hongkun Zheng, Weiwei Yao, Muyu Zhang, and Xiuqiang He. 2021. Enhancing explicit and implicit feature interactions via information sharing for parallel deep ctr models. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3757–3766.
Chen (2023) Zheng Chen. 2023. PALR: Personalization Aware LLMs for Recommendation. arXiv preprint arXiv:2305.07622 (2023).
Chenghao Fan and Tian (2023) Zhenyi Lu Chenghao Fan and Jie Tian. 2023. Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model. https://github.com/Facico/Chinese-Vicuna
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
Christakopoulou et al. (2023) Konstantina Christakopoulou, Alberto Lalama, Cj Adams, Iris Qu, Yifat Amir, Samer Chucri, Pierce Vollucci, Fabio Soldo, Dina Bseiso, Sarah Scodel, et al. 2023. Large Language Models for User Interest Journeys. arXiv preprint arXiv:2305.15498 (2023).
Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
Cui et al. (2022) Zeyu Cui, Jianxin Ma, Chang Zhou, **gren Zhou, and Hongxia Yang. 2022. M6-rec: Generative pretrained language models are open-ended recommender systems. arXiv preprint arXiv:2205.08084 (2022).
Dai et al. (2023) Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxiang Sun, Xiao Zhang, and Jun Xu. 2023. Uncovering ChatGPT’s Capabilities in Recommender Systems. arXiv preprint arXiv:2305.02182 (2023).
Dai et al. (2021) Xinyi Dai, Jianghao Lin, Weinan Zhang, Shuai Li, Weiwen Liu, Ruiming Tang, Xiuqiang He, Jianye Hao, Jun Wang, and Yong Yu. 2021. An adversarial imitation click model for information retrieval. In Proceedings of the Web Conference 2021. 1809–1820.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Feng et al. (2021) Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. 2021. A survey of data augmentation approaches for NLP. arXiv preprint arXiv:2105.03075 (2021).
Fu et al. (2023b) Junchen Fu, Fajie Yuan, Yu Song, Zheng Yuan, Mingyue Cheng, Shenghui Cheng, Jiaqi Zhang, Jie Wang, and Yunzhu Pan. 2023b. Exploring Adapter-based Transfer Learning for Recommender Systems: Empirical Studies and Practical Insights. arXiv preprint arXiv:2305.15036 (2023).
Fu et al. (2023a) Lingyue Fu, Jianghao Lin, Weiwen Liu, Ruiming Tang, Weinan Zhang, Rui Zhang, and Yong Yu. 2023a. An F-shape Click Model for Information Retrieval on Multi-block Mobile Pages. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 1057–1065.
Geng et al. (2022a) Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022a. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems. 299–315.
Geng et al. (2022b) Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022b. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems. 299–315.
Geng et al. (2023) Shijie Geng, Juntao Tan, Shuchang Liu, Zuohui Fu, and Yongfeng Zhang. 2023. VIP5: Towards Multimodal Foundation Models for Recommendation. arXiv preprint arXiv:2305.14302 (2023).
Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 1725–1731.
Hidasi and Karatzoglou (2018) Balázs Hidasi and Alexandros Karatzoglou. 2018. Recurrent neural networks with top-k gains for session-based recommendations. CIKM (2018).
Hidasi et al. (2016) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016. Session-based recommendations with recurrent neural networks. In ICLR.
Hou et al. (2023a) Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao. 2023a. Learning vector-quantized item representation for transferable sequential recommenders. In Proceedings of the ACM Web Conference 2023. 1162–1171.
Hou et al. (2022) Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 585–593.
Hou et al. (2023b) Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. 2023b. Large language models are zero-shot rankers for recommender systems. arXiv preprint arXiv:2305.08845 (2023).
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
Hua et al. (2023a) Wenyue Hua, Yingqiang Ge, Shuyuan Xu, Jianchao Ji, and Yongfeng Zhang. 2023a. UP5: Unbiased Foundation Model for Fairness-aware Recommendation. arXiv preprint arXiv:2305.12090 (2023).
Hua et al. (2023b) Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. 2023b. How to Index Item IDs for Recommendation Foundation Models. arXiv preprint arXiv:2305.06569 (2023).
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-Attentive Sequential Recommendation. ICDM (2018).
Kang et al. (2023) Wang-Cheng Kang, Jianmo Ni, Nikhil Mehta, Maheswaran Sathiamoorthy, Lichan Hong, Ed Chi, and Derek Zhiyuan Cheng. 2023. Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction. arXiv preprint arXiv:2305.06474 (2023).
Korbak et al. (2022) Tomasz Korbak, Hady Elsahar, German Kruszewski, and Marc Dymetman. 2022. Controlling conditional language models without catastrophic forgetting. In International Conference on Machine Learning. PMLR, 11499–11528.
Li et al. (2022) Bohan Li, Yutai Hou, and Wanxiang Che. 2022. Data augmentation approaches in natural language processing: A survey. AI Open 3 (2022), 71–90.
Li et al. (2023c) Chen Li, Yixiao Ge, Jiayong Mao, Dian Li, and Ying Shan. 2023c. TagGPT: Large Language Models are Zero-shot Multimodal Taggers. arXiv preprint arXiv:2304.03022 (2023).
Li et al. (2023d) Jiacheng Li, Ming Wang, ** Li, **miao Fu, Xin Shen, **gbo Shang, and Julian McAuley. 2023d. Text Is All You Need: Learning Language Representations for Sequential Recommendation. arXiv preprint arXiv:2305.13731 (2023).
Li et al. (2023b) Ruyu Li, Wenhao Deng, Yu Cheng, Zheng Yuan, Jiaqi Zhang, and Fajie Yuan. 2023b. Exploring the Upper Limits of Text-Based Collaborative Filtering Using Large Language Models: Discoveries and Insights. arXiv preprint arXiv:2305.11700 (2023).
Li et al. (2023a) Xiangyang Li, Bo Chen, Lu Hou, and Ruiming Tang. 2023a. CTRL: Connect Tabular and Language Model for CTR Prediction. arXiv preprint arXiv:2306.02841 (2023).
Li et al. (2023e) Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. 2023e. PBNR: Prompt-based News Recommender System. arXiv preprint arXiv:2304.07862 (2023).
Li et al. (2020) Zeyu Li, Wei Cheng, Yang Chen, Haifeng Chen, and Wei Wang. 2020. Interpretable click-through rate prediction through hierarchical attention. In Proceedings of the 13th International Conference on Web Search and Data Mining. 313–321.
Lian et al. (2018) Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In KDD. 1754–1763.
Lin et al. (2023a) Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, et al. 2023a. How Can Recommender Systems Benefit from Large Language Models: A Survey. arXiv preprint arXiv:2306.05817 (2023).
Lin et al. (2021) Jianghao Lin, Weiwen Liu, Xinyi Dai, Weinan Zhang, Shuai Li, Ruiming Tang, Xiuqiang He, Jianye Hao, and Yong Yu. 2021. A Graph-Enhanced Click Model for Web Search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1259–1268.
Lin et al. (2023b) Jianghao Lin, Yanru Qu, Wei Guo, Xinyi Dai, Ruiming Tang, Yong Yu, and Weinan Zhang. 2023b. MAP: A Model-agnostic Pretraining Framework for Click-through Rate Prediction. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1384–1395.
Liu et al. (2019) Bin Liu, Ruiming Tang, Yingzhi Chen, **kai Yu, Huifeng Guo, and Yuzhou Zhang. 2019. Feature generation by convolutional neural network for click-through rate prediction. In WWW. 1119–1129.
Liu et al. (2022) Guang Liu, Jie Yang, and Ledell Wu. 2022. PTab: Using the Pre-trained Language Model for Modeling Tabular Data. arXiv preprint arXiv:2209.08060 (2022).
Liu et al. (2023b) Junling Liu, Chao Liu, Renjie Lv, Kang Zhou, and Yan Zhang. 2023b. Is chatgpt a good recommender? a preliminary study. arXiv preprint arXiv:2304.10149 (2023).
Liu et al. (2023a) Qijiong Liu, Nuo Chen, Tetsuya Sakai, and Xiao-Ming Wu. 2023a. A First Look at LLM-Powered Generative News Recommendation. arXiv preprint arXiv:2305.06566 (2023).
Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
Mao et al. (2023) Zhiming Mao, Huimin Wang, Yiming Du, and Kam-fai Wong. 2023. UniTRec: A Unified Text-to-Text Transformer and Joint Contrastive Learning Framework for Text-based Recommendation. arXiv preprint arXiv:2305.15756 (2023).
Muhamed et al. (2021a) Aashiq Muhamed, Iman Keivanloo, Sujan Perera, James Mracek, Yi Xu, Qingjun Cui, Santosh Rajagopalan, Belinda Zeng, and Trishul Chilimbi. 2021a. CTR-BERT: Cost-effective knowledge distillation for billion-parameter teacher models. In NeurIPS Efficient Natural Language and Speech Processing Workshop.
Muhamed et al. (2021b) Aashiq Muhamed, Iman Keivanloo, Sujan Perera, James Mracek, Yi Xu, Qingjun Cui, Santosh Rajagopalan, Belinda Zeng, and Trishul Chilimbi. 2021b. CTR-BERT: Cost-effective knowledge distillation for billion-parameter teacher models. In NeurIPS Efficient Natural Language and Speech Processing Workshop.
Mysore et al. (2023) Sheshera Mysore, Andrew McCallum, and Hamed Zamani. 2023. Large Language Model Augmented Narrative Driven Recommendations. arXiv preprint arXiv:2306.02250 (2023).
Ni et al. (2021) Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. 2021. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877 (2021).
Ou et al. (2023) Weitong Ou, Bo Chen, Yingxuan Yang, Xinyi Dai, Weiwen Liu, Weinan Zhang, Ruiming Tang, and Yong Yu. 2023. Deep Landscape Forecasting in Multi-Slot Real-Time Bidding. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4685–4695.
Petrov and Macdonald (2023) Aleksandr V Petrov and Craig Macdonald. 2023. Generative Sequential Recommendation with GPTRec. arXiv preprint arXiv:2306.11114 (2023).
Pi et al. (2019) Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Practice on long sequential user behavior modeling for click-through rate prediction. In KDD. 2671–2679.
Pi et al. (2020) Qi Pi, Guorui Zhou, Yu**g Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2685–2692.
Qin et al. (2021) Jiarui Qin, Weinan Zhang, Rong Su, Zhirong Liu, Weiwen Liu, Ruiming Tang, Xiuqiang He, and Yong Yu. 2021. Retrieval & Interaction Machine for Tabular Data Prediction. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1379–1389.
Qin et al. (2020) Jiarui Qin, W. Zhang, Xin Wu, Jiarui **, Yuchen Fang, and Y. Yu. 2020. User Behavior Retrieval for Click-Through Rate Prediction. In SIGIR.
Qiu et al. (2021) Zhaopeng Qiu, Xian Wu, **gyue Gao, and Wei Fan. 2021. U-BERT: Pre-training user representations for improved recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4320–4327.
Qu et al. (2016) Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Product-based neural networks for user response prediction. In 2016 IEEE 16th international conference on data mining (ICDM). IEEE, 1149–1154.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
Rajput et al. (2023) Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q Tran, Jonah Samost, et al. 2023. Recommender Systems with Generative Retrieval. arXiv preprint arXiv:2305.05065 (2023).
Ramasesh et al. (2021) Vinay Venkatesh Ramasesh, Aitor Lewkowycz, and Ethan Dyer. 2021. Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
Ren et al. (2019) Kan Ren, Jiarui Qin, Yuchen Fang, Weinan Zhang, Lei Zheng, Weijie Bian, Guorui Zhou, Jian Xu, Yong Yu, Xiaoqiang Zhu, et al. 2019. Lifelong Sequential Modeling with Personalized Memorization for User Response Prediction. SIGIR.
Rendle (2010) Steffen Rendle. 2010. Factorization machines. In ICDM.
Shlens (2014) Jonathon Shlens. 2014. A tutorial on principal component analysis. arXiv preprint arXiv:1404.1100 (2014).
Song et al. (2019) Wei** Song, Chence Shi, Zhi** Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. 2019. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1161–1170.
Sun et al. (2023) Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie Ren, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. arXiv preprint arXiv:2304.09542 (2023).
Tang and Wang (2018) Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the eleventh ACM international conference on web search and data mining. 565–573.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
Wang et al. (2022b) Jie Wang, Fajie Yuan, Mingyue Cheng, Joemon M Jose, Chenyun Yu, Beibei Kong, Xiangnan He, Zhi** Wang, Bo Hu, and Zang Li. 2022b. TransRec: Learning Transferable Recommendation from Mixture-of-Modality Feedback. arXiv preprint arXiv:2206.06190 (2022).
Wang and Lim (2023) Lei Wang and Ee-Peng Lim. 2023. Zero-Shot Next-Item Recommendation using Large Pretrained Language Models. arXiv preprint arXiv:2304.03153 (2023).
Wang et al. (2022a) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022a. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022).
Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17. 1–7.
Wang et al. (2021) Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi. 2021. Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems. In Proceedings of the Web Conference 2021. 1785–1797.
Wang et al. (2023) Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, and Yingzhen Yang. 2023. Recmind: Large language model powered agent for recommendation. arXiv preprint arXiv:2308.14296 (2023).
Xi et al. (2023a) Yunjia Xi, Jianghao Lin, Weiwen Liu, Xinyi Dai, Weinan Zhang, Rui Zhang, Ruiming Tang, and Yong Yu. 2023a. A Bird’s-eye View of Reranking: from List Level to Page Level. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 1075–1083.
Xi et al. (2023b) Yunjia Xi, Weiwen Liu, Jianghao Lin, Jieming Zhu, Bo Chen, Ruiming Tang, Weinan Zhang, Rui Zhang, and Yong Yu. 2023b. Towards Open-World Recommendation with Knowledge Augmentation from Large Language Models. arXiv preprint arXiv:2306.10933 (2023).
Xiao et al. (2017) Jun Xiao, Hao Ye, Xiangnan He, Hanwang Zhang, Fei Wu, and Tat-Seng Chua. 2017. Attentional factorization machines: learning the weight of feature interactions via attention networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 3119–3125.
Xin et al. (2019) Xin Xin, Bo Chen, Xiangnan He, Dong Wang, Yue Ding, and Joemon M Jose. 2019. CFM: Convolutional Factorization Machines for Context-Aware Recommendation.. In IJCAI, Vol. 19. 3926–3932.
Yu et al. (2021) Yang Yu, Fangzhao Wu, Chuhan Wu, **gwei Yi, and Qi Liu. 2021. Tiny-newsrec: Effective and efficient plm-based news recommendation. arXiv preprint arXiv:2112.00944 (2021).
Yuan et al. (2023) Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni. 2023. Where to go next for recommender systems? id-vs. modality-based recommender models revisited. arXiv preprint arXiv:2303.13835 (2023).
Zhang et al. (2017) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).
Zhang et al. (2023a) Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023a. Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation. arXiv preprint arXiv:2305.07609 (2023).
Zhang et al. (2023b) Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2023b. Recommendation as instruction following: A large language model empowered recommendation approach. arXiv preprint arXiv:2305.07001 (2023).
Zhang et al. (2023c) Kai Zhang, Fubang Zhao, Yangyang Kang, and Xiaozhong Liu. 2023c. Memory-augmented llm personalization with short-and long-term memory coordination. arXiv preprint arXiv:2309.11696 (2023).
Zhang et al. (2021b) Weinan Zhang, Jiarui Qin, Wei Guo, Ruiming Tang, and Xiuqiang He. 2021b. Deep learning for click-through rate estimation. IJCAI (2021).
Zhang et al. (2022) Xinyang Zhang, Yury Malkov, Omar Florez, Serim Park, Brian McWilliams, Jiawei Han, and Ahmed El-Kishky. 2022. TwHIN-BERT: a socially-enriched pre-trained language model for multilingual Tweet representations. arXiv preprint arXiv:2209.07562 (2022).
Zhang et al. (2021a) Yuhui Zhang, Hao Ding, Zeren Shui, Yifei Ma, James Zou, Anoop Deoras, and Hao Wang. 2021a. Language models as recommender systems: Evaluations and limitations. (2021).
Zhang and Wang (2023) Zizhuo Zhang and Bang Wang. 2023. Prompt learning for news recommendation. arXiv preprint arXiv:2304.05263 (2023).
Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate prediction. In AAAI, Vol. 33. 5941–5948.
Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi **, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1059–1068.

Appendix A Prompt Illustration

We demonstrate several examples to illustrate the hard prompt templates used for ReLLa on all three datasets.

Figure 9 shows the textual input-output pairs without semantic user behavior retrieval (SUBR), where the user behavior sequence is truncated to most recent $K$ (e.g., $K=4$ in the figure). As is shown in Figure 10, after applying SUBR, the user behavior sequence will be replaced by most relevant $K$ historical items towards the target item. For example, for MovieLens-25M dataset, historical behaviors retrieved by SUBR are all related to superheros or Marvel, which is highly correlated to the target movie “Thor: Ragnaro”. Note that the user behavior sequence generated by SUBR keeps the chronological order in the original lifelong user sequence.

Figure 4 demonstrates how we design prompts for item descriptions on the three datasets, which will be encoded by LLM for semantic user behavior retrieval (SUBR).

Appendix B Data Preprocessing

Our experiments are conducted on three real-world public datasets (i.e., BookCrossing, MovieLens-1M and MovieLens-25M), and the statistics of the processed datasets are show in Table 1. MovieLens-1M and MovieLens-25M datasets are split into training and testing sets with ratio of 8:1 according to the global timestamp (Qin et al., 2021). Since BookCrossing dataset has no timestamps, following previous work (Bao et al., 2023), we divide it into training and testing sets with ratio of 9:1 by random split of users. Data samples with user behavior sequence length less than 5 are filtered on all three datasets. We describe more preprocessing details as follows:

•

BookCrossing possesses user-book integer ratings ranging from 0 to 10. We consider samples with rating above 5 as positive, and the rest as negative.
•

MovieLens-1M contains user-movie integer ratings ranging from 0 to 5. Samples with ratings of 4 and 5 are labeled as positive and the rest as negative. (Zhou et al., 2018; Xi et al., 2023b)
•

MovieLens-25M has a scoring range from 0 to 5, with increments of 0.5. We label samples with ratings above 3.0 as positive, and the rest as negative.

Under the few-shot setting with a particular number of shot $N$ , we uniformly sample $N$ data instances from the training set, which is then fixed ReLLa during few-shot tuning. Note that the sampled data instances with a smaller $N$ are all included in the sampled few-shot training sets with a larger $N$ .

Appendix C Baseline Implementation

In this section, we describe the hyperparameter configuration for the baseline models from two different categories: (1) traditional CTR models, and (2) LM-based models.

C.1. Traditional CTR Models

We choose the embedding size from {8, 16, 32} on BookCrossing dataset and {32, 64} on MovieLens-1M and MovieLens-25M datasets. The dropout rate is selected from {0.0, 0.1, 0.2}. The activation function is fixed to ReLU. The learning rate is set to $1\times 10^{-3}$ and AdamW (Loshchilov and Hutter, 2017) optimizer is used. On BookCrossing, the batch size is selected from {32, 64}. On MovieLens-1M and MovieLens-25M, the batch size is selected from {256, 512}. More model-specific hyperparameter settings are shown as follows:

•

DeepFM (Guo et al., 2017). On BookCrossing, the size of DNN layer is selected from {32, 64, 128}. The number of DNN layers is selected from {1, 2, 3}. On MovieLens-1M and MovieLens-25M, we choose the size of DNN layer from {128, 256} and the number of DNN layers from {3, 6, 9, 12}.
•

AutoInt (Song et al., 2019). On BookCrossing, the number of attention layers is selected from {1, 2} and the attention size is set to 32. On MovieLens-1M and MovieLens-25M, the attention layers is selected from {3, 6, 9, 12} and the attention size is selected from {64, 128, 256}. The number of attention heads are all set to 1.
•

DCNv2 (Wang et al., 2021). On BookCrossing, the size of DNN layer is selected from {32, 64, 128}. The number of DNN layers and cross layers are selected from {1, 2, 3}. On MovieLens-1M and MovieLens-25M, we choose the size of DNN layers from {128, 256} and the number of DNN layers and cross layers are from {3, 6, 9, 12}.
•

GRU4Rec (Hidasi et al., 2016). The number of GRU layers is selected from {1, 2, 3}. On BookCrossing, the GRU hidden size and DNN hidden size is selected from {32, 64}. On MovieLens-1M and MovieLens-25M, the GRU hidden size and DNN hidden size is selected from {64, 128, 256}.
•

Caser (Tang and Wang, 2018). The number of vertical convolution kernels is selected from {2, 4, 8}. The number of horizontal convolution kernels is selected from {4, 8, 16}. The number of DNN layers is selected from {1,2,3}. The DNN hidden size is selected from {32, 64} on BookCrossing and {64, 128, 256} on MovieLens-1M and MovieLens-25M.
•

SASRec (Kang and McAuley, 2018). The number of attention heads is selected from {1, 2, 4}. The number of attention layers is selected from {1, 2, 3}. The attention size is selected from {32, 64, 128} on BookCrossing and {64, 128, 256}. The number of DNN layers is selected from {1,2,3}. The DNN hidden size is selected from {32, 64} on BookCrossing and {64, 128, 256} on MovieLens-1M and MovieLens-25M.
•

DIN (Zhou et al., 2018). The number of DIN attention layers and DNN layers are selected from {1, 2, 3}. The DNN hidden size is selected from {32, 64} on BookCrossing and {64, 128, 256} on MovieLens-1M and MovieLens-25M.
•

SIM (Pi et al., 2020). The number of attention layers and DNN layers are selected from {1, 2, 3}. The DNN hidden size is selected from {32, 64} on BookCrossing and {64, 128, 256} on MovieLens-1M and MovieLens-25M.

C.2. LM-based Models

The structure of the pretrained language models is kept unchanged. And AdamW (Loshchilov and Hutter, 2017) optimizer is used for all the baselines. The detailed training settings are as follows:

•

CTR-BERT (Muhamed et al., 2021a). We maintain a two-tower model structure based on the BERT (Devlin et al., 2018) model to encode the user and item information respectively. The total number of tuning epochs is set to 10. The batch size is set to 1024. The learning rate is set to $5\times 10^{-5}$ with linear decay. The warmup ratio is 0.05.
•

P5 (Geng et al., 2022a) is a unified sequence-to-sequence framework with T5 (Raffel et al., 2020) as the backbone pretrained language model for multiple recommendation tasks. In this paper, we leverage P5 for a single task only (i.e., CTR prediction). The total number of epochs is set to 10 with batch size of 32. The learning rate is selected from $\{5\times 10^{-4},1\times 10^{-3}\}$ with linear decay. The warmup ratio is 0.05. Following P5’s official implementation, we also perform gradient clip with threshold equal to 1.0.
•

PTab (Liu et al., 2022) adopts the common pretrain-finetune scheme based on the BERT (Devlin et al., 2018) model. PTab first further pretrains the BERT model with the classical masked language modeling objective based on the textualized CTR data, and then finetunes BERT for downstream CTR prediction as a text classification problem. Following the original paper, we pretrain BERT for 10 epochs with batch size equal to 1024. The learning rate for pretraining is set to $5\times 10^{-5}$ with linear decay. The warmup ratio is 0.05. As for finetuning, the total number of tuning epoch is set to 10 with batch size of 1024. The learning rate for finetuning is initialized at $5\times 10^{-5}$ with linear decay. The warmup ratio is 0.01.

Appendix D Additional Experiments

In this section, we further provide additional experiments to verify the following core points:

•

The universality of the lifelong sequential behavior incomprehension problem and the generalization of our proposed ReLLa.
•

Analysis about the model parameter and inference time.
•

Ablation on PCA dimensionality and distance metrics for SUBR.
•

Analysis and discussion about the potential reason for the incomprehension problem.

D.1. Universality & Generalization

We validate the universality of the lifelong sequential behavior incomprehension problem and he generalization of our proposed ReLLa, by incorporating different backbone LLMs of different architectures and sizes including Falcon-7B (Almazrouei et al., 2023)⁷⁷7https://huggingface.co/tiiuae/falcon-7b-instruct, Mistral-7B (Jiang et al., 2023)⁸⁸8https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1, Vicuna-7B (Chiang et al., 2023)⁹⁹9https://huggingface.co/lmsys/vicuna-7b-v1.3, Vicuna-13B (Chiang et al., 2023)¹⁰¹⁰10https://huggingface.co/lmsys/vicuna-13b-v1.3, LLaMA-2-70B-Chat (Touvron et al., 2023)¹¹¹¹11https://huggingface.co/meta-llama/Llama-2-70b-chat-hf.

D.1.1. Universality of the Incomprehension Problem

We first analyze the universality of lifelong sequential behavior incomprehension problem for different LLMs on MovieLens-1M dataset. We report the zero-shot AUC performance of different LLMs w.r.t. different length $K$ of user behavior sequence ranging from {5, 10, 15, 20, 25, 30}. It is worth noting that we downsample the test set to 10,000 for LLaMA2-70B-chat due to the time consumption, while the other four LLMs are still tested on the whole test set. The results are given in Table 4. We can observe that the length of user sequence at which a peaking performance is reached is small and far from reaching the context limit for all the five LLMs. Therefore, we validate the universality of lifelong sequential behavior incomprehension problem when adapting LLMs to recommendation domains. Besides, there exists performance difference among different LLMs, which may be related to the inherent instruction-following capabilities of LLMs themselves.

Table 4. Zero-shot AUC performance w.r.t. different sequence length

K

for different LLMs on MovieLens-1M dataset. The peaking performance for each LLM is given in bold.

LLM	MovieLens-1M
LLM	K=5	K=10	K=15	K=20	K=25	K=30
Falcon-7B	0.5906	0.5741	0.5583	0.5420	0.5468	0.5452
Mistral-7B	0.6566	0.6568	0.6670	0.6623	0.6612	0.6610
Vicuna-7B	0.6630	0.6586	0.6739	0.6527	0.6463	0.6412
Vicuna-13B	0.6807	0.6932	0.6993	0.6918	0.6937	0.6908
LLaMA2-70B	0.6259	0.6348	0.6421	0.6402	0.6339	0.6321

D.1.2. Generalization of ReLLa

We further investigate the generalization of our proposed ReLLa in terms of different backbone LLMs (i.e., model compatibility). We apply semantic user behavior retrieval (SUBR) and retrieval-enhanced instruction tuning (ReiT) on four LLMs, excluding LLaMA-2-70B-Chat due to the computational resource constraint. The finetuning configuration is set as the same as our previous experiment on Vicuna-13B. We set the length of user sequence to 30. Few-shot settings ¡1% and ¡10% indicate 8,192 and 65,536 training samples, respectively. We also provide the performance of the best baseline model (i.e., SIM) for both full-shot and few-shot settings. We report the results in Table 5, from which we have the following observations and discussions:

•

Compared with the original LLMs, ReLLa can improve the recommendation performance in both zero-shot and few-shot settings consistently and significantly.
•

Mistral-7B, Vicuna-7B and Vicuna-13B with ReiT (¡10%) settings are able to significantly outperform full-shot SIM that is trained on the whole training set, which demonstrates the surprisingly high data efficiency property of ReLLa.
•
Although Falcon-7B with ReiT (¡10%) setting obtains much better performance than SIM under the same few-shot setting (¡10%), it fails to defeat full-shot SIM as other three LLMs do. We think the main reasons are in two folds:
1. (1)
  
  Model Capability. The recommendation performance of LLM is highly correlated to its instruction-following capability. Maybe Falcon-7B itself is not ready to suit recommendation tasks.
2. (2)
  
  Tuning Strategy. We simply set the hyperparameter configuration and prompt template to be the same as Vicuna-13B, which might be suboptimal. We will proceed to optimize the performance to see whether few-shot Falcon-7B can defeat full-shot SIM or not.
•

While the peaking performance arrives with $K$ =5 or 15 for zero-shot LLMs, the four LLMs gain much better recommendation performance with larger $K$ =30 when equipped with ReLLa. Hence, we validate the generalization of ReLLa to address the user sequence incomprehension problem and improve the recommendation performance.

Table 5. The model compatibility of ReLLa w.r.t. different backbone LLMs on MovieLens-1M dataset with

K

=30. We also give the performance of SIM, which is the best baseline among traditional recommendation models.

Model		MovieLens-1M
Model		AUC	Log Loss	ACC
SIM	few-shot (¡1%)	0.7352	0.6132	0.6743
	few-shot (¡10%)	0.7414	0.6129	0.6756
	full-shot	0.7992	0.5387	0.7268
Falcon-7B	zero-shot	0.5906	0.7674	0.5436
	with SUBR	0.5964	0.7709	0.5437
	with ReiT (¡1%)	0.7811	0.5589	0.7111
	with ReiT (¡10%)	0.7870	0.5658	0.7072
Mistral-7B	zero-shot	0.6670	0.7556	0.4793
	with SUBR	0.6881	0.7321	0.5119
	with ReiT (¡1%)	0.7905	0.5488	0.7210
	with ReiT (¡10%)	0.8005	0.5388	0.7275
Vicuna-7B	zero-shot	0.6739	0.9510	0.5644
	with SUBR	0.6704	0.7745	0.5655
	with ReiT (¡1%)	0.7918	0.5493	0.7196
	with ReiT (¡10%)	0.8016	0.5365	0.7274
Vicuna-13B	zero-shot	0.6993	0.6291	0.6493
	with SUBR	0.7013	0.6250	0.6507
	with ReiT (¡1%)	0.7927	0.5475	0.7196
	with ReiT (¡10%)	0.8033	0.5362	0.7280

D.2. Model Parameter & Inference Time

We provide the complexity analysis on MovieLens-1M dataset by reporting the number of total parameters, the number of trainable parameters, and the averaged inference time per batch for both ReLLa and SIM (the best traditional recommendation baseline). We choose Vicuna-13B as the backbone LLM for ReLLa. The evaluation batch size is set to 512 and 4 for SIM and ReLLa, respectively. The run-time experiment is exclusively conducted on the same server with one GeForce RTX 4090 GPU.

Table 6. Complexity analysis on MovieLens-1M dataset.

Model	# Total Parameter	# Trainable Parameter	Inference Time
SIM	1.44M	1.44M	3.21ms
ReLLa	13B	650M	500ms

We report the results in Table 6. Although ReLLa achieves remarkable success on sequential recommendation in terms of both performance and sample efficiency, we have to admit that the inference speed is slower compared with traditional recommendation models. Hence, ReLLa is currently suitable for real-world applications with a high tolerance for latency, such as conversational search or conversational recommendation. Note that this computational limitation inherently stems from the large-scale property of LLMs. It is not unique to ReLLa but rather a common issue that our research community of LLM for recommendation should make joint efforts to overcome.

D.3. Ablation on PCA & Distance Metric

We conduct ablation study to investigate the impact of PCA dimensionality and distance metrics for SUBR, respectively. We choose Vicuna-13B as the backbone LLM for ReLLa.

D.3.1. Impact of PCA Dimensionality

To offer a deeper insight into the impact of PCA dimensionality, we evaluate the performance of ReLLa w.r.t. different PCA dimensionalities on MovieLens-1M dataset under both zero-shot and few-shot (8192-shot, ¡1%) settings. The results are reported in Table 7. We can observe that PCA dimensionality 512 generally achieves the best performance. The dimension reduction brought by PCA also means a kind of semantic information loss. Hence, the smaller the dimensionality, the worse ReLLa performs. While dimensionality larger than 512 might lead to heavy cost of storage and computing, we consider 512 as a reasonable choice of PCA dimensionality to balance the performance and storage/computing cost.

Table 7. Ablation study w.r.t different PCA dimensionalities for ReLLa on MovieLens-1M dataset under both zero-shot and few-shot (¡1%) settings.

Setting	PCA Dim.	MovieLens-1M
Setting	PCA Dim.	AUC	Log Loss	ACC
zero-shot	512	0.7013	0.6250	0.6507
	256	0.7064	0.6377	0.6357
	128	0.7063	0.6379	0.6351
	64	0.7057	0.6375	0.6349
few-shot	512	0.7927	0.5475	0.7196
	256	0.7917	0.5476	0.7098
	128	0.7897	0.5606	0.7099
	64	0.7901	0.5629	0.7099

D.3.2. Impact of Distance Metric

We empirically choose the cosine distance as the default choice for ReLLa to measure the semantic relevance, as it has been widely used to evaluate textual similarity in the realm of natural language processing (NLP) (Wang et al., 2022a; Reimers and Gurevych, 2019; Ni et al., 2021). To offer a deeper insight, we compare three different distance metrics: (1) cosine distance, (2) L2 distance, and (3) L1 distance. We report the performance of ReLLa w.r.t. different distance metrics in both zero-shot and few-shot (¡1%) settings on MovieLens-1M dataset. The results are given in Table 8. We have the following discussions:

•

Cosine similarity inherently normalizes the vectors. It focuses on the angular difference between vectors rather than their magnitude. This means that even if two vectors differ greatly in size, they can still have a high cosine similarity if they point in similar directions.
•

In high-dimensional spaces, L1 and L2 distances tend to suffer from the ”curse of dimensionality”, where the distance between all points becomes similar as dimensions increase. This makes these distances less effective measures of similarity in high dimensions. Cosine similarity is not affected by this issue and therefore remains effective in high-dimensional spaces.

Table 8. Ablation study w.r.t different distance metrics for ReLLa on MovieLens-1M dataset under both zero-shot and few-shot (¡1%) settings.

Setting	Distance	MovieLens-1M
Setting	Distance	AUC	Log Loss	ACC
zero-shot	Cosine	0.7013	0.6250	0.6507
	L2	0.6975	0.6356	0.6386
	L1	0.6811	0.6388	0.6339
Few-shot	Cosine	0.7927	0.5475	0.7196
	L2	0.7872	0.5762	0.6944
	L1	0.7833	0.5598	0.7119

D.4. Potential Reason for Incomprehension

We give our in-depth analysis and conjecture about the potential reason for the incomprehension problem of LLMs in recommendation. We argue that the incomprehension problem arises from the fact that LLMs struggle to comprehend highly heterogeneous user behavior sequences and thus fail to extract meaningful information from them. The heterogeneity of a sequence can be defined as the diversity of user behaviors within that sequence, such as different item genres. The longer the user’s behavior sequence, the higher the probability of it exhibiting a high level of heterogeneity, and consequently, the greater the difficulty for LLMs to comprehend. Based on the conjecture above, retrieval, as the core technique for ReLLa, is essentially a means of homogenizing the user sequence. It aggregates homogeneous behaviors that are similar to the target item into a certain sequence and removes those unrelated heterogeneous behaviors, thus improving the comprehension capability of LLMs for long user sequences.

We provide an empirical study on MovieLens-1M dataset to demonstrate the homogenizing effect of retrieval. Here, we define the heterogeneity score as the number of unique movie genres in a given sequence of behaviors (i.e., movies). We illustrate two examples as follows:

•

$[\text{Fiction, Comedy, Comedy, Family}]$ $\rightarrow$ heterogeneity score=3
•

$[\text{Fiction, Fiction, Child, Fiction}]$ $\rightarrow$ heterogeneity score=2

In Table 9, we report the averaged heterogeneity score of two sequence types w.r.t. different length $K$ : (1) the sequence is constructed by top-recent behaviors. (2) the sequence consists of top-relevant behaviors generated by our proposed semantic user behavior retrieval (SUBR). From Table 9, we give the following discussions:

•

We can see that the heterogeneity score gradually increases as the length $K$ grows. We argue that LLMs might fail to comprehend the given sequence once the heterogeneity score exceeds a certain threshold, and therefore we observe the phenomenon that the performance of LLM only peaks at around $K$ =15 as illustrated in Figure 1.
•

When equipped with our proposed SUBR, the heterogeneity score of top-relevant sequences largely decreases compared with the top-recent sequences. The lower the heterogeneity level of the sequence, the easier it is for LLMs to comprehend, and consequently, the better performance ReLLa can achieve with a larger $K$ .

Table 9. The averaged heterogeneity scores of two sequence types w.r.t. different length

K

Seq. Type	MovieLens-1M
Seq. Type	K=5	K=10	K=15	K=20	K=25	K=30
Top Recent (Origin)	2.91	4.19	5.09	5.80	6.39	6.90
Top Relevant (Retrieval)	2.44	3.37	4.01	4.51	4.94	5.32

ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation