Enhancing Content-based Recommendation via
Large Language Model
Abstract.
In real-world applications, users express different behaviors when they interact with different items, including implicit click/like interactions, and explicit comments/reviews interactions. Nevertheless, almost all recommender works are focused on how to describe user preferences by the implicit click/like interactions, to find the synergy of people. For the content-based explicit comments/reviews interactions, some works attempt to utilize them to mine the semantic knowledge to enhance recommender models. However, they still neglect the following two points: (1) The content semantic is a universal world knowledge; how do we extract the multi-aspect semantic information to empower different domains? (2) The user/item ID feature is a fundamental element for recommender models; how do we align the ID and content semantic feature space? In this paper, we propose a ‘plugin’ semantic knowledge transferring method LoID, which includes two major components: (1) LoRA-based large language model pretraining to extract multi-aspect semantic information; (2) ID-based contrastive objective to align their feature spaces. We conduct extensive experiments with SOTA baselines on real-world datasets, the detailed results demonstrating significant improvements of our method LoID.
1. Introduction
Background. With the boom of digital information, billions of user requests are produced daily, so recommender systems (RSs) have become an integral part of Internet platforms. To capture users’ interests more accurately, RSs have gone through several milestones, such as the logistic regression with hand-crafted features (e.g., FM (Rendle, 2010)), the neural networks (e.g., WideDeep (Cheng et al., 2016), YoutubeNet (Covington et al., 2016)), the sequential signal (e.g., DIN (Zhou et al., 2018), SIM (Pi et al., 2020)), and the multi-hop graph signal (e.g., PinSage (Ying et al., 2018), DGRec (Liangwei Yang, 2023)). In retrospect, these effective methods are based on the collaborative filtering (CF) idea and extend the boundaries of RSs. However, the CF framework also limits them under the case of cold-start and data sparsity problems. The reason is that the CF idea aims to mine the user/item pattern intelligence from data, to discover and recommend high-click candidate items for a user while it is hard to understand the users’ fine-grained and multi-aspect interests. In fact, instead of mining user preferences from massive user-item interaction logs, the user always leaves some reviews/comments to explain further his/her feelings about this interaction, which provides an explicit way to understand the users’ complex interests in language semantic space.
Related work. To extract valuable content semantic information, the pioneering works are formed as a rating prediction task: for a user-item pair in test set, give the historical user/item contents in training set (e.g., reviews), then predict their possible interaction rating. In early years, the DeepCoNN (Zheng et al., 2017) employed two convolutional neural networks (CNNs) towers to aggregate user/item content tokens individually to measure their dot-similarity, and the D-Attn (Seo et al., 2017) further extended DeepCoNN by introducing local and global attention mechanism to replace CNNs to aggregate tokens. Following the D-Attn, the ALFM (Cheng et al., 2018) and ANR (Chin et al., 2018) focused on extracting the fine-grained multi-aspect semantic information and assigning different weights for aspects. The recent progress is RGCL (Shuai et al., 2022), which used BERT to generate the user-item content scores and then leveraged them as user-item graph edge weights to conduct a multi-hop graph neural network to make prediction.
Motivation. Although these methods raised model ability with content information, they ignore the following problems:
-
•
Transfer semantic knowledge across domains: Actually, RS needs to serve several domains simultaneously, such as electronics, clothing, books, etc. Since different domains always express different aspects of users’ interests (e.g., food is delicious, price is appropriate, etc.), the previous methods need to re-tune their semantic component for different domains, which is time-consuming.
-
•
Enhancing the correlation between content and ID: The user-item content and ID-interaction information can be seen as two different modalities (Li et al., 2023) connected by users. Nevertheless, previous methods focus on utilizing separate components to model the two corresponding content/ID spaces while ignoring how to exploit the correlation and align them in a unified space.
Our Work. To alleviate the above problems, we propose LoID, a LLM-based (Brown et al., 2020) model for transferring semantic knowledge across domains based on LoRA (Hu et al., 2021), and aligning content/ID information with contrastive objectives (Radford et al., 2021). It mainly includes two steps:
-
(1)
Pretraining semantic ‘plugin’: On the one hand, the ideal semantic information should act as a universal role to support all domains. On the other hand, different domains have their main aspects of semantic information that are related to recommendation. Considering such a trade-off, we borrow the LoRA strategy idea to train a small set of parameters for each domain, which could serve as plugins that can be added seamlessly to the target domain without further re-training.
-
(2)
Aligning the content/ID feature spaces: As discussed before, the content and ID information can be seen as two modalities of user-item interaction. To minimize their gap, we introduce the contrastive idea of maximizing content/ID features’ mutual information to align their feature spaces.
Finally, to validate our LoID effectiveness, we extensively test LoID under 11 different domain datasets to show its superior ability.
Contributions. Our contributions are summarized as follows:
-
•
We give a ‘plugin’ idea to transfer semantic knowledge, which sheds light on building a new paradigm for recommendation.
-
•
We devise a novel content/ID feature alignment objective.
-
•
We conducted detailed analyses with SOTA methods and LLMs.
2. Preliminary
2.1. Problem Statement
This work considers a brief task: For one domain dataset , it contains user-item ratings and textual contents, and each user-item interaction primarily comprises the following four elements: user , item , the corresponding rating , and the textual content token list left by the user. Our model aims to predict ratings by historical content and user/item ID information.
Besides, to test the ‘plugin’ idea effectiveness, we further consider multi-domain scenarios, that is: given source domain content information , predict target domain rating score.
2.2. Low-Rank Adaptation (LoRA)
Before going on, we first explain the LoRA (Hu et al., 2021), to show the basic idea of how to fast tuning an LLM (Devlin et al., 2018). Indeed, the unit block of LLM, transformer (Vaswani et al., 2017), consists of two parts: (Masked) attention and MLPs (FFM). The two parts introduce four pre-trained parameter matrices (e.g., ). Then, the challenge is: how to update those matrices without re-training them or involving new large parameters. Thereby, LoRA was proposed by assigning two small matrics for any pre-trained matrix :
(1) |
where is the modified weight matrix, and are low-rank matrices. LoRA freezes the original large parameter matrix , introduces and updates it by . Next, we can fine-tune the LLM by any supervised Task that without updating all parameters as follows:
(2) |
where is extra added parameters and is all parameters111In our work, we only introduce extra 4M parameters to tune 110M BERT..
3. Methodology
3.1. Overview
The architecture of our method LoID is illustrated in Figure 1, which includes two processes. In part (a), we first train the LoRA parameters of source domain as ”plug-ins” to enhance target domain prediction without further re-training. In part (b), we first extract historical contents of user/item to obtain user/item semantic representation, and then align the ID and semantic to make target domain rating prediction, note that the source domain LoRA plugin is an optional choice.
3.2. LoRA-based Encoder Pretraining
Source LoRA. To get users’ behavior data from the source domain, We leverage the LoRA strategy to pre-train the source domain222For better understanding, we use BERT as LLM to show our method details.. In the pre-training task, we leverage rating prediction by token to predict rating directly:
(3) |
where BERT is the LLM forward procedure, Predict is an MLP to generate the prediction score , then we adopt the Mean Squared Error (MSE) loss to optimize LoRA parameters:
(4) |
where means the number of the samples of source domains, / represent predicted/real rating respectively. After pre-training all source domain’s LoRA parameters, we can use them as plugins to enhance BERT when dealing with other target domain tasks without further re-training.
3.3. Re-LoRA In Target Domain
Target LoRA. Enabling the model to adapt to the target domain, we introduce a LoRA module (i.e., Target LoRA) again in the target domain to fine-tune the target LoRA parameters. In the Re-LoRA process, we freeze the parameters of the BERT and source LoRA, and the only part that needs training is Target LoRA:
(5) |
where is the LLM part’s parameters of target domain. In the next, we explain how to align the language semantic and ID space.
3.4. ID-based Contrastive Learning
As the fundamental element of CF-based recommender work, the user/item IDs are indispensable in achieving personalized signals. How to align the space of semantic/ID is the key to making the content information more competitive for industrial RS.
User/Item Text Encoder. Considering the computation scale of LLM, we randomly extract historical contents in the target domain (/) to describe the target domain user/item holistic preferences/properties:
(6) |
where and are the embeddings for the user’s and item’s content. and are the corresponding textual content token list, and stands for the LLM parameter on the target domain.
Attention Layer. On top of the user/item historical content semantic information, we devise a novel attention mechanism to exchange the user/item semantic/ID information. Specifically, we integrate the semantic information into the item representation , versus versa for user representation . By this mechanism, ID embedding and token are fused, after which we obtain the updated item representation . Similarly, on the user side, we obtain as follows:
(7) |
where and are the updated item and user representations.
Contrastive Loss. The Attention mechanism primarily focuses on different parts of the input sequence but does not explicitly consider the relationship between users and items. So, we conduct contrastive learning, enhancing the similarity between interactive users and items. For each updated representation /, treated as an anchor, we pair it with the original user/item representation as a positive sample , and another representation in the batch as a negative sample . The goal is to minimize the distance and maximize . The loss function is minimized to ensure that the representations of similar items/users are closer in terms of the distance .
(8) |
where is the representation of the updated item/user, represents the original feature representation of the user/item associated with the current content, as positive sample. represents the negative sample. is a margin constant that enforces a ”safe distance” constraint on correctly classified samples, ensuring that the model’s classification results have sufficient robustness.
3.5. Model Optimization
Finally, we concatenate and to obtain the final score:
(9) |
where — means the number of the samples, and represent predicted and real rating respectively.
Throughout the training process, the total loss comprises the MSE loss of rating prediction loss and the contrastive loss incurred during contrastive learning:
(10) |
where denotes the prediction loss, is the weight assigned to the contrastive loss, and represents the contrastive loss.
4. Experiments
4.1. Experimental Setup
Datasets | DeepConn | D-attn | ALFM | ANR | BiGI | RGCL | LoID | LoID(Elec) | LoID(Elec &Movie) |
Electronics | 1.659 | 1.744 | 1.563 | 1.445 | 1.1433 | 0.9621 | 0.8662 | ||
Movies & TV | 1.207 | 1.246 | 1.193 | 1.112 | 1.3489 | 0.9239 | 0.8353 | 0.8167 | |
CDs & Vinyl | 0.980 | 1.014 | 0.956 | 0.914 | 1.1831 | 0.8180 | 0.6516 | 0.6295 | 0.6217 |
Amazon Instant Video | 1.178 | 1.213 | 1.075 | 1.009 | 1.2437 | 0.9357 | 0.5177 | 0.4891 | 0.4864 |
Baby | 1.442 | 1.507 | 1.359 | 1.258 | 1.5166 | 1.1414 | 0.7373 | 0.6911 | 0.6876 |
Digital Music | 0.749 | 0.775 | 0.725 | 0.688 | 0.7731 | 0.7735 | 0.4696 | 0.4380 | 0.4365 |
Musical Instruments | 1.160 | 1.224 | 1.072 | 1.034 | 1.3580 | 0.7211 | 0.5830 | 0.5179 | 0.5166 |
Office Products | 1.569 | 1.650 | 1.474 | 1.337 | 1.6902 | 0.7001 | 0.6150 | 0.5909 | 0.5573 |
Patio, Lawn & Garden | 1.622 | 1.696 | 1.510 | 1.403 | 1.7205 | 0.7049 | 0.6622 | 0.6358 | 0.6129 |
Pet Supplies | 1.565 | 1.628 | 1.485 | 1.377 | 1.6185 | 1.2380 | 0.7685 | 0.7304 | 0.7249 |
Video Games | 1.498 | 1.533 | 1.383 | 1.292 | 1.4450 | 1.0826 | 0.7822 | 0.7541 | 0.7409 |
Remark ‘’ indicates that this domain is served as the source domain.
4.1.1. Datasets
We selected 11 categories 333http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/ of representative datasets from the Amazon dataset (Smith and Linden, 2017). Among them, the smallest dataset Musical Instruments includes 339,231 users, 83,046 items, 500,176 ratings, and the largest dataset Electronics comprises 4,201,696 users, 476,002 items, 7,824,482 ratings. Considering the different data-scale, we select three largest datasets (e.g., Electronics, Movies, and CDs) as our source domains, which contain relatively rich records of user-item contents and ratings. Following (Chin et al., 2018; Catherine and Cohen, 2017; Chen et al., 2018; Seo et al., 2017), we randomly partitioned our datasets into training, validation, and test sets with an 8:1:1 ratio.
4.1.2. Baselines
We compare LoID with several baselines, which can be categorized into three classes: (1) Single-aspect methods, such as DeepCoNN (Zheng et al., 2017) and D-Attn (Seo et al., 2017), extract features from the historical semantic information of users and items using CNN and attention mechanisms. (2) Multi-aspect methods, including ALFM (Cheng et al., 2018) and ANR (Chin et al., 2018), aim to extract multiple semantic aspects and their respective importance. (3) GNNs-based methods, such as BiGI (Cao et al., 2021) and RGCL (Shuai et al., 2022), integrate adjacent node information from collaborative filtering.
4.1.3. Parameter Settings
In our method, the embedding size is fixed as 768, the dropout rate is fixed 0.5, the learning rate is set as 1e-5, the batch size is fixed as 4, and the training spans 5 epochs. In the LoRA module, the low-rank hyper-parameter selected from 16 to 48 with step length 8; the selected from 0.2 to 0.5 with step length 0.1; the number of extracted user/item historical contents is is set as . In the following section, we report LoID results under , , and by default.
In the experiment, we report the DeepConn, D-attn, ALFM, and ANR results in the original literature (Chin et al., 2018) directly. Besides, in BiGI, the feature and hidden layer dimension are both set to 128, the dropout rate is fixed to 0.3, the step size is 0.001 and the number of GNN layers is set to 2. In RGCL, the number of GCN aggregation units and the number of output units of RGCL are both set to 64. The dropout rates of Edges, GCN, and nodes are set to 1.0, 0.7, and 0.3. Among our method and all baselines, we use the Adam (Kingma and Ba, 2015) algorithm to update parameters.
Datasets | Electronics | Movies | CDs | |||
Sim | MSE | Sim | MSE | Sim | MSE | |
Amazon Instant Video | 0.07 | 0.489(+5%) | 0.28 | 0.473(+9%) | 0.16 | 0.485(+6%) |
Digital Music | 0.08 | 0.438(+6%) | 0.15 | 0.436(+7%) | 0.33 | 0.425(+9%) |
Musical Instruments | 0.22 | 0.518(+11%) | 0.11 | 0.546(+6%) | 0.21 | 0.540(+7%) |
4.2. Performance Comparisons
Table 1 shows the performance of LoID on eleven datasets in terms of MSE metrics. Compared to other baselines, we note that aspect-aware recommendation methods such as ALFM and ANR consistently outperform DeepCoNN and D-Attn. We attribute this to the limitations of DeepCoNN and D-Attn, which lack a comprehensive model for the intricate decision-making process in user-item interactions. In addition, RGCL outperforms single-aspect and multi-aspect methods, indicating superior performance of graph neural networks. However, BiGI, despite incorporating graph networks, falls short compared to single-aspect methods. This suggests incorporating user textual semantic information is effective. We observe that LoID achieves statistically significant improvement over all SOTA baseline methods. This highlights that when using user contents, ID information should not be overlooked; instead, aligning them with semantic information in the same space can achieve positive performance. Moreover, LoID (Elec) results are superior to LoID, emphasizing enhanced performance with information from different domains. This demonstrates the effectiveness of extracting multi-aspect semantic information to empower various domains and our plugin idea effectiveness.
Multi-LoRA Merging. When we employ multiple LoRAs from different domains, the results of Multi-LoRA are superior to Single LoRA. We believe the reason is that multiple LoRAs bring richer and more extensive users’ data, more accurately describing historical behavior and preference tendencies.
4.3. Discussion of Domain Correlation Effect
To explore the semantic relationship of different domains, we conducted a more granular study of how to enhance further the transfer of semantic information to empower different domains. We consider whether the recommendation performance is correlated to the similarity between the target and source domains. First, we select 100 reviews from each dataset randomly, employ the Sentence-BERT (Reimers and Gurevych, 2019), and quantify the cosine similarity between datasets. Then, we assign different datasets as source domains to show the improvements compared with the origin LoID (as shown in Table 2). We can conclude that an increase in the similarity between domains leads to a corresponding rise in the performance improvement ratio. This implies that the transfer of our method becomes more effective when domains exhibit higher similarity.
4.4. Discussion of Different LLMs
Datasets | BERT | GPT2-medium | ||||
Precision | Recall | F1-score | Precision | Recall | F1-score | |
Amazon Instant Video | 0.7402 | 0.7578 | 0.7472 | 0.615 | 0.5999 | 0.606 |
Digital Music | 0.7692 | 0.7767 | 0.7712 | 0.7244 | 0.7136 | 0.7183 |
Musical Instruments | 0.7376 | 0.7436 | 0.7382 | 0.5534 | 0.5227 | 0.5332 |
This section investigates the impact of different LLMs in the pre-training phase, considering two distinct paradigms, i.e., GPT and BERT. Due to prompt-based GPT being unable to predict floating-point numbers, thus we formulate the prompt:
Input template: Give some example: {content1} is score {score1}, {content2} is score {score2}. Guess the score (The score should be between 1 and 5, where 1 means the lowest score, and 5 means the highest score) of {current content}, we think the score is? Target template: {score}, {explanation}.
To ensure a fair comparison, we tuned them by LoRA and adopted PRF (Precision, Recall, and F1-score) as the evaluation protocol. Table 3 presents the results, indicating that the BERT outperforms GPT2-medium. This superiority may be attributed to BERT belonging to bi-directional encoder-framework LLM, which is more powerful under the regression task than decoder-framework LLM.
5. Conclusions
This paper introduces a simple yet effective approach named LoID, which includes two major components. (1) ‘Pre-training plugin’, we propose a flexible plugin framework that could transfer different domain semantic knowledge without re-training. (2) ‘Aligning semantic/ID space’, we devise a novel attention mechanism to connect the semantic and ID space, making our model easily applied in industrial RS. Extensive experiments reveal that LoID surpasses existing SOTA methods, and in-depth analyses underscore the effectiveness of our model components. In the future, we will explore the vision signal to improve our model ability.
References
- (1)
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances In Neural Information Processing Systems (2020).
- Cao et al. (2021) Jiangxia Cao, Xixun Lin, Shu Guo, Luchen Liu, Tingwen Liu, and Bin Wang. 2021. Bipartite graph embedding via mutual information maximization. In ACM International Conference on Web Search and Data Mining (WSDM).
- Catherine and Cohen (2017) Rose Catherine and William Cohen. 2017. Transnets: Learning to transform for recommendation. In ACM Conference on Recommender Systems (RecSys).
- Chen et al. (2018) Chong Chen, Min Zhang, Yiqun Liu, and Shao** Ma. 2018. Neural attentional rating regression with review-level explanations. In The World Wide Web Conference (WWW).
- Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In ACM Conference on Recommender Systems Workshop.
- Cheng et al. (2018) Zhiyong Cheng, Ying Ding, Lei Zhu, and Mohan Kankanhalli. 2018. Aspect-aware latent factor model: Rating prediction with ratings and reviews. In The World Wide Web Conference (WWW).
- Chin et al. (2018) ** Yao Chin, Kaiqi Zhao, Shafiq Joty, and Gao Cong. 2018. ANR: Aspect-based neural recommender. In ACM International Conference on Information and Knowledge Management (CIKM).
- Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In ACM Conference on Recommender Systems (RecSys).
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv (2018).
- Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. ArXiv (2021).
- Kingma and Ba (2015) P. Diederik Kingma and Lei Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR).
- Li et al. (2023) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. ArXiv (2023).
- Liangwei Yang (2023) Yunzhe Tao Jiankai Sun Xiaolong Liu Philip S. Yu Taiqing Wang Liangwei Yang, Shengjie Wang. 2023. DGRec: Graph Neural Network for Recommendation with Diversified Embedding Generation. In ACM International Conference on Web Search and Data Mining (WSDM).
- Pi et al. (2020) Qi Pi, Guorui Zhou, Yu**g Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In ACM International Conference on Information and Knowledge Management (CIKM).
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML).
- Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. ArXiv (2019).
- Rendle (2010) Steffen Rendle. 2010. Factorization machines. In IEEE International Conference on Data Mining.
- Seo et al. (2017) Sungyong Seo, **g Huang, Hao Yang, and Yan Liu. 2017. Interpretable convolutional neural networks with dual local and global attention for review rating prediction. In ACM Conference on Recommender Systems (RecSys).
- Shuai et al. (2022) Jie Shuai, Kun Zhang, Le Wu, Peijie Sun, Richang Hong, Meng Wang, and Yong Li. 2022. A review-aware graph contrastive learning framework for recommendation. In ACM International Conference on Research on Development in Information Retrieval (SIGIR).
- Smith and Linden (2017) Brent Smith and Greg Linden. 2017. Two decades of recommender systems at Amazon. com. IEEE International Computing (2017).
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances In Neural Information Processing Systems (2017).
- Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In ACM Knowledge Discovery and Data Mining (KDD).
- Zheng et al. (2017) Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Joint deep modeling of users and items using reviews for recommendation. In ACM International Conference on Web Search and Data Mining (WSDM).
- Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi **, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In ACM Knowledge Discovery and Data Mining (KDD).