License: arXiv.org perpetual non-exclusive license
arXiv:2401.11705v1 [cs.IR] 22 Jan 2024

Domain-Aware Cross-Attention for Cross-domain Recommendation

Abstract

Cross-domain recommendation (CDR) is an important method to improve recommender system performance, especially when observations in target domains are sparse. However, most existing cross-domain recommendations fail to fully utilize the target domain’s special features and are hard to be generalized to new domains. The designed network is complex and is not suitable for rapid industrial deployment. Our method introduces a two-step domain-aware cross-attention, extracting transferable features of the source domain from different granularity, which allows the efficient expression of both domain and user interests. In addition, we simplify the training process, and our model can be easily deployed on new domains. We conduct experiments on both public datasets and industrial datasets, and the experimental results demonstrate the effectiveness of our method. We have also deployed the model in an online advertising system and observed significant improvements in both Click-Through-Rate (CTR) and effective cost per mille (ECPM).

Index Terms—  Cross Domain Recommendations; Cold-start Problem; CTR prediction; Transfer learning

1 Introduction

In major e-commerce platforms such as JD.com and Amazon, precise prediction of Click-Through Rates (CTR) is crucial to increasing business revenue. These platforms encompass multiple business domains, each representing a specific display location or activity page, and are dedicated to presenting attractive items to users, whether through mobile applications or PC websites. With the enrichment of features and the complexity of model structures, deep CTR models have made significant progress in recent years. However, traditional recommender systems typically engage in tasks within a single domain, utilizing data collected from a specific domain to train models and serve for a specific task. With the increase of new domains and the influx of new users, some users have insufficient interactions in these domains, presenting recommender systems with the challenge of cold starts.

Fortunately, the missing interactions of these cold-start users can be captured in other online domains, offering a solution to the cold start dilemma. For instance, a user may exhibit sparse interactions in one business domain but possess a rich history of clicks in another. Cross-Domain Recommendation (CDR) methods have been developed to address the challenges of cold-start users by transferring knowledge from relatively data-rich source domains. By leveraging user interactions from source domain, these methods compensate for the data scarcity in the target domain, thereby improving the accuracy and personalization across various e-commerce business domains.

According to the general cross-domain recommendation settings, domains are usually divided into source domain and target domain. The source domain typically has richer interactions. This paper primarily investigates how to utilize data from the source domain to enhance the quality of recommendations for cold-start users in the target domain.

Most existing cross-domain recommendation methods employ embedding map** methods, operating under the assumption that user preference relationship between the source and target domains are similar, which resulted in transferring user embeddings from the source to the target domain through shared preference transfer methods [1, 2, 3]. Some approaches argue that the complex relationship between user preferences of source and target domains vary from user to user, which leads to personalized preference map**s [4]. however, in reality, these methods often overlook the differences in data distribution, item features, and scenario characteristics between the source and target domains. They fail to take into account the actual characteristics of the target scenario, which may lead to negative knowledge transfer.

To address this issue, we introduce a two-step domain-aware cross-attention network. From a practical application standpoint, to align with the recommendation tasks in the target domain and effectively integrate source domain information, this network extracts key behavioral sequence features from the source domain during the cross-domain transfer process. In addition, this attention network considers both coarse-grained and fine-grained attention on domain-level and item-level attention to selectively control what to transfer from the source domain, thereby reducing the likelihood of negative transfer. Furthermore, most mainstream cross-domain methods require pre-training embeddings in both source and target domains. This is followed by training a map** function to project users from the source to the target domain and make recommendations. In contrast to these cross-domain approaches, our method trains a model end-to-end. This greatly facilitates industrial training and deployment. For a newly added domain, we can fine-tune a suitable model on top of the existing one, avoiding training a brand new one completely. This approach is more conducive to the industrial maintenance of multiple domains and the rapid deployment of new ones.

The main contributions of our work can be summarized as follows:

  • To Cold Start problem in CDR, we propose a novel method called Domain-Aware Cross-Attention for Cross-Domain Recommendation (DACDR). This method utilizes a two-step cross attention network to capture transferable knowledge of each user from source domain to alleviate negative transfer problem.

  • Our model is industrial-friendly as we simplify the model architecture and training steps to alleviate the gradient imbalance issue associated with using behaviors as inputs for the Meta Network. Inaddition, for new domains, we can rapidly fine-tune the existing model and deploy it online.

  • We conduct extensive experiments on both industrial and public cross-domain datasets to demonstrate the effectiveness and robustness of DACDR. The DACDR method has been deployed on an online recommendation system and achieved a gain of on both CTR and ECPM.

Refer to caption
Fig. 1: The overall architecture of DACDR.

2 Related Work

2.1 Cross-domain Recommendation

Cross-domain recommendation (CDR) aims to improve performance or minimize the number of labeled examples required in a target domain with the help of auxiliary domains. This methodology has garnered widespread attention in recommendation systems to alleviate data sparsity and the cold start problem in the target domain. Initially, Collective Matrix Factorization (CMF) [5] is a classic method that assumes all domains share a global user embedding matrix, simultaneously factorizing matrices from multiple domains.

In recent years, deep learning-based models have been proposed to enhance knowledge transfer. For instance, CoNet [6] utilizes cross connections between feedforward neural networks to transfer and integrate knowledge. DDTCDR [7] has developed a novel latent orthogonal map** to extract user preferences across multiple domains while preserving user relationships across different latent spaces.

Another set of CDR methods focuses on constructing bridges for user preferences across different domains, which is most pertinent to our research. For example, EMCDR [1] uses matrix factorization to generate latent factors for users and transfers user latent vectors via a linear function as the map** function, while PTUPCDR [4] learns personalized bridges for each cross-domain user. Our research falls into this bridge-based category but is distinct in approach. However, the aforementioned methods do not sufficiently consider the contextual features and data distributions of the target domain, and the approach of transferring preferences based on shared users may not fulfill the business requirements of the target domain in practical recommendation systems.

2.2 Multi task / multi domain Recommendation

In practical business domains, data for recommendation scenarios are often dispersed and unevenly distributed, posing challenges for model training and maintenance. Training individual models for each scenario not only adds to the complexity of maintenance but also yields unsatisfactory performance in data-sparse situations. Some pioneering endeavors have attempted to employ multi-domain and multi-task learning as the most representative methodologies. Multi-task and multi-domain recommendation segment data into groups based on tasks or domains and utilize well-designed structures to learn parameters that are specific to each group.

In the realm of multi-task learning, the Share-Bottom structure [8] is proposed to share the input embedding layer at the bottom. Ma et al. [9] introduce the concept of sharing a universal expert across tasks, employing a gating network to ascertain the relevance of this universal expert to diverse tasks. Building on this, Tang et al. [10] propose the PLE, which explicitly delineates shared experts from task-specific ones, thereby mitigating detrimental interference between task-specific and task-shared insights. Concerning multi-domain learning, Sheng et al. [11] propose a STAR topology approach to model the shared centered parameters and domain-specific parameters among domains to keep each domain separate. Nevertheless, in real-world business domains, frequent additions and updates to web pages are commonplace. The above-mentioned multi-domain models still rely on sufficient training data from the target domain and fail to alleviate the cold start problem of new domains.

If multi-domain models are employed, mixing data and networks from different domains can inadvertently lead to negative transfer of user preferences from the source domain to target domains, neglecting the specific recommendations required in the target domain. Consequently, these models exhibit limited efficacy in cold-start recommendations in new domains and fail to fully utilize the combination of target domain scenario information.

3 Method

3.1 Problem Formulation

Typically, domains are divided into the source domain S𝑆Sitalic_S and the target domain T𝑇Titalic_T according to general cross-domain recommendation settings. Each domain comprises a set of users U={u1,u2,,ui}𝑈subscript𝑢1subscript𝑢2subscript𝑢𝑖U=\{u_{1},u_{2},...,u_{i}\}italic_U = { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, a collection of items V={v1,v2,,vj}𝑉subscript𝑣1subscript𝑣2subscript𝑣𝑗V=\{v_{1},v_{2},...,v_{j}\}italic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }, and labels y𝑦yitalic_y. y{0,1}𝑦01y\in\{0,1\}italic_y ∈ { 0 , 1 } indicates whether user u𝑢uitalic_u clicked on item v𝑣vitalic_v or not. For a user uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, he usually exhibit N𝑁Nitalic_N distinct types of behaviors (e.g., browsing, clicking, purchasing) within a domain, which can be organized into a sequence of user behaviors B={b1,b2,,bN}𝐵subscript𝑏1subscript𝑏2subscript𝑏𝑁B=\{b_{1},b_{2},\dots,b_{N}\}italic_B = { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }. The multi-behavioral historical sequence for a user u𝑢uitalic_u is denoted by Sui={suib1,suib2,,suibN}subscript𝑆subscript𝑢𝑖superscriptsubscript𝑠subscript𝑢𝑖subscript𝑏1superscriptsubscript𝑠subscript𝑢𝑖subscript𝑏2superscriptsubscript𝑠subscript𝑢𝑖subscript𝑏𝑁S_{u_{i}}=\{s_{u_{i}}^{b_{1}},s_{u_{i}}^{b_{2}},\dots,s_{u_{i}}^{b_{N}}\}italic_S start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }, where subsuperscriptsubscript𝑠𝑢𝑏s_{u}^{b}italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT represents the sequence of behaviors under a specific action b𝑏bitalic_b. For each behavior, there exists a sequential series of items sub={v1,v2,,vn}superscriptsubscript𝑠𝑢𝑏subscript𝑣1subscript𝑣2subscript𝑣𝑛s_{u}^{b}=\{v_{1},v_{2},\dots,v_{n}\}italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, with n𝑛nitalic_n indicating the number of items interacted with.

To differentiate between the source and target domains, the user and item sets of the source domain are denoted as Ussuperscript𝑈𝑠U^{s}italic_U start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and Vssuperscript𝑉𝑠V^{s}italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, respectively. In the case of the target domain Dtsuperscript𝐷𝑡D^{t}italic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, these are represented as Utsuperscript𝑈𝑡U^{t}italic_U start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and Vtsuperscript𝑉𝑡V^{t}italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. The embeddings for users uidksuperscriptsubscript𝑢𝑖𝑑superscript𝑘u_{i}^{d}\in\mathbb{R}^{k}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and items vjdksuperscriptsubscript𝑣𝑗𝑑superscript𝑘v_{j}^{d}\in\mathbb{R}^{k}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT are symbolized respectively as euidsuperscriptsubscript𝑒subscript𝑢𝑖𝑑e_{u_{i}}^{d}italic_e start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and evjdsuperscriptsubscript𝑒subscript𝑣𝑗𝑑e_{v_{j}}^{d}italic_e start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, where k𝑘kitalic_k indicates the dimensionality of the embeddings, and d{s,t}𝑑𝑠𝑡d\in\{s,t\}italic_d ∈ { italic_s , italic_t } signifies the domain label. Additionally, edtsuperscriptsubscript𝑒𝑑𝑡e_{d}^{t}italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represents the embedding for the target domain special feature. In our problem setting, cold-start users are defined as those belonging to Ussuperscript𝑈𝑠U^{s}italic_U start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT but new to Utsuperscript𝑈𝑡U^{t}italic_U start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. The goal of CDR is to utilize the rich behavioral data from the source domain to provide recommendations for users u𝑢uitalic_u within the Utsuperscript𝑈𝑡U^{t}italic_U start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT set.

3.2 Domain-Aware Cross-Attention

The framework of our method is shown in Fig 1. Transfer learning typically relies on the abundant explicit and implicit data from the source domain to train the transfer function, with the aim of map** this embedding to the target domain. However, as previously mentioned, failing to fully consider the characteristics of the target domain may lead to noise and negative transfer problems when directly transferring user embeddings from the source domain to the target domain. Given the different characteristics and business scopes of different domains, users’ behavioral preferences often diverge across these domains. Hence, it is crucial to consider the contribution of items from the user’s source domain behavior during the cross-domain process. Only an accurate evaluation of the contribution of the source domain can have a positive impact on the target domain recommendation results.

Inspired by the Cross-Attention (CA) mechanism [12], we construct an two-step domain-aware cross-attention network to fully consider such contributions from the source domain, thereby generating features that better match the target domain. We believe that both domain-level and item-level contributions should be considered when evaluating this part. Cross-attention combines asymmetrically two separate embedding sequences of same dimension, in contrast self-attention input is a single embedding sequence.

CA(Q,K,V)=softmax(QKTdk)V,CA𝑄𝐾𝑉𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇subscript𝑑𝑘𝑉\displaystyle\text{CA}(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V,CA ( italic_Q , italic_K , italic_V ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V , (1)
Qi=XWiQ,Ki=XWiK,Vi=XWiVformulae-sequencesubscript𝑄𝑖𝑋superscriptsubscript𝑊𝑖𝑄formulae-sequencesubscript𝐾𝑖𝑋superscriptsubscript𝑊𝑖𝐾subscript𝑉𝑖𝑋superscriptsubscript𝑊𝑖𝑉\displaystyle Q_{i}=XW_{i}^{Q},K_{i}=XW_{i}^{K},V_{i}=XW_{i}^{V}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT (2)

In this network, the first step involves a coarse-grained allocation of attention to the source domain’s behavior sequence using the features of the target domain, and the second step refines the attention allocation by using the target domain item embeddings with the output of the first step. The two-step domain-aware cross-attention network can be expressed by the following formula:

e1=Domain-Level CA(XW1Q,edtW1K,edtW1V)subscript𝑒1Domain-Level CA𝑋superscriptsubscript𝑊1𝑄superscriptsubscript𝑒𝑑𝑡superscriptsubscript𝑊1𝐾superscriptsubscript𝑒𝑑𝑡superscriptsubscript𝑊1𝑉\displaystyle e_{1}=\text{Domain-Level CA}(XW_{1}^{Q},e_{d}^{t}W_{1}^{K},e_{d}% ^{t}W_{1}^{V})italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = Domain-Level CA ( italic_X italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ) (3)
ezt=Item-Level CA(e1W2Q,evitW2K,evitW2V)superscriptsubscript𝑒𝑧𝑡Item-Level CAsubscript𝑒1superscriptsubscript𝑊2𝑄superscriptsubscript𝑒subscript𝑣𝑖𝑡superscriptsubscript𝑊2𝐾superscriptsubscript𝑒subscript𝑣𝑖𝑡superscriptsubscript𝑊2𝑉\displaystyle e_{z}^{t}=\text{Item-Level CA}(e_{1}W_{2}^{Q},e_{v_{i}}^{t}W_{2}% ^{K},e_{v_{i}}^{t}W_{2}^{V})italic_e start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = Item-Level CA ( italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ) (4)

where X𝑋Xitalic_X represents the behavior sequence feature {v1,v2,v3,}subscript𝑣1subscript𝑣2subscript𝑣3\{v_{1},v_{2},v_{3},\dots\}{ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … } of user uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, edtsuperscriptsubscript𝑒𝑑𝑡e_{d}^{t}italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represents target domain feature embedding, evitsuperscriptsubscript𝑒subscript𝑣𝑖𝑡e_{v_{i}}^{t}italic_e start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represents the target item embedding of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, eztksuperscriptsubscript𝑒𝑧𝑡superscript𝑘e_{z}^{t}\in\mathbb{R}^{k}italic_e start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT denotes the transmissible feature embedding of user uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from source domain and z𝑧zitalic_z represents the user behavior sequences and other side info sequence. In this way we can get transferable knowledge eseqtsuperscriptsubscript𝑒seq𝑡e_{\text{seq}}^{t}italic_e start_POSTSUBSCRIPT seq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and esidetsuperscriptsubscript𝑒side𝑡e_{\text{side}}^{t}italic_e start_POSTSUBSCRIPT side end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT which contain rich source knowledge and are suitable for target domain. WQd1×dksuperscript𝑊𝑄superscriptsubscript𝑑1subscript𝑑𝑘W^{Q}\in\mathbb{R}^{d_{1}\times d_{k}}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, WK,WVd2×dksuperscript𝑊𝐾superscript𝑊𝑉superscriptsubscript𝑑2subscript𝑑𝑘W^{K},W^{V}\in\mathbb{R}^{d_{2}\times d_{k}}italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. It should be noted that this two-step domain-aware attention network assist in identifying source domain user interactions that are relevant to the target domain from coarse and fine granularity.

3.3 Domain Encoder

In the cross-domain transfer process, the existing bridge-based methods [1, 2, 13] train a designed Meta Network f()𝑓f(\cdot)italic_f ( ⋅ ), taking the outputs of the Cross Attention Network as Meta bridge parameters and the user embeddings from the source domain as input to calculate the transferred user embeddings:

e^uit=f(euis;w)superscriptsubscript^𝑒subscript𝑢𝑖𝑡𝑓superscriptsubscript𝑒subscript𝑢𝑖𝑠𝑤\hat{e}_{u_{i}}^{t}=f(e_{u_{i}}^{s};w)over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_f ( italic_e start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ; italic_w ) (5)

where euissuperscriptsubscript𝑒subscript𝑢𝑖𝑠e_{u_{i}}^{s}italic_e start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT denotes the embedding of user uissuperscriptsubscript𝑢𝑖𝑠u_{i}^{s}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT in source domain, e^uitsuperscriptsubscript^𝑒subscript𝑢𝑖𝑡\hat{e}_{u_{i}}^{t}over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represents the transformed embedding, w𝑤witalic_w represents the parameters of the meta network.

To realize personalized transfer meta network, some existing methods take the user sequence embedding as w𝑤witalic_w. However, in pratical applications, user-side features typically encompass not only ID embeddings but also a large numbers of other auxiliary features. All these features will be fed into the final FC layers to predict the output of the target domain. When training with these auxiliary features, one problem is caused that the update gradients of the user embeddings obtained from the meta network are far smaller than those of the other auxiliary features. To address this issue, we adopt a domain encoder to generate transformed target user embeddings for simplicity instead of multiplying the behavior sequence feature with the user embedding. We flatten the output of two-step Domain-Aware Cross-Attention and concat them with each other to generate the transferred user embeddings. The proposed domain encoder network can be formalized as follows:

e^uit=g(concat[eseqt,esidet,edt])superscriptsubscript^𝑒subscript𝑢𝑖𝑡𝑔𝑐𝑜𝑛𝑐𝑎𝑡superscriptsubscript𝑒seq𝑡superscriptsubscript𝑒side𝑡superscriptsubscript𝑒𝑑𝑡\hat{e}_{u_{i}}^{t}=g(concat[e_{\text{seq}}^{t},e_{\text{side}}^{t},e_{d}^{t}])over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_g ( italic_c italic_o italic_n italic_c italic_a italic_t [ italic_e start_POSTSUBSCRIPT seq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT side end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] ) (6)

where g()𝑔g(\cdot)italic_g ( ⋅ ) represent the domain encoder, edtsuperscriptsubscript𝑒𝑑𝑡e_{d}^{t}italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represent the target domain feature embedding and esidetsuperscriptsubscript𝑒side𝑡e_{\text{side}}^{t}italic_e start_POSTSUBSCRIPT side end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and eseqtsuperscriptsubscript𝑒seq𝑡e_{\text{seq}}^{t}italic_e start_POSTSUBSCRIPT seq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are the output embeddings from domain-aware cross-attention.

Consequently, we directly obtain the transferred user embedding, avoiding the problem of that the Meta Network may encounter during training.

3.4 Prediction and Optimization

With the generated target user embeddings, the output of the model can be formed as follows:

y^=h(concat[e^uit,eseqt,esidet,evjt,edt];w)^𝑦𝑐𝑜𝑛𝑐𝑎𝑡superscriptsubscript^𝑒subscript𝑢𝑖𝑡superscriptsubscript𝑒seq𝑡superscriptsubscript𝑒side𝑡superscriptsubscript𝑒subscript𝑣𝑗𝑡superscriptsubscript𝑒𝑑𝑡𝑤\hat{y}=h(concat[\hat{e}_{u_{i}}^{t},e_{\text{seq}}^{t},e_{\text{side}}^{t},e_% {v_{j}}^{t},e_{d}^{t}];w)over^ start_ARG italic_y end_ARG = italic_h ( italic_c italic_o italic_n italic_c italic_a italic_t [ over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT seq end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT side end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] ; italic_w ) (7)

where h()h(\cdot)italic_h ( ⋅ ) represents the FC layers and w𝑤witalic_w represents the model parameters. In addition, we introduce a task-oriented optimization strategy inspired by work [4]. This strategy does not directly minimize the distance between the target user embeddingutsuperscript𝑢𝑡u^{t}italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and the generated user embeddingu^tsuperscript^𝑢𝑡\hat{u}^{t}over^ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Instead, but optimize the generated user embedding performance of the target domain. The optimization can be formulated as:

=1Ti=1Tyilog(y^i(1yi)log(1yi^)\mathcal{L}=\frac{1}{T}\sum_{i=1}^{T}-y_{i}log(\hat{y}_{i}-(1-y_{i})log(1-\hat% {y_{i}})caligraphic_L = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l italic_o italic_g ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l italic_o italic_g ( 1 - over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) (8)

where yi{0,1}subscript𝑦𝑖01y_{i}\in\{0,1\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } is the ground truth of the i-th sample, y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the predicted CTR value and T is the total sample number. This task-oriented optimization has two main advantages. It can mitigate the impact of insufficient target user embeddings, as it directly employs actual behavior label rather than approximate intermediate results. Through this optimization, the task-oriented learning process can utilize more training samples than comparing the distance between each other, thereby reducing the risk of overfitting.

3.5 Fine-tuning

In our practical business situations, the addition of new domain is quite common. Each domain has a dedicated item set and we need to make specific recommendations that are suitable for this domain within the dedicated item set. However, creating separate models for each new page would lead to severe cold-start issues and the complexity of engineering maintenance. Therefore, employing less model to serve different domains has become an urgent issue to address.

Our model DACDR is very friendly to the fine-tuning of new scenes. For the creation of new domains, we can fine-tune DACDR by training the target domain features edtsuperscriptsubscript𝑒𝑑𝑡e_{d}^{t}italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, target item embedding evitsuperscriptsubscript𝑒subscript𝑣𝑖𝑡e_{v_{i}}^{t}italic_e start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and WQ,WK,WVsuperscript𝑊𝑄superscript𝑊𝐾superscript𝑊𝑉W^{Q},W^{K},W^{V}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT in the two-step cross-attention part, and freeze the remaining network parameters. This allows for the rapid online deployment of the fine-tuned model.

4 Experiments

In this section, we perform experiments on real-world datasets to evaluate the performance of our model.

4.1 Datasets and Implementation

Experiments were conducted on both real-world industrial datasets and public datasets. The selection of the public datasets adhered to the choices made by the majority of the current methodologies [2, 3], namely the Books, Movies and TV, CDs and Vinyl from Amazon111http://jmcauley.ucsd.edu/data/amazon/. The industrial dataset is collected from our e-commerce online advertising system, encompassing a wealth of information across diverse domains. We extracted a subset of logs from September 1, 2023, to September 11, 2023, spanning 11 days, with the initial ten days allocated for training and the data from the final day reserved for testing purposes.

Table 1: Statistics of datasets used in experiments.
Datasets #Users #Items #Interactions CTR
Source 19.9M 32.8M 36B 4.02%
Target #1 2.2M(94% overlap) 210K 22.1M 1.48%
Target #2 4.5M(69% overlap) 22K 39.1M 2.19%
Target #3 1.5M(78% overlap) 491K 21.9M 1.85%
Movies 123,960 50,052 1,697,533 /
CDs 75,258 64,443 1,097,592 /
Books 603,668 367,982 8,898,041 /

Within our business domains, the primary focus was directed towards data from two areas: central and target domains. Throughout the experiments, these were denoted as the source domain S𝑆Sitalic_S and the target domain T𝑇Titalic_T, respectively. The target domain typically has divergent business goals from the source domain. For instance, one user prefer to purchase more home appliances categories in the source domain, but shows more interest in clothing and shoes in domain #1.

Table 1 summarizes the statistical information of the public Amazon and our industrial datasets, detailing fundamental characteristics, the sparsity of each task within every domain, and the degree of overlap between users and exposed items across domains. Even though the domains share the same pool of items and contain numerous overlap** users, variations in item exposure and user behavior between the different domains were discernible. This suggests that users exhibit diverse behavioral intentions across multiple domains, engaging in a differentiated consumption ecosystem.

Our metric for evaluation was the AUC (Area under the ROC curve), the most commonly employed indicator in CTR prediction.

4.2 Baselines and Metrics

We compare our model with bellowing methods.

\bullet DNN [14]: This method is a implement of YouTube DNN. In DNN-single, we train models for each domain separately using only target domain dataset, while in DNN-multi, we train models with datasets from both source domain and target domain.

\bullet CMF [5]: CMF is an extension of Matrix Factorization (MF), in which user embeddings are shared across the source and target domains.

\bullet EMCDR [1]: A popular CDR method which employs MF to learn embeddings, and utilization of a neural network to bridge user embeddings from the auxiliary domain to the target domain.

\bullet SSCDR [2]: A semi-supervised bridge-based method for cross domain recommendation.

\bullet Shared-Bottom [8]: Shared-Bottom is a classical multi-task method that consists of shared-bottom networks and several domain-specific tower networks. Each domain has its specific tower network while sharing the same bottom network.

\bullet MMoE [9]: MMoE adopts the Mixture-of-Experts (MoE) structure by sharing the expert modules across all domains, while having a gating network trained for each domain.

\bullet PLE [10]: PLE is optimized version of MMoE, which separates shared experts and task-specific experts explicitly and adopts a progressive mechanism to extract features gradually.

\bullet STAR [11]: is the state-of-the-art MDR method. It splits the parameters into shared and specific parts. Meanwhile, it proposes a Partitioned Normalization for distinct domain statistics.

Table 2: Overall Performance comparisons on our online production dataset w.r.t. AUC
Methods Target #1 Target #2 Target #3
AUC AUC AUC
DNN-single 0.6239 0.6746 0.6317
CMF 0.5828 0.6529 0.5965
SSCDR 0.6136 0.6673 0.6085
EMCDR 0.6076 0.6758 0.6244
PTUPCDR 0.6329 0.6846 0.6363
DNN-multi 0.6336 0.6762 0.6479
Share Bottom 0.6401 0.6859 0.6563
MMoE 0.6453 0.6878 0.6658
PLE 0.6489 0.6951 0.6597
STAR 0.6437 0.6956 0.6631
DACDR 0.6513 0.7023 0.6672
DACDR w/o DA 0.6498 0.6982 0.6631
DACDR w/o IA 0.6423 0.6938 0.6577
DACDR w/o DA&IA 0.6401 0.6903 0.6518

Table 2 summarizes the performance of our proposed DACDR model and all the baselines on industrial datasets. The proposed DACDR model outperforms all the other baselines in all target domains. We observe that the majority of multi-task model results outperform those of cross-domain models. The reason for this is that multi-task models, through direct or indirect sharing of training embeddings or experts across different tasks, thus endow the models with a certain level of implicit understanding of the target domain. In contrast, our model, while considering the transfer of user preferences, has reinforced the model’s capability to understand the target domain during the cross-domain process. Therefore, DACDR can exhibit superior performance.

DACDR has been deployed in our online recommender systems and achieves a 3.5% increase in CTR and a 7.4% increase in effective cost per mille (ECPM) in our business. It is remarkable that these advancements are substantial and may bring in millions of incomes within a month.

Table 3: Cold-start performance of 3 public cross-domain tasks w.r.t MAE and RMSE.
β𝛽\betaitalic_β Metric CMF SSCDR EMCDR PTUPCDR DACDR Improve
Movie 20% MAE 1.5209 1.3723 1.3507 1.1293 0.8265 26.81%
RMSE 2.0158 1.7304 1.6737 1.4734 1.1953 18.86%
\downarrow 50% MAE 1.6893 1.5535 1.5312 1.2183 0.9327 23.45%
RMSE 2.2271 1.9032 1.8832 1.6388 1.4138 13.73%
CDs 80% MAE 2.4186 2.0728 2.0441 1.5209 1.2420 18.36%
RMSE 3.0936 2.3490 2.4277 2.1238 1.8964 10.70%
Book 20% MAE 1.3632 1.1732 1.1328 1.0701 0.9042 15.50%
RMSE 1.7918 1.6327 1.4227 1.3698 1.1654 14.92%
\downarrow 50% MAE 1.5813 1.3045 1.1836 1.0972 0.9123 16.86%
RMSE 2.0886 1.5737 1.4984 1.4356 1.2119 15.58%
Movie 80% MAE 2.1577 1.3668 1.3545 1.2084 0.9879 18.24%
RMSE 2.6777 1.7198 1.7151 1.6135 1.3411 16.89%
Book 20% MAE 2.0084 1.9324 1.8951 1.4523 0.9503 34.60%
RMSE 2.3829 2.3424 2.3163 1.9910 1.3800 30.66%
\downarrow 50% MAE 2.1856 2.1417 2.1113 1.5445 0.9916 35.76%
RMSE 2.5373 2.4937 2.4681 2.0886 1.4391 31.08%
CDs 80% MAE 2.8737 2.5126 2.4388 1.7968 1.1949 33.50%
RMSE 3.3424 2.9302 2.8110 2.4605 1.8245 25.82%

In our experiments, we test Cold-start performance on CDR public datasets. Following [1, 4] we set the proportions of cold-start users β𝛽\betaitalic_β as 80%, 50%, and 20% of the total overlap** users, respectively. Given that the ground truth is 5-core ratings, we modified the loss function to Mean Squared Error (MSE) Loss and updated the evaluation metrics to Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). The results are shown in Table 3. Additionally, we take the sequential timestamps into account to avoid information leakage. Our method outperforms other bridge-based CDR methods on public datasets. We attribute this superiority to simplifying the three-step training process when training bridge-based methods with the meta network, opting to directly train a domain encoder instead of the meta method.

4.3 Ablation Study

To explore the effectiveness of key modules in our approach, we performed ablation experiments to verify the importance of our method DACDR by removing Domain-levels cross-Attention (w/o DA) and Item-level cross-Attention (w/o IA). The results are shown in Table 2. Without domain-level attention, DACDR’s performance drops but still outperforms most other methods. The same occurs with item-level attention. Removing both domain adaptation and instance adaptation simultaneously results in the most significant decrease in DACDR’s performance, demonstrating the effectiveness of the attention module in CDR.

5 Conclusion

In this paper, we focus on the CDR problem and propose the DACDR model. We build a two-step domain-aware cross-attention network to capture coarse and fine-grained knowledge from the source domain. From DACDR, we observe the significant role of transformers in cross-domain recommendations, where domain information serves as a condition to guide the model towards better classification. In addition, our model can be easily fine-tuned and deployed, and is effective in handling new scenarios in practical applications. Experimental results on the industrial dataset and public dataset demonstrate the effectiveness of our method.

References

  • [1] Tong Man, Huawei Shen, et al., “Cross-domain recommendation: An embedding and map** approach.,” in IJCAI, 2017, vol. 17, pp. 2464–2470.
  • [2] SeongKu Kang, Junyoung Hwang, et al., “Semi-supervised learning for cross-domain recommendation to cold-start users,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 1563–1572.
  • [3] Yongchun Zhu, Kaikai Ge, et al., “Transfer-meta framework for cross-domain recommendation to cold-start users,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1813–1817.
  • [4] Yongchun Zhu, Zhenwei Tang, et al., “Personalized transfer of user preferences for cross-domain recommendation,” in Proceedings of the Fifteenth ACM ICDM, 2022, pp. 1507–1515.
  • [5] Ajit P Singh et al., “Relational learning via collective matrix factorization,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008, pp. 650–658.
  • [6] Guangneng Hu et al., “Conet: Collaborative cross networks for cross-domain recommendation,” in Proceedings of the 27th ACM international conference on information and knowledge management, 2018, pp. 667–676.
  • [7] Pan Li and Alexander Tuzhilin, “Ddtcdr: Deep dual transfer cross domain recommendation,” in Proceedings of the 13th International Conference on Web Search and Data Mining, 2020, pp. 331–339.
  • [8] Sebastian Ruder, “An overview of multi-task learning in deep neural networks,” CoRR, vol. abs/1706.05098, 2017.
  • [9] Jiaqi Ma, Zhe Zhao, et al., “Modeling task relationships in multi-task learning with multi-gate mixture-of-experts,” in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 1930–1939.
  • [10] Hongyan Tang, Junning Liu, et al., “Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations,” in Proceedings of the 14th ACM Conference on Recommender Systems, 2020, pp. 269–278.
  • [11] Xiang-Rong Sheng, Liqin Zhao, et al., “One model to serve all: Star topology adaptive recommender for multi-domain ctr prediction,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 4104–4113.
  • [12] Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF ICCV, 2021, pp. 357–366.
  • [13] Wen**g Fu et al., “Deeply fusing reviews and contents for cold start users in cross-domain recommendation systems,” in Proceedings of the AAAI, 2019, vol. 33, pp. 94–101.
  • [14] Paul Covington, Jay Adams, and Emre Sargin, “Deep neural networks for youtube recommendations,” in Proceedings of the 10th ACM Conference on Recommender Systems, New York, NY, USA, 2016.