ECAT: A Entire space Continual and Adaptive Transfer Learning Framework for Cross-Domain Recommendation

Chaoqun Hou [email protected] Alibaba GroupHangzhouChina43017-6221 Yuanhang Zhou [email protected] Alibaba GroupHangzhouChina Yi Cao [email protected] Alibaba GroupHangzhouChina  and  Tong Liu [email protected] Alibaba GroupHangzhouChina
(2024)
Abstract.

In industrial recommendation systems, there are several mini-apps designed to meet the diverse interests and needs of users. The sample space of them is merely a small subset of the entire space, making it challenging to train an efficient model. In recent years, there have been many excellent studies related to cross-domain recommendation aimed at mitigating the problem of data sparsity. However, few of them have simultaneously considered the adaptability of both sample and representation continual transfer setting to the target task. To overcome the above issue, we propose a Entire space Continual and Adaptive Transfer learning framework called ECAT which includes two core components: First, as for sample transfer, we propose a two-stage method that realizes a coarse-to-fine process. Specifically, we perform an initial selection through a graph-guided method, followed by a fine-grained selection using domain adaptation method. Second, we propose an adaptive knowledge distillation method for continually transferring the representations from a model that is well-trained on the entire space dataset. ECAT enables full utilization of the entire space samples and representations under the supervision of the target task, while avoiding negative migration. Comprehensive experiments on real-world industrial datasets from Taobao show that ECAT advances state-of-the-art performance on offline metrics, and brings +13.6% CVR and +8.6% orders for Baiyibutie, a famous mini-app of Taobao.

cross domain, continual transfer learning, adaptive knowledge distillation, graph guided
journalyear: 2024copyright: acmlicensedconference: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 14–18, 2024; Washington, DC, USA.booktitle: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), July 14–18, 2024, Washington, DC, USAisbn: 979-8-4007-0431-4/24/07doi: 10.1145/3626772.3661348ccs: Information systems Retrieval models and ranking

1. Introduction

Recommendation systems (RS) have played a significant role in e-commerce platforms, and their efficiency is closely related to the accuracy of click-through rate (CTR) prediction. In recent years, thanks to the continuous improvements in computational power and the increasing volume of datasets, numerous outstanding single-domain CTR models (Cheng et al., 2016; Guo et al., 2017; Chen et al., 2022; Pi et al., 2020; Zhang et al., 2021) have achieved impressive results. At large e-commercial companies, there are several mini-apps designed to meet the diverse interests and needs of users. However, these mini-apps all encounter a common issue: the target domain has relatively sparse samples, making it challenging to train the complex CTR model, especially the representations of ID categorical features (i.e., item ID and user ID). Take Taobao for instance, Baiyibutie is a mini-app that contributes billions of daily page views by exclusively selling brand-discounted products. The sample size of Baiyibutie is less than 1% of the entire Taobao domain. Therefore, exploring how cross-domain transfer learning can utilize the abundant information available in data-rich domains to enhance the data-sparse domains has emerged as an important research focus in the industry. The traditional cross-domain (Li et al., 2023; Huan et al., 2023; Chen et al., 2023; Yang et al., 2023; Mu et al., 2023; Tian et al., 2023; Zhao et al., 2023; Gao et al., 2023) recommendation can be categorized into two paradigms: sample transfer and parameter transfer from the well-trained source model.

In the sample transfer paradigm, multi-task learning methods  (Zhang and Yang, 2021; Xie et al., 2022; Ma et al., 2018; Tang et al., 2020; Sheng et al., 2021; Hu et al., 2018; Ouyang et al., 2020; Zou et al., 2022) are typically employed to enhance performance across all domains by combining the source and the target samples. However, despite the fact that this paradigm has achieved commendable results in many scenarios, it still has some evident limitations in certain situations. For instance, in scenarios where the sample size of source domain is hundreds of times larger than that of the target domain, the training process can be easily dominated by the source domain, resulting in insufficient training in the target domain. Another issue is that introducing the source domain samples of such a large scale could significantly increase complexity. Therefore, the core objective should be to enhance the performance of the target task by selecting samples that are deemed valuable.

In the parameter transfer paradigm, pre-training & fine-tuning methods (Hu et al., 2019, 2020; Chen et al., 2021) are more efficient and effective. Specifically, the initialization parameters of the target model are obtained by loading a pre-trained source model, followed by fine-tuning with samples from the target domain. However, an evident issue is that merely fine-tuning with sparse samples from the target domain can easily lead the target model to settle into a sub-optimal local minimum. Therefore, it’s crucial to measure the value of the source model’s parameters for the target task. Another issue is that few studies considering the setting of Continual Transfer Learning (CTL) (Wang et al., 2020; De Lange et al., 2021; Rusu et al., 2016; Liu et al., 2023), resulting in an inability to continuously utilize the newest information of the source model. CTNet (Liu et al., 2023) accomplishes continuous transfer by treating the latest source domain representations through an adapter layer. However, in most real-world RSs, user behavioral sequences hold great potential in boosting the performance of the CTR model. CTNet ignores the representations of user behavior sequences in the source model. Furthermore, the target model outperforms the source model on certain samples. Therefore, we need the source model to provide valuable incremental information for these samples that the target model cannot handle well. For these samples that the target model can predict more accurately, we should minimize intervention.

To better solve the above issues in cross-domain modeling, we propose a Entire space Continual and Adaptive Transfer learning framework (ECAT). As shown in Figure 1, the ECAT framework mainly includes two parts: sample transfer and representations continual transfer. Specifically, we perform a coarse sample selection through a graph guided method, followed by a fine-grained selection using domain adaptation method. During the training process, we continuously transfer valuable information from the source model using an adaptive knowledge distillation method. During the online inference process, the only additional component introduced is the adapter layers, which have a very small complexity.

To summarize, the main contributions include:

  • We propose an ECAT framework, which enables full utilization of the source domain samples and representations under the supervision of the target task, while alleviating negative migration.

  • We propose a two-stage method that realizes a coarse-to-fine process for sample transfer (GST & DA), which enables ECAT to efficiently select samples that are valuable for the target task.

  • We propose an adaptive knowledge distillation method (AKD-CT) for continually transferring the representations from a source model that is well-trained on the entire space dataset, which allows the ECAT framework to adaptively decide whether to incorporate representational information from the source model.

  • We evaluate ECAT on the Taobao industrial dataset. Comprehensive experiments show that ECAT advances state-of-the-art performance on offline metrics, and brings +13.6% CVR and +8.6% orders for Baiyibutie, a mini-app of Taobao.

2. Methods

2.1. Problem Definition

Mathematically, we represent samples from the source domain and the target domain as Ds=(xis,yis)subscript𝐷𝑠superscriptsubscript𝑥𝑖𝑠superscriptsubscript𝑦𝑖𝑠D_{s}=(x_{i}^{s},y_{i}^{s})italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) and Dt=(xit,yit)subscript𝐷𝑡superscriptsubscript𝑥𝑖𝑡superscriptsubscript𝑦𝑖𝑡D_{t}=(x_{i}^{t},y_{i}^{t})italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) respectively, where xsRdssuperscript𝑥𝑠superscript𝑅subscript𝑑𝑠x^{s}\in R^{d_{s}}italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and xtRdtsuperscript𝑥𝑡superscript𝑅subscript𝑑𝑡x^{t}\in R^{d_{t}}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Label yissuperscriptsubscript𝑦𝑖𝑠y_{i}^{s}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and yit{0,1}superscriptsubscript𝑦𝑖𝑡01y_{i}^{t}\in\{0,1\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ { 0 , 1 } indicate whether the itemi𝑖𝑡𝑒subscript𝑚𝑖item_{i}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT was purchased or not. It is worth mentioning that we have established the capability to acquire samples from the entire domain of Taobao. S𝑆Sitalic_S is a continually well-trained model on Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, capable of learning new distributions in a timely manner. In this study, our goal is to train a model T𝑇Titalic_T using Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT while considering the incremental information in areas including sample transfer from Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and representation transfer from S𝑆Sitalic_S. Furthermore, we represent Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT through a graph Gs=(Vs,Es)subscript𝐺𝑠subscript𝑉𝑠subscript𝐸𝑠G_{s}=(V_{s},E_{s})italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), where Vs={u1s,i1s,,uns,ins}subscript𝑉𝑠superscriptsubscript𝑢1𝑠superscriptsubscript𝑖1𝑠superscriptsubscript𝑢𝑛𝑠superscriptsubscript𝑖𝑛𝑠V_{s}=\{u_{1}^{s},i_{1}^{s},...,u_{n}^{s},i_{n}^{s}\}italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_i start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } denotes the user and item node in the graph of source domain. Edge eijEssubscript𝑒𝑖𝑗subscript𝐸𝑠e_{ij}\in E_{s}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT denotes that useri𝑢𝑠𝑒subscript𝑟𝑖user_{i}italic_u italic_s italic_e italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has clicked or purchased on itemj𝑖𝑡𝑒subscript𝑚𝑗item_{j}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In other words, we can identify the corresponding samples Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT through the nodes and edges of Gssubscript𝐺𝑠G_{s}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Similarly, we define Gt=(Vt,Et)subscript𝐺𝑡subscript𝑉𝑡subscript𝐸𝑡G_{t}=(V_{t},E_{t})italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) according to Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Refer to caption
Figure 1. The illustration of ECAT (Entire space Continual and Adaptive Transfer) framework. ECAT is composed of three parts: First, the Graph-guided module (a) and the Domain Adaption (b-left) are aimed to transfer samples. Second, the target model is trained daily. Third, the Adaptive Knowledge Distillation (b-right) module is for transferring representations continually.
\Description

2.2. Model Overview

We have decomposed the ECAT framework process into two serial stages. Initially, figure 1(a) shows a simple yet effective method called Graph-guided based Sample Transfer (GST), which aims to select samples from Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT with a similar distribution to Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The GST can incorporate sample relevance by leveraging prior heuristic insights or measure it through representational learning via graph neural networks. In this paper, we focus on the area of e-commerce recommendation, there inherently exists plenty of valuable prior knowledge. For example, a direct browse, click or purchase of an item by a user acts as a one-hop link, while a two-hop link can be established between two items through a co-click relationship by users. Moreover, GST is versatile and capable of employing suitable strategies based on the specific domain, or even training a graph representation network model. Subsequently, from left to right in figure 1(b), the diagram sequentially illustrates the Domain Adaption (DA) module for assessment of incremental value that samples from Dgstsubscript𝐷𝑔𝑠𝑡D_{gst}italic_D start_POSTSUBSCRIPT italic_g italic_s italic_t end_POSTSUBSCRIPT contribute to T𝑇Titalic_T, and the Adaptive Knowledge Distillation (AKD-CT) module is designed to assess the incremental value that representations from the well-trained source model S𝑆Sitalic_S. More detailed exposition will be delineated in the subsequent discourse.

2.3. Graph guided and Domain Adaptation based Sample Transfer

Graph guided Module: Incorporating the findings of many related studies (Crawshaw, 2020; ** Vseed=VtVssubscript𝑉𝑠𝑒𝑒𝑑subscript𝑉𝑡subscript𝑉𝑠V_{seed}=V_{t}\cap V_{s}italic_V start_POSTSUBSCRIPT italic_s italic_e italic_e italic_d end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT onto Gssubscript𝐺𝑠G_{s}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, anchoring the alignment through the correspondence of the same IDs between the domains. Subsequently, within Gs=(Vs,Es)subscript𝐺𝑠subscript𝑉𝑠subscript𝐸𝑠G_{s}=(V_{s},E_{s})italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), we expand to include more nodes Vgeneralizedsubscript𝑉𝑔𝑒𝑛𝑒𝑟𝑎𝑙𝑖𝑧𝑒𝑑V_{generalized}italic_V start_POSTSUBSCRIPT italic_g italic_e italic_n italic_e italic_r italic_a italic_l italic_i italic_z italic_e italic_d end_POSTSUBSCRIPT, which are similar to the target domain, by exploiting one-hop (i.e., click or pay relationships) and two-hop (i.e., co-click or group cluster) connectivity. Finally, within the context of e-commerce recommendation systems, the relevant sample can be identified by specifying a distinct useri𝑢𝑠𝑒subscript𝑟𝑖user_{i}italic_u italic_s italic_e italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and itemj𝑖𝑡𝑒subscript𝑚𝑗item_{j}italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In other words, we can convert Gs=(VseedVgeneralized,Es)subscriptsuperscript𝐺𝑠subscript𝑉𝑠𝑒𝑒𝑑subscript𝑉𝑔𝑒𝑛𝑒𝑟𝑎𝑙𝑖𝑧𝑒𝑑subscript𝐸𝑠G^{{}^{\prime}}_{s}=(V_{seed}\cup V_{generalized},E_{s})italic_G start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( italic_V start_POSTSUBSCRIPT italic_s italic_e italic_e italic_d end_POSTSUBSCRIPT ∪ italic_V start_POSTSUBSCRIPT italic_g italic_e italic_n italic_e italic_r italic_a italic_l italic_i italic_z italic_e italic_d end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) to Dgstsubscript𝐷𝑔𝑠𝑡D_{gst}italic_D start_POSTSUBSCRIPT italic_g italic_s italic_t end_POSTSUBSCRIPT.

Target Domain Module: The structure of the target model T𝑇Titalic_T is similar to ETA (Chen et al., 2022), which includes four parts. First, the embedding layer maps features to representations of a specific dimension, primarily including categorical features and numerical features. It is worth mentioning that the categorical features are extremely important and require a substantial number of samples for effective training. It is the significant reason why we introduce the well-trained model S𝑆Sitalic_S across the entire space. Subsequently, long and short-term user behavioral sequences are mapped into higher semantic representations through the sequence layers. Finally, we can get the score after successively passing through the classification and logit layers. Lysubscript𝐿𝑦L_{y}italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is usually a binary cross-entropy loss function.

(1) Ly=1nt+ngstxitDtDgstLce[Gt(Φt(xit)),yit],subscript𝐿𝑦1subscript𝑛𝑡subscript𝑛𝑔𝑠𝑡subscriptsuperscriptsubscript𝑥𝑖𝑡subscript𝐷𝑡subscript𝐷𝑔𝑠𝑡subscript𝐿𝑐𝑒superscript𝐺𝑡superscriptΦ𝑡superscriptsubscript𝑥𝑖𝑡superscriptsubscript𝑦𝑖𝑡\displaystyle L_{y}=\frac{1}{n_{t}+n_{gst}}\sum_{x_{i}^{t}\in D_{t}\cup D_{gst% }}L_{ce}\left[G^{t}(\Phi^{t}(x_{i}^{t})),y_{i}^{t}\right],italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_g italic_s italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ italic_D start_POSTSUBSCRIPT italic_g italic_s italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT [ italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] ,

where ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ngstsubscript𝑛𝑔𝑠𝑡n_{gst}italic_n start_POSTSUBSCRIPT italic_g italic_s italic_t end_POSTSUBSCRIPT are the sample size of Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Dgstsubscript𝐷𝑔𝑠𝑡D_{gst}italic_D start_POSTSUBSCRIPT italic_g italic_s italic_t end_POSTSUBSCRIPT respectively. Gtsuperscript𝐺𝑡G^{t}italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT denotes the output of samples pass sequentially from the representation to the logit layers of T𝑇Titalic_T. ΦtsuperscriptΦ𝑡\Phi^{t}roman_Φ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is designed to map samples from different feature dimensions to the same feature space, such as attention maps (Komodakis and Zagoruyko, 2017).

Domain Adaptation Module: To select samples from Dgstsubscript𝐷𝑔𝑠𝑡D_{gst}italic_D start_POSTSUBSCRIPT italic_g italic_s italic_t end_POSTSUBSCRIPT that better fit the distribution of Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, as shown in figure 1(b), we refine the sample selection by incorporating a Domain Adaption module. The training dataset is Dda=DtDgst=(xida,yida)subscript𝐷𝑑𝑎subscript𝐷𝑡subscript𝐷𝑔𝑠𝑡superscriptsubscript𝑥𝑖𝑑𝑎superscriptsubscript𝑦𝑖𝑑𝑎D_{da}=D_{t}\cup D_{gst}=(x_{i}^{da},y_{i}^{da})italic_D start_POSTSUBSCRIPT italic_d italic_a end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∪ italic_D start_POSTSUBSCRIPT italic_g italic_s italic_t end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_a end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_a end_POSTSUPERSCRIPT ) and Φ(xida)Φsuperscriptsubscript𝑥𝑖𝑑𝑎\Phi(x_{i}^{da})roman_Φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_a end_POSTSUPERSCRIPT ) denotes domain-independent features. Label yida{0,1}superscriptsubscript𝑦𝑖𝑑𝑎01y_{i}^{da}\in\{0,1\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_a end_POSTSUPERSCRIPT ∈ { 0 , 1 } indicates whether the sample xidasuperscriptsubscript𝑥𝑖𝑑𝑎x_{i}^{da}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_a end_POSTSUPERSCRIPT belongs to Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT or Dgstsubscript𝐷𝑔𝑠𝑡D_{gst}italic_D start_POSTSUBSCRIPT italic_g italic_s italic_t end_POSTSUBSCRIPT. The optimization objective Ldasubscript𝐿𝑑𝑎L_{da}italic_L start_POSTSUBSCRIPT italic_d italic_a end_POSTSUBSCRIPT is a binary cross-entropy loss function.

The DA module is effective due to three key factors: First, to avoid feature bias, we ensure the effectiveness of the discriminator by solely using domain-independent features. Second, to avoid model bias towards the source domain, we select samples similar to the target domain distribution through GST. Third, to prevent the target model from being influenced by irrelevant gradients, we stop the gradients produced by the DA on the target model.

Up to this point in our discussion, we have been able to achieve satisfactory results in sample transfer. However, as time progresses, target model T𝑇Titalic_T will gradually forget the representations obtained through one-time warm up from S𝑆Sitalic_S, while the representation of S𝑆Sitalic_S also continues to update. We will solve this issue in the next section.

2.4. Adaptive Knowledge Distillation based Continual representation Transfer

Source Domain Module: To provide incremental information, we introduce a source model S𝑆Sitalic_S that has been well-trained in the entire space. During the training process of T𝑇Titalic_T, S𝑆Sitalic_S only executes forward propagation, which entails a low computational complexity. As shown in Figure 1(b-right), S𝑆Sitalic_S and T𝑇Titalic_T have identical architecture.

AKD-CT Module: Inspired by CTNet, ECAT endeavors to enhance the target model performance through CTL setting. ECAT differs in that it further transfers all layers from the embedding layers to the logit layers, particularly sequence layer representations. To achieve this, we propose an Adaptive Knowledge Distillation based Continual Transfer (AKD-CT) method. Figure 1(b-right) illustrates the training process of AKD-CT that showcases the representations distillation of the sequence layers. Specifically, we obtain the representations of various behavioral sequence features after passing through the embedding and sequence layers. eseqtsuperscriptsubscript𝑒𝑠𝑒𝑞𝑡e_{seq}^{t}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and eseqssuperscriptsubscript𝑒𝑠𝑒𝑞𝑠e_{seq}^{s}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT denote the sequence representations obtained from T𝑇Titalic_T and S𝑆Sitalic_S, respectively. Subsequently, eseqtsuperscriptsubscript𝑒𝑠𝑒𝑞𝑡e_{seq}^{t}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the input of adapter layers to obtain eseqtsuperscriptsubscript𝑒𝑠𝑒𝑞superscript𝑡e_{seq}^{t^{\prime}}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, which then distill knowledge from S𝑆Sitalic_S under the supervision of Ldisubscript𝐿𝑑𝑖L_{di}italic_L start_POSTSUBSCRIPT italic_d italic_i end_POSTSUBSCRIPT. We use cosine similarity loss to pull eseqtsuperscriptsubscript𝑒𝑠𝑒𝑞superscript𝑡e_{seq}^{t^{\prime}}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and eseqssuperscriptsubscript𝑒𝑠𝑒𝑞𝑠e_{seq}^{s}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT more similar. To prevent noise from the distillation process, we stop conducting gradient to T𝑇Titalic_T. We have obtained the incremental information eseqtsuperscriptsubscript𝑒𝑠𝑒𝑞superscript𝑡e_{seq}^{t^{\prime}}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, all that remains is appropriately fusing eseqtsuperscriptsubscript𝑒𝑠𝑒𝑞superscript𝑡e_{seq}^{t^{\prime}}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT into T𝑇Titalic_T.

The Design of Adaptive Knowledge Distillation is three-fold: First, considering that T𝑇Titalic_T may have better discrimination for certain samples than S𝑆Sitalic_S. We introduce an adaptive gate network to assess the value of eseqtsuperscriptsubscript𝑒𝑠𝑒𝑞superscript𝑡e_{seq}^{t^{\prime}}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for T𝑇Titalic_T. Specifically, we concatenate eseqtsuperscriptsubscript𝑒𝑠𝑒𝑞superscript𝑡e_{seq}^{t^{\prime}}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, eseqtsuperscriptsubscript𝑒𝑠𝑒𝑞𝑡e_{seq}^{t}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and the entropy from T𝑇Titalic_T as the input of the adaptive gate network to generate fusion weight. With the supervision of loss Lysubscript𝐿𝑦L_{y}italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, the fusion weight could indicate the importance of eseqtsuperscriptsubscript𝑒𝑠𝑒𝑞superscript𝑡e_{seq}^{t^{\prime}}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Second, our objective is not to distill representations from the source model that are merely similar, but rather those are more suitably adapted to T𝑇Titalic_T. Therefore, each sample is associated with a distillation intensity wipowsuperscriptsubscript𝑤𝑖𝑝𝑜𝑤w_{i}^{pow}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_o italic_w end_POSTSUPERSCRIPT that governs the degree to which eseqtsuperscriptsubscript𝑒𝑠𝑒𝑞superscript𝑡e_{seq}^{t^{\prime}}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT approximates eseqssuperscriptsubscript𝑒𝑠𝑒𝑞𝑠e_{seq}^{s}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. Ideally, the distillation intensity would be higher for samples that T𝑇Titalic_T finds hard to predict. After numerous experiments, we adopt the cos similarity to calculate wipowsuperscriptsubscript𝑤𝑖𝑝𝑜𝑤w_{i}^{pow}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_o italic_w end_POSTSUPERSCRIPT. Third, our primary task is to enhance the performance of T𝑇Titalic_T. Therefore, the adapter layers responsible for generating eseqtsuperscriptsubscript𝑒𝑠𝑒𝑞superscript𝑡e_{seq}^{t^{\prime}}italic_e start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT are also subject to supervision from Lysubscript𝐿𝑦L_{y}italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT.

In summary, the final loss of ECAT is as follows:

(2) LECAT=wdaLy+αwpowLdi+βLda,subscript𝐿𝐸𝐶𝐴𝑇superscriptw𝑑𝑎subscript𝐿𝑦𝛼superscriptw𝑝𝑜𝑤subscript𝐿𝑑𝑖𝛽subscript𝐿𝑑𝑎\displaystyle L_{ECAT}=\textbf{w}^{da}*L_{y}+\alpha*\textbf{w}^{pow}*L_{di}+% \beta*L_{da},italic_L start_POSTSUBSCRIPT italic_E italic_C italic_A italic_T end_POSTSUBSCRIPT = w start_POSTSUPERSCRIPT italic_d italic_a end_POSTSUPERSCRIPT ∗ italic_L start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + italic_α ∗ w start_POSTSUPERSCRIPT italic_p italic_o italic_w end_POSTSUPERSCRIPT ∗ italic_L start_POSTSUBSCRIPT italic_d italic_i end_POSTSUBSCRIPT + italic_β ∗ italic_L start_POSTSUBSCRIPT italic_d italic_a end_POSTSUBSCRIPT ,

where αandβ𝛼𝑎𝑛𝑑𝛽\alpha and\betaitalic_α italic_a italic_n italic_d italic_β are hyperparameter that controls the weight of corresponding loss, wdasuperscriptw𝑑𝑎\textbf{w}^{da}w start_POSTSUPERSCRIPT italic_d italic_a end_POSTSUPERSCRIPT is the entropy value from the DA module for each sample and wpowsuperscriptw𝑝𝑜𝑤\textbf{w}^{pow}w start_POSTSUPERSCRIPT italic_p italic_o italic_w end_POSTSUPERSCRIPT is the distillation intensity for each sample.

3. Experiments

3.1. Experimental Setup

3.1.1. Dataset

In the absence of suitable public benchmarks for evaluating continual cross-domain prediction, we adopt Taobao industrial datasets to comprehensively compare ECAT and baselines. Therefore, we use the Baiyibutie from Taobao as the target domain, which generates millions CVR samples every day, accounting for less than 1% of the entire space of Taobao. The users and items in the source domain and the target domain partially overlap, but the data distribution is very different. In this study, we utilize target domain samples spanning 90 days, amounting to a total of 120 million samples. Similarly, taking the entire space as the source domain, we have accumulated a total of 66 billion samples. During A/B testing, ECAT serves over hundreds of thousands of users daily.

3.1.2. Baseline Models

We compare the samples transfer methods like simple merge Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, DANN (Ganin and Lempitsky, 2015), the representations transfer methods including Shared Bottom, MMoE (Ma et al., 2018), PLE (Tang et al., 2020), and the continual learning setting method like CTNet (Liu et al., 2023).

3.1.3. Implementation Details

To ensure the fairness of the experiments, all single domain methods employs the ETA (Chen et al., 2022) as the architecture, including the target model and source model. Specifically, we use AdagradDecayV2 (Duchi et al., 2011) as the optimizer. Learning rate is set to 0.01 and the batch size is 1024. The dimension of MLP Layers is set to 1024, 512 and 256. Following previous work, we adopt AUC to measure the CVR prediction performance in offline evaluation.

3.2. Offline Evaluation

As shown in Table 1, our ECAT (GST & DA and AKD-CT) achieves the best performance among all baselines. More specifically,

(1) ECAT achieves the best performance (AUC=0.8348) compared to both single-domain and cross-domain methods, with its core advantage being that ECAT simultaneously considers the adaptability of both sample and representation for the target task.

(2) In terms of sample transfer: ECAT enhances the performance of each method, including CTNet (AUC from 0.8307 to 0.8327), by transferring valuable samples from Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT through GST & DA. Besides, directly merging Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT with Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT leads to performance degradation.

(3) In terms of continual representation transfer: ECAT further improves performance through the adaptive capabilities of AKD-CT module, which under the supervision of the target task, continuously transfers valuable representation information from S𝑆Sitalic_S. Even with the same sample transfer strategy, the effectiveness of AKD-CT (AUC=0.8348) surpasses that of CTNet (AUC=0.8327).

Table 1. Offline results of various methods.
Method Sample Transfer Setting
Only Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Merge Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT GST & DA
Target Model 0.8284 0.8276 0.8301
DANN 0.8289 0.8287 0.8307
Shared Bottom 0.8297 0.8301 0.8312
MMOE 0.8287 0.8286 0.8303
PLE 0.8303 0.8298 0.8324
CTNet 0.8307 0.8302 0.8327
AKD-CT 0.8327 0.8321 0.8348

3.3. Research Questions

3.3.1. RQ1: How to prove the adaptive capability of AKD-CT model?


Table 2 shows that (1) AKD-CT drops performance without gate. The reason is that the gate network assesses the importance of incremental information for T𝑇Titalic_T, providing valuable incremental representation information for samples with higher uncertainty and lower confidence. (2) The absence of distillation intensity in AKD-CT results in poorer result, which suggests that an intensity-based strategy facilitates the distillation of representations more suited to T𝑇Titalic_T.

Table 2. Ablation experiments on adaptive capability.
Adaptive Setting AUC
AKD-CT 0.8348
AKD-CT without gate 0.8331 (-0.20%)
AKD-CT without intensity 0.8342 (-0.07%)

3.3.2. RQ2: How to prove the necessity of CTL setting?


We compare the performance between AKD-CT and CTNet under continuous transfer and one-time transfer setting. Table 3 shows that continuous transfer is better than one-time transfer, which illustrates the necessity of CTL. ΔΔ\Deltaroman_Δt is 30 days in this study.

Table 3. The comparisons between different transfer setting.
Transfer Setting Method t t+ΔΔ\Deltaroman_Δt t+2ΔΔ\Deltaroman_Δt
one-time Base (PLE) 0.8354 0.8372 0.8324
CTNet 0.8356 0.8373 0.8325
AKD-CT 0.8366 0.8378 0.8336
continual CTNet 0.8356 0.8375 0.8327
AKD-CT 0.8366 0.8382 0.8348

4. CONCLUSIONS

In this paper, we introduce the ECAT framework for cross-domain prediction, which not only considers the continual transfer of both samples and representations but also the adaptability of incremental information to the target task. Experiments conducted on a large-scale industrial dataset, along with online A/B testing, confirm its effectiveness in real-world applications. It is noteworthy that ECAT has been deployed in the RS of Taobao to serve numerous marketing channels, including Baiyibutie.

References

  • (1)
  • Chen et al. (2023) Liyue Chen, Linian Wang, **yu Xu, Shuai Chen, Weiqiang Wang, Wenbiao Zhao, Qiyu Li, and Leye Wang. 2023. Knowledge-inspired Subdomain Adaptation for Cross-Domain Knowledge Transfer. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 234–244.
  • Chen et al. (2021) Lei Chen, Fajie Yuan, Jiaxi Yang, Xiangnan He, Chengming Li, and Min Yang. 2021. User-specific adaptive fine-tuning for cross-domain recommendations. IEEE Transactions on Knowledge and Data Engineering (2021).
  • Chen et al. (2022) Qiwei Chen, Yue Xu, Changhua Pei, Shanshan Lv, Tao Zhuang, and Junfeng Ge. 2022. Efficient Long Sequential User Data Modeling for Click-Through Rate Prediction. arXiv preprint arXiv:2209.12212 (2022).
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. 7–10.
  • Crawshaw (2020) Michael Crawshaw. 2020. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796 (2020).
  • De Lange et al. (2021) Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. 2021. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence 44, 7 (2021), 3366–3385.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research 12, 7 (2011).
  • Ganin and Lempitsky (2015) Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180–1189.
  • Gao et al. (2023) **gtong Gao, Xiangyu Zhao, Bo Chen, Fan Yan, Huifeng Guo, and Ruiming Tang. 2023. AutoTransfer: Instance Transfer for Cross-Domain Recommendations. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1478–1487.
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).
  • Hu et al. (2018) Guangneng Hu, Yu Zhang, and Qiang Yang. 2018. Conet: Collaborative cross networks for cross-domain recommendation. In Proceedings of the 27th ACM international conference on information and knowledge management. 667–676.
  • Hu et al. (2019) Jian Hu, Hongya Tuo, Chao Wang, Lingfeng Qiao, Haowen Zhong, and Zhongliang **g. 2019. Multi-Weight Partial Domain Adaptation.. In BMVC. 5.
  • Hu et al. (2020) Jian Hu, Hongya Tuo, Chao Wang, Lingfeng Qiao, Haowen Zhong, Junchi Yan, Zhongliang **g, and Henry Leung. 2020. Discriminative partial domain adversarial network. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16. Springer, 632–648.
  • Huan et al. (2023) Zhaoxin Huan, Ang Li, Xiaolu Zhang, Xu Min, Jieyu Yang, Yong He, and Jun Zhou. 2023. SAMD: An Industrial Framework for Heterogeneous Multi-Scenario Recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4175–4184.
  • Komodakis and Zagoruyko (2017) Nikos Komodakis and Sergey Zagoruyko. 2017. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR.
  • Li et al. (2023) Chenglin Li, Yuanzhen Xie, Chenyun Yu, Bo Hu, Zang Li, Guoqiang Shu, Xiaohu Qie, and Di Niu. 2023. One for All, All for One: Learning and Transferring User Embeddings for Cross-Domain Recommendation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 366–374.
  • Liu et al. (2023) Lixin Liu, Yanling Wang, Tianming Wang, Dong Guan, Jiawei Wu, **gxu Chen, Rong Xiao, Wenxiang Zhu, and Fei Fang. 2023. Continual Transfer Learning for Cross-Domain Click-Through Rate Prediction at Taobao. In Companion Proceedings of the ACM Web Conference 2023. 346–350.
  • Ma et al. (2018) Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1930–1939.
  • Mu et al. (2023) Shanlei Mu, Penghui Wei, Wayne Xin Zhao, Shaoguo Liu, Liang Wang, and Bo Zheng. 2023. Hybrid Contrastive Constraints for Multi-Scenario Ad Ranking. arXiv preprint arXiv:2302.02636 (2023).
  • Ouyang et al. (2020) Wentao Ouyang, Xiuwu Zhang, Lei Zhao, **mei Luo, Yu Zhang, Heng Zou, Zhaojie Liu, and Yanlong Du. 2020. Minet: Mixed interest network for cross-domain click-through rate prediction. In Proceedings of the 29th ACM international conference on information & knowledge management. 2669–2676.
  • Pi et al. (2020) Qi Pi, Guorui Zhou, Yu**g Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2685–2692.
  • Rusu et al. (2016) Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671 (2016).
  • Sheng et al. (2021) Xiang-Rong Sheng, Liqin Zhao, Guorui Zhou, Xinyao Ding, Binding Dai, Qiang Luo, Siran Yang, **gshan Lv, Chi Zhang, Hongbo Deng, et al. 2021. One model to serve all: Star topology adaptive recommender for multi-domain ctr prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4104–4113.
  • Tang et al. (2020) Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. 2020. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In Proceedings of the 14th ACM Conference on Recommender Systems. 269–278.
  • Tian et al. (2023) Yu Tian, Bofang Li, Si Chen, Xubin Li, Hongbo Deng, Jian Xu, Bo Zheng, Qian Wang, and Chenliang Li. 2023. Multi-Scenario Ranking with Adaptive Feature Learning. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 517–526.
  • Wang et al. (2020) Hao Wang, Hao He, and Dina Katabi. 2020. Continuously indexed domain adaptation. arXiv preprint arXiv:2007.01807 (2020).
  • Xie et al. (2022) Yufeng Xie, Mingchu Li, Kun Lu, Syed Bilal Hussain Shah, and Xiao Zheng. 2022. Multi-task Learning Model based on Multiple Characteristics and Multiple Interests for CTR prediction. In 2022 IEEE Conference on Dependable and Secure Computing (DSC). IEEE, 1–7.
  • Yang et al. (2023) Xuanhua Yang, Jianxin Zhao, Shaoguo Liu, Liang Wang, and Bo Zheng. 2023. Gradient Coordination for Quantifying and Maximizing Knowledge Transference in Multi-Task Learning. arXiv preprint arXiv:2303.05847 (2023).
  • Zhang et al. (2021) Weinan Zhang, Jiarui Qin, Wei Guo, Ruiming Tang, and Xiuqiang He. 2021. Deep learning for click-through rate estimation. arXiv preprint arXiv:2104.10584 (2021).
  • Zhang and Yang (2021) Yu Zhang and Qiang Yang. 2021. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering 34, 12 (2021), 5586–5609.
  • Zhao et al. (2023) Pengyu Zhao, Xin Gao, Chunxu Xu, and Liang Chen. 2023. M5: Multi-Modal Multi-Interest Multi-Scenario Matching for Over-the-Top Recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5650–5659.
  • Zou et al. (2022) Xinyu Zou, Zhi Hu, Yiming Zhao, Xuchu Ding, Zhongyi Liu, Chenliang Li, and Aixin Sun. 2022. Automatic expert selection for multi-scenario and multi-task search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1535–1544.