License: arXiv.org perpetual non-exclusive license
arXiv:2312.13484v1 [stat.ML] 20 Dec 2023

Bayesian Transfer Learning

Piotr M. Suderlabel=e1][email protected] [    Jason Xu,label=e2][email protected] [    David B. Dunsonlabel=e3][email protected] [ Piotr M. Suder: PhD Student, Department of Statistical Science, Duke University presep= ]e1. Jason Xu: Assistant Professor, Department of Statistical Science, Duke University presep= ]e2. David B. Dunson: Arts and Sciences Distinguished Professor, Departments of Statistical Science and Mathematics, Duke University presep= ]e3.
Abstract

Transfer learning is a burgeoning concept in statistical machine learning that seeks to improve inference and/or predictive accuracy on a domain of interest by leveraging data from related domains. While the term ”transfer learning” has garnered much recent interest, its foundational principles have existed for years under various guises. Prior literature reviews in computer science and electrical engineering have sought to bring these ideas into focus, primarily surveying general methodologies and works from these disciplines. This article highlights Bayesian approaches to transfer learning, which have received relatively limited attention despite their innate compatibility with the notion of drawing upon prior knowledge to guide new learning tasks. Our survey encompasses a wide range of Bayesian transfer learning frameworks applicable to a variety of practical settings. We discuss how these methods address the problem of finding the optimal information to transfer between domains, which is a central question in transfer learning. We illustrate the utility of Bayesian transfer learning methods via a simulation study where we compare performance against frequentist competitors.

Bayesian machine learning,
domain adaptation,
hierarchical model,
meta analysis,
keywords:
\startlocaldefs\endlocaldefs

, and

1 Introduction

Transfer learning—applying knowledge gained from training on previous tasks and domains toward new tasks—is a burgeoning concept in statistics and machine learning. This natural idea mimics some of the mechanisms of human intelligence where past experience, skills and knowledge are often utilized in learning new topics. It is appealing to apply the same paradigm in develo** machine intelligence to extract knowledge from the rapidly growing body of datasets available to scientists which are often related to each other in various ways. If the domains between which the transfer of information occurs are sufficiently related, transfer learning can substantially improve the performance of the target model. This is particularly useful when we have a small target dataset we want to study which does not contain enough datapoints to extract precise inferences or predictions, but have access to a large, related dataset.

For instance, suppose we want to study brain connectomes of Alzheimer’s patients or genomes of people suffering from a rare type of cancer. We may utilize large datasets of brain connectomes or genomes collected from healthy individuals such as the ones provided by the UK Biobank to improve the models fitted to the target data. These related sources may aid in the extraction of, say, a low dimensional latent representation of the complex data we seek to study, which can be useful toward dimensionality reduction in the target domain.

Although the term transfer learning has seen increasing popularity in recent years, some of the ideas undergirding it have been around for much longer, and have appeared under various names. Several recent literature reviews aim to help researchers organize and classify these ideas systematically. To name a few, [70], and more recently [69] and [110], provide general overviews on transfer learning methodology, largely from the computer science and electrical engineering literature. [84] focuses on transfer learning in deep neural networks, an appealing use-case due to the data-hungry nature of deep learning models together with the availability of large datasets for training source models. Areas where deep learning is commonly applied such as computer vision often leverage public datasets such as ImageNet [24] or Open Images V4 [52], with millions of datapoints available for training. Meanwhile, [106] focuses on the phenomenon of negative transfer, where the source domains are too different from the target domain, so that applying transfer learning worsens the performance of the target learner. The existence of the negative transfer phenomenon illustrates the importance of choosing an appropriate amount of information to be transferred (the ”strength” of transfer) between domains, which remains one of the key challenges in transfer learning and will be one of the focal topics in this survey.

With the exception of [99], none of these literature reviews substantially focus on Bayesian views. While the work of Xuan et. al [99] explicitly overviews Bayesian transfer learning, its scope is limited to probabilistic graphical models. One can argue that the Bayesian paradigm provides a natural framework for how to incorporate prior information from previous datasets within current inferences, and hence provides a canonical umbrella of approaches for transfer learning. In this paper we provide an overview of some highlights of the Bayesian transfer learning literature. Our focus is on describing how different classical Bayesian approaches can be either directly applied or easily adapted to transfer learning problems. In doing so, we contribute various ideas toward answering a central question of transfer learning: how do we determine and enforce optimal information transfer between domains utilizing various Bayesian modeling approaches? Our aim is to contribute a broad view of Bayesian transfer learning, while presenting approaches that help surmount the problem of negative transfer.

The rest of the paper is organized as follows. In the following section, we give formal definitions of transfer learning and related areas, and discuss alternative names for related ideas in the literature. In Section 3 we provide an overview of general Bayesian approaches to transfer learning with specific examples and some applications. In Sections 4 and 5 we provide a brief taxonomy of transfer learning and point out several areas where some specific Bayesian approaches introduced here can be particularly useful. Finally, in Section 6 we present a simulation study comparing one of the Bayesian methods introduced here with frequentist transfer learning competitors. We conclude with a discussion in Section 7.

2 Definition and related areas

While approaches for transferring information across statistical tasks have a rich history, use of the “transfer learning” terminology is relatively recent. Perhaps as a result, there is not yet a standard technical definition of what qualifies as transfer learning. While some authors adopt a narrow definition of transferring parameters between models [41], others welcome broader, more general definitions [70], [110], [84]. In this section we provide one definition of transfer learning to fix ideas for the rest of the article. We then discuss closely related areas. Here by domain we denote the two-element set of the form 𝒟={𝒳,P}𝒟𝒳𝑃\mathcal{D}=\{\mathcal{X},P\}caligraphic_D = { caligraphic_X , italic_P }, where 𝒳𝒳\mathcal{X}caligraphic_X is the feature space and P𝑃Pitalic_P is the marginal probability distribution of the observations X𝒳𝑋𝒳X\in\mathcal{X}italic_X ∈ caligraphic_X collected in a dataset associated with 𝒟𝒟\mathcal{D}caligraphic_D. Given a domain 𝒟𝒟\mathcal{D}caligraphic_D and its associated label space 𝒴𝒴\mathcal{Y}caligraphic_Y, [70] define a task on 𝒟𝒟\mathcal{D}caligraphic_D as the set 𝒯={𝒴,f()}𝒯𝒴𝑓\mathcal{T}=\{\mathcal{Y},f(\cdot)\}caligraphic_T = { caligraphic_Y , italic_f ( ⋅ ) }, where f𝑓fitalic_f is a function given by f={(x,y)x𝒳,y𝒴}𝑓conditional-set𝑥𝑦formulae-sequence𝑥𝒳𝑦𝒴f=\{(x,y)\mid x\in\mathcal{X},y\in\mathcal{Y}\}italic_f = { ( italic_x , italic_y ) ∣ italic_x ∈ caligraphic_X , italic_y ∈ caligraphic_Y }. In this framework, f𝑓fitalic_f is the ground truth, the optimal solution to the task which is not directly observed but whose approximation can be learned from the observed data.

Definition 2.1 (Transfer Learning).

Consider the source domains 𝒟1,𝒟2,,𝒟Ksubscript𝒟1subscript𝒟2subscript𝒟𝐾\mathcal{D}_{1},\mathcal{D}_{2},\dots,\mathcal{D}_{K}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT with respective associated source tasks 𝒯1,𝒯2,,𝒯Ksubscript𝒯1subscript𝒯2subscript𝒯𝐾\mathcal{T}_{1},\mathcal{T}_{2},\dots,\mathcal{T}_{K}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_T start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, as well as the target domain 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with the associated target task 𝒯0={𝒴0,f0}subscript𝒯0subscript𝒴0subscript𝑓0\mathcal{T}_{0}=\{\mathcal{Y}_{0},f_{0}\}caligraphic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { caligraphic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }, where an approximation to f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be learned based on the available data (X0,Y0)subscript𝑋0subscript𝑌0(X_{0},Y_{0})( italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) with X0𝒳0,Y0𝒴0formulae-sequencesubscript𝑋0subscript𝒳0subscript𝑌0subscript𝒴0X_{0}\in\mathcal{X}_{0},Y_{0}\in\mathcal{Y}_{0}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Suppose that 𝒟k𝒟0subscript𝒟𝑘subscript𝒟0\mathcal{D}_{k}\neq\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or 𝒯k𝒯0subscript𝒯𝑘subscript𝒯0\mathcal{T}_{k}\neq\mathcal{T}_{0}caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ caligraphic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for any k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K. Transfer learning refers to algorithms which aim at improving the approximation of f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by incorporating the knowledge from 𝒟1,𝒟2,,𝒟Ksubscript𝒟1subscript𝒟2subscript𝒟𝐾\mathcal{D}_{1},\mathcal{D}_{2},\dots,\mathcal{D}_{K}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and 𝒯1,𝒯2,,𝒯Ksubscript𝒯1subscript𝒯2subscript𝒯𝐾\mathcal{T}_{1},\mathcal{T}_{2},\dots,\mathcal{T}_{K}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_T start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT.

In this setting, by knowledge we mean either: (i) the raw data sampled from the source domains, possibly equipped with labels from the source tasks, (ii) learners pre-trained on the data from source domains and tasks, or (iii) oracle models which have complete and true information on the source domains and the source tasks. Among these, cases (i) and (ii) are the most commonly encountered ones in practice.

Our definition follows and generalizes the conventions used by [70]. Note that in the above definition, the source and target domains are not necessarily different, encompassing cases with a common domain but different tasks. Furthermore, when two domains are different, their feature spaces need not differ. In the deep learning literature the target task is sometimes referred to as the downstream task [78].

By allowing the label space to be (a) discrete, (b) one-dimensional, (c) multidimensional, (d) defining only a partition of a dataset associated with 𝒟𝒟\mathcal{D}caligraphic_D without giving a specific meaning to how the labels are used for that purpose, this definition comprises, respectively, (a) classification, (b) univariate regression, (c) multivariate regression and dimensionality reduction, and (d) clustering tasks. Finally, allowing the values of f𝑓fitalic_f to be probability distributions naturally lends itself to Bayesian posterior learning.

2.1 Related fields

Another closely related problem which follows this paradigm is multitask learning. Just like transfer learning, its goal is to improve learning on a particular domain based on information from related domains and tasks, but it differs in seeking to simultaneously learn each task jointly on all the domains considered. This may improve the performance across tasks by borrowing information across related tasks and domains, in contrast to using a set of tasks and domains only as means to the end of improving performance on a single target task [70]. While some authors regard multitask learning and transfer learning as separate disciplines [70], [100], [91], others either consider them as the same field [99], or draw a distinction between the two according to different criteria, as in [37]. Often a multitask learning method can be easily adapted to transfer learning [70], [100]. As we will see in the following sections, the Bayesian framework elegantly reconciles these notions in many cases.

Continual or lifelong learning [46], [107], [46], [51], [29] is a popular concept in machine learning, which combines aspects of transfer and multitask learning. In this setting an agent faces a sequence of domain-task pairs over time, with the goal being to utilize previously encountered tasks to learn each new task in a more effective way while maintaining the ability to solve the previous tasks [97]. Continual learning attempts to provide a remedy to the phenomenon of catastrophic forgetting [65] in transfer learning where the model performs worse on the source tasks after being adjusted to the target task; this commonly occurs in deep learning models [8], [2]. This forgetting can be especially problematic when the number of encountered tasks and model parameters becomes large and it becomes difficult to store the previously encountered datasets and models trained on them.

There is additionally a Bayesian literature on metalearning, or learning to learn [101], [71], [73], [105]. Vanschoren [89] defines metalearning as methods aiming to improve the “configuration” (e.g. model hyperparameters, network architecture in case of deep learning methods, etc.) of the model for the target task by training on metadata. Here metadata refers to information obtained from models trained individually with different configurations; for example, one may vary different aspects of the model and measure its performance via cross validation. While some researchers consider metalearning as distinct from transfer learning [41], following Definition 1 we consider it to be a special case of transfer learning. In this case the information from the source domains is utilized by training models with various configurations on these domains and then using the metadata generated from them in improving the model for the target task.

Domain adaptation [108], [7], [32], [95] is another popular term, which is sometimes used interchangeably with transfer learning as in [50]. However, since knowledge can also be transferred between different tasks on the same domain, we view it as a particular case of transfer learning. Additional terminology for concepts closely related to transfer learning includes cooperative learning [109], [26], knowledge consolidation, context-sensitive learning, knowledge-based inductive bias, incremental, and cumulative learning [70].

3 Bayesian approaches to transfer learning

Two fundamental questions which need to be addressed are: (i) how information should be transferred, and (ii) which information should be transferred. There are various approaches to answering these questions and they are often related to the models used for solving the source and target tasks. Determining appropriate information transfer between domains is critical, since transferring inappropriate information can result in large bias and suboptimal performance. In extreme cases, one obtains negative transfer [106], which corresponds to the case in which transferring information decreases performance.

Some of the existing approaches rely on expert knowledge about the domains considered and their relationships, some introduce statistical measures of similarity between domains, while others rely on more flexible model-based or validation-based approaches to the optimal choice of parameters controlling information transfer. In this section we discuss different ideas based on Bayesian methodology which can be used to tackle questions (i)-(ii).

3.1 Shared parameters

One of the most prevalent approaches is to use common parameters in the source and target domains. For exposition, throughout this subsection we assume only one source dataset XSsubscript𝑋𝑆X_{S}italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and one target dataset XTsubscript𝑋𝑇X_{T}italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. For convenience of notation, by Xdsubscript𝑋𝑑X_{d}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT we denote both the datapoints and their associated labels (when applicable) for domain d{S,T}𝑑𝑆𝑇d\in\{S,T\}italic_d ∈ { italic_S , italic_T }. We parameterize the data likelihood for both source and target domains as p(XSθC,θS)𝑝conditionalsubscript𝑋𝑆subscript𝜃𝐶subscript𝜃𝑆p(X_{S}\mid\theta_{C},\theta_{S})italic_p ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) and p(XTθC,θT)𝑝conditionalsubscript𝑋𝑇subscript𝜃𝐶subscript𝜃𝑇p(X_{T}\mid\theta_{C},\theta_{T})italic_p ( italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), respectively, where θCsubscript𝜃𝐶\theta_{C}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is the common vector of parameters, shared by both source and target data, while θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and θTsubscript𝜃𝑇\theta_{T}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are vectors of parameters unique to the datasets. Let π(θC)𝜋subscript𝜃𝐶\pi(\theta_{C})italic_π ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ), π(θS)𝜋subscript𝜃𝑆\pi(\theta_{S})italic_π ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ), π(θT)𝜋subscript𝜃𝑇\pi(\theta_{T})italic_π ( italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) be the prior distributions for, respectively, θCsubscript𝜃𝐶\theta_{C}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, θTsubscript𝜃𝑇\theta_{T}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

A simple Bayesian transfer learning approach would compute the posterior for θCsubscript𝜃𝐶\theta_{C}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT based on the prior π(θC)𝜋subscript𝜃𝐶\pi(\theta_{C})italic_π ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) and the source data XSsubscript𝑋𝑆X_{S}italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT via

p(θCXS)𝑝conditionalsubscript𝜃𝐶subscript𝑋𝑆\displaystyle p(\theta_{C}\mid X_{S})italic_p ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) proportional-to\displaystyle\propto p(XSθC)π(θC)𝑝conditionalsubscript𝑋𝑆subscript𝜃𝐶𝜋subscript𝜃𝐶\displaystyle p(X_{S}\mid\theta_{C})\pi(\theta_{C})italic_p ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) italic_π ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) (1)
=\displaystyle== (p(XSθC,θS)π(θS)𝑑θS)π(θC),𝑝conditionalsubscript𝑋𝑆subscript𝜃𝐶subscript𝜃𝑆𝜋subscript𝜃𝑆differential-dsubscript𝜃𝑆𝜋subscript𝜃𝐶\displaystyle\left(\int p(X_{S}\mid\theta_{C},\theta_{S})\pi(\theta_{S})d% \theta_{S}\right)\pi(\theta_{C}),( ∫ italic_p ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) italic_π ( italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) italic_d italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) italic_π ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ,

and then use π*(θC,θT)p(θCXS)π(θT)proportional-tosuperscript𝜋subscript𝜃𝐶subscript𝜃𝑇𝑝conditionalsubscript𝜃𝐶subscript𝑋𝑆𝜋subscript𝜃𝑇\pi^{*}(\theta_{C},\theta_{T})\propto p(\theta_{C}\mid X_{S})\pi(\theta_{T})italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∝ italic_p ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) italic_π ( italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) as the prior for (θC,θT)subscript𝜃𝐶subscript𝜃𝑇(\theta_{C},\theta_{T})( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) in the analysis of XTsubscript𝑋𝑇X_{T}italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to obtain the posterior

p*(θC,θTXT)p(XTθC,θT)π*(θC,θT).proportional-tosuperscript𝑝subscript𝜃𝐶conditionalsubscript𝜃𝑇subscript𝑋𝑇𝑝conditionalsubscript𝑋𝑇subscript𝜃𝐶subscript𝜃𝑇superscript𝜋subscript𝜃𝐶subscript𝜃𝑇p^{*}(\theta_{C},\theta_{T}\mid X_{T})\propto p(X_{T}\mid\theta_{C},\theta_{T}% )\pi^{*}(\theta_{C},\theta_{T}).italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∝ italic_p ( italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) .

It is straightforward to see that if XTXS(θC,θT)X_{T}\perp\!\!\!\perp X_{S}\mid(\theta_{C},\theta_{T})italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ⟂ ⟂ italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∣ ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) and θSθT\theta_{S}\perp\!\!\!\perp\theta_{T}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ⟂ ⟂ italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT a priori, then this is equivalent to obtaining a posterior for (θC,θT)subscript𝜃𝐶subscript𝜃𝑇(\theta_{C},\theta_{T})( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) based on the data (XT,XS)subscript𝑋𝑇subscript𝑋𝑆(X_{T},X_{S})( italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) with the prior π(θC)π(θT)𝜋subscript𝜃𝐶𝜋subscript𝜃𝑇\pi(\theta_{C})\pi(\theta_{T})italic_π ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) italic_π ( italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) on (θC,θT)subscript𝜃𝐶subscript𝜃𝑇(\theta_{C},\theta_{T})( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), i.e.

p*(θC,θTXT)=p(θC,θTXT,XS),superscript𝑝subscript𝜃𝐶conditionalsubscript𝜃𝑇subscript𝑋𝑇𝑝subscript𝜃𝐶conditionalsubscript𝜃𝑇subscript𝑋𝑇subscript𝑋𝑆p^{*}(\theta_{C},\theta_{T}\mid X_{T})=p(\theta_{C},\theta_{T}\mid X_{T},X_{S}),italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = italic_p ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) , (2)

where

p(θC,θTXT,XS)p(XT,XSθC,θT)π(θC)π(θT).proportional-to𝑝subscript𝜃𝐶conditionalsubscript𝜃𝑇subscript𝑋𝑇subscript𝑋𝑆𝑝subscript𝑋𝑇conditionalsubscript𝑋𝑆subscript𝜃𝐶subscript𝜃𝑇𝜋subscript𝜃𝐶𝜋subscript𝜃𝑇p(\theta_{C},\theta_{T}\mid X_{T},X_{S})\propto p(X_{T},X_{S}\mid\theta_{C},% \theta_{T})\pi(\theta_{C})\pi(\theta_{T}).italic_p ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ∝ italic_p ( italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_π ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) italic_π ( italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) .

Hence, this approach is equivalent to giving equal weights to the source and target data in computing the posterior of the shared parameters. This is an appropriate approach when the model is correctly specified and the true parameters θCsubscript𝜃𝐶\theta_{C}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT are indeed exactly the same in the source and target populations.

However, in practice, it is likely that the assumption of exactly equivalent values of θCsubscript𝜃𝐶\theta_{C}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is an oversimplification. As the true values of θCsubscript𝜃𝐶\theta_{C}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT vary more widely between the source and target domains, the above approach can have suboptimal performance, particularly when the source data sample size is larger than that of the target, which is often the case. A simple and commonly used heuristic solution is to specify the prior for θCsubscript𝜃𝐶\theta_{C}italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT in the target posterior as a variance-inflated version of the posterior p(θCXS)𝑝conditionalsubscript𝜃𝐶subscript𝑋𝑆p(\theta_{C}\mid X_{S})italic_p ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) from the source data analysis.

Shwartz-Ziv et al. [78] apply a related approach to Bayesian deep neural networks (DNNs). First a Gaussian approximation to the posterior for the DNN fitted to the source data is obtained. The authors assume the “feature extractor” layers of the DNN are common to the source and target DNN. The variance of the Gaussian approximation to the source posterior for the weights in these layers is scaled up by a constant factor and then used as a prior for the feature extractor component in the target data DNN. The remaining weights characterizing the “head” of the DNN are given an isotropic Gaussian prior. To learn an appropriate amount of information sharing between the source and target domain, the scaling factor is chosen on held-out validation data from the target training dataset.

An alternative approach to controlling the influence of the source data on the target domain posterior distribution is the power prior [18], [19], [44], [43], [27]. The power prior for the target parameters is proportional to an initial prior multiplied by the source data likelihood raised to a fractional power. The fractional power serves to diminish the information provided by the source data likelihood. In our transfer learning setting, the joint prior for (θC,θT)subscript𝜃𝐶subscript𝜃𝑇(\theta_{C},\theta_{T})( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) in the target model is given by

πa0(θC,θTXS)p(XSθC)a0π(θC)π(θT),proportional-tosubscript𝜋subscript𝑎0subscript𝜃𝐶conditionalsubscript𝜃𝑇subscript𝑋𝑆𝑝superscriptconditionalsubscript𝑋𝑆subscript𝜃𝐶subscript𝑎0𝜋subscript𝜃𝐶𝜋subscript𝜃𝑇\pi_{a_{0}}(\theta_{C},\theta_{T}\mid X_{S})\propto p(X_{S}\mid\theta_{C})^{a_% {0}}\pi(\theta_{C})\pi(\theta_{T}),italic_π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ∝ italic_p ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∣ italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_π ( italic_θ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) italic_π ( italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , (3)

where the strength of information transfer ranges between no transfer at a0=0subscript𝑎00a_{0}=0italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 to ”full” transfer at a0=1subscript𝑎01a_{0}=1italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1. In the latter case, the source data are given equal weight to those in the target domain. This setup generalizes the partial borrowing power prior of [19], where the source model parameters are a subset of those used in the target domain.

Several appealing theoretical properties of the power prior were established in [44] for the case when all the parameters are shared. In that case the posterior for θ𝜃\thetaitalic_θ reduces to

πa0(θXT,XS)p(XTθ)p(XSθ)a0π(θ),proportional-tosubscript𝜋subscript𝑎0conditional𝜃subscript𝑋𝑇subscript𝑋𝑆𝑝conditionalsubscript𝑋𝑇𝜃𝑝superscriptconditionalsubscript𝑋𝑆𝜃subscript𝑎0𝜋𝜃\pi_{a_{0}}(\theta\mid X_{T},X_{S})\propto p(X_{T}\mid\theta)p(X_{S}\mid\theta% )^{a_{0}}\pi(\theta),italic_π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ∣ italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ∝ italic_p ( italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_θ ) italic_p ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∣ italic_θ ) start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_π ( italic_θ ) , (4)

where θ𝜃\thetaitalic_θ determines the distribution of both source and target data. Ibrahim et al. [44] show that for a fixed a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, (4) minimizes the weighted sum of Kullback–Leibler (KL) divergences between the posterior with no information transfer and one with full information transfer, i.e.

πa0subscript𝜋subscript𝑎0\displaystyle\pi_{a_{0}}italic_π start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (θXT,XS)=conditional𝜃subscript𝑋𝑇subscript𝑋𝑆absent\displaystyle(\theta\mid X_{T},X_{S})=( italic_θ ∣ italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) =
=argming{(1a0)KL(gf0)+a0KL(gf1)},\displaystyle=\operatorname*{arg\ min}_{g}\{(1-a_{0})KL(g\mid\mid f_{0})+a_{0}% KL(g\mid\mid f_{1})\},= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT { ( 1 - italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_K italic_L ( italic_g ∣ ∣ italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_K italic_L ( italic_g ∣ ∣ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) } ,

where f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are probability densities given by

f0(θ)=π0(θXS,XT)p(XTθ)π(θ)subscript𝑓0𝜃subscript𝜋0conditional𝜃subscript𝑋𝑆subscript𝑋𝑇proportional-to𝑝conditionalsubscript𝑋𝑇𝜃𝜋𝜃f_{0}(\theta)=\pi_{0}(\theta\mid X_{S},X_{T})\propto p(X_{T}\mid\theta)\pi(\theta)italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ) = italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_θ ∣ italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∝ italic_p ( italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_θ ) italic_π ( italic_θ )

and

f1(θ)=π1(θXS,XT)p(XTθ)p(XSθ)π(θ).subscript𝑓1𝜃subscript𝜋1conditional𝜃subscript𝑋𝑆subscript𝑋𝑇proportional-to𝑝conditionalsubscript𝑋𝑇𝜃𝑝conditionalsubscript𝑋𝑆𝜃𝜋𝜃f_{1}(\theta)=\pi_{1}(\theta\mid X_{S},X_{T})\propto p(X_{T}\mid\theta)p(X_{S}% \mid\theta)\pi(\theta).italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ ) = italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ ∣ italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∝ italic_p ( italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_θ ) italic_p ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∣ italic_θ ) italic_π ( italic_θ ) .

Just like in the other transfer learning approaches, choosing the right amount of information to be transferred—in this case governed by the value of a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT—is a key challenge. One approach is to treat a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as fixed and perform sensitivity analysis over a set of values which ideally should include a0=0subscript𝑎00a_{0}=0italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 and a0=1subscript𝑎01a_{0}=1italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1, as recommended by [43]. In generalized linear models (GLMs) the choice of a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be better informed with the help of model selection criteria such as those proposed in [44], [43], [42], and [83]. Ibrahim et al. [44] propose a penalized likelihood-type criterion (PLC) that chooses a0(0,1]subscript𝑎001a_{0}\in(0,1]italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ ( 0 , 1 ] to be the minimizer of

2logp(XTθ)p(XSθ)a0π(θ)𝑑θ+log(nS)a0,2𝑝conditionalsubscript𝑋𝑇𝜃𝑝superscriptconditionalsubscript𝑋𝑆𝜃subscript𝑎0𝜋𝜃differential-d𝜃subscript𝑛𝑆subscript𝑎0-2\log\int p(X_{T}\mid\theta)p(X_{S}\mid\theta)^{a_{0}}\pi(\theta)d\theta+% \frac{\log(n_{S})}{a_{0}},- 2 roman_log ∫ italic_p ( italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∣ italic_θ ) italic_p ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∣ italic_θ ) start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_π ( italic_θ ) italic_d italic_θ + divide start_ARG roman_log ( italic_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) end_ARG start_ARG italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ,

where nSsubscript𝑛𝑆n_{S}italic_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is the sample size of the source dataset.

Alternatively, we can treat a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as random and in turn assign it a prior distribution. We can either directly define the joint prior for (θ,a0)𝜃subscript𝑎0(\theta,a_{0})( italic_θ , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) as in [43], i.e.

π(θ,a0XS)p(XSθ)a0π(θ)π(a0),proportional-to𝜋𝜃conditionalsubscript𝑎0subscript𝑋𝑆𝑝superscriptconditionalsubscript𝑋𝑆𝜃subscript𝑎0𝜋𝜃𝜋subscript𝑎0\pi(\theta,a_{0}\mid X_{S})\propto p(X_{S}\mid\theta)^{a_{0}}\pi(\theta)\pi(a_% {0}),italic_π ( italic_θ , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ∝ italic_p ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∣ italic_θ ) start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_π ( italic_θ ) italic_π ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (5)

or the normalized power prior as in [27]

π(θ,a0XS)𝜋𝜃conditionalsubscript𝑎0subscript𝑋𝑆\displaystyle\pi(\theta,a_{0}\mid X_{S})italic_π ( italic_θ , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) =π(θXS,a0)π(a0)absent𝜋conditional𝜃subscript𝑋𝑆subscript𝑎0𝜋subscript𝑎0\displaystyle=\pi(\theta\mid X_{S},a_{0})\pi(a_{0})= italic_π ( italic_θ ∣ italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_π ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (6)
=p(XSθ)a0π(θ)p(XSθ)a0π(θ)𝑑θπ(a0).absent𝑝superscriptconditionalsubscript𝑋𝑆𝜃subscript𝑎0𝜋𝜃𝑝superscriptconditionalsubscript𝑋𝑆superscript𝜃subscript𝑎0𝜋superscript𝜃differential-dsuperscript𝜃𝜋subscript𝑎0\displaystyle=\frac{p(X_{S}\mid\theta)^{a_{0}}\pi(\theta)}{\int p(X_{S}\mid% \theta^{\prime})^{a_{0}}\pi(\theta^{\prime})d\theta^{\prime}}\pi(a_{0}).= divide start_ARG italic_p ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∣ italic_θ ) start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_π ( italic_θ ) end_ARG start_ARG ∫ italic_p ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∣ italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_π ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG italic_π ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) .

The normalized power prior first specifies a marginal prior for a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and then a conditional prior for θ𝜃\thetaitalic_θ given a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Taking π(a0)𝜋subscript𝑎0\pi(a_{0})italic_π ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) to be a beta or Dirichlet distribution depending on the number of source domains is a natural choice, with theoretical support proved in [44] under fixed a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that extend to the random a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT case under (5). Other priors with an appropriate support, such as gamma or Gaussian truncated to [0,1]01[0,1][ 0 , 1 ], can also be utilized [18]. However, it is not clear how the data inform about an appropriate value for a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, since a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is not a traditional parameter. It may be that this approach can be used to represent prior uncertainty in a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT but will not adapt to information in the data to concentrate on the optimal amount of borrowing from the source data.

3.2 Hierachical models and random effects

The approaches mentioned in the previous section rely on sharing parameters in the likelihood specification for source and target data. Alternatively, we can allow the parameters of the source and target data models to differ, instead imposing the assumption that they come from a jointly specified or identical prior distribution acting as a bridge for information flow between the domains.

As a simple example, consider the Gaussian linear model. Let the datasets (𝑿1,𝒚1),,(𝑿K,𝒚K)subscript𝑿1subscript𝒚1subscript𝑿𝐾subscript𝒚𝐾(\boldsymbol{X}_{1},\boldsymbol{y}_{1}),\ldots,(\boldsymbol{X}_{K},\boldsymbol% {y}_{K})( bold_italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( bold_italic_X start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) denote the source data and (𝑿0,𝒚0)subscript𝑿0subscript𝒚0(\boldsymbol{X}_{0},\boldsymbol{y}_{0})( bold_italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) be the target data, where 𝑿dnd×psubscript𝑿𝑑superscriptsubscript𝑛𝑑𝑝\boldsymbol{X}_{d}\in\mathbb{R}^{n_{d}\times p}bold_italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT × italic_p end_POSTSUPERSCRIPT and 𝒚dndsubscript𝒚𝑑superscriptsubscript𝑛𝑑\boldsymbol{y}_{d}\in\mathbb{R}^{n_{d}}bold_italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for d{0,1,,K}𝑑01𝐾d\in\{0,1,\ldots,K\}italic_d ∈ { 0 , 1 , … , italic_K }. Under this model we assume that

𝒚d=𝑿d𝜷d+ϵd,ϵd𝒩(0,σd2Ind),formulae-sequencesubscript𝒚𝑑subscript𝑿𝑑subscript𝜷𝑑subscriptbold-italic-ϵ𝑑similar-tosubscriptbold-italic-ϵ𝑑𝒩0superscriptsubscript𝜎𝑑2subscript𝐼subscript𝑛𝑑\boldsymbol{y}_{d}=\boldsymbol{X}_{d}\boldsymbol{\beta}_{d}+\boldsymbol{% \epsilon}_{d},\quad\boldsymbol{\epsilon}_{d}{\sim}\mathcal{N}(0,\sigma_{d}^{2}% I_{n_{d}}),bold_italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = bold_italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT bold_italic_β start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + bold_italic_ϵ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (7)

with the prior on the coefficients given by 𝜷d𝒩(𝝁,𝚺)similar-tosubscript𝜷𝑑𝒩𝝁𝚺\boldsymbol{\beta}_{d}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})bold_italic_β start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_italic_μ , bold_Σ ) for d{0,1,,K}𝑑01𝐾d\in\{0,1,\ldots,K\}italic_d ∈ { 0 , 1 , … , italic_K }. Here, the domain-specific parameters 𝜷dsubscript𝜷𝑑\boldsymbol{\beta}_{d}bold_italic_β start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT are drawn from a common prior distribution 𝒩(𝝁,𝚺)𝒩𝝁𝚺\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})caligraphic_N ( bold_italic_μ , bold_Σ ), which is often referred to as a random effects distribution. Model (7) is a common type of hierarchical regression model for data nested within groups (domains in our terminology). Data from all the domains are used to inform the random effects mean 𝝁𝝁\boldsymbol{\mu}bold_italic_μ and covariance 𝚺𝚺\boldsymbol{\Sigma}bold_Σ, inducing borrowing of information.

We can either treat σ0,σ1,,σKsubscript𝜎0subscript𝜎1subscript𝜎𝐾\sigma_{0},\sigma_{1},\ldots,\sigma_{K}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, 𝝁𝝁\boldsymbol{\mu}bold_italic_μ, and 𝚺𝚺\boldsymbol{\Sigma}bold_Σ as fixed, taking a frequentist approach to inference, or specify hyperpriors for them to obtain a Bayesian hierarchical model. In either case, the random effects covariance 𝚺𝚺\boldsymbol{\Sigma}bold_Σ controls how much information transfer there is, analogously to a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the power prior approach. Large covariance implies less shrinkage of the 𝜷dsubscript𝜷𝑑\boldsymbol{\beta}_{d}bold_italic_β start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT values towards the random effects mean 𝝁𝝁\boldsymbol{\mu}bold_italic_μ. In practice, the prior for the random effects mean and covariance will be updated based on information in the data about the variability in the regression coefficients across domains.

For fixed σ0,σ1,,σKsubscript𝜎0subscript𝜎1subscript𝜎𝐾\sigma_{0},\sigma_{1},\ldots,\sigma_{K}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, 𝝁𝝁\boldsymbol{\mu}bold_italic_μ, and 𝚺𝚺\boldsymbol{\Sigma}bold_Σ, [17] showed a direct analytic relationship between 𝚺𝚺\boldsymbol{\Sigma}bold_Σ and the tuning parameter a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the power prior approach, establishing duality between these methods for the Gaussian linear model.

3.3 Shared latent space

Rather than imposing shared parameters on data generating processes for source and target domains, whether explicitly in the likelihoods or at higher levels in a hierarchical model, we can also specify or seek to learn a shared latent space. This approach can be particularly useful in more complex datasets with large numbers of dimensions.

3.3.1 Factor analysis

In the Bayesian context many such examples can be found in the factor analysis literature. Under the classical factor model specification outlined in [60] the i𝑖iitalic_i-th observation 𝒚ipsubscript𝒚𝑖superscript𝑝\boldsymbol{y}_{i}\in\mathbb{R}^{p}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is given by

𝒚i=𝚲𝜼i+ϵi,subscript𝒚𝑖𝚲subscript𝜼𝑖subscriptbold-italic-ϵ𝑖\boldsymbol{y}_{i}=\boldsymbol{\Lambda}\boldsymbol{\eta}_{i}+\boldsymbol{% \epsilon}_{i},bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_Λ bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

where 𝜼iiid𝒩(𝟎,𝑰q)subscript𝜼𝑖iidsimilar-to𝒩0subscript𝑰𝑞\boldsymbol{\eta}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(\boldsymbol{0},% \boldsymbol{I}_{q})bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( bold_0 , bold_italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) are the vectors of latent factors, 𝚲p×q𝚲superscript𝑝𝑞\boldsymbol{\Lambda}\in\mathbb{R}^{p\times q}bold_Λ ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_q end_POSTSUPERSCRIPT is the factor loading matrix, ϵiiid𝒩(𝟎,𝚫)subscriptbold-italic-ϵ𝑖iidsimilar-to𝒩0𝚫\boldsymbol{\epsilon}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(\boldsymbol{0% },\boldsymbol{\Delta})bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( bold_0 , bold_Δ ) are random noise terms with 𝚫=diag(δ12,,δp2)𝚫diagsuperscriptsubscript𝛿12superscriptsubscript𝛿𝑝2\boldsymbol{\Delta}=\textnormal{diag}(\delta_{1}^{2},\dots,\delta_{p}^{2})bold_Δ = diag ( italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_δ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and 𝜼isubscript𝜼𝑖\boldsymbol{\eta}_{i}bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ϵjsubscriptbold-italic-ϵ𝑗\boldsymbol{\epsilon}_{j}bold_italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are independent for any i,j𝑖𝑗i,jitalic_i , italic_j. It is commonly assumed that qpmuch-less-than𝑞𝑝q\ll pitalic_q ≪ italic_p, i.e. the high dimensional data can be explained using a latent structure of much lower dimensional factors. This model can be equivalently written as a Gaussian distribution with a constrained covariance structure, i.e.

𝒚iiid𝒩(𝟎,𝚺),𝚺=𝚲𝚲T+𝚫.subscript𝒚𝑖iidsimilar-to𝒩0𝚺𝚺𝚲superscript𝚲𝑇𝚫\boldsymbol{y}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(\boldsymbol{0},% \boldsymbol{\Sigma}),\quad\boldsymbol{\Sigma}=\boldsymbol{\Lambda}\boldsymbol{% \Lambda}^{T}+\boldsymbol{\Delta}.bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( bold_0 , bold_Σ ) , bold_Σ = bold_Λ bold_Λ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + bold_Δ .

The mean-zero assumption on 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT comes from the standard practice of centering the data and does not limit the generality of the model.

In [23] and [90] this setup is generalized to the situation with data coming from multiple domains by letting

𝒚k,iiid𝒩(𝟎,𝚺k),𝚺k=𝚲𝚲T+𝚽k𝚽kT+𝚫k,subscript𝒚𝑘𝑖iidsimilar-to𝒩0subscript𝚺𝑘subscript𝚺𝑘𝚲superscript𝚲𝑇subscript𝚽𝑘superscriptsubscript𝚽𝑘𝑇subscript𝚫𝑘\boldsymbol{y}_{k,i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(\boldsymbol{0},% \boldsymbol{\Sigma}_{k}),\quad\boldsymbol{\Sigma}_{k}=\boldsymbol{\Lambda}% \boldsymbol{\Lambda}^{T}+\boldsymbol{\Phi}_{k}\boldsymbol{\Phi}_{k}^{T}+% \boldsymbol{\Delta}_{k},bold_italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( bold_0 , bold_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_Λ bold_Λ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + bold_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + bold_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where 𝚫k=diag(δk,12,,δk,p2)subscript𝚫𝑘diagsuperscriptsubscript𝛿𝑘12superscriptsubscript𝛿𝑘𝑝2\boldsymbol{\Delta}_{k}=\textnormal{diag}(\delta_{k,1}^{2},\dots,\delta_{k,p}^% {2})bold_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = diag ( italic_δ start_POSTSUBSCRIPT italic_k , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_δ start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is the error variance matrix, 𝚲p×q𝚲superscript𝑝𝑞\boldsymbol{\Lambda}\in\mathbb{R}^{p\times q}bold_Λ ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_q end_POSTSUPERSCRIPT, 𝚽kp×qksubscript𝚽𝑘superscript𝑝subscript𝑞𝑘\boldsymbol{\Phi}_{k}\in\mathbb{R}^{p\times q_{k}}bold_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for domain k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K. Here 𝚽k𝚽kTsubscript𝚽𝑘superscriptsubscript𝚽𝑘𝑇\boldsymbol{\Phi}_{k}\boldsymbol{\Phi}_{k}^{T}bold_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT accounts for the domain-specific dependencies between the datapoints and 𝚲𝚲T𝚲superscript𝚲𝑇\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{T}bold_Λ bold_Λ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is the underlying shared covariance structure which allows for information transfer between domains.

Analogously to the single domain case above, this model has the equivalent representation

𝒚k,i=𝚲𝜼k,i+𝚽k𝜻k,i+ϵk,i,subscript𝒚𝑘𝑖𝚲subscript𝜼𝑘𝑖subscript𝚽𝑘subscript𝜻𝑘𝑖subscriptbold-italic-ϵ𝑘𝑖\displaystyle\boldsymbol{y}_{k,i}=\boldsymbol{\Lambda}\boldsymbol{\eta}_{k,i}+% \boldsymbol{\Phi}_{k}\boldsymbol{\zeta}_{k,i}+\boldsymbol{\epsilon}_{k,i},bold_italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT = bold_Λ bold_italic_η start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT + bold_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_ζ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT + bold_italic_ϵ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT , (8)

with

𝜼k,iiid𝒩(𝟎,𝑰q),𝜻k,iiid𝒩(𝟎,𝑰qk),ϵk,iiid𝒩(𝟎,𝚫k),subscript𝜼𝑘𝑖iidsimilar-to𝒩0subscript𝑰𝑞subscript𝜻𝑘𝑖iidsimilar-to𝒩0subscript𝑰subscript𝑞𝑘subscriptbold-italic-ϵ𝑘𝑖iidsimilar-to𝒩0subscript𝚫𝑘\displaystyle\boldsymbol{\eta}_{k,i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(% \boldsymbol{0},\boldsymbol{I}_{q}),\hskip 7.11317pt\boldsymbol{\zeta}_{k,i}% \overset{\mathrm{iid}}{\sim}\mathcal{N}(\boldsymbol{0},\boldsymbol{I}_{q_{k}})% ,\hskip 7.11317pt\boldsymbol{\epsilon}_{k,i}\overset{\mathrm{iid}}{\sim}% \mathcal{N}(\boldsymbol{0},\boldsymbol{\Delta}_{k}),bold_italic_η start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( bold_0 , bold_italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) , bold_italic_ζ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( bold_0 , bold_italic_I start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , bold_italic_ϵ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( bold_0 , bold_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,

where 𝜼k,isubscript𝜼𝑘𝑖\boldsymbol{\eta}_{k,i}bold_italic_η start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT is a latent factor in the q𝑞qitalic_q dimensional shared subspace, 𝜻k,isubscript𝜻𝑘𝑖\boldsymbol{\zeta}_{k,i}bold_italic_ζ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT are qksubscript𝑞𝑘q_{k}italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT dimensional domain-specific latent factors and ϵk,isubscriptbold-italic-ϵ𝑘𝑖\boldsymbol{\epsilon}_{k,i}bold_italic_ϵ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT are error terms. Thus, 𝚲𝚲\boldsymbol{\Lambda}bold_Λ is the shared factor loading matrix and 𝚽ksubscript𝚽𝑘\boldsymbol{\Phi}_{k}bold_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are p×qk𝑝subscript𝑞𝑘p\times q_{k}italic_p × italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT domain-specific factor loading matrices. In this model the transfer of knowledge between domains occurs through information borrowing in the estimation of 𝚲𝚲\boldsymbol{\Lambda}bold_Λ.

The above model allows for a lot of flexibility between the domains, but suffers from an identifiability issue known as information switching: the data can be fitted equally well with the shared columns in factor loading matrices transferred from 𝚲𝚲\boldsymbol{\Lambda}bold_Λ to 𝚽ksubscript𝚽𝑘\boldsymbol{\Phi}_{k}bold_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT’s. De Vito et al. [23], [90] solve this problem by restricting the augmented matrix [𝚲𝚽1𝚽K]\begin{matrix}[\boldsymbol{\Lambda}&\boldsymbol{\Phi}_{1}&\dots&\boldsymbol{% \Phi}_{K}]\end{matrix}start_ARG start_ROW start_CELL [ bold_Λ end_CELL start_CELL bold_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL bold_Φ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] end_CELL end_ROW end_ARG to be lower-triangular. One limitation of this approach is that it imposes an ordering on the domains. Often when we have multiple source domains there is no natural ordering between them and hence this approach would not be preferred in such a scenario.

A recent paper proposes a different solution to the problem of information switching by restricting the factor loading matrices to be linear transforms of the shared factor loading matrix and imposing a shared covariance of error terms between the domains [15]. That is, they assume 𝚽k=𝚲𝑨ksubscript𝚽𝑘𝚲subscript𝑨𝑘\boldsymbol{\Phi}_{k}=\boldsymbol{\Lambda}\boldsymbol{A}_{k}bold_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_Λ bold_italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where 𝑨kq×qksubscript𝑨𝑘superscript𝑞subscript𝑞𝑘\boldsymbol{A}_{k}\in\mathbb{R}^{q\times q_{k}}bold_italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_q × italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝚫k=𝚫=diag(δ12,,δp2)subscript𝚫𝑘𝚫diagsuperscriptsubscript𝛿12superscriptsubscript𝛿𝑝2\boldsymbol{\Delta}_{k}=\boldsymbol{\Delta}=\textnormal{diag}(\delta_{1}^{2},% \dots,\delta_{p}^{2})bold_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_Δ = diag ( italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_δ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for every k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K in (8). The authors show that under any non-degenerate continuous prior on 𝑨ksubscript𝑨𝑘\boldsymbol{A}_{k}bold_italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the information switching does not occur almost surely provided that k=1Kqkqsuperscriptsubscript𝑘1𝐾subscript𝑞𝑘𝑞\sum_{k=1}^{K}q_{k}\leq q∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_q.

This result provides some guidance for choosing the dimensions of shared and domain-specific latent spaces since these are not known in most practical applications. These dimensions influence the amount of information being transferred between domains since larger values of q1,,qKsubscript𝑞1subscript𝑞𝐾q_{1},\dots,q_{K}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT give more flexibility to the domain-specific latent features, thus reducing the influence of shared latent factors. One approach to choosing these dimensions would be to put priors on q1,,qKsubscript𝑞1subscript𝑞𝐾q_{1},\dots,q_{K}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and q𝑞qitalic_q and use reversible-jump algorithms outlined in [35]. However, such algorithms can be computationally prohibitive.

Chandra et al. [15] provide an alternative solution by fixing q,q1,,qK𝑞subscript𝑞1subscript𝑞𝐾q,q_{1},\dots,q_{K}italic_q , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT at an upper bound, and then utilizing appropriate priors to shrink the excess columns in 𝚲,𝑨1,,𝑨K𝚲subscript𝑨1subscript𝑨𝐾\boldsymbol{\Lambda},\boldsymbol{A}_{1},\dots,\boldsymbol{A}_{K}bold_Λ , bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_A start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. Specifically, they obtain approximate singular values and eigenvectors of the pooled dataset via the augmented implicitly restarted Lanczos bidiagonalization [3] and choose the smallest q^^𝑞\hat{q}over^ start_ARG italic_q end_ARG which explains 95% of the variability in the data. Then they fix qk^=q^/K^subscript𝑞𝑘^𝑞𝐾\hat{q_{k}}=\hat{q}/Kover^ start_ARG italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = over^ start_ARG italic_q end_ARG / italic_K for k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K to ensure that information switching will not occur. The strength of information transfer between domains is thus directly influenced by the amount of shrinkage induced by the priors on the factor loading matrices. In [15], the fixed priors vec(𝚲)DL(1/2)similar-tovec𝚲DL12\textnormal{vec}(\boldsymbol{\Lambda})\sim\textnormal{DL}(1/2)vec ( bold_Λ ) ∼ DL ( 1 / 2 ) and ak,i,jiid𝒩(0,1)subscript𝑎𝑘𝑖𝑗iidsimilar-to𝒩01a_{k,i,j}\overset{\mathrm{iid}}{\sim}\mathcal{N}(0,1)italic_a start_POSTSUBSCRIPT italic_k , italic_i , italic_j end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( 0 , 1 ) are used, where ak,i,jsubscript𝑎𝑘𝑖𝑗a_{k,i,j}italic_a start_POSTSUBSCRIPT italic_k , italic_i , italic_j end_POSTSUBSCRIPT is the (i,j)𝑖𝑗(i,j)( italic_i , italic_j )-th entry of 𝑨ksubscript𝑨𝑘\boldsymbol{A}_{k}bold_italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and DL denotes the Dirichlet-Laplace distribution [6]. Thus a possible extension in this line of work would be to propose some methods for choosing these hyperparameters in an adaptive way depending on how related the domains are.

3.3.2 Mixture models

Latent space models are commonly used as a dimensionality reduction tool, including when dealing with non-standard data structures such as networks [40], [9]. As a vignette showing how mixture models can be used in combination with latent space models for flexible transfer learning, we focus on the approach proposed in [28]. They were motivated by data on brain networks for individuals in different groups.

Specifically, given n𝑛nitalic_n observed networks each belonging to one of K𝐾Kitalic_K groups and consisting of V𝑉Vitalic_V labelled vertices, denote the i𝑖iitalic_i-th network together with its group label as {yi,𝓛(𝑨i)}subscript𝑦𝑖𝓛subscript𝑨𝑖\{y_{i},\boldsymbol{\mathcal{L}}(\boldsymbol{A}_{i})\}{ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_caligraphic_L ( bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }, where 𝑨i{0,1}V×Vsubscript𝑨𝑖superscript01𝑉𝑉\boldsymbol{A}_{i}\in\{0,1\}^{V\times V}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_V × italic_V end_POSTSUPERSCRIPT is the adjacency matrix and 𝓛(𝑨i){0,1}V(V1)/2𝓛subscript𝑨𝑖superscript01𝑉𝑉12\boldsymbol{\mathcal{L}}(\boldsymbol{A}_{i})\in\{0,1\}^{V(V-1)/2}bold_caligraphic_L ( bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_V ( italic_V - 1 ) / 2 end_POSTSUPERSCRIPT denotes the lower triangular entries

(Ai[2,1],,Ai[V,1],Ai[3,2],,Ai[V,2],,Ai[V,V1])Tsuperscriptsubscript𝐴𝑖21subscript𝐴𝑖𝑉1subscript𝐴𝑖32subscript𝐴𝑖𝑉2subscript𝐴𝑖𝑉𝑉1𝑇(A_{i[2,1]},\dots,A_{i[V,1]},A_{i[3,2]},\dots,A_{i[V,2]},\dots,A_{i[V,V-1]})^{T}( italic_A start_POSTSUBSCRIPT italic_i [ 2 , 1 ] end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_i [ italic_V , 1 ] end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i [ 3 , 2 ] end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_i [ italic_V , 2 ] end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_i [ italic_V , italic_V - 1 ] end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT

of 𝑨isubscript𝑨𝑖\boldsymbol{A}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We discard the main diagonal and the upper triangular part of 𝑨isubscript𝑨𝑖\boldsymbol{A}_{i}bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT since the network is an undirected graph and self-relationships of the nodes are not of interest. In [28] subjects fall into a low and high creativity group, so we have K=2𝐾2K=2italic_K = 2 domains. The network representation 𝓛(𝑨)𝓛𝑨\boldsymbol{\mathcal{L}}(\boldsymbol{A})bold_caligraphic_L ( bold_italic_A ) conditional on the group membership y𝑦yitalic_y is modeled as

(𝓛\displaystyle\mathbb{P}(\boldsymbol{\mathcal{L}}blackboard_P ( bold_caligraphic_L (𝑨i)=𝒂y=k)=\displaystyle(\boldsymbol{A}_{i})=\boldsymbol{a}\mid y=k)=( bold_italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_italic_a ∣ italic_y = italic_k ) =
=h=1Hνh(k)l=1V(V1)/2(πl(h))al(1πl(h))1alabsentsuperscriptsubscript1𝐻superscriptsubscript𝜈𝑘superscriptsubscriptproduct𝑙1𝑉𝑉12superscriptsuperscriptsubscript𝜋𝑙subscript𝑎𝑙superscript1superscriptsubscript𝜋𝑙1subscript𝑎𝑙\displaystyle=\sum_{h=1}^{H}\nu_{h}^{(k)}\prod_{l=1}^{V(V-1)/2}\left(\pi_{l}^{% (h)}\right)^{a_{l}}\left(1-\pi_{l}^{(h)}\right)^{1-a_{l}}= ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V ( italic_V - 1 ) / 2 end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_π start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

for any 𝒂{0,1}V(V1)/2𝒂superscript01𝑉𝑉12\boldsymbol{a}\in\{0,1\}^{V(V-1)/2}bold_italic_a ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_V ( italic_V - 1 ) / 2 end_POSTSUPERSCRIPT with the probability vector

𝝅(h)=(π1(h),,πV(V1)/2(h))T(0,1)V(V1)/2superscript𝝅superscriptsuperscriptsubscript𝜋1superscriptsubscript𝜋𝑉𝑉12𝑇superscript01𝑉𝑉12\boldsymbol{\pi}^{(h)}=\left(\pi_{1}^{(h)},\dots,\pi_{V(V-1)/2}^{(h)}\right)^{% T}\in(0,1)^{V(V-1)/2}bold_italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT = ( italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_V ( italic_V - 1 ) / 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ ( 0 , 1 ) start_POSTSUPERSCRIPT italic_V ( italic_V - 1 ) / 2 end_POSTSUPERSCRIPT

in the hhitalic_h-th mixture component given by

πl(h)=[1+exp(ZlDl(h))]1,subscriptsuperscript𝜋𝑙superscriptdelimited-[]1subscript𝑍𝑙subscriptsuperscript𝐷𝑙1{\pi}^{(h)}_{l}=\left[1+\exp(-{Z}_{l}-{D}^{(h)}_{l})\right]^{-1},italic_π start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = [ 1 + roman_exp ( - italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_D start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ,

with

𝑫(h)=𝓛(𝑿(h)𝚲(h)𝑿(h)T),h=1,,H,formulae-sequencesuperscript𝑫𝓛superscript𝑿superscript𝚲superscript𝑿𝑇1𝐻\boldsymbol{D}^{(h)}=\boldsymbol{\mathcal{L}}(\boldsymbol{X}^{(h)}\boldsymbol{% \Lambda}^{(h)}\boldsymbol{X}^{(h)T}),\quad h=1,\dots,H,bold_italic_D start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT = bold_caligraphic_L ( bold_italic_X start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT bold_Λ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT bold_italic_X start_POSTSUPERSCRIPT ( italic_h ) italic_T end_POSTSUPERSCRIPT ) , italic_h = 1 , … , italic_H ,

where 𝑿(h)V×Rsuperscript𝑿superscript𝑉𝑅\boldsymbol{X}^{(h)}\in\mathbb{R}^{V\times R}bold_italic_X start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_R end_POSTSUPERSCRIPT, 𝚲(h)=diag(λ1(h),,λR(h))superscript𝚲diagsuperscriptsubscript𝜆1superscriptsubscript𝜆𝑅\boldsymbol{\Lambda}^{(h)}=\textnormal{diag}(\lambda_{1}^{(h)},\dots,\lambda_{% R}^{(h)})bold_Λ start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT = diag ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ) with λ1(h),,λR(h)0superscriptsubscript𝜆1superscriptsubscript𝜆𝑅0\lambda_{1}^{(h)},\dots,\lambda_{R}^{(h)}\geq 0italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_h ) end_POSTSUPERSCRIPT ≥ 0, and 𝒁V(V1)/2𝒁superscript𝑉𝑉12\boldsymbol{Z}\in\mathbb{R}^{V(V-1)/2}bold_italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_V ( italic_V - 1 ) / 2 end_POSTSUPERSCRIPT.

The model supposes that there are H𝐻Hitalic_H different brain structure “types”. The probability of an edge between the l𝑙litalic_l-th pair of brain regions follows a logistic model having an intercept Zlsubscript𝑍𝑙Z_{l}italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT characterizing the baseline log odds of a connection and a low-rank deviation that differs according to the individual’s brain type. To enable information transfer across the creativity groups (domains), the model assumes the brain structure types do not differ across the groups (referred to as “common atoms” in the mixture modeling literature). However, the proportion of individuals having brain type hhitalic_h, νh(y)superscriptsubscript𝜈𝑦\nu_{h}^{(y)}italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_y ) end_POSTSUPERSCRIPT, does differ across domains y=1,,K𝑦1𝐾y=1,\ldots,Kitalic_y = 1 , … , italic_K.

Although the goal in [28] was inference on group differences, this model can be used directly for transfer learning from source domains to a target domain, with the source data enabling more accurate estimation of the shared network types. In addition, the baseline log-odds of an edge between each pair of nodes is also shared across the groups, leading to information sharing about common topological properties of the graphs, including block structures, homophily behaviors and transitive edge patterns [40]. An important application would be to transferring information from large brain imaging repositories, such as the Human Connectome Project (HCP) and UK Biobank, to small neuroimaging studies in targeted populations.

In [58] shared kernels are used to model complex distributions of multiple variables across different domains. The motivating applications are studies investigating how DNA methylation profiles vary according to cancer subtype. For samples i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n, data consist of 𝒙i=(xi1,,xip)Tsubscript𝒙𝑖superscriptsubscript𝑥𝑖1subscript𝑥𝑖𝑝𝑇\boldsymbol{x}_{i}=(x_{i1},\ldots,x_{ip})^{T}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT with xijsubscript𝑥𝑖𝑗x_{ij}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denoting the methylation level at site j𝑗jitalic_j, for j=1,,p𝑗1𝑝j=1,\ldots,pitalic_j = 1 , … , italic_p with p𝑝pitalic_p very large (e.g., p=450,000𝑝450000p=450,000italic_p = 450 , 000) and yi{1,,K}subscript𝑦𝑖1𝐾y_{i}\in\{1,\ldots,K\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , … , italic_K } denoting the group membership. The density of the data in group k𝑘kitalic_k for the j𝑗jitalic_jth variable is fj(k)()=h=1Hνjh(k)𝒦(;𝜽h)superscriptsubscript𝑓𝑗𝑘superscriptsubscript1𝐻superscriptsubscript𝜈𝑗𝑘𝒦subscript𝜽f_{j}^{(k)}(\cdot)=\sum_{h=1}^{H}\nu_{jh}^{(k)}\mathcal{K}(\cdot;\boldsymbol{% \theta}_{h})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( ⋅ ) = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_j italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT caligraphic_K ( ⋅ ; bold_italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), with 𝒦(𝜽)𝒦𝜽\mathcal{K}(\boldsymbol{\theta})caligraphic_K ( bold_italic_θ ) a family of densities parameterized by 𝜽𝜽\boldsymbol{\theta}bold_italic_θ and νjh(k)superscriptsubscript𝜈𝑗𝑘\nu_{jh}^{(k)}italic_ν start_POSTSUBSCRIPT italic_j italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT a probability weight on kernel hhitalic_h specific to site j𝑗jitalic_j and group k𝑘kitalic_k.

Although the motivation in [58] is testing for differences in methylation between different groups, the proposed approach can be directly applied to transfer learning focused on inferring the marginal densities of very high-dimensional data within a particular domain. The data from all the domains are used in inferring the shared kernel parameters 𝜽1,,𝜽Hsubscript𝜽1subscript𝜽𝐻\boldsymbol{\theta}_{1},\ldots,\boldsymbol{\theta}_{H}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_θ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and further borrowing of information occurs through a hierarchical model for the weights {νjh(k)}superscriptsubscript𝜈𝑗𝑘\{\nu_{jh}^{(k)}\}{ italic_ν start_POSTSUBSCRIPT italic_j italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT }. Even in using shared kernels, this approach allows highly flexible differences in distribution across the groups.

There is a rich literature on alternative Bayesian mixture models for borrowing of information across grouped data, while also allowing distinct characteristics of each group. Suppose we let 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote feature data for subject i𝑖iitalic_i with yi{1,,K}subscript𝑦𝑖1𝐾y_{i}\in\{1,\ldots,K\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , … , italic_K } denoting the subject’s group membership. Then, a common approach is to incorporate subject-specific parameters 𝜽isubscript𝜽𝑖\boldsymbol{\theta}_{i}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT within the likelihood function for 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and then let 𝜽iPyisimilar-tosubscript𝜽𝑖subscript𝑃subscript𝑦𝑖\boldsymbol{\theta}_{i}\sim P_{y_{i}}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, with the collection of group-specific random effects distributions (P1,,PK)Πsimilar-tosubscript𝑃1subscript𝑃𝐾Π(P_{1},\ldots,P_{K})\sim\Pi( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ∼ roman_Π given an appropriate prior. Popular choices of ΠΠ\Piroman_Π include the hierarchical Dirichlet process (HDP) [86] and nested Dirichlet process (NDP) [1], both of which fall within the broad class of hierarchical processes [10]. These approaches characterize each Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as almost surely discrete while incorporating statistical dependence between Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Plsubscript𝑃𝑙P_{l}italic_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for all kl𝑘𝑙k\neq litalic_k ≠ italic_l, leading to dependence in clustering.

Alternatively, analogously to the multi-group factor models of [23], [90], and [15], Müller et al. [68] modeled the group-specific random effects distributions Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as a mixture of a common cross-group distribution P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and group-specific distributions Qksubscript𝑄𝑘Q_{k}italic_Q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, with a hyperprior chosen for the mixture weight on P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to allow data adaptivity. This approach, and the above approaches, assume a priori exchangeability across the groups. In the future, it will be interesting to adapt these approaches and develop appropriate extensions explicitly targeting the transfer learning case in which one domain is the particular focus. A relevant recent advance is the graphical Dirichlet process of Chakrabarti et al. [13], which incorporates a known directed acyclic graph (DAG) characterizing the dependence structure across the groups.

3.4 Network transfer

As an alternative to viewing each source domain as providing exchangeable information about the target domain a priori, there is often expert knowledge about directed relationships between the different domains. Incorporating a network of relationships among the domains in transfer learning is referred to as network transfer, as opposed to direct transfer.

Tsubscript𝑇\mathcal{L}_{T}caligraphic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPTS1subscriptsubscript𝑆1\mathcal{L}_{S_{1}}caligraphic_L start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPTS2subscriptsubscript𝑆2\mathcal{L}_{S_{2}}caligraphic_L start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPTS3subscriptsubscript𝑆3\mathcal{L}_{S_{3}}caligraphic_L start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPTS4subscriptsubscript𝑆4\mathcal{L}_{S_{4}}caligraphic_L start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPTS1subscriptsubscript𝑆1\mathcal{L}_{S_{1}}caligraphic_L start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPTTsubscript𝑇\mathcal{L}_{T}caligraphic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPTS3subscriptsubscript𝑆3\mathcal{L}_{S_{3}}caligraphic_L start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPTS2subscriptsubscript𝑆2\mathcal{L}_{S_{2}}caligraphic_L start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPTS4subscriptsubscript𝑆4\mathcal{L}_{S_{4}}caligraphic_L start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
Figure 1: Direct transfer (left) and network transfer learning (right). In direct transfer all the source learners are used directly in supporting the training of the target learner, whereas in network transfer we can have a more complex structure with some of the source learners supporting other source learners rather than the target learner directly.

In Bayesian network meta-analysis [61] the goal is often to compare the efficacy of a pair of treatments based on multiple studies, some of which may involve arms with other treatments. Let W,X,Y,Z𝑊𝑋𝑌𝑍W,X,Y,Zitalic_W , italic_X , italic_Y , italic_Z be four available treatments among which we want to compare the efficacy of X𝑋Xitalic_X and Y.𝑌Y.italic_Y . Suppose that we have the dataset 𝒟XYsubscript𝒟𝑋𝑌\mathcal{D}_{XY}caligraphic_D start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPT formed based on studies comparing X𝑋Xitalic_X and Y𝑌Yitalic_Y and that we also have access to datasets 𝒟XWsubscript𝒟𝑋𝑊\mathcal{D}_{XW}caligraphic_D start_POSTSUBSCRIPT italic_X italic_W end_POSTSUBSCRIPT, 𝒟YWsubscript𝒟𝑌𝑊\mathcal{D}_{YW}caligraphic_D start_POSTSUBSCRIPT italic_Y italic_W end_POSTSUBSCRIPT, 𝒟YZsubscript𝒟𝑌𝑍\mathcal{D}_{YZ}caligraphic_D start_POSTSUBSCRIPT italic_Y italic_Z end_POSTSUBSCRIPT, and 𝒟WZsubscript𝒟𝑊𝑍\mathcal{D}_{WZ}caligraphic_D start_POSTSUBSCRIPT italic_W italic_Z end_POSTSUBSCRIPT comparing, respectively X𝑋Xitalic_X to W𝑊Witalic_W, Y𝑌Yitalic_Y to W𝑊Witalic_W, Y𝑌Yitalic_Y to Z𝑍Zitalic_Z, and W𝑊Witalic_W to Z𝑍Zitalic_Z. We may have several different trials for certain of these comparisons. The knowledge extracted from the trials comparing other treatments can be used to indirectly improve the analysis of the X𝑋Xitalic_X vs Y𝑌Yitalic_Y trials. Figure 2 shows a graph representing the observed comparisons between the treatments, sometimes referred to as the evidence network [62], and the associated network transfer graph.

X𝑋Xitalic_XY𝑌Yitalic_YZ𝑍Zitalic_ZW𝑊Witalic_WXWsubscript𝑋𝑊\mathcal{L}_{XW}caligraphic_L start_POSTSUBSCRIPT italic_X italic_W end_POSTSUBSCRIPTXYsubscript𝑋𝑌\mathcal{L}_{XY}caligraphic_L start_POSTSUBSCRIPT italic_X italic_Y end_POSTSUBSCRIPTYZsubscript𝑌𝑍\mathcal{L}_{YZ}caligraphic_L start_POSTSUBSCRIPT italic_Y italic_Z end_POSTSUBSCRIPTYWsubscript𝑌𝑊\mathcal{L}_{YW}caligraphic_L start_POSTSUBSCRIPT italic_Y italic_W end_POSTSUBSCRIPTZWsubscript𝑍𝑊\mathcal{L}_{ZW}caligraphic_L start_POSTSUBSCRIPT italic_Z italic_W end_POSTSUBSCRIPT
Figure 2: Evidence network for the treatment comparison (left) and the network transfer of information between the associated learners in the meta analysis (right).

The general framework for Bayesian network meta-analysis is outlined in [62]. Denote the observed mean difference in the efficacy of treatments k𝑘kitalic_k and l𝑙litalic_l in study i𝑖iitalic_i by δi,k,lsubscript𝛿𝑖𝑘𝑙\delta_{i,k,l}italic_δ start_POSTSUBSCRIPT italic_i , italic_k , italic_l end_POSTSUBSCRIPT and the baseline difference in efficacy between treatments k𝑘kitalic_k and l𝑙litalic_l by dk,lsubscript𝑑𝑘𝑙d_{k,l}italic_d start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT. We refer to dk,lsubscript𝑑𝑘𝑙d_{k,l}italic_d start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT as effect parameters. In [62] they are divided into basic parameters 𝒅bsubscript𝒅𝑏\boldsymbol{d}_{b}bold_italic_d start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and functional parameters 𝒅fsubscript𝒅𝑓\boldsymbol{d}_{f}bold_italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Any set of effect parameters can be treated as basic parameters if the edges associated with them create a spanning tree of the evidence network. The functional parameters are the remaining effect parameters.

Network meta analysis assumes functional parameters can be represented as linear functions of basic parameters, i.e. 𝒅f=𝑭𝒅bsubscript𝒅𝑓𝑭subscript𝒅𝑏\boldsymbol{d}_{f}=\boldsymbol{F}\boldsymbol{d}_{b}bold_italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = bold_italic_F bold_italic_d start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT for some matrix 𝑭𝑭\boldsymbol{F}bold_italic_F, which is referred to as evidence consistency. Usually, these relations take the form dj,k=dj,ldk,lsubscript𝑑𝑗𝑘subscript𝑑𝑗𝑙subscript𝑑𝑘𝑙d_{j,k}=d_{j,l}-d_{k,l}italic_d start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_j , italic_l end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT for any treatments j,k,l𝑗𝑘𝑙j,k,litalic_j , italic_k , italic_l. In our example, we can choose dX,W,dY,W,dZ,Wsubscript𝑑𝑋𝑊subscript𝑑𝑌𝑊subscript𝑑𝑍𝑊{d}_{X,W},{d}_{Y,W},{d}_{Z,W}italic_d start_POSTSUBSCRIPT italic_X , italic_W end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_Y , italic_W end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_Z , italic_W end_POSTSUBSCRIPT to be the basic parameters and then relate the functional parameters to them via dX,Y=dX,WdY,Wsubscript𝑑𝑋𝑌subscript𝑑𝑋𝑊subscript𝑑𝑌𝑊d_{X,Y}=d_{X,W}-d_{Y,W}italic_d start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_X , italic_W end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_Y , italic_W end_POSTSUBSCRIPT and dY,Z=dY,WdZ,Wsubscript𝑑𝑌𝑍subscript𝑑𝑌𝑊subscript𝑑𝑍𝑊d_{Y,Z}=d_{Y,W}-d_{Z,W}italic_d start_POSTSUBSCRIPT italic_Y , italic_Z end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_Y , italic_W end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_Z , italic_W end_POSTSUBSCRIPT. Leveraging this assumption is analogous to utilizing the shared parameter strategy outlined in section 3.1. We can use these identities to increase the precision of estimation of dY,Wsubscript𝑑𝑌𝑊d_{Y,W}italic_d start_POSTSUBSCRIPT italic_Y , italic_W end_POSTSUBSCRIPT which in turn, together with the estimates of dX,Wsubscript𝑑𝑋𝑊d_{X,W}italic_d start_POSTSUBSCRIPT italic_X , italic_W end_POSTSUBSCRIPT, can increase the precision of the estimation of dX,Ysubscript𝑑𝑋𝑌d_{X,Y}italic_d start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT. This is represented by the network transfer in Figure 2.

The linear relationship 𝒅f=𝑭𝒅bsubscript𝒅𝑓𝑭subscript𝒅𝑏\boldsymbol{d}_{f}=\boldsymbol{F}\boldsymbol{d}_{b}bold_italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = bold_italic_F bold_italic_d start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT can be used to model the vector of observed differences in treatments 𝜹𝜹\boldsymbol{\delta}bold_italic_δ conditionally on 𝒅bsubscript𝒅𝑏\boldsymbol{d}_{b}bold_italic_d start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and the covariance of 𝜹bsubscript𝜹𝑏\boldsymbol{\delta}_{b}bold_italic_δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, denoted by Cov(𝜹b)=𝑽bCovsubscript𝜹𝑏subscript𝑽𝑏\text{Cov}(\boldsymbol{\delta}_{b})=\boldsymbol{V}_{b}Cov ( bold_italic_δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = bold_italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Using the Gaussian distribution 𝜹𝒩((𝒅bT,𝒅bT𝑭T)T,𝑽)similar-to𝜹𝒩superscriptsuperscriptsubscript𝒅𝑏𝑇superscriptsubscript𝒅𝑏𝑇superscript𝑭𝑇𝑇𝑽\boldsymbol{\delta}\sim\mathcal{N}\left(\left(\boldsymbol{d}_{b}^{T},% \boldsymbol{d}_{b}^{T}\boldsymbol{F}^{T}\right)^{T},\boldsymbol{V}\right)bold_italic_δ ∼ caligraphic_N ( ( bold_italic_d start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_italic_d start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_italic_V ) is standard [61], [25], [57], where

𝑽=(𝑽b𝑽b𝑭T𝑭𝑽bT𝑭𝑽b𝑭T).𝑽matrixsubscript𝑽𝑏subscript𝑽𝑏superscript𝑭𝑇𝑭superscriptsubscript𝑽𝑏𝑇𝑭subscript𝑽𝑏superscript𝑭𝑇\boldsymbol{V}=\begin{pmatrix}\boldsymbol{V}_{b}&\boldsymbol{V}_{b}\boldsymbol% {F}^{T}\\ \boldsymbol{F}\boldsymbol{V}_{b}^{T}&\boldsymbol{FV}_{b}\boldsymbol{F}^{T}\end% {pmatrix}.bold_italic_V = ( start_ARG start_ROW start_CELL bold_italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_CELL start_CELL bold_italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT bold_italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_F bold_italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL bold_italic_F bold_italic_V start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT bold_italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) .

This prior can be incorporated within a hierarchical model for the individual observations in each study. Further borrowing of information can be facilitated through placing a common random effects distribution on the basic treatment effect parameters as in [57].

Additional flexibility in information transfer can come from allowing violations in evidence consistency. Lu et al. [62] provide such a framework via 𝒅f=𝑭𝒅b+𝒘subscript𝒅𝑓𝑭subscript𝒅𝑏𝒘\boldsymbol{d}_{f}=\boldsymbol{F}\boldsymbol{d}_{b}+\boldsymbol{w}bold_italic_d start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = bold_italic_F bold_italic_d start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + bold_italic_w, where 𝒘𝒘\boldsymbol{w}bold_italic_w represents inconsistencies between studies. In our example

dX,Y=dX,WdY,W+wX,Y,Wsubscript𝑑𝑋𝑌subscript𝑑𝑋𝑊subscript𝑑𝑌𝑊subscript𝑤𝑋𝑌𝑊d_{X,Y}=d_{X,W}-d_{Y,W}+w_{X,Y,W}italic_d start_POSTSUBSCRIPT italic_X , italic_Y end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_X , italic_W end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_Y , italic_W end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_X , italic_Y , italic_W end_POSTSUBSCRIPT

and

dY,Z=dY,WdZ,W+wY,Z,W.subscript𝑑𝑌𝑍subscript𝑑𝑌𝑊subscript𝑑𝑍𝑊subscript𝑤𝑌𝑍𝑊d_{Y,Z}=d_{Y,W}-d_{Z,W}+w_{Y,Z,W}.italic_d start_POSTSUBSCRIPT italic_Y , italic_Z end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_Y , italic_W end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_Z , italic_W end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_Y , italic_Z , italic_W end_POSTSUBSCRIPT .

The inferred size of 𝒘𝒘\boldsymbol{w}bold_italic_w directly measures how related the domains are and determines how much information transfer should occur between them. There can be various sources of inconsistencies between the pairwise comparisons. They can stem from limitations in the design of individual studies and from changes in the baseline efficacy of treatments over time, for example due to increasing antibiotic resistance. This problem has recently been addressed in [57] where the basic parameters are assumed to vary over time according to a Gaussian process. Thus, the information transfer between domains is corrected for the times at which the associated datasets were collected.

Often the appropriate transfer network joining the domains is not known and needs to be inferred. One can take a brute-force approach to select the best transfer network under some quality measure by checking every possible graph. However, this approach quickly becomes intractable as the number of domains grows with millions of possible transfer networks on just eight domains. Zhou et al. [109] provide a greedy algorithm which starts from the target learner and at each step includes a source learner yielding the highest conditional marginal likelihood for the target task. Specifically, suppose we have learners 1,,Ksubscript1subscript𝐾\mathcal{L}_{1},\dots,\mathcal{L}_{K}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_L start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT operating on datasets 𝑫(1)=(𝒚(1),𝑿(1)),,𝑫(K)=(𝒚(K),𝑿(K))formulae-sequencesuperscript𝑫1superscript𝒚1superscript𝑿1superscript𝑫𝐾superscript𝒚𝐾superscript𝑿𝐾\boldsymbol{D}^{(1)}=(\boldsymbol{y}^{(1)},\boldsymbol{X}^{(1)}),\dots,% \boldsymbol{D}^{(K)}=(\boldsymbol{y}^{(K)},\boldsymbol{X}^{(K)})bold_italic_D start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = ( bold_italic_y start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) , … , bold_italic_D start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT = ( bold_italic_y start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUPERSCRIPT ( italic_K ) end_POSTSUPERSCRIPT ), respectively, where 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the target learner. Let G=(V,E)𝐺𝑉𝐸{G}=(V,E)italic_G = ( italic_V , italic_E ) be the (connected) network transfer graph with V={1,,K}𝑉1𝐾V=\{1,\dots,K\}italic_V = { 1 , … , italic_K }. Let 𝜽1,,𝜽Ksubscript𝜽1subscript𝜽𝐾\boldsymbol{\theta}_{1},\dots,\boldsymbol{\theta}_{K}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_θ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT be the parameters in 1,,Ksubscript1subscript𝐾\mathcal{L}_{1},\dots,\mathcal{L}_{K}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_L start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, where for every (i,j)E𝑖𝑗𝐸(i,j)\in E( italic_i , italic_j ) ∈ italic_E there exist subvectors 𝜽i,𝒞isubscript𝜽𝑖subscript𝒞𝑖\boldsymbol{\theta}_{i,\mathcal{C}_{i}}bold_italic_θ start_POSTSUBSCRIPT italic_i , caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝜽j,𝒞jsubscript𝜽𝑗subscript𝒞𝑗\boldsymbol{\theta}_{j,\mathcal{C}_{j}}bold_italic_θ start_POSTSUBSCRIPT italic_j , caligraphic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT of, respectively, 𝜽isubscript𝜽𝑖\boldsymbol{\theta}_{i}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝜽jsubscript𝜽𝑗\boldsymbol{\theta}_{j}bold_italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT which are restricted to be equal (shared parameter approach). Then at each step, given the chosen set of learners QV𝑄𝑉Q\subset Vitalic_Q ⊂ italic_V, which is referred to as the linkage set, let NG(Q)subscript𝑁𝐺𝑄N_{G}(Q)italic_N start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_Q ) be the set of neighbors of Q𝑄Qitalic_Q, consisting of all learners adjacent to at least one learner in Q𝑄Qitalic_Q. We then select a new learner j*superscript𝑗j^{*}italic_j start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT to be added to Q𝑄Qitalic_Q via

j*=argmaxjNG(Q)p(kQ𝒚(k)kQ𝑿(k),𝑫(j)).superscript𝑗subscriptargmax𝑗subscript𝑁𝐺𝑄𝑝subscript𝑘𝑄conditionalsuperscript𝒚𝑘subscript𝑘𝑄superscript𝑿𝑘superscript𝑫𝑗j^{*}=\operatorname*{arg\ max}_{j\in N_{G}(Q)}p(\cup_{k\in Q}\boldsymbol{y}^{(% k)}\mid\cup_{k\in Q}\boldsymbol{X}^{(k)},\boldsymbol{D}^{(j)}).italic_j start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j ∈ italic_N start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_Q ) end_POSTSUBSCRIPT italic_p ( ∪ start_POSTSUBSCRIPT italic_k ∈ italic_Q end_POSTSUBSCRIPT bold_italic_y start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∣ ∪ start_POSTSUBSCRIPT italic_k ∈ italic_Q end_POSTSUBSCRIPT bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_italic_D start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) . (9)

The algorithm terminates once adding a new learner no longer increases the conditional likelihood in (9). The complexity of this algorithm is O(K2)𝑂superscript𝐾2O(K^{2})italic_O ( italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) under the assumption that the conditional likelihood in (9) can be obtained in constant time. This can be further reduced to O(KlogK)𝑂𝐾𝐾O(K\log K)italic_O ( italic_K roman_log italic_K ) if the likelihood computation is parallelized between the learners in Q𝑄Qitalic_Q. Zhou et al. [109] provide theoretical guarantees for the recovery of the optimal transfer subnetwork of G𝐺Gitalic_G.

Having explored a variety of Bayesian approaches to transferring information between domains in a flexible manner, we now discuss which among these are applicable to particular types of transfer learning problems depending on (i) feature spaces of source and target domains; (ii) the availability of labels and samples in both source and target datasets. We note that there is an immense Bayesian literature providing relevant models for transfer learning that we do not mention above. Instead, we have chosen to highlight some approaches that we find particularly interesting and useful.

4 Transfer with Common vs Overlap** Variables

Following criterion (i), transfer learning problems can be dichotomized based on whether or not the observations in the source and target domains live in the same feature spaces. We will refer to the former case as common variables and to the latter as overlap** variables transfer. This classification often determines which of the approaches presented in Section 3 are appropriate or even feasible to use.

4.1 Common variables transfer

Common variables transfer, also known as as homogeneous transfer [47], occurs when the source and target data and labels have the same meaning in the different domains, but may follow different distributions. For example, the same variables are collected for each of the study subjects but subjects in different groups may have considerably different attributes; hence, the distribution of the variables being collected may vary across groups. In this common variables transfer case, it typically makes sense to define the same form of likelihood for the data in each domain, though there may be systematic differences in the parameters. All of the methods described in Section 3 can be applied directly to common variables problems.

Examples of common variables transfer include clinical studies with patients divided according to their health status or subtype of disease. Researchers are often interested in improving the accuracy of inference and predictions for a particular group of patients by utilizing the information gathered from the other groups or from healthy individuals. This can be especially useful when dealing with rare diseases where it is often difficult to collect measurements on a large sample of affected patients. In [7] and [38], the authors use RNA sequencing datasets for different types of lung, kidney, head and neck cancer as source domains in order to improve the accuracy of subtype identification for particular types of lung cancer. Bayesian models borrowing strength across classes and types of cancer have been applied in other contexts, including survival analysis [64], [76], and protein network inference [85], [4].

4.2 Overlap** variables transfer

This case is more challenging as the source and target datasets do not consist of measurements on exactly the same variables for different study subjects. In order for transfer learning to apply, there has to be something in common across the domains. A typical setting is when the domains have overlap** but not completely identical sets of variables. For example, there may be a common focus across the domains in studying the impact of key predictors of interest on a response, measured under different covariates.

For regression or classification models, the coefficients for the key predictors are not directly comparable across models adjusting for different covariates. Hence, shared parameter models are not appropriate. Nonetheless, it may make sense to assume that the domain-specific coefficients for the key predictors are drawn from a common random effects distribution, thereby enabling borrowing of information. Shared latent space models are even more natural in this case. By jointly modeling the response, key predictors, and covariates as conditionally independent given latent factors, we induce a parsimonious latent factor regression/classification model [11], [31]. Multi-study variants of such models can seamlessly handle cases in which the covariates differ across domains. The multi-group variants of Bayesian mixture models discussed in Section 3 similarly apply for the joint distributions of key predictors, covariates and response(s) to induce flexible transfer learning.

Bayesian methods enjoy a key advantage in this context, in their ability to rely on joint latent feature models to transfer information across domains with partially overlap** variables. The majority of non-Bayesian approaches such as deep transfer learning ([102], [84], [77], [88], [63], [59]) rely on both domains having observations in the same feature space for pre-training and fine-tuning the learners. We detail some examples of Bayesian transfer with overlap** variables below.

4.2.1 Multi-study latent factor regression

Recall the multi-study latent factor model in equation (8). Previously, we considered the observed data on subject i𝑖iitalic_i in study (domain) k𝑘kitalic_k, 𝒚k,isubscript𝒚𝑘𝑖\boldsymbol{y}_{k,i}bold_italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT, to be p𝑝pitalic_p-dimensional, with p𝑝pitalic_p fixed across subjects and domains. To generalize this, we instead consider the p𝑝pitalic_p-dimensional data 𝒚k,isubscript𝒚𝑘𝑖\boldsymbol{y}_{k,i}bold_italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT to be the complete data for subject i𝑖iitalic_i in study k𝑘kitalic_k that could have potentially been measured. Then, we define 𝒎k,i=(mk,i,j,j=1,,p)T\boldsymbol{m}_{k,i}=(m_{k,i,j},j=1,\ldots,p)^{T}bold_italic_m start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT = ( italic_m start_POSTSUBSCRIPT italic_k , italic_i , italic_j end_POSTSUBSCRIPT , italic_j = 1 , … , italic_p ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to be the missingness pattern for the (k,i)𝑘𝑖(k,i)( italic_k , italic_i ) subject, with mk,i,j=1subscript𝑚𝑘𝑖𝑗1m_{k,i,j}=1italic_m start_POSTSUBSCRIPT italic_k , italic_i , italic_j end_POSTSUBSCRIPT = 1 if the j𝑗jitalic_jth variable is not observed for that subject and mk,i,j=0subscript𝑚𝑘𝑖𝑗0m_{k,i,j}=0italic_m start_POSTSUBSCRIPT italic_k , italic_i , italic_j end_POSTSUBSCRIPT = 0 otherwise. A variable is missing for a subject if the study they participated in does not collect that variable, or if the study planned to collect that variable but it was not available.

Let 𝒚k,i(obs)={yk,i,j,j:mk,i,j=0,j=1,,p}superscriptsubscript𝒚𝑘𝑖𝑜𝑏𝑠conditional-setsubscript𝑦𝑘𝑖𝑗𝑗formulae-sequencesubscript𝑚𝑘𝑖𝑗0𝑗1𝑝\boldsymbol{y}_{k,i}^{(obs)}=\{y_{k,i,j},j:m_{k,i,j}=0,j=1,\ldots,p\}bold_italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_o italic_b italic_s ) end_POSTSUPERSCRIPT = { italic_y start_POSTSUBSCRIPT italic_k , italic_i , italic_j end_POSTSUBSCRIPT , italic_j : italic_m start_POSTSUBSCRIPT italic_k , italic_i , italic_j end_POSTSUBSCRIPT = 0 , italic_j = 1 , … , italic_p } denote the pk,i=j=1p(1mk,i,j)subscript𝑝𝑘𝑖superscriptsubscript𝑗1𝑝1subscript𝑚𝑘𝑖𝑗p_{k,i}=\sum_{j=1}^{p}(1-m_{k,i,j})italic_p start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( 1 - italic_m start_POSTSUBSCRIPT italic_k , italic_i , italic_j end_POSTSUBSCRIPT ) dimensional observed data vector for subject (k,i)𝑘𝑖(k,i)( italic_k , italic_i ). The Gaussian multi-study latent factor model characterizes the complete data vector as 𝒚k,i𝒩(𝟎,𝚺k)similar-tosubscript𝒚𝑘𝑖𝒩0subscript𝚺𝑘\boldsymbol{y}_{k,i}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{\Sigma}_{k})bold_italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). This in turn induces a pk,isubscript𝑝𝑘𝑖p_{k,i}italic_p start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT-dimensional multivariate Gaussian distribution for the observed data vector 𝒚k,i(obs)superscriptsubscript𝒚𝑘𝑖𝑜𝑏𝑠\boldsymbol{y}_{k,i}^{(obs)}bold_italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_o italic_b italic_s ) end_POSTSUPERSCRIPT having covariance corresponding to the appropriate sub-matrix of 𝚺ksubscript𝚺𝑘\boldsymbol{\Sigma}_{k}bold_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. In fitting Bayesian multi-study factor models, it is not necessary to impute the missing data. Instead, one can simply take into account the differing observed data contributions for each subject in implementing a Gibbs sampler or alternative Markov chain Monte Carlo (MCMC) algorithm for posterior sampling.

This approach can be used for transfer learning about the covariance structure in multivariate data specific to the target domain. Alternatively, when the focus is on regression, one can concatenate outcomes, predictors of interest and covariates together in the 𝒚k,isubscript𝒚𝑘𝑖\boldsymbol{y}_{k,i}bold_italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT data vector for subject (k,i)𝑘𝑖(k,i)( italic_k , italic_i ). A Gaussian linear regression model for the outcome given the predictors of interest and covariates can be then obtained directly from the covariance 𝚺ksubscript𝚺𝑘\boldsymbol{\Sigma}_{k}bold_Σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT using standard multivariate Gaussian theory. This type of approach is straightforward to extend to mixed categorical and continuous data by following the popular approach of linking categorical variables to underlying Gaussian variables.

4.2.2 Nonlinear and nonparametric extensions

A limitation of the above multi-study factor analysis model is the assumption of multivariate Gaussianity. It is hence useful to consider extensions that incorporate shared and study-specific latent factors while relaxing these restrictive distribution assumptions, which also imply linear relationships among the variables.

In the single modality case, there is a rich literature on nonlinear factor models. For example, we could let

𝒚i=f(𝜼i)+ϵi,ϵiiid𝒩(𝟎,σ2𝑰p),subscript𝒚𝑖𝑓subscript𝜼𝑖subscriptbold-italic-ϵ𝑖subscriptbold-italic-ϵ𝑖iidsimilar-to𝒩0superscript𝜎2subscript𝑰𝑝\displaystyle\boldsymbol{y}_{i}=f(\boldsymbol{\eta}_{i})+\boldsymbol{\epsilon}% _{i},\quad\boldsymbol{\epsilon}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(% \boldsymbol{0},\sigma^{2}\boldsymbol{I}_{p}),bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , (10)

where 𝜼iiid𝒩(𝟎,𝑰q)subscript𝜼𝑖iidsimilar-to𝒩0subscript𝑰𝑞\boldsymbol{\eta}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(\boldsymbol{0},% \boldsymbol{I}_{q})bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( bold_0 , bold_italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) are the vectors of latent factors and f()𝑓f(\cdot)italic_f ( ⋅ ) is an unknown and potentially non-linear function. Gaussian process latent variable models (GP-LVMs) place a GP prior on the function f𝑓fitalic_f map** from the latent to ambient space [56], [30], [80], [98]. Alternatively, the popular class of variational autoencoders (VAEs) characterize f𝑓fitalic_f using deep neural networks and take a variational approach to inference [67], [45], [12].

While these highly flexible nonlinear latent variable models have exhibited appealing practical performance as black-box models for generating new data that resemble the training data, they are prone to a number of vexing issues in reproducing statistical inferences. One major challenge is the curse of dimensionality resulting from the fact that the function f𝑓fitalic_f is an unknown map** from q𝑞qitalic_q to p𝑝pitalic_p dimensional space; the space of such functions is immense, necessitating an enormous amount of training data for adequate performance. Furthermore, these models are not identifiable without substantial additional constraints. Another common problem is referred to as posterior collapse [20], [92], [93], in which there is a lack of learning about the latent variables based on the data. While there have been some attempts at addressing these problems, there remains a lack of practically useful methodology to perform reproducible dimensionality reduction.

The above challenges are exacerbated in considering extensions to the multi-study (transfer learning) case. Hence, we recommend starting with more parsimonious nonlinear latent factor models in future work develo** such extensions. One promising point of departure is the recently proposed NIFTY framework of Xu et al. [98], which lets

𝒚isubscript𝒚𝑖\displaystyle\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =\displaystyle== 𝚲𝜼i+ϵi,ϵiiid𝒩(𝟎,𝚺),𝚲subscript𝜼𝑖subscriptbold-italic-ϵ𝑖subscriptbold-italic-ϵ𝑖iidsimilar-to𝒩0𝚺\displaystyle\boldsymbol{\Lambda}\boldsymbol{\eta}_{i}+\boldsymbol{\epsilon}_{% i},\quad\boldsymbol{\epsilon}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(% \boldsymbol{0},\boldsymbol{\Sigma}),bold_Λ bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( bold_0 , bold_Σ ) ,
ηihsubscript𝜂𝑖\displaystyle\eta_{ih}italic_η start_POSTSUBSCRIPT italic_i italic_h end_POSTSUBSCRIPT =\displaystyle== gh(uikh),h=1,,q,formulae-sequencesubscript𝑔subscript𝑢𝑖subscript𝑘1𝑞\displaystyle g_{h}(u_{ik_{h}}),\quad h=1,\ldots,q,italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i italic_k start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_h = 1 , … , italic_q ,

where 𝚲𝚲\boldsymbol{\Lambda}bold_Λ is a factor loading matrix, 𝚺=diag(σ12,,σp2)𝚺diagsuperscriptsubscript𝜎12superscriptsubscript𝜎𝑝2\boldsymbol{\Sigma}=\mbox{diag}(\sigma_{1}^{2},\ldots,\sigma_{p}^{2})bold_Σ = diag ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and uikiidU(0,1)superscriptsimilar-to𝑖𝑖𝑑subscript𝑢𝑖𝑘𝑈01u_{ik}\stackrel{{\scriptstyle iid}}{{\sim}}U(0,1)italic_u start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_i italic_i italic_d end_ARG end_RELOP italic_U ( 0 , 1 ) for k=1,,Kqformulae-sequence𝑘1𝐾𝑞k=1,\ldots,K\leq qitalic_k = 1 , … , italic_K ≤ italic_q. Each latent factor ηihsubscript𝜂𝑖\eta_{ih}italic_η start_POSTSUBSCRIPT italic_i italic_h end_POSTSUBSCRIPT is a transformation of a latent uikhsubscript𝑢𝑖subscript𝑘u_{ik_{h}}italic_u start_POSTSUBSCRIPT italic_i italic_k start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT via an unknown non-decreasing function ghsubscript𝑔g_{h}italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. The subscript khsubscript𝑘k_{h}italic_k start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT allows the same latent uniforms to be used for multiple factors, inducing dependence.

This model induces a flexible multivariate density for 𝒚isubscript𝒚𝑖\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT while massively reducing dimensionality relative to model (10). In their paper, they provided theory on identifiability, leveraging on pre-training with state-of-the-art nonlinear dimensionality reduction algorithms. They also showed excellent performance for a wide variety of complex examples. They were even able to train a realistic generative model for bird songs based on few training examples; audio recordings of bird songs provide an example of massive dimensional data with low intrinsic dimension. NIFTY can exploit the complex low dimensional structure in the data for highly efficient performance.

In conducting inference for latent variable models, the NIFTY authors noticed a common problem of distributional shift. In particular, many of the current models assume that the latent variables are iid 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ) or U(0,1)𝑈01U(0,1)italic_U ( 0 , 1 ). Inferences on the parameters, such as the induced covariance in the Gaussian linear factor model case, critically depend on this assumption holding not just a priori but also a posteriori. Xu et al. [98] propose a general approach for solving latent variable distributional shift through forcing the posterior distribution of the latent variables to be very close to iid U(0,1).𝑈01U(0,1).italic_U ( 0 , 1 ) .

4.2.3 Mixture models

An alternative direction towards building more flexible models for transfer learning, including in the partially overlap** variables case, is to rely on mixture models, building on the developments in Section 3.3. Such models have the advantage of also clustering subjects within the different domains. In the partially overlap** variables transfer learning case, it is appealing to define a joint model, as motivated above. However, Chandra et al. [14] recently noted a pitfall of mixture models in high dimensional cases in which the posterior tends to concentrate on trivial clusterings of the observations that place all subjects into one cluster or in singleton clusters.

As a solution in the single domain case, they proposed a latent mixture model formulation that lets

𝒚isubscript𝒚𝑖\displaystyle\boldsymbol{y}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =\displaystyle== 𝚲𝜼i+ϵi,ϵiiid𝒩(𝟎,𝚺),𝚲subscript𝜼𝑖subscriptbold-italic-ϵ𝑖subscriptbold-italic-ϵ𝑖iidsimilar-to𝒩0𝚺\displaystyle\boldsymbol{\Lambda}\boldsymbol{\eta}_{i}+\boldsymbol{\epsilon}_{% i},\quad\boldsymbol{\epsilon}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(% \boldsymbol{0},\boldsymbol{\Sigma}),bold_Λ bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( bold_0 , bold_Σ ) ,
𝜼isubscript𝜼𝑖\displaystyle\boldsymbol{\eta}_{i}bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT similar-to\displaystyle\sim h=1Hνh𝒩(𝝁h,𝚫h),superscriptsubscript1𝐻subscript𝜈𝒩subscript𝝁subscript𝚫\displaystyle\sum_{h=1}^{H}\nu_{h}\mathcal{N}(\boldsymbol{\mu}_{h},\boldsymbol% {\Delta}_{h}),∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , bold_Δ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) , (11)

so that a mixture of Gaussians model is used for the latent variables in a linear factor model. They prove that this model solves the above mentioned pitfall in Bayesian clustering in high dimensions. The trick is to model the variation across clusters in a lower dimensional latent space to address the curse of dimensionality.

With this single domain specification as the starting point, there are multiple promising directions forward in terms of extensions to the multiple domain transfer learning case. One possibility is to define a multi-study factor model as in Chandra et al. [15] but instead of assuming Gaussian shared and study-specific latent factors, use Gaussian mixture models to induce a flexible distribution on the latent factors while also producing separate clusters of subjects in each domain with respect to the shared and study-specific components. An alternative is to rely on the model in equation (11) but with domain-specific distributions for the latent factors defined as

fk(𝜼i)=𝒩(𝜼i;𝜽i)𝑑Pk(𝜽i),subscript𝑓𝑘subscript𝜼𝑖𝒩subscript𝜼𝑖subscript𝜽𝑖differential-dsubscript𝑃𝑘subscript𝜽𝑖\displaystyle f_{k}(\boldsymbol{\eta}_{i})=\int\mathcal{N}(\boldsymbol{\eta}_{% i};\boldsymbol{\theta}_{i})dP_{k}(\boldsymbol{\theta}_{i}),italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∫ caligraphic_N ( bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_d italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where 𝜽i={𝝁i,𝚫i}subscript𝜽𝑖subscript𝝁𝑖subscript𝚫𝑖\boldsymbol{\theta}_{i}=\{\boldsymbol{\mu}_{i},\boldsymbol{\Delta}_{i}\}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } are the Gaussian parameters and Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a mixing distribution on these parameters that is specific to domain k𝑘kitalic_k.

In the special case in which Pk=P=h=1Hνhδ𝜽h*subscript𝑃𝑘𝑃superscriptsubscript1𝐻subscript𝜈subscript𝛿superscriptsubscript𝜽P_{k}=P=\sum_{h=1}^{H}\nu_{h}\delta_{\boldsymbol{\theta}_{h}^{*}}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT with δ𝜽subscript𝛿𝜽\delta_{\boldsymbol{\theta}}italic_δ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT a degenerate distribution concentrated at 𝜽𝜽\boldsymbol{\theta}bold_italic_θ we obtain the original model in (11). However by using the different priors (P1,,PK)Πsimilar-tosubscript𝑃1subscript𝑃𝐾Π(P_{1},\ldots,P_{K})\sim\Pi( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ∼ roman_Π considered in Section 3.3 we can allow differences across the domains while borrowing information; further borrowing is achieved through the implicit assumption of a latent space that is shared across domains - this is induced through the use of a common factor loading matrix 𝚲𝚲\boldsymbol{\Lambda}bold_Λ.

4.2.4 Multiresolution transfer learning

Closely related to the overlap** variables case is the setting in which data are collected for each domain on related random functions or stochastic processes. For example, let fk:𝒯:subscript𝑓𝑘𝒯f_{k}:\mathcal{T}\to\mathbb{R}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT : caligraphic_T → blackboard_R denote a latent smooth continuously differentiable function for domain k𝑘kitalic_k, and suppose that we have

yk,i=fk(tk,i)+ϵk,i,ϵk,iiid𝒩(0,σ2),subscript𝑦𝑘𝑖subscript𝑓𝑘subscript𝑡𝑘𝑖subscriptitalic-ϵ𝑘𝑖subscriptitalic-ϵ𝑘𝑖iidsimilar-to𝒩0superscript𝜎2\displaystyle y_{k,i}=f_{k}(t_{k,i})+\epsilon_{k,i},\quad\epsilon_{k,i}% \overset{\mathrm{iid}}{\sim}\mathcal{N}(0,\sigma^{2}),italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT overroman_iid start_ARG ∼ end_ARG caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (12)

with 𝒚k={yk,i,i=1,,nk}\boldsymbol{y}_{k}=\{y_{k,i},i=1,\ldots,n_{k}\}bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_y start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } the observed data and 𝒕k={tk,i,i=1,,nk}\boldsymbol{t}_{k}=\{t_{k,i},i=1,\ldots,n_{k}\}bold_italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } the observation locations for domain k𝑘kitalic_k. Often, the observation locations do not line up across domains and certain domains may have lower resolution data than others, with the resolution referring to the density of the observation points 𝒕ksubscript𝒕𝑘\boldsymbol{t}_{k}bold_italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over the support 𝒯.𝒯\mathcal{T}.caligraphic_T .

Model (12) represents a type of functional data analysis (FDA). In many FDA settings, we observe noisy realizations of a subject-specific function at a finite set of points, but here we are considering the case in which we have one function for each domain and are particularly interested in inference on the function for the source domain. There are many applications in which this type of problem arises. We may have domain-specific regression functions fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and want to borrow information across domains in a nonparametric regression context without assuming any common parameters. Alternatively, 𝒚ksubscript𝒚𝑘\boldsymbol{y}_{k}bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT may correspond to a domain k𝑘kitalic_k-specific time series and we want to borrow information across related time series to enhance prediction for a target series.

A natural Bayesian approach to inference under model (12) is to consider a functional data extension of the hierarchical and random effects modeling approaches highlighted in Section 3. For example, one could use a hierarchical GP that lets fkGP(f0,c)similar-tosubscript𝑓𝑘GPsubscript𝑓0𝑐f_{k}\sim\mbox{GP}(f_{0},c)italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ GP ( italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c ) with f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in turn given a GP prior [5], [21]. For articles on using GPs in closely related transfer learning settings to that of (12) refer to [94], [103], [39]. These approaches can automatically accommodate the case in which observations are denser in some domains than others.

Wilson et al. [94] introduce GP regression networks (GPRN) which use latent GPs to transfer information between different continuous time processes. Specifically, given K𝐾Kitalic_K time series 𝒚1,,𝒚Ksubscript𝒚1subscript𝒚𝐾\boldsymbol{y}_{1},\dots,\boldsymbol{y}_{K}bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, [94] let

𝒚k=q=1Q𝑾k,q𝒇q+ϵk,subscript𝒚𝑘superscriptsubscript𝑞1𝑄direct-productsubscript𝑾𝑘𝑞subscript𝒇𝑞subscriptbold-italic-ϵ𝑘\boldsymbol{y}_{k}=\sum_{q=1}^{Q}\boldsymbol{W}_{k,q}\odot\boldsymbol{f}_{q}+% \boldsymbol{\epsilon}_{k},bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_k , italic_q end_POSTSUBSCRIPT ⊙ bold_italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT + bold_italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where direct-product\odot is the Hadamard product, 𝒇q𝒢𝒫(0,𝑲qf)similar-tosubscript𝒇𝑞𝒢𝒫0superscriptsubscript𝑲𝑞𝑓\boldsymbol{f}_{q}\sim\mathcal{GP}(0,\boldsymbol{K}_{q}^{f})bold_italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∼ caligraphic_G caligraphic_P ( 0 , bold_italic_K start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ) are latent basis GPs evaluated at the observation times, 𝑾k,q𝒢𝒫(0,𝑲k,qw)similar-tosubscript𝑾𝑘𝑞𝒢𝒫0superscriptsubscript𝑲𝑘𝑞𝑤\boldsymbol{W}_{k,q}\sim\mathcal{GP}(0,\boldsymbol{K}_{k,q}^{w})bold_italic_W start_POSTSUBSCRIPT italic_k , italic_q end_POSTSUBSCRIPT ∼ caligraphic_G caligraphic_P ( 0 , bold_italic_K start_POSTSUBSCRIPT italic_k , italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) are domain-specific weights of the latent GPs, and ϵk𝒩(𝟎,σk2𝑰)similar-tosubscriptbold-italic-ϵ𝑘𝒩0superscriptsubscript𝜎𝑘2𝑰\boldsymbol{\epsilon}_{k}\sim\mathcal{N}(\boldsymbol{0},\sigma_{k}^{2}% \boldsymbol{I})bold_italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ) represents measurement noise. The weights 𝑾k,qsubscript𝑾𝑘𝑞\boldsymbol{W}_{k,q}bold_italic_W start_POSTSUBSCRIPT italic_k , italic_q end_POSTSUBSCRIPT determine the strength and structure of information transfer between the domains, analogously to the network of learners introduced in [109]. However, unlike in most works in Section 3.4, GPRN allows for the strength of the transfer within the network of learners to vary over time.

Although [94] simplify inference by assuming identical measurement times across domains, [39] extend the approach to allow different measurement times and increase flexibility by using deep GPs. By using a shared latent space approach with GPs serving as the basis for the latent space, these models are able to achieve transfer both across resolutions and different classes of observations, corresponding to different air pollutants in the application presented in [39].

5 Availability of labelled data

As we have seen, the Bayesian paradigm provides a fertile ground for develo** a rich variety of techniques relevant to transfer learning in both supervised [78], [18], [43], [47], [48], [49], [64] and unsupervised settings [23], [90], [15], [82], [81]. When the focus is on prediction, there are often challenges presented by the limited availability of labelled data. The term semi-supervised learning refers to the case in which labels are only available for a subset of the samples. The joint modeling approaches for overlap** variable transfer learning described in the previous section can trivially handle semi-supervised settings, as missing labels are just one type of missing data.

Particularly challenging are cases of one-shot and few-shot learning, which refers to having only a single or a few labelled samples in the target domain, respectively. For articles proposing Bayesian approaches to handle semi-supervised learning and these cases of a tiny number of labelled target data, refer to [101], [71], [75], [53], [55], [54]. In such cases, performance is critically dependent on borrowing of information from labelled data in the source domains. Common examples include classification based on images or audio ([53], [55], [54]). For example, we may have many labelled examples of different individuals handwriting but only a tiny number for the individual of interest.

In the transfer learning literature, inductive transfer learning refers to the case where target domain labels are available, while transductive transfer learning has labels available only in the source domains [70]. Of course, if there are certain systematic differences between the source and target domains, accurate transductive transfer may be impossible [22]. Nonetheless, there is a rich PAC-Bayesian literature on this topic ([32], [33], [34], [79]) which specifies conditions allowing for successful training of the target learner in the absence of target labels in classification settings. They provide theoretical upper bounds on the expected error on the target domains of a Gibbs classifier depending on various measures of divergence between the distributions of predictors and labels of both domains as well as the properties of the set of voter algorithms from which the Gibbs classifier is constructed.

6 Simulation illustration

In order to illustrate Bayesian transfer learning in practice, we run a simulation experiment focused on the problem of transfer learning targeting the covariance and/or precision matrix of a high-dimensional multivariate Gaussian distribution. In particular, there are data collected on the same set of variables for subjects in different groups and we would like to allow the covariance/precision to vary across groups, while borrowing information. This is a natural setting for both Bayesian multi-study factor analysis models and frequentist competitors focused on transfer learning in precision matrix estimation.

Specifically, we compare Bayesian Subspace Factor Analysis (SUFA) [15] presented in section 3.3 with the frequentist Trans-CLIME method proposed by Li et al. [74]. Trans-CLIME is a transfer learning extension of constrained L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT minimization for inverse matrix estimation (CLIME) [87]. We also include the frequentist estimator proposed by Guo et al. [36], which Li et al. [74] refer to as multitask graphical lasso (MT-Glasso). While [15] focused on inferring the covariance, SUFA can just as easily be used to infer any functional of the covariance including the precision. We focus our comparisons on precision estimation, as this was the emphasis in [74] and [36].

We generate the synthetic data in two domains, T𝑇Titalic_T (target) and S𝑆Sitalic_S (source), with sample sizes nT=40subscript𝑛𝑇40n_{T}=40italic_n start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 40 and nS=80subscript𝑛𝑆80n_{S}=80italic_n start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = 80, respectively, and consider the data dimensions p{40,60,80,,280}𝑝406080280p\in\{40,60,80,\dots,280\}italic_p ∈ { 40 , 60 , 80 , … , 280 }. We generate 50 replicated datasets for each value of p𝑝pitalic_p. We report the average Frobenius and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of the errors between the estimated and true values for the target precision matrix. For each value of p𝑝pitalic_p considered, we generate the data using a fixed true precision matrix across all the simulation replicates. We present the results in Figure 3.

Refer to caption
Figure 3: Frobenius and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm errors of target precision matrix estimation for SUFA, Trans-CLIME and MT-Glasso over varying dimension p𝑝pitalic_p.

As we can see, in terms of the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of the error, the performance tends to improve significantly with growing dimension for SUFA and MT-Glasso, and to an extent for Trans-CLIME as well. This somewhat counterintuitive phenomenon, known as the blessing of dimensionality [104] is commonly encountered in the covariance and precision estimation literature [66], [96], [104], [72]. For the Frobenius norm error we see more instability in the performance of MT-Glasso and Trans-CLIME. Intuitively, the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm might be more favorable for these two methods, since they are both built on L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT minimization.

Since Trans-CLIME does not guarantee positive definiteness and invertibility of the estimated precision matrices, it is problematic to use it for covariance estimation; even after selecting only the synthetic datasets for which Trans-CLIME did produce invertible precision matrices, the resulting covariance estimates were highly unstable and overall significantly worse than those given by SUFA. MT-Glasso produces invertible precision matrices but also yielded covariance estimates further from the truth than SUFA estimates.

Hence, we find that this particular Bayesian approach to transfer learning based on a shared latent space model is competitive with frequentist counterparts even on tasks it was not built for (estimating the precision instead of the covariance). Indeed, as we have partially illustrated, since we have a posterior for the covariance that has support on the space of positive semidefinite covariances, we can provide Bayes estimates and posterior credible intervals providing uncertainty quantification for any functional of the covariance of interest. Hence, from a single Bayesian analysis, we can provide multiple results of interest that are all internally coherent.

7 Discussion

Transfer learning is a timely problem given the abundance of datasets from related domains. In many applications, there is simply not enough data from the domain of interest to support reliable inference and accurate predictions as we seek to fit increasingly complex models. Hence, it becomes critical to cleverly borrow information from available “source” datasets.

Choosing the appropriate strength and structure of information transfer between the domains remains one of the key challenges. The Bayesian paradigm offers a wide variety of approaches to transfer learning, including shared parameters, hierarchical and random effects models, shared latent space, and network transfer methods. There is a rich literature develo** and applying these approaches in transfer learning settings, even though most often “transfer learning” is not mentioned in the associated papers.

This article has focused on providing a flavor for some of the interesting directions that are possible in terms of Bayesian transfer learning, but has not attempted a comprehensive overview of the massive relevant literature. Most of the transfer learning literature has focused on the simplest “common variables” case, where data from different domains consist of the same variables measured across different subjects. Bayesian ideas applied to transfer learning can be particularly useful in the more challenging settings presented in Section 4, where these existing methods largely do not apply.

We have purposely focused much of our attention on shared latent space-type models for Bayesian transfer learning, ranging from multi-study factor analysis to multi-group Bayesian nonparametric models. We focused on these areas because the associated models are not as well known to the broad community but are very practically useful, including in challenging high-dimensional and complex structured data cases. In addition, there have been interesting recent developments that we have highlighted, while sketching out some promising directions for ongoing research. This includes extending Bayesian continuous latent factor modeling approaches to transfer learning settings. Our view is that more careful statistical models will tend to dominate over-parametrized black boxes, such as VAEs, in many settings.

An additional interesting area for future research is Bayesian transfer learning involving deep neural networks. While in recent years there have been papers taking some early steps in this area [78], [16], there is plenty of potential for further impactful developments in this field, especially given the importance of transfer learning to deep neural networks training due to their data-hungry nature.

8 Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (grant agreement No. 856506), United States National Institutes of Health (R01ES035625), Office of Naval Research (N00014-21-1-2510), National Institute of Environmental Health Sciences (R01ES027498), and National Science Foundation (DMS-2230074). Suder’s contributions were supported in part by the Duke Summer Research Fellowship.

References

  • [1] {barticle}[author] \bauthor\bsnmAbel Rodríguez, \bfnmDavid B Dunson\binitsD. B. D. and \bauthor\bsnmGelfand, \bfnmAlan E\binitsA. E. (\byear2008). \btitleThe nested Dirichlet process. \bjournalJournal of the American Statistical Association. \endbibitem
  • [2] {barticle}[author] \bauthor\bsnmAvrahami, \bfnmOmri\binitsO., \bauthor\bsnmLischinski, \bfnmDani\binitsD. and \bauthor\bsnmFried, \bfnmOhad\binitsO. (\byear2021). \btitleGAN cocktail: Mixing GANs without dataset access. \bjournalEuropean Conference on Computer Vision. \endbibitem
  • [3] {barticle}[author] \bauthor\bsnmBaglama, \bfnmJames\binitsJ. and \bauthor\bsnmReichel, \bfnmLothar\binitsL. (\byear2005). \btitleAugmented implicitly restarted Lanczos bidiagonalization methods. \bjournalSIAM Journal on Scientific Computing. \endbibitem
  • [4] {barticle}[author] \bauthor\bsnmBaladandayuthapani, \bfnmVeerabhadran\binitsV., \bauthor\bsnmTalluri, \bfnmRajesh\binitsR., \bauthor\bsnmJi, \bfnmYuan\binitsY., \bauthor\bsnmCoombes, \bfnmKevin\binitsK., \bauthor\bsnmLu, \bfnmYiling\binitsY., \bauthor\bsnmHennessy, \bfnmBryan\binitsB., \bauthor\bsnmDavies, \bfnmMichael\binitsM. and \bauthor\bsnmMallick, \bfnmBani\binitsB. (\byear2014). \btitleBayesian sparse graphical models for classification with application to protein expression data. \bjournalThe Annals of Applied Statistics. \endbibitem
  • [5] {barticle}[author] \bauthor\bsnmBehseta, \bfnmSam\binitsS., \bauthor\bsnmKass, \bfnmRobert E\binitsR. E. and \bauthor\bsnmWallstrom, \bfnmGarrick L\binitsG. L. (\byear2005). \btitleHierarchical models for assessing variability among functions. \bjournalBiometrika. \endbibitem
  • [6] {barticle}[author] \bauthor\bsnmBhattacharya, \bfnmAnirban\binitsA., \bauthor\bsnmPati, \bfnmDebdeep\binitsD., \bauthor\bsnmPillai, \bfnmNatesh S.\binitsN. S. and \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. (\byear2015). \btitleDirichlet–Laplace priors for optimal shrinkage. \bjournalJournal of the American Statistical Association. \endbibitem
  • [7] {barticle}[author] \bauthor\bsnmBoluki, \bfnmShahin\binitsS., \bauthor\bsnmQian, \bfnmXiaoning\binitsX. and \bauthor\bsnmDougherty, \bfnmEdward R\binitsE. R. (\byear2021). \btitleOptimal Bayesian supervised domain adaptation for RNA sequencing data. \bjournalBioinformatics. \endbibitem
  • [8] {barticle}[author] \bauthor\bsnmBoschini, \bfnmMatteo\binitsM., \bauthor\bsnmBonicelli, \bfnmLorenzo\binitsL., \bauthor\bsnmPorrello, \bfnmAngelo\binitsA., \bauthor\bsnmBellitto, \bfnmGiovanni\binitsG., \bauthor\bsnmPennisi, \bfnmMatteo\binitsM., \bauthor\bsnmPalazzo, \bfnmSimone\binitsS., \bauthor\bsnmSpampinato, \bfnmConcetto\binitsC. and \bauthor\bsnmCalderara, \bfnmSimone\binitsS. (\byear2022). \btitleTransfer without forgetting. \bjournalEuropean Conference on Computer Vision. \endbibitem
  • [9] {barticle}[author] \bauthor\bsnmBu, \bfnmFan\binitsF., \bauthor\bsnmKagaayi, \bfnmJoseph\binitsJ., \bauthor\bsnmGrabowski, \bfnmKate\binitsK., \bauthor\bsnmRatmann, \bfnmOliver\binitsO. and \bauthor\bsnmXu, \bfnmJason\binitsJ. (\byear2023). \btitleInferring HIV transmission patterns from viral deep-sequence data via latent typed point processes. arXiv preprint arXiv 2302.11567. \endbibitem
  • [10] {barticle}[author] \bauthor\bsnmCamerlenghi, \bfnmFederico\binitsF., \bauthor\bsnmLijoi, \bfnmAntonio\binitsA., \bauthor\bsnmOrbanz, \bfnmPeter\binitsP. and \bauthor\bsnmPrünster, \bfnmIgor\binitsI. (\byear2019). \btitleDistribution theory for hierarchical processes. \bjournalThe Annals of Statistics. \endbibitem
  • [11] {barticle}[author] \bauthor\bsnmCarvalho, \bfnmCarlos M\binitsC. M., \bauthor\bsnmChang, \bfnmJeffrey\binitsJ., \bauthor\bsnmLucas, \bfnmJoseph E\binitsJ. E., \bauthor\bsnmNevins, \bfnmJoseph R\binitsJ. R., \bauthor\bsnmWang, \bfnmQuanli\binitsQ. and \bauthor\bsnmWest, \bfnmMike\binitsM. (\byear2008). \btitleHigh-dimensional sparse factor modeling: Applications in gene expression genomics. \bjournalJournal of the American Statistical Association. \endbibitem
  • [12] {barticle}[author] \bauthor\bsnmCemgil, \bfnmTaylan\binitsT., \bauthor\bsnmGhaisas, \bfnmSumedh\binitsS., \bauthor\bsnmDvijotham, \bfnmKrishnamurthy\binitsK., \bauthor\bsnmGowal, \bfnmSven\binitsS. and \bauthor\bsnmKohli, \bfnmPushmeet\binitsP. (\byear2020). \btitleThe autoencoding variational autoencoder. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [13] {bmisc}[author] \bauthor\bsnmChakrabarti, \bfnmArhit\binitsA., \bauthor\bsnmNi, \bfnmYang\binitsY., \bauthor\bsnmMorris, \bfnmEllen Ruth A.\binitsE. R. A., \bauthor\bsnmSalinas, \bfnmMichael L.\binitsM. L., \bauthor\bsnmChapkin, \bfnmRobert S.\binitsR. S. and \bauthor\bsnmMallick, \bfnmBani K.\binitsB. K. (\byear2023). \btitleGraphical Dirichlet process for clustering non-exchangeable grouped data. arXiv preprint arXiv 2302.09111. \endbibitem
  • [14] {barticle}[author] \bauthor\bsnmChandra, \bfnmNoirrit Kiran\binitsN. K., \bauthor\bsnmCanale, \bfnmAntonio\binitsA. and \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. (\byear2023). \btitleEsca** the curse of dimensionality in Bayesian model-based clustering. \bjournalJournal of Machine Learning Research. \endbibitem
  • [15] {barticle}[author] \bauthor\bsnmChandra, \bfnmNoirrit Kiran\binitsN. K., \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. and \bauthor\bsnmXu, \bfnmJason\binitsJ. (\byear2023). \btitleInferring covariance structure from multiple data sources via subspace factor analysis. arXiv preprint arXiv 2305.04113. \endbibitem
  • [16] {barticle}[author] \bauthor\bsnmChandra, \bfnmRohitash\binitsR. and \bauthor\bsnmKapoor, \bfnmArpit\binitsA. (\byear2020). \btitleBayesian neural multi-source transfer learning. \bjournalNeurocomputing. \endbibitem
  • [17] {barticle}[author] \bauthor\bsnmChen, \bfnmMing-Hui\binitsM.-H. and \bauthor\bsnmIbrahim, \bfnmJoseph\binitsJ. (\byear2006). \btitleThe relationship between the power prior and hierarchical models. \bjournalBayesian Analysis. \endbibitem
  • [18] {barticle}[author] \bauthor\bsnmChen, \bfnmMing-Hui\binitsM.-H. and \bauthor\bsnmIbrahim, \bfnmJoseph G.\binitsJ. G. (\byear2000). \btitlePower prior distributions for regression models. \bjournalStatistical Science. \endbibitem
  • [19] {barticle}[author] \bauthor\bsnmChen, \bfnmMing-Hui\binitsM.-H., \bauthor\bsnmIbrahim, \bfnmJoseph G.\binitsJ. G., \bauthor\bsnmLam, \bfnmPeter\binitsP., \bauthor\bsnmYu, \bfnmAlan\binitsA. and \bauthor\bsnmZhang, \bfnmYuanye\binitsY. (\byear2011). \btitleBayesian design of noninferiority trials for medical devices using historical data. \bjournalBiometrics. \endbibitem
  • [20] {barticle}[author] \bauthor\bsnmDai, \bfnmBin\binitsB., \bauthor\bsnmWang, \bfnmZiyu\binitsZ. and \bauthor\bsnmWipf, \bfnmDavid\binitsD. (\byear2020). \btitleThe usual suspects? Reassessing blame for VAE posterior collapse. \bjournalInternational Conference on Machine Learning. \endbibitem
  • [21] {barticle}[author] \bauthor\bsnmDaniel R. Kowal, \bfnmDavid S. Matteson\binitsD. S. M. and \bauthor\bsnmRuppert, \bfnmDavid\binitsD. (\byear2019). \btitleFunctional autoregression for sparsely sampled data. \bjournalJournal of Business & Economic Statistics. \endbibitem
  • [22] {barticle}[author] \bauthor\bsnmDavid, \bfnmShai Ben\binitsS. B., \bauthor\bsnmLu, \bfnmTyler\binitsT., \bauthor\bsnmLuu, \bfnmTeresa\binitsT. and \bauthor\bsnmPal, \bfnmDavid\binitsD. (\byear2010). \btitleImpossibility theorems for domain adaptation. \bjournalInternational Conference on Artificial Intelligence and Statistics. \endbibitem
  • [23] {barticle}[author] \bauthor\bsnmDe Vito, \bfnmRoberta\binitsR., \bauthor\bsnmBellio, \bfnmRuggero\binitsR., \bauthor\bsnmTrippa, \bfnmLorenzo\binitsL. and \bauthor\bsnmParmigiani, \bfnmGiovanni\binitsG. (\byear2019). \btitleMulti-study factor analysis. \bjournalBiometrics. \endbibitem
  • [24] {barticle}[author] \bauthor\bsnmDeng, \bfnmJia\binitsJ., \bauthor\bsnmDong, \bfnmWei\binitsW., \bauthor\bsnmSocher, \bfnmRichard\binitsR., \bauthor\bsnmLi, \bfnmLi-Jia\binitsL.-J., \bauthor\bsnmLi, \bfnmKai\binitsK. and \bauthor\bsnmFei-Fei, \bfnmLi\binitsL. (\byear2009). \btitleImageNet: A large-scale hierarchical image database. \bjournalIEEE Conference on Computer Vision and Pattern Recognition. \endbibitem
  • [25] {bbook}[author] \bauthor\bsnmDias, \bfnmSofia\binitsS., \bauthor\bsnmWelton, \bfnmNicky J\binitsN. J., \bauthor\bsnmSutton, \bfnmAlex J\binitsA. J. and \bauthor\bsnmAdes, \bfnmA E\binitsA. E. (\byear2014). \btitleNICE DSU technical support document 2: A generalised linear modelling framework for pairwise and network meta-analysis of randomised controlled trials. \bpublisherNational Institute for Health and Care Excellence (NICE). \endbibitem
  • [26] {barticle}[author] \bauthor\bsnmDing, \bfnmDaisy Yi\binitsD. Y., \bauthor\bsnmLi, \bfnmShuangning\binitsS., \bauthor\bsnmNarasimhan, \bfnmBalasubramanian\binitsB. and \bauthor\bsnmTibshirani, \bfnmRobert\binitsR. (\byear2022). \btitleCooperative learning for multiview analysis. \bjournalProceedings of the National Academy of Sciences. \endbibitem
  • [27] {barticle}[author] \bauthor\bsnmDuan, \bfnmYuyan\binitsY., \bauthor\bsnmYe, \bfnmKeying\binitsK. and \bauthor\bsnmSmith, \bfnmEric P.\binitsE. P. (\byear2006). \btitleEvaluating water quality using power priors to incorporate historical information. \bjournalEnvironmetrics. \endbibitem
  • [28] {barticle}[author] \bauthor\bsnmDurante, \bfnmDaniele\binitsD. and \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. (\byear2018). \btitleBayesian inference and testing of group differences in brain networks. \bjournalBayesian Analysis. \endbibitem
  • [29] {barticle}[author] \bauthor\bsnmEbrahimi, \bfnmSayna\binitsS., \bauthor\bsnmElhoseiny, \bfnmMohamed\binitsM., \bauthor\bsnmDarrell, \bfnmTrevor\binitsT. and \bauthor\bsnmRohrbach, \bfnmMarcus\binitsM. (\byear2020). \btitleUncertainty-guided continual learning with Bayesian neural networks. \bjournalInternational Conference on Learning Representations. \endbibitem
  • [30] {barticle}[author] \bauthor\bsnmEleftheriadis, \bfnmStefanos\binitsS., \bauthor\bsnmRudovic, \bfnmOgnjen\binitsO. and \bauthor\bsnmPantic, \bfnmMaja\binitsM. (\byear2014). \btitleDiscriminative shared gaussian processes for multiview and view-invariant facial expression recognition. \bjournalIEEE Transactions on Image Processing. \endbibitem
  • [31] {barticle}[author] \bauthor\bsnmFerrari, \bfnmFederico\binitsF. and \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. (\byear2021). \btitleBayesian factor analysis for inference on interactions. \bjournalJournal of the American Statistical Association. \endbibitem
  • [32] {barticle}[author] \bauthor\bsnmGermain, \bfnmPascal\binitsP., \bauthor\bsnmHabrard, \bfnmAmaury\binitsA., \bauthor\bsnmLaviolette, \bfnmFrançois\binitsF. and \bauthor\bsnmMorvant, \bfnmEmilie\binitsE. (\byear2013). \btitleA PAC-Bayesian approach for domain adaptation with specialization to linear classifiers. \bjournalInternational Conference on Machine Learning. \endbibitem
  • [33] {barticle}[author] \bauthor\bsnmGermain, \bfnmPascal\binitsP., \bauthor\bsnmHabrard, \bfnmAmaury\binitsA., \bauthor\bsnmLaviolette, \bfnmFrançois\binitsF. and \bauthor\bsnmMorvant, \bfnmEmilie\binitsE. (\byear2016). \btitleA new PAC-Bayesian perspective on domain adaptation. \bjournalInternational Conference on Machine Learning. \endbibitem
  • [34] {barticle}[author] \bauthor\bsnmGermain, \bfnmPascal\binitsP., \bauthor\bsnmHabrard, \bfnmAmaury\binitsA., \bauthor\bsnmLaviolette, \bfnmFrançois\binitsF. and \bauthor\bsnmMorvant, \bfnmEmilie\binitsE. (\byear2020). \btitlePAC-Bayes and domain adaptation. \bjournalNeurocomputing. \endbibitem
  • [35] {barticle}[author] \bauthor\bsnmGreen, \bfnmPeter J.\binitsP. J. (\byear1995). \btitleReversible jump Markov chain Monte Carlo computation and Bayesian model determination. \bjournalBiometrika. \endbibitem
  • [36] {barticle}[author] \bauthor\bsnmGuo, \bfnmJian\binitsJ., \bauthor\bsnmLevina, \bfnmElizaveta\binitsE., \bauthor\bsnmMichailidis, \bfnmGeorge\binitsG. and \bauthor\bsnmZhu, \bfnmJi\binitsJ. (\byear2011). \btitleJoint estimation of multiple graphical models. \bjournalBiometrika. \endbibitem
  • [37] {barticle}[author] \bauthor\bsnmGönen, \bfnmMehmet\binitsM. and \bauthor\bsnmMargolin, \bfnmA. A.\binitsA. A. (\byear2014). \btitleKernelized Bayesian transfer learning. \bjournalAAAI Conference on Artificial Intelligence. \endbibitem
  • [38] {barticle}[author] \bauthor\bsnmHajiramezanali, \bfnmEhsan\binitsE., \bauthor\bsnmZamani Dadaneh, \bfnmSiamak\binitsS., \bauthor\bsnmKarbalayghareh, \bfnmAlireza\binitsA., \bauthor\bsnmZhou, \bfnmMingyuan\binitsM. and \bauthor\bsnmQian, \bfnmXiaoning\binitsX. (\byear2018). \btitleBayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [39] {barticle}[author] \bauthor\bsnmHamelijnck, \bfnmOliver\binitsO., \bauthor\bsnmDamoulas, \bfnmTheodoros\binitsT., \bauthor\bsnmWang, \bfnmKangrui\binitsK. and \bauthor\bsnmGirolami, \bfnmMark\binitsM. (\byear2019). \btitleMulti-resolution multi-task Gaussian processes. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [40] {barticle}[author] \bauthor\bsnmHoff, \bfnmPeter\binitsP. (\byear2007). \btitleModeling homophily and stochastic equivalence in symmetric relational data. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [41] {barticle}[author] \bauthor\bsnmHospedales, \bfnmT.\binitsT., \bauthor\bsnmAntoniou, \bfnmA.\binitsA., \bauthor\bsnmMicaelli, \bfnmP.\binitsP. and \bauthor\bsnmStorkey, \bfnmA.\binitsA. (\byear2022). \btitleMeta-learning in neural networks: A survey. \bjournalIEEE Transactions on Pattern Analysis and Machine Intelligence. \endbibitem
  • [42] {barticle}[author] \bauthor\bsnmIbrahim, \bfnmJ. G.\binitsJ. G., \bauthor\bsnmChen, \bfnmM. H.\binitsM. H. and \bauthor\bsnmSinha, \bfnmD.\binitsD. (\byear2001). \btitleBayesian semiparametric models for survival data with a cure fraction. \bjournalBiometrics. \endbibitem
  • [43] {barticle}[author] \bauthor\bsnmIbrahim, \bfnmJoseph G.\binitsJ. G., \bauthor\bsnmChen, \bfnmMing-Hui\binitsM.-H., \bauthor\bsnmGwon, \bfnmYeong**\binitsY. and \bauthor\bsnmChen, \bfnmFang\binitsF. (\byear2015). \btitleThe power prior: theory and applications. \bjournalStatistics in Medicine. \endbibitem
  • [44] {barticle}[author] \bauthor\bsnmIbrahim, \bfnmJoseph G\binitsJ. G., \bauthor\bsnmChen, \bfnmMing-Hui\binitsM.-H. and \bauthor\bsnmSinha, \bfnmDebajyoti\binitsD. (\byear2003). \btitleOn optimality properties of the power prior. \bjournalJournal of the American Statistical Association. \endbibitem
  • [45] {barticle}[author] \bauthor\bsnmIlse, \bfnmMaximilian\binitsM., \bauthor\bsnmTomczak, \bfnmJakub M.\binitsJ. M., \bauthor\bsnmLouizos, \bfnmChristos\binitsC. and \bauthor\bsnmWelling, \bfnmMax\binitsM. (\byear2020). \btitleDIVA: Domain invariant variational autoencoders. \bjournalConference on Medical Imaging with Deep Learning. \endbibitem
  • [46] {barticle}[author] \bauthor\bsnmKapoor, \bfnmSanyam\binitsS., \bauthor\bsnmKaraletsos, \bfnmTheofanis\binitsT. and \bauthor\bsnmBui, \bfnmThang D\binitsT. D. (\byear2021). \btitleVariational auto-regressive Gaussian processes for continual learning. \bjournalInternational Conference on Machine Learning. \endbibitem
  • [47] {barticle}[author] \bauthor\bsnmKarbalayghareh, \bfnmAlireza\binitsA., \bauthor\bsnmQian, \bfnmXiaoning\binitsX. and \bauthor\bsnmDougherty, \bfnmEdward R.\binitsE. R. (\byear2018). \btitleOptimal Bayesian transfer learning. \bjournalIEEE Transactions on Signal Processing. \endbibitem
  • [48] {barticle}[author] \bauthor\bsnmKarbalayghareh, \bfnmAlireza\binitsA., \bauthor\bsnmQian, \bfnmXiaoning\binitsX. and \bauthor\bsnmDougherty, \bfnmEdward R.\binitsE. R. (\byear2018). \btitleOptimal Bayesian transfer regression. \bjournalIEEE Signal Processing Letters. \endbibitem
  • [49] {barticle}[author] \bauthor\bsnmKarbalayghareh, \bfnmAlireza\binitsA., \bauthor\bsnmQian, \bfnmXiaoning\binitsX. and \bauthor\bsnmDougherty, \bfnmEdward R.\binitsE. R. (\byear2021). \btitleOptimal Bayesian transfer learning for count data. \bjournalIEEE/ACM Transactions on Computational Biology and Bioinformatics. \endbibitem
  • [50] {barticle}[author] \bauthor\bsnmKouw, \bfnmWouter M.\binitsW. M. and \bauthor\bsnmLoog, \bfnmMarco\binitsM. (\byear2019). \btitleAn introduction to domain adaptation and transfer learning. arXiv preprint arXiv 1812.11806. \endbibitem
  • [51] {barticle}[author] \bauthor\bsnmKumar, \bfnmAbhishek\binitsA., \bauthor\bsnmChatterjee, \bfnmSunabha\binitsS. and \bauthor\bsnmRai, \bfnmPiyush\binitsP. (\byear2021). \btitleBayesian structural adaptation for continual learning. \bjournalInternational Conference on Machine Learning. \endbibitem
  • [52] {barticle}[author] \bauthor\bsnmKuznetsova, \bfnmAlina\binitsA., \bauthor\bsnmRom, \bfnmHassan\binitsH., \bauthor\bsnmAlldrin, \bfnmNeil\binitsN., \bauthor\bsnmUijlings, \bfnmJasper\binitsJ., \bauthor\bsnmKrasin, \bfnmIvan\binitsI., \bauthor\bsnmPont-Tuset, \bfnmJordi\binitsJ., \bauthor\bsnmKamali, \bfnmShahab\binitsS., \bauthor\bsnmPopov, \bfnmStefan\binitsS., \bauthor\bsnmMalloci, \bfnmMatteo\binitsM., \bauthor\bsnmKolesnikov, \bfnmAlexander\binitsA., \bauthor\bsnmDuerig, \bfnmTom\binitsT. and \bauthor\bsnmFerrari, \bfnmVittorio\binitsV. (\byear2020). \btitleThe open images dataset V4. \bjournalInternational Journal of Computer Vision. \endbibitem
  • [53] {barticle}[author] \bauthor\bsnmLake, \bfnmBrenden M.\binitsB. M., \bauthor\bsnmSalakhutdinov, \bfnmRuslan\binitsR., \bauthor\bsnmGross, \bfnmJason\binitsJ. and \bauthor\bsnmTenenbaum, \bfnmJoshua B.\binitsJ. B. (\byear2011). \btitleOne shot learning of simple visual concepts. \bjournalCognitive Science. \endbibitem
  • [54] {barticle}[author] \bauthor\bsnmLake, \bfnmBrenden M.\binitsB. M., \bauthor\bsnmSalakhutdinov, \bfnmRuslan\binitsR. and \bauthor\bsnmTenenbaum, \bfnmJoshua B.\binitsJ. B. (\byear2015). \btitleHuman-level concept learning through probabilistic program induction. \bjournalScience. \endbibitem
  • [55] {barticle}[author] \bauthor\bsnmLake, \bfnmBrenden M\binitsB. M., \bauthor\bsnmSalakhutdinov, \bfnmRuss R\binitsR. R. and \bauthor\bsnmTenenbaum, \bfnmJosh\binitsJ. (\byear2013). \btitleOne-shot learning by inverting a compositional causal process. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [56] {barticle}[author] \bauthor\bsnmLawrence, \bfnmNeil D.\binitsN. D. and \bauthor\bsnmMoore, \bfnmAndrew J.\binitsA. J. (\byear2007). \btitleHierarchical Gaussian process latent variable models. \bjournalInternational Conference on Machine Learning. \endbibitem
  • [57] {bmisc}[author] \bauthor\bsnmLeBlanc, \bfnmPatrick M.\binitsP. M. and \bauthor\bsnmBanks, \bfnmDavid\binitsD. (\byear2023). \btitleTime-varying Bayesian network meta-analysis. arXiv preprint arXiv 2211.08312. \endbibitem
  • [58] {barticle}[author] \bauthor\bsnmLock, \bfnmEric F.\binitsE. F. and \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. (\byear2015). \btitleShared kernel Bayesian screening. \bjournalBiometrika. \endbibitem
  • [59] {barticle}[author] \bauthor\bsnmLong, \bfnmMingsheng\binitsM., \bauthor\bsnmCao, \bfnmYue\binitsY., \bauthor\bsnmWang, \bfnmJianmin\binitsJ. and \bauthor\bsnmJordan, \bfnmMichael I.\binitsM. I. (\byear2015). \btitleLearning Transferable Features with Deep Adaptation Networks. \bjournalInternational Conference on Machine Learning. \endbibitem
  • [60] {barticle}[author] \bauthor\bsnmLopes, \bfnmHedibert Freitas\binitsH. F. and \bauthor\bsnmWest, \bfnmMike\binitsM. (\byear2004). \btitleBayesian model assessment in factor analysis. \bjournalStatistica Sinica. \endbibitem
  • [61] {barticle}[author] \bauthor\bsnmLu, \bfnmG.\binitsG. and \bauthor\bsnmAdes, \bfnmA. E.\binitsA. E. (\byear2004). \btitleCombination of direct and indirect evidence in mixed treatment comparisons. \bjournalStatistics in Medicine. \endbibitem
  • [62] {barticle}[author] \bauthor\bsnmLu, \bfnmGuobing\binitsG. and \bauthor\bsnmAdes, \bfnmA. E.\binitsA. E. (\byear2006). \btitleAssessing evidence inconsistency in mixed treatment comparisons. \bjournalJournal of the American Statistical Association. \endbibitem
  • [63] {barticle}[author] \bauthor\bsnmLuo, \bfnmZelun\binitsZ., \bauthor\bsnmZou, \bfnmYuliang\binitsY., \bauthor\bsnmHoffman, \bfnmJudy\binitsJ. and \bauthor\bsnmFei-Fei, \bfnmLi F\binitsL. F. (\byear2017). \btitleLabel Efficient Learning of Transferable Representations across Domains and Tasks. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [64] {barticle}[author] \bauthor\bsnmMaity, \bfnmArnab\binitsA., \bauthor\bsnmBhattacharya, \bfnmAnirban\binitsA., \bauthor\bsnmMallick, \bfnmBani\binitsB. and \bauthor\bsnmBaladandayuthapani, \bfnmVeerabhadran\binitsV. (\byear2019). \btitleBayesian data integration and variable selection for pan‐cancer survival prediction using protein expression data. \bjournalBiometrics. \endbibitem
  • [65] {bincollection}[author] \bauthor\bsnmMcCloskey, \bfnmMichael\binitsM. and \bauthor\bsnmCohen, \bfnmNeal J.\binitsN. J. (\byear1989). \btitleCatastrophic interference in connectionist networks: The sequential learning problem. \bseriesPsychology of Learning and Motivation. \endbibitem
  • [66] {bmisc}[author] \bauthor\bsnmMolstad, \bfnmAaron J.\binitsA. J., \bauthor\bsnmEkvall, \bfnmKarl Oskar\binitsK. O. and \bauthor\bsnmSuder, \bfnmPiotr M.\binitsP. M. (\byear2022). \btitleDirect covariance matrix estimation with compositional data. arXiv preprint arXiv 2212.09833. \endbibitem
  • [67] {barticle}[author] \bauthor\bsnmMoran, \bfnmGemma Elyse\binitsG. E., \bauthor\bsnmSridhar, \bfnmDhanya\binitsD., \bauthor\bsnmWang, \bfnmYixin\binitsY. and \bauthor\bsnmBlei, \bfnmDavid\binitsD. (\byear2022). \btitleIdentifiable deep generative models via sparse decoding. \bjournalTransactions on Machine Learning Research. \endbibitem
  • [68] {barticle}[author] \bauthor\bsnmMüller, \bfnmPeter\binitsP., \bauthor\bsnmQuintana, \bfnmFernando\binitsF. and \bauthor\bsnmRosner, \bfnmGary\binitsG. (\byear2004). \btitleA method for combining inference across related nonparametric Bayesian models. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology). \endbibitem
  • [69] {barticle}[author] \bauthor\bsnmNiu, \bfnmShuteng\binitsS., \bauthor\bsnmLiu, \bfnmYongxin\binitsY., \bauthor\bsnmWang, \bfnmJian\binitsJ. and \bauthor\bsnmSong, \bfnmHoubing\binitsH. (\byear2020). \btitleA decade survey of transfer learning (2010–2020). \bjournalIEEE Transactions on Artificial Intelligence. \endbibitem
  • [70] {barticle}[author] \bauthor\bsnmPan, \bfnmSinno Jialin\binitsS. J. and \bauthor\bsnmYang, \bfnmQiang\binitsQ. (\byear2010). \btitleA survey on transfer learning. \bjournalIEEE Transactions on Knowledge and Data Engineering. \endbibitem
  • [71] {barticle}[author] \bauthor\bsnmPatacchiola, \bfnmMassimiliano\binitsM., \bauthor\bsnmTurner, \bfnmJack\binitsJ., \bauthor\bsnmCrowley, \bfnmElliot J.\binitsE. J., \bauthor\bsnmO' Boyle, \bfnmMichael\binitsM. and \bauthor\bsnmStorkey, \bfnmAmos J\binitsA. J. (\byear2020). \btitleBayesian meta-learning for the few-shot setting via deep kernels. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [72] {barticle}[author] \bauthor\bsnmQuefeng Li, \bfnmJianqing Fan\binitsJ. F. \bsuffixGuang Cheng and \bauthor\bsnmWang, \bfnmYuyan\binitsY. (\byear2018). \btitleEmbracing the blessing of dimensionality in factor models. \bjournalJournal of the American Statistical Association. \endbibitem
  • [73] {barticle}[author] \bauthor\bsnmRavi, \bfnmSachin\binitsS. and \bauthor\bsnmBeatson, \bfnmAlex\binitsA. (\byear2019). \btitleAmortized Bayesian meta-learning. \bjournalInternational Conference on Learning Representations. \endbibitem
  • [74] {barticle}[author] \bauthor\bsnmSai Li, \bfnmT. Tony Cai\binitsT. T. C. and \bauthor\bsnmLi, \bfnmHongzhe\binitsH. (\byear2022). \btitleTransfer learning in large-scale Gaussian graphical models with false discovery rate control. \bjournalJournal of the American Statistical Association. \endbibitem
  • [75] {barticle}[author] \bauthor\bsnmSalakhutdinov, \bfnmRuslan\binitsR., \bauthor\bsnmTenenbaum, \bfnmJoshua\binitsJ. and \bauthor\bsnmTorralba, \bfnmAntonio\binitsA. (\byear2012). \btitleOne-shot learning with a hierarchical nonparametric Bayesian model. \bjournalProceedings of ICML Workshop on Unsupervised and Transfer Learning. \endbibitem
  • [76] {barticle}[author] \bauthor\bsnmSamorodnitsky, \bfnmSarah\binitsS., \bauthor\bsnmHoadley, \bfnmKatherine\binitsK. and \bauthor\bsnmLock, \bfnmEric\binitsE. (\byear2020). \btitleA pan-cancer and polygenic Bayesian hierarchical model for the effect of somatic mutations on survival. \bjournalCancer Informatics. \endbibitem
  • [77] {barticle}[author] \bauthor\bsnmShin, \bfnmHoo-Chang\binitsH.-C., \bauthor\bsnmRoth, \bfnmHolger R\binitsH. R., \bauthor\bsnmGao, \bfnmMingchen\binitsM., \bauthor\bsnmLu, \bfnmLe\binitsL., \bauthor\bsnmXu, \bfnmZiyue\binitsZ., \bauthor\bsnmNogues, \bfnmIsabella\binitsI., \bauthor\bsnmYao, \bfnmJianhua\binitsJ., \bauthor\bsnmMollura, \bfnmDaniel\binitsD. and \bauthor\bsnmSummers, \bfnmRonald M\binitsR. M. \btitleDeep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. \bjournalIEEE Transactions on Medical Imaging. \endbibitem
  • [78] {barticle}[author] \bauthor\bsnmShwartz-Ziv, \bfnmRavid\binitsR., \bauthor\bsnmGoldblum, \bfnmMicah\binitsM., \bauthor\bsnmSouri, \bfnmHossein\binitsH., \bauthor\bsnmKapoor, \bfnmSanyam\binitsS., \bauthor\bsnmZhu, \bfnmChen\binitsC., \bauthor\bsnmLeCun, \bfnmYann\binitsY. and \bauthor\bsnmWilson, \bfnmAndrew G\binitsA. G. (\byear2022). \btitlePre-train your loss: Easy Bayesian transfer learning with informative priors. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [79] {barticle}[author] \bauthor\bsnmSicilia, \bfnmAnthony\binitsA., \bauthor\bsnmAtwell, \bfnmKatherine\binitsK., \bauthor\bsnmAlikhani, \bfnmMalihe\binitsM. and \bauthor\bsnmHwang, \bfnmSeong Jae\binitsS. J. (\byear2022). \btitlePAC-Bayesian domain adaptation bounds for multiclass learners. \bjournalUncertainty in Artificial Intelligence. \endbibitem
  • [80] {barticle}[author] \bauthor\bsnmSong, \bfnmGuoli\binitsG., \bauthor\bsnmWang, \bfnmShuhui\binitsS., \bauthor\bsnmHuang, \bfnmQingming\binitsQ. and \bauthor\bsnmTian, \bfnmQi\binitsQ. (\byear2019). \btitleHarmonized multimodal learning with Gaussian process latent variable models. \bjournalIEEE Transactions on Pattern Analysis and Machine Intelligence. \endbibitem
  • [81] {barticle}[author] \bauthor\bsnmSotiropoulos, \bfnmStamatios N\binitsS. N., \bauthor\bsnmHernández-Fernández, \bfnmMoisés\binitsM., \bauthor\bsnmVu, \bfnmAn T\binitsA. T., \bauthor\bsnmAndersson, \bfnmJesper L\binitsJ. L., \bauthor\bsnmMoeller, \bfnmSteen\binitsS., \bauthor\bsnmYacoub, \bfnmEssa\binitsE., \bauthor\bsnmLenglet, \bfnmChristophe\binitsC., \bauthor\bsnmUgurbil, \bfnmKamil\binitsK., \bauthor\bsnmBehrens, \bfnmTimothy E J\binitsT. E. J. and \bauthor\bsnmJbabdi, \bfnmSaad\binitsS. (\byear2016). \btitleFusion in diffusion MRI for improved fibre orientation estimation: An application to the 3T and 7T data of the Human Connectome Project. \bjournalNeuroimage. \endbibitem
  • [82] {barticle}[author] \bauthor\bsnmSotiropoulos, \bfnmStamatios N\binitsS. N., \bauthor\bsnmJbabdi, \bfnmSaad\binitsS., \bauthor\bsnmAndersson, \bfnmJesper L\binitsJ. L., \bauthor\bsnmWoolrich, \bfnmMark W\binitsM. W., \bauthor\bsnmUgurbil, \bfnmKamil\binitsK. and \bauthor\bsnmBehrens, \bfnmTimothy E J\binitsT. E. J. (\byear2013). \btitleRubiX: combining spatial resolutions for Bayesian inference of crossing fibers in diffusion MRI. \bjournalIEEE Transactions on Medical Imaging. \endbibitem
  • [83] {barticle}[author] \bauthor\bsnmSpiegelhalter, \bfnmDavid J.\binitsD. J., \bauthor\bsnmBest, \bfnmNicola G.\binitsN. G., \bauthor\bsnmCarlin, \bfnmBradley P.\binitsB. P. and \bauthor\bsnmVan Der Linde, \bfnmAngelika\binitsA. (\byear2002). \btitleBayesian measures of model complexity and fit. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology). \endbibitem
  • [84] {barticle}[author] \bauthor\bsnmTan, \bfnmChuanqi\binitsC., \bauthor\bsnmSun, \bfnmFuchun\binitsF., \bauthor\bsnmKong, \bfnmTao\binitsT., \bauthor\bsnmZhang, \bfnmWenchang\binitsW., \bauthor\bsnmYang, \bfnmChao\binitsC. and \bauthor\bsnmLiu, \bfnmChunfang\binitsC. (\byear2018). \btitleA survey on deep transfer learning. \bjournalArtificial Neural Networks and Machine Learning. \endbibitem
  • [85] {barticle}[author] \bauthor\bsnmTan, \bfnmLinda S. L.\binitsL. S. L., \bauthor\bsnmJasra, \bfnmAjay\binitsA., \bauthor\bsnmIorio, \bfnmMaria De\binitsM. D. and \bauthor\bsnmEbbels, \bfnmTimothy M. D.\binitsT. M. D. (\byear2017). \btitleBayesian inference for multiple Gaussian graphical models with application to metabolic association networks. \bjournalThe Annals of Applied Statistics. \endbibitem
  • [86] {barticle}[author] \bauthor\bsnmTeh, \bfnmYee Whye\binitsY. W., \bauthor\bsnmJordan, \bfnmMichael I.\binitsM. I., \bauthor\bsnmBeal, \bfnmMatthew J.\binitsM. J. and \bauthor\bsnmBlei, \bfnmDavid M.\binitsD. M. (\byear2006). \btitleHierarchical Dirichlet processes. \bjournalJournal of the American Statistical Association. \endbibitem
  • [87] {barticle}[author] \bauthor\bsnmTony Cai, \bfnmWeidong Liu\binitsW. L. and \bauthor\bsnmLuo, \bfnmXi\binitsX. (\byear2011). \btitleA constrained L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT minimization approach to sparse precision matrix estimation. \bjournalJournal of the American Statistical Association. \endbibitem
  • [88] {barticle}[author] \bauthor\bsnmTzeng, \bfnmEric\binitsE., \bauthor\bsnmHoffman, \bfnmJudy\binitsJ., \bauthor\bsnmDarrell, \bfnmTrevor\binitsT. and \bauthor\bsnmSaenko, \bfnmKate\binitsK. (\byear2015). \btitleSimultaneous Deep Transfer Across Domains and Tasks. \bjournalIEEE International Conference on Computer Vision. \endbibitem
  • [89] {binbook}[author] \bauthor\bsnmVanschoren, \bfnmJoaquin\binitsJ. (\byear2019). \btitleMeta-Learning In \bbooktitleAutomated Machine Learning: Methods, Systems, Challenges. \endbibitem
  • [90] {barticle}[author] \bauthor\bsnmVito, \bfnmRoberta De\binitsR. D., \bauthor\bsnmBellio, \bfnmRuggero\binitsR., \bauthor\bsnmTrippa, \bfnmLorenzo\binitsL. and \bauthor\bsnmParmigiani, \bfnmGiovanni\binitsG. (\byear2021). \btitleBayesian multistudy factor analysis for high-throughput biological data. \bjournalThe Annals of Applied Statistics. \endbibitem
  • [91] {barticle}[author] \bauthor\bsnmWang, \bfnmBoyu\binitsB. and \bauthor\bsnmPineau, \bfnmJoelle\binitsJ. (\byear2015). \btitleOnline boosting algorithms for anytime transfer and multitask learning. \bjournalAAAI Conference on Artificial Intelligence. \endbibitem
  • [92] {barticle}[author] \bauthor\bsnmWang, \bfnmYixin\binitsY., \bauthor\bsnmBlei, \bfnmDavid\binitsD. and \bauthor\bsnmCunningham, \bfnmJohn P\binitsJ. P. (\byear2021). \btitlePosterior collapse and latent variable non-identifiability. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [93] {barticle}[author] \bauthor\bsnmWang, \bfnmZihao\binitsZ. and \bauthor\bsnmZiyin, \bfnmLiu\binitsL. (\byear2022). \btitlePosterior collapse of a linear latent variable model. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [94] {barticle}[author] \bauthor\bsnmWilson, \bfnmAndrew Gordon\binitsA. G., \bauthor\bsnmKnowles, \bfnmDavid A.\binitsD. A. and \bauthor\bsnmGhahramani, \bfnmZoubin\binitsZ. (\byear2012). \btitleGaussian process regression networks. \bjournalInternational Conference on Machine Learning. \endbibitem
  • [95] {barticle}[author] \bauthor\bsnmWood, \bfnmFrank\binitsF. and \bauthor\bsnmTeh, \bfnmYee Whye\binitsY. W. (\byear2009). \btitleA hierarchical nonparametric Bayesian approach to statistical language model domain adaptation. \bjournalInternational Conference on Artificial Intelligence and Statistics. \endbibitem
  • [96] {barticle}[author] \bauthor\bsnmXu, \bfnmJason\binitsJ. and \bauthor\bsnmLange, \bfnmKenneth\binitsK. (\byear2022). \btitleA proximal distance algorithm for likelihood-based sparse covariance estimation. \bjournalBiometrika. \endbibitem
  • [97] {barticle}[author] \bauthor\bsnmXu, \bfnmJu\binitsJ. and \bauthor\bsnmZhu, \bfnmZhanxing\binitsZ. (\byear2018). \btitleReinforced continual learning. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [98] {bmisc}[author] \bauthor\bsnmXu, \bfnmMaoran\binitsM., \bauthor\bsnmHerring, \bfnmAmy H.\binitsA. H. and \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. (\byear2023). \btitleIdentifiable and interpretable nonparametric factor analysis. arXiv preprint arXiv 2311.08254. \endbibitem
  • [99] {barticle}[author] \bauthor\bsnmXuan, \bfnmJunyu\binitsJ., \bauthor\bsnmLu, \bfnmJie\binitsJ. and \bauthor\bsnmZhang, \bfnmGuangquan\binitsG. (\byear2021). \btitleBayesian transfer learning: An overview of probabilistic graphical models for transfer learning. arXiv preprint arXiv 2109.13233. \endbibitem
  • [100] {bbook}[author] \bauthor\bsnmYang, \bfnmQiang\binitsQ., \bauthor\bsnmZhang, \bfnmYu\binitsY., \bauthor\bsnmDai, \bfnmWenyuan\binitsW. and \bauthor\bsnmPan, \bfnmSinno Jialin\binitsS. J. (\byear2020). \btitleTransfer learning. \bpublisherCambridge University Press. \endbibitem
  • [101] {barticle}[author] \bauthor\bsnmYoon, \bfnmJaesik\binitsJ., \bauthor\bsnmKim, \bfnmTaesup\binitsT., \bauthor\bsnmDia, \bfnmOusmane\binitsO., \bauthor\bsnmKim, \bfnmSungwoong\binitsS., \bauthor\bsnmBengio, \bfnmYoshua\binitsY. and \bauthor\bsnmAhn, \bfnmSung**\binitsS. (\byear2018). \btitleBayesian model-agnostic meta-learning. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [102] {barticle}[author] \bauthor\bsnmYosinski, \bfnmJason\binitsJ., \bauthor\bsnmClune, \bfnmJeff\binitsJ., \bauthor\bsnmBengio, \bfnmYoshua\binitsY. and \bauthor\bsnmLipson, \bfnmHod\binitsH. (\byear2014). \btitleHow transferable are features in deep neural networks? \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [103] {barticle}[author] \bauthor\bsnmYousefi, \bfnmFariba\binitsF., \bauthor\bsnmSmith, \bfnmMichael T\binitsM. T. and \bauthor\bsnmÁlvarez, \bfnmMauricio\binitsM. (\byear2019). \btitleMulti-task learning for aggregated data using Gaussian processes. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [104] {barticle}[author] \bauthor\bsnmYuanpei Cao, \bfnmWei Lin\binitsW. L. and \bauthor\bsnmLi, \bfnmHongzhe\binitsH. (\byear2019). \btitleLarge covariance estimation for compositional data via composition-adjusted thresholding. \bjournalJournal of the American Statistical Association. \endbibitem
  • [105] {barticle}[author] \bauthor\bsnmZhang, \bfnmQiang\binitsQ., \bauthor\bsnmFang, \bfnm**yuan\binitsJ., \bauthor\bsnmMeng, \bfnmZaiqiao\binitsZ., \bauthor\bsnmLiang, \bfnmShangsong\binitsS. and \bauthor\bsnmYilmaz, \bfnmEmine\binitsE. (\byear2021). \btitleVariational continual Bayesian meta-learning. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [106] {barticle}[author] \bauthor\bsnmZhang, \bfnmWen\binitsW., \bauthor\bsnmDeng, \bfnmLingfei\binitsL., \bauthor\bsnmZhang, \bfnmLei\binitsL. and \bauthor\bsnmWu, \bfnmDongrui\binitsD. (\byear2023). \btitleA survey on negative transfer. \bjournalIEEE/CAA Journal of Automatica Sinica. \endbibitem
  • [107] {barticle}[author] \bauthor\bsnmZhao, \bfnmTingting\binitsT., \bauthor\bsnmWang, \bfnmZifeng\binitsZ., \bauthor\bsnmMasoomi, \bfnmAria\binitsA. and \bauthor\bsnmDy, \bfnmJennifer\binitsJ. (\byear2022). \btitleDeep Bayesian unsupervised lifelong learning. \bjournalNeural Networks. \endbibitem
  • [108] {barticle}[author] \bauthor\bsnmZhou, \bfnmAurick\binitsA. and \bauthor\bsnmLevine, \bfnmSergey\binitsS. (\byear2021). \btitleBayesian adaptation for covariate shift. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
  • [109] {barticle}[author] \bauthor\bsnmZhou, \bfnmJiaying\binitsJ., \bauthor\bsnmDing, \bfnmJie\binitsJ., \bauthor\bsnmTan, \bfnmKean Ming\binitsK. M. and \bauthor\bsnmTarokh, \bfnmVahid\binitsV. (\byear2021). \btitleModel linkage selection for cooperative learning. \bjournalJournal of Machine Learning Research. \endbibitem
  • [110] {barticle}[author] \bauthor\bsnmZhuang, \bfnmFuzhen\binitsF., \bauthor\bsnmQi, \bfnmZhiyuan\binitsZ., \bauthor\bsnmDuan, \bfnmKeyu\binitsK., \bauthor\bsnmXi, \bfnmDongbo\binitsD., \bauthor\bsnmZhu, \bfnmYongchun\binitsY., \bauthor\bsnmZhu, \bfnmHengshu\binitsH., \bauthor\bsnmXiong, \bfnmHui\binitsH. and \bauthor\bsnmHe, \bfnmQing\binitsQ. (\byear2021). \btitleA comprehensive survey on transfer learning. \bjournalProceedings of the IEEE. \endbibitem