Bayesian Transfer Learning

Piotr M. Suderlabel=e1][email protected] [ Jason Xu,label=e2][email protected] [ David B. Dunsonlabel=e3][email protected] [ Piotr M. Suder: PhD Student, Department of Statistical Science, Duke University presep= ]e1. Jason Xu: Assistant Professor, Department of Statistical Science, Duke University presep= ]e2. David B. Dunson: Arts and Sciences Distinguished Professor, Departments of Statistical Science and Mathematics, Duke University presep= ]e3.

Abstract

Transfer learning is a burgeoning concept in statistical machine learning that seeks to improve inference and/or predictive accuracy on a domain of interest by leveraging data from related domains. While the term ”transfer learning” has garnered much recent interest, its foundational principles have existed for years under various guises. Prior literature reviews in computer science and electrical engineering have sought to bring these ideas into focus, primarily surveying general methodologies and works from these disciplines. This article highlights Bayesian approaches to transfer learning, which have received relatively limited attention despite their innate compatibility with the notion of drawing upon prior knowledge to guide new learning tasks. Our survey encompasses a wide range of Bayesian transfer learning frameworks applicable to a variety of practical settings. We discuss how these methods address the problem of finding the optimal information to transfer between domains, which is a central question in transfer learning. We illustrate the utility of Bayesian transfer learning methods via a simulation study where we compare performance against frequentist competitors.

Bayesian machine learning,

domain adaptation,

hierarchical model,

meta analysis,

keywords:

\startlocaldefs\endlocaldefs

, and

1 Introduction

Transfer learning—applying knowledge gained from training on previous tasks and domains toward new tasks—is a burgeoning concept in statistics and machine learning. This natural idea mimics some of the mechanisms of human intelligence where past experience, skills and knowledge are often utilized in learning new topics. It is appealing to apply the same paradigm in develo** machine intelligence to extract knowledge from the rapidly growing body of datasets available to scientists which are often related to each other in various ways. If the domains between which the transfer of information occurs are sufficiently related, transfer learning can substantially improve the performance of the target model. This is particularly useful when we have a small target dataset we want to study which does not contain enough datapoints to extract precise inferences or predictions, but have access to a large, related dataset.

For instance, suppose we want to study brain connectomes of Alzheimer’s patients or genomes of people suffering from a rare type of cancer. We may utilize large datasets of brain connectomes or genomes collected from healthy individuals such as the ones provided by the UK Biobank to improve the models fitted to the target data. These related sources may aid in the extraction of, say, a low dimensional latent representation of the complex data we seek to study, which can be useful toward dimensionality reduction in the target domain.

Although the term transfer learning has seen increasing popularity in recent years, some of the ideas undergirding it have been around for much longer, and have appeared under various names. Several recent literature reviews aim to help researchers organize and classify these ideas systematically. To name a few, [70], and more recently [69] and [110], provide general overviews on transfer learning methodology, largely from the computer science and electrical engineering literature. [84] focuses on transfer learning in deep neural networks, an appealing use-case due to the data-hungry nature of deep learning models together with the availability of large datasets for training source models. Areas where deep learning is commonly applied such as computer vision often leverage public datasets such as ImageNet [24] or Open Images V4 [52], with millions of datapoints available for training. Meanwhile, [106] focuses on the phenomenon of negative transfer, where the source domains are too different from the target domain, so that applying transfer learning worsens the performance of the target learner. The existence of the negative transfer phenomenon illustrates the importance of choosing an appropriate amount of information to be transferred (the ”strength” of transfer) between domains, which remains one of the key challenges in transfer learning and will be one of the focal topics in this survey.

With the exception of [99], none of these literature reviews substantially focus on Bayesian views. While the work of Xuan et. al [99] explicitly overviews Bayesian transfer learning, its scope is limited to probabilistic graphical models. One can argue that the Bayesian paradigm provides a natural framework for how to incorporate prior information from previous datasets within current inferences, and hence provides a canonical umbrella of approaches for transfer learning. In this paper we provide an overview of some highlights of the Bayesian transfer learning literature. Our focus is on describing how different classical Bayesian approaches can be either directly applied or easily adapted to transfer learning problems. In doing so, we contribute various ideas toward answering a central question of transfer learning: how do we determine and enforce optimal information transfer between domains utilizing various Bayesian modeling approaches? Our aim is to contribute a broad view of Bayesian transfer learning, while presenting approaches that help surmount the problem of negative transfer.

The rest of the paper is organized as follows. In the following section, we give formal definitions of transfer learning and related areas, and discuss alternative names for related ideas in the literature. In Section 3 we provide an overview of general Bayesian approaches to transfer learning with specific examples and some applications. In Sections 4 and 5 we provide a brief taxonomy of transfer learning and point out several areas where some specific Bayesian approaches introduced here can be particularly useful. Finally, in Section 6 we present a simulation study comparing one of the Bayesian methods introduced here with frequentist transfer learning competitors. We conclude with a discussion in Section 7.

2 Definition and related areas

While approaches for transferring information across statistical tasks have a rich history, use of the “transfer learning” terminology is relatively recent. Perhaps as a result, there is not yet a standard technical definition of what qualifies as transfer learning. While some authors adopt a narrow definition of transferring parameters between models [41], others welcome broader, more general definitions [70], [110], [84]. In this section we provide one definition of transfer learning to fix ideas for the rest of the article. We then discuss closely related areas. Here by domain we denote the two-element set of the form $\mathcal{D}=\{\mathcal{X},P\}$ , where $\mathcal{X}$ is the feature space and $P$ is the marginal probability distribution of the observations $X\in\mathcal{X}$ collected in a dataset associated with $\mathcal{D}$ . Given a domain $\mathcal{D}$ and its associated label space $\mathcal{Y}$ , [70] define a task on $\mathcal{D}$ as the set $\mathcal{T}=\{\mathcal{Y},f(\cdot)\}$ , where $f$ is a function given by $f=\{(x,y)\mid x\in\mathcal{X},y\in\mathcal{Y}\}$ . In this framework, $f$ is the ground truth, the optimal solution to the task which is not directly observed but whose approximation can be learned from the observed data.

Definition 2.1 (Transfer Learning).

Consider the source domains $\mathcal{D}_{1},\mathcal{D}_{2},\dots,\mathcal{D}_{K}$ with respective associated source tasks $\mathcal{T}_{1},\mathcal{T}_{2},\dots,\mathcal{T}_{K}$ , as well as the target domain $\mathcal{D}_{0}$ with the associated target task $\mathcal{T}_{0}=\{\mathcal{Y}_{0},f_{0}\}$ , where an approximation to $f_{0}$ can be learned based on the available data $(X_{0},Y_{0})$ with $X_{0}\in\mathcal{X}_{0},Y_{0}\in\mathcal{Y}_{0}$ . Suppose that $\mathcal{D}_{k}\neq\mathcal{D}_{0}$ or $\mathcal{T}_{k}\neq\mathcal{T}_{0}$ for any $k=1,\dots,K$ . Transfer learning refers to algorithms which aim at improving the approximation of $f_{0}$ by incorporating the knowledge from $\mathcal{D}_{1},\mathcal{D}_{2},\dots,\mathcal{D}_{K}$ and $\mathcal{T}_{1},\mathcal{T}_{2},\dots,\mathcal{T}_{K}$ .

In this setting, by knowledge we mean either: (i) the raw data sampled from the source domains, possibly equipped with labels from the source tasks, (ii) learners pre-trained on the data from source domains and tasks, or (iii) oracle models which have complete and true information on the source domains and the source tasks. Among these, cases (i) and (ii) are the most commonly encountered ones in practice.

Our definition follows and generalizes the conventions used by [70]. Note that in the above definition, the source and target domains are not necessarily different, encompassing cases with a common domain but different tasks. Furthermore, when two domains are different, their feature spaces need not differ. In the deep learning literature the target task is sometimes referred to as the downstream task [78].

By allowing the label space to be (a) discrete, (b) one-dimensional, (c) multidimensional, (d) defining only a partition of a dataset associated with $\mathcal{D}$ without giving a specific meaning to how the labels are used for that purpose, this definition comprises, respectively, (a) classification, (b) univariate regression, (c) multivariate regression and dimensionality reduction, and (d) clustering tasks. Finally, allowing the values of $f$ to be probability distributions naturally lends itself to Bayesian posterior learning.

2.1 Related fields

Another closely related problem which follows this paradigm is multitask learning. Just like transfer learning, its goal is to improve learning on a particular domain based on information from related domains and tasks, but it differs in seeking to simultaneously learn each task jointly on all the domains considered. This may improve the performance across tasks by borrowing information across related tasks and domains, in contrast to using a set of tasks and domains only as means to the end of improving performance on a single target task [70]. While some authors regard multitask learning and transfer learning as separate disciplines [70], [100], [91], others either consider them as the same field [99], or draw a distinction between the two according to different criteria, as in [37]. Often a multitask learning method can be easily adapted to transfer learning [70], [100]. As we will see in the following sections, the Bayesian framework elegantly reconciles these notions in many cases.

Continual or lifelong learning [46], [107], [46], [51], [29] is a popular concept in machine learning, which combines aspects of transfer and multitask learning. In this setting an agent faces a sequence of domain-task pairs over time, with the goal being to utilize previously encountered tasks to learn each new task in a more effective way while maintaining the ability to solve the previous tasks [97]. Continual learning attempts to provide a remedy to the phenomenon of catastrophic forgetting [65] in transfer learning where the model performs worse on the source tasks after being adjusted to the target task; this commonly occurs in deep learning models [8], [2]. This forgetting can be especially problematic when the number of encountered tasks and model parameters becomes large and it becomes difficult to store the previously encountered datasets and models trained on them.

There is additionally a Bayesian literature on metalearning, or learning to learn [101], [71], [73], [105]. Vanschoren [89] defines metalearning as methods aiming to improve the “configuration” (e.g. model hyperparameters, network architecture in case of deep learning methods, etc.) of the model for the target task by training on metadata. Here metadata refers to information obtained from models trained individually with different configurations; for example, one may vary different aspects of the model and measure its performance via cross validation. While some researchers consider metalearning as distinct from transfer learning [41], following Definition 1 we consider it to be a special case of transfer learning. In this case the information from the source domains is utilized by training models with various configurations on these domains and then using the metadata generated from them in improving the model for the target task.

Domain adaptation [108], [7], [32], [95] is another popular term, which is sometimes used interchangeably with transfer learning as in [50]. However, since knowledge can also be transferred between different tasks on the same domain, we view it as a particular case of transfer learning. Additional terminology for concepts closely related to transfer learning includes cooperative learning [109], [26], knowledge consolidation, context-sensitive learning, knowledge-based inductive bias, incremental, and cumulative learning [70].

3 Bayesian approaches to transfer learning

Two fundamental questions which need to be addressed are: (i) how information should be transferred, and (ii) which information should be transferred. There are various approaches to answering these questions and they are often related to the models used for solving the source and target tasks. Determining appropriate information transfer between domains is critical, since transferring inappropriate information can result in large bias and suboptimal performance. In extreme cases, one obtains negative transfer [106], which corresponds to the case in which transferring information decreases performance.

Some of the existing approaches rely on expert knowledge about the domains considered and their relationships, some introduce statistical measures of similarity between domains, while others rely on more flexible model-based or validation-based approaches to the optimal choice of parameters controlling information transfer. In this section we discuss different ideas based on Bayesian methodology which can be used to tackle questions (i)-(ii).

3.1 Shared parameters

One of the most prevalent approaches is to use common parameters in the source and target domains. For exposition, throughout this subsection we assume only one source dataset $X_{S}$ and one target dataset $X_{T}$ . For convenience of notation, by $X_{d}$ we denote both the datapoints and their associated labels (when applicable) for domain $d\in\{S,T\}$ . We parameterize the data likelihood for both source and target domains as $p(X_{S}\mid\theta_{C},\theta_{S})$ and $p(X_{T}\mid\theta_{C},\theta_{T})$ , respectively, where $\theta_{C}$ is the common vector of parameters, shared by both source and target data, while $\theta_{S}$ and $\theta_{T}$ are vectors of parameters unique to the datasets. Let $\pi(\theta_{C})$ , $\pi(\theta_{S})$ , $\pi(\theta_{T})$ be the prior distributions for, respectively, $\theta_{C}$ , $\theta_{S}$ , $\theta_{T}$ .

A simple Bayesian transfer learning approach would compute the posterior for $\theta_{C}$ based on the prior $\pi(\theta_{C})$ and the source data $X_{S}$ via

	$\displaystyle p(\theta_{C}\mid X_{S})$	$\displaystyle\propto$	$\displaystyle p(X_{S}\mid\theta_{C})\pi(\theta_{C})$		(1)
		$\displaystyle=$	$\displaystyle\left(\int p(X_{S}\mid\theta_{C},\theta_{S})\pi(\theta_{S})d% \theta_{S}\right)\pi(\theta_{C}),$		(1)

and then use $\pi^{*}(\theta_{C},\theta_{T})\propto p(\theta_{C}\mid X_{S})\pi(\theta_{T})$ as the prior for $(\theta_{C},\theta_{T})$ in the analysis of $X_{T}$ to obtain the posterior

p^{*}(\theta_{C},\theta_{T}\mid X_{T})\propto p(X_{T}\mid\theta_{C},\theta_{T}% )\pi^{*}(\theta_{C},\theta_{T}).

It is straightforward to see that if $X_{T}\perp\!\!\!\perp X_{S}\mid(\theta_{C},\theta_{T})$ and $\theta_{S}\perp\!\!\!\perp\theta_{T}$ a priori, then this is equivalent to obtaining a posterior for $(\theta_{C},\theta_{T})$ based on the data $(X_{T},X_{S})$ with the prior $\pi(\theta_{C})\pi(\theta_{T})$ on $(\theta_{C},\theta_{T})$ , i.e.

p^{*}(\theta_{C},\theta_{T}\mid X_{T})=p(\theta_{C},\theta_{T}\mid X_{T},X_{S}),

(2)

where

p(\theta_{C},\theta_{T}\mid X_{T},X_{S})\propto p(X_{T},X_{S}\mid\theta_{C},% \theta_{T})\pi(\theta_{C})\pi(\theta_{T}).

Hence, this approach is equivalent to giving equal weights to the source and target data in computing the posterior of the shared parameters. This is an appropriate approach when the model is correctly specified and the true parameters $\theta_{C}$ are indeed exactly the same in the source and target populations.

However, in practice, it is likely that the assumption of exactly equivalent values of $\theta_{C}$ is an oversimplification. As the true values of $\theta_{C}$ vary more widely between the source and target domains, the above approach can have suboptimal performance, particularly when the source data sample size is larger than that of the target, which is often the case. A simple and commonly used heuristic solution is to specify the prior for $\theta_{C}$ in the target posterior as a variance-inflated version of the posterior $p(\theta_{C}\mid X_{S})$ from the source data analysis.

Shwartz-Ziv et al. [78] apply a related approach to Bayesian deep neural networks (DNNs). First a Gaussian approximation to the posterior for the DNN fitted to the source data is obtained. The authors assume the “feature extractor” layers of the DNN are common to the source and target DNN. The variance of the Gaussian approximation to the source posterior for the weights in these layers is scaled up by a constant factor and then used as a prior for the feature extractor component in the target data DNN. The remaining weights characterizing the “head” of the DNN are given an isotropic Gaussian prior. To learn an appropriate amount of information sharing between the source and target domain, the scaling factor is chosen on held-out validation data from the target training dataset.

An alternative approach to controlling the influence of the source data on the target domain posterior distribution is the power prior [18], [19], [44], [43], [27]. The power prior for the target parameters is proportional to an initial prior multiplied by the source data likelihood raised to a fractional power. The fractional power serves to diminish the information provided by the source data likelihood. In our transfer learning setting, the joint prior for $(\theta_{C},\theta_{T})$ in the target model is given by

\pi_{a_{0}}(\theta_{C},\theta_{T}\mid X_{S})\propto p(X_{S}\mid\theta_{C})^{a_% {0}}\pi(\theta_{C})\pi(\theta_{T}),

(3)

where the strength of information transfer ranges between no transfer at $a_{0}=0$ to ”full” transfer at $a_{0}=1$ . In the latter case, the source data are given equal weight to those in the target domain. This setup generalizes the partial borrowing power prior of [19], where the source model parameters are a subset of those used in the target domain.

Several appealing theoretical properties of the power prior were established in [44] for the case when all the parameters are shared. In that case the posterior for $\theta$ reduces to

\pi_{a_{0}}(\theta\mid X_{T},X_{S})\propto p(X_{T}\mid\theta)p(X_{S}\mid\theta% )^{a_{0}}\pi(\theta),

(4)

where $\theta$ determines the distribution of both source and target data. Ibrahim et al. [44] show that for a fixed $a_{0}$ , (4) minimizes the weighted sum of Kullback–Leibler (KL) divergences between the posterior with no information transfer and one with full information transfer, i.e.

	$\displaystyle\pi_{a_{0}}$	$\displaystyle(\theta\mid X_{T},X_{S})=$
		$\displaystyle=\operatorname*{arg\ min}_{g}\{(1-a_{0})KL(g\mid\mid f_{0})+a_{0}% KL(g\mid\mid f_{1})\},$

where $f_{0}$ and $f_{1}$ are probability densities given by

f_{0}(\theta)=\pi_{0}(\theta\mid X_{S},X_{T})\propto p(X_{T}\mid\theta)\pi(\theta)

and

f_{1}(\theta)=\pi_{1}(\theta\mid X_{S},X_{T})\propto p(X_{T}\mid\theta)p(X_{S}% \mid\theta)\pi(\theta).

Just like in the other transfer learning approaches, choosing the right amount of information to be transferred—in this case governed by the value of $a_{0}$ —is a key challenge. One approach is to treat $a_{0}$ as fixed and perform sensitivity analysis over a set of values which ideally should include $a_{0}=0$ and $a_{0}=1$ , as recommended by [43]. In generalized linear models (GLMs) the choice of $a_{0}$ can be better informed with the help of model selection criteria such as those proposed in [44], [43], [42], and [83]. Ibrahim et al. [44] propose a penalized likelihood-type criterion (PLC) that chooses $a_{0}\in(0,1]$ to be the minimizer of

-2\log\int p(X_{T}\mid\theta)p(X_{S}\mid\theta)^{a_{0}}\pi(\theta)d\theta+% \frac{\log(n_{S})}{a_{0}},

where $n_{S}$ is the sample size of the source dataset.

Alternatively, we can treat $a_{0}$ as random and in turn assign it a prior distribution. We can either directly define the joint prior for $(\theta,a_{0})$ as in [43], i.e.

\pi(\theta,a_{0}\mid X_{S})\propto p(X_{S}\mid\theta)^{a_{0}}\pi(\theta)\pi(a_% {0}),

(5)

or the normalized power prior as in [27]

	$\displaystyle\pi(\theta,a_{0}\mid X_{S})$	$\displaystyle=\pi(\theta\mid X_{S},a_{0})\pi(a_{0})$		(6)
		$\displaystyle=\frac{p(X_{S}\mid\theta)^{a_{0}}\pi(\theta)}{\int p(X_{S}\mid% \theta^{\prime})^{a_{0}}\pi(\theta^{\prime})d\theta^{\prime}}\pi(a_{0}).$

The normalized power prior first specifies a marginal prior for $a_{0}$ and then a conditional prior for $\theta$ given $a_{0}$ .

Taking $\pi(a_{0})$ to be a beta or Dirichlet distribution depending on the number of source domains is a natural choice, with theoretical support proved in [44] under fixed $a_{0}$ that extend to the random $a_{0}$ case under (5). Other priors with an appropriate support, such as gamma or Gaussian truncated to $[0,1]$ , can also be utilized [18]. However, it is not clear how the data inform about an appropriate value for $a_{0}$ , since $a_{0}$ is not a traditional parameter. It may be that this approach can be used to represent prior uncertainty in $a_{0}$ but will not adapt to information in the data to concentrate on the optimal amount of borrowing from the source data.

3.2 Hierachical models and random effects

The approaches mentioned in the previous section rely on sharing parameters in the likelihood specification for source and target data. Alternatively, we can allow the parameters of the source and target data models to differ, instead imposing the assumption that they come from a jointly specified or identical prior distribution acting as a bridge for information flow between the domains.

As a simple example, consider the Gaussian linear model. Let the datasets $(\boldsymbol{X}_{1},\boldsymbol{y}_{1}),\ldots,(\boldsymbol{X}_{K},\boldsymbol% {y}_{K})$ denote the source data and $(\boldsymbol{X}_{0},\boldsymbol{y}_{0})$ be the target data, where $\boldsymbol{X}_{d}\in\mathbb{R}^{n_{d}\times p}$ and $\boldsymbol{y}_{d}\in\mathbb{R}^{n_{d}}$ for $d\in\{0,1,\ldots,K\}$ . Under this model we assume that

\boldsymbol{y}_{d}=\boldsymbol{X}_{d}\boldsymbol{\beta}_{d}+\boldsymbol{% \epsilon}_{d},\quad\boldsymbol{\epsilon}_{d}{\sim}\mathcal{N}(0,\sigma_{d}^{2}% I_{n_{d}}),

(7)

with the prior on the coefficients given by $\boldsymbol{\beta}_{d}\sim\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})$ for $d\in\{0,1,\ldots,K\}$ . Here, the domain-specific parameters $\boldsymbol{\beta}_{d}$ are drawn from a common prior distribution $\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})$ , which is often referred to as a random effects distribution. Model (7) is a common type of hierarchical regression model for data nested within groups (domains in our terminology). Data from all the domains are used to inform the random effects mean $\boldsymbol{\mu}$ and covariance $\boldsymbol{\Sigma}$ , inducing borrowing of information.

We can either treat $\sigma_{0},\sigma_{1},\ldots,\sigma_{K}$ , $\boldsymbol{\mu}$ , and $\boldsymbol{\Sigma}$ as fixed, taking a frequentist approach to inference, or specify hyperpriors for them to obtain a Bayesian hierarchical model. In either case, the random effects covariance $\boldsymbol{\Sigma}$ controls how much information transfer there is, analogously to $a_{0}$ in the power prior approach. Large covariance implies less shrinkage of the $\boldsymbol{\beta}_{d}$ values towards the random effects mean $\boldsymbol{\mu}$ . In practice, the prior for the random effects mean and covariance will be updated based on information in the data about the variability in the regression coefficients across domains.

For fixed $\sigma_{0},\sigma_{1},\ldots,\sigma_{K}$ , $\boldsymbol{\mu}$ , and $\boldsymbol{\Sigma}$ , [17] showed a direct analytic relationship between $\boldsymbol{\Sigma}$ and the tuning parameter $a_{0}$ in the power prior approach, establishing duality between these methods for the Gaussian linear model.

3.3 Shared latent space

Rather than imposing shared parameters on data generating processes for source and target domains, whether explicitly in the likelihoods or at higher levels in a hierarchical model, we can also specify or seek to learn a shared latent space. This approach can be particularly useful in more complex datasets with large numbers of dimensions.

3.3.1 Factor analysis

In the Bayesian context many such examples can be found in the factor analysis literature. Under the classical factor model specification outlined in [60] the $i$ -th observation $\boldsymbol{y}_{i}\in\mathbb{R}^{p}$ is given by

\boldsymbol{y}_{i}=\boldsymbol{\Lambda}\boldsymbol{\eta}_{i}+\boldsymbol{% \epsilon}_{i},

where $\boldsymbol{\eta}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(\boldsymbol{0},% \boldsymbol{I}_{q})$ are the vectors of latent factors, $\boldsymbol{\Lambda}\in\mathbb{R}^{p\times q}$ is the factor loading matrix, $\boldsymbol{\epsilon}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(\boldsymbol{0% },\boldsymbol{\Delta})$ are random noise terms with $\boldsymbol{\Delta}=\textnormal{diag}(\delta_{1}^{2},\dots,\delta_{p}^{2})$ , and $\boldsymbol{\eta}_{i}$ , $\boldsymbol{\epsilon}_{j}$ are independent for any $i,j$ . It is commonly assumed that $q\ll p$ , i.e. the high dimensional data can be explained using a latent structure of much lower dimensional factors. This model can be equivalently written as a Gaussian distribution with a constrained covariance structure, i.e.

\boldsymbol{y}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(\boldsymbol{0},% \boldsymbol{\Sigma}),\quad\boldsymbol{\Sigma}=\boldsymbol{\Lambda}\boldsymbol{% \Lambda}^{T}+\boldsymbol{\Delta}.

The mean-zero assumption on $\boldsymbol{y}_{i}$ comes from the standard practice of centering the data and does not limit the generality of the model.

In [23] and [90] this setup is generalized to the situation with data coming from multiple domains by letting

\boldsymbol{y}_{k,i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(\boldsymbol{0},% \boldsymbol{\Sigma}_{k}),\quad\boldsymbol{\Sigma}_{k}=\boldsymbol{\Lambda}% \boldsymbol{\Lambda}^{T}+\boldsymbol{\Phi}_{k}\boldsymbol{\Phi}_{k}^{T}+% \boldsymbol{\Delta}_{k},

where $\boldsymbol{\Delta}_{k}=\textnormal{diag}(\delta_{k,1}^{2},\dots,\delta_{k,p}^% {2})$ is the error variance matrix, $\boldsymbol{\Lambda}\in\mathbb{R}^{p\times q}$ , $\boldsymbol{\Phi}_{k}\in\mathbb{R}^{p\times q_{k}}$ for domain $k=1,\dots,K$ . Here $\boldsymbol{\Phi}_{k}\boldsymbol{\Phi}_{k}^{T}$ accounts for the domain-specific dependencies between the datapoints and $\boldsymbol{\Lambda}\boldsymbol{\Lambda}^{T}$ is the underlying shared covariance structure which allows for information transfer between domains.

Analogously to the single domain case above, this model has the equivalent representation

\displaystyle\boldsymbol{y}_{k,i}=\boldsymbol{\Lambda}\boldsymbol{\eta}_{k,i}+% \boldsymbol{\Phi}_{k}\boldsymbol{\zeta}_{k,i}+\boldsymbol{\epsilon}_{k,i},

(8)

with

\displaystyle\boldsymbol{\eta}_{k,i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(% \boldsymbol{0},\boldsymbol{I}_{q}),\hskip 7.11317pt\boldsymbol{\zeta}_{k,i}% \overset{\mathrm{iid}}{\sim}\mathcal{N}(\boldsymbol{0},\boldsymbol{I}_{q_{k}})% ,\hskip 7.11317pt\boldsymbol{\epsilon}_{k,i}\overset{\mathrm{iid}}{\sim}% \mathcal{N}(\boldsymbol{0},\boldsymbol{\Delta}_{k}),

where $\boldsymbol{\eta}_{k,i}$ is a latent factor in the $q$ dimensional shared subspace, $\boldsymbol{\zeta}_{k,i}$ are $q_{k}$ dimensional domain-specific latent factors and $\boldsymbol{\epsilon}_{k,i}$ are error terms. Thus, $\boldsymbol{\Lambda}$ is the shared factor loading matrix and $\boldsymbol{\Phi}_{k}$ are $p\times q_{k}$ domain-specific factor loading matrices. In this model the transfer of knowledge between domains occurs through information borrowing in the estimation of $\boldsymbol{\Lambda}$ .

The above model allows for a lot of flexibility between the domains, but suffers from an identifiability issue known as information switching: the data can be fitted equally well with the shared columns in factor loading matrices transferred from $\boldsymbol{\Lambda}$ to $\boldsymbol{\Phi}_{k}$ ’s. De Vito et al. [23], [90] solve this problem by restricting the augmented matrix $\begin{matrix}[\boldsymbol{\Lambda}&\boldsymbol{\Phi}_{1}&\dots&\boldsymbol{% \Phi}_{K}]\end{matrix}$ to be lower-triangular. One limitation of this approach is that it imposes an ordering on the domains. Often when we have multiple source domains there is no natural ordering between them and hence this approach would not be preferred in such a scenario.

A recent paper proposes a different solution to the problem of information switching by restricting the factor loading matrices to be linear transforms of the shared factor loading matrix and imposing a shared covariance of error terms between the domains [15]. That is, they assume $\boldsymbol{\Phi}_{k}=\boldsymbol{\Lambda}\boldsymbol{A}_{k}$ , where $\boldsymbol{A}_{k}\in\mathbb{R}^{q\times q_{k}}$ and $\boldsymbol{\Delta}_{k}=\boldsymbol{\Delta}=\textnormal{diag}(\delta_{1}^{2},% \dots,\delta_{p}^{2})$ for every $k=1,\dots,K$ in (8). The authors show that under any non-degenerate continuous prior on $\boldsymbol{A}_{k}$ the information switching does not occur almost surely provided that $\sum_{k=1}^{K}q_{k}\leq q$ .

This result provides some guidance for choosing the dimensions of shared and domain-specific latent spaces since these are not known in most practical applications. These dimensions influence the amount of information being transferred between domains since larger values of $q_{1},\dots,q_{K}$ give more flexibility to the domain-specific latent features, thus reducing the influence of shared latent factors. One approach to choosing these dimensions would be to put priors on $q_{1},\dots,q_{K}$ and $q$ and use reversible-jump algorithms outlined in [35]. However, such algorithms can be computationally prohibitive.

Chandra et al. [15] provide an alternative solution by fixing $q,q_{1},\dots,q_{K}$ at an upper bound, and then utilizing appropriate priors to shrink the excess columns in $\boldsymbol{\Lambda},\boldsymbol{A}_{1},\dots,\boldsymbol{A}_{K}$ . Specifically, they obtain approximate singular values and eigenvectors of the pooled dataset via the augmented implicitly restarted Lanczos bidiagonalization [3] and choose the smallest $\hat{q}$ which explains 95% of the variability in the data. Then they fix $\hat{q_{k}}=\hat{q}/K$ for $k=1,\ldots,K$ to ensure that information switching will not occur. The strength of information transfer between domains is thus directly influenced by the amount of shrinkage induced by the priors on the factor loading matrices. In [15], the fixed priors $\textnormal{vec}(\boldsymbol{\Lambda})\sim\textnormal{DL}(1/2)$ and $a_{k,i,j}\overset{\mathrm{iid}}{\sim}\mathcal{N}(0,1)$ are used, where $a_{k,i,j}$ is the $(i,j)$ -th entry of $\boldsymbol{A}_{k}$ and DL denotes the Dirichlet-Laplace distribution [6]. Thus a possible extension in this line of work would be to propose some methods for choosing these hyperparameters in an adaptive way depending on how related the domains are.

3.3.2 Mixture models

Latent space models are commonly used as a dimensionality reduction tool, including when dealing with non-standard data structures such as networks [40], [9]. As a vignette showing how mixture models can be used in combination with latent space models for flexible transfer learning, we focus on the approach proposed in [28]. They were motivated by data on brain networks for individuals in different groups.

Specifically, given $n$ observed networks each belonging to one of $K$ groups and consisting of $V$ labelled vertices, denote the $i$ -th network together with its group label as $\{y_{i},\boldsymbol{\mathcal{L}}(\boldsymbol{A}_{i})\}$ , where $\boldsymbol{A}_{i}\in\{0,1\}^{V\times V}$ is the adjacency matrix and $\boldsymbol{\mathcal{L}}(\boldsymbol{A}_{i})\in\{0,1\}^{V(V-1)/2}$ denotes the lower triangular entries

(A_{i[2,1]},\dots,A_{i[V,1]},A_{i[3,2]},\dots,A_{i[V,2]},\dots,A_{i[V,V-1]})^{T}

of $\boldsymbol{A}_{i}$ . We discard the main diagonal and the upper triangular part of $\boldsymbol{A}_{i}$ since the network is an undirected graph and self-relationships of the nodes are not of interest. In [28] subjects fall into a low and high creativity group, so we have $K=2$ domains. The network representation $\boldsymbol{\mathcal{L}}(\boldsymbol{A})$ conditional on the group membership $y$ is modeled as

	$\displaystyle\mathbb{P}(\boldsymbol{\mathcal{L}}$	$\displaystyle(\boldsymbol{A}_{i})=\boldsymbol{a}\mid y=k)=$
		$\displaystyle=\sum_{h=1}^{H}\nu_{h}^{(k)}\prod_{l=1}^{V(V-1)/2}\left(\pi_{l}^{% (h)}\right)^{a_{l}}\left(1-\pi_{l}^{(h)}\right)^{1-a_{l}}$

for any $\boldsymbol{a}\in\{0,1\}^{V(V-1)/2}$ with the probability vector

\boldsymbol{\pi}^{(h)}=\left(\pi_{1}^{(h)},\dots,\pi_{V(V-1)/2}^{(h)}\right)^{% T}\in(0,1)^{V(V-1)/2}

in the $h$ -th mixture component given by

{\pi}^{(h)}_{l}=\left[1+\exp(-{Z}_{l}-{D}^{(h)}_{l})\right]^{-1},

with

\boldsymbol{D}^{(h)}=\boldsymbol{\mathcal{L}}(\boldsymbol{X}^{(h)}\boldsymbol{% \Lambda}^{(h)}\boldsymbol{X}^{(h)T}),\quad h=1,\dots,H,

where $\boldsymbol{X}^{(h)}\in\mathbb{R}^{V\times R}$ , $\boldsymbol{\Lambda}^{(h)}=\textnormal{diag}(\lambda_{1}^{(h)},\dots,\lambda_{% R}^{(h)})$ with $\lambda_{1}^{(h)},\dots,\lambda_{R}^{(h)}\geq 0$ , and $\boldsymbol{Z}\in\mathbb{R}^{V(V-1)/2}$ .

The model supposes that there are $H$ different brain structure “types”. The probability of an edge between the $l$ -th pair of brain regions follows a logistic model having an intercept $Z_{l}$ characterizing the baseline log odds of a connection and a low-rank deviation that differs according to the individual’s brain type. To enable information transfer across the creativity groups (domains), the model assumes the brain structure types do not differ across the groups (referred to as “common atoms” in the mixture modeling literature). However, the proportion of individuals having brain type $h$ , $\nu_{h}^{(y)}$ , does differ across domains $y=1,\ldots,K$ .

Although the goal in [28] was inference on group differences, this model can be used directly for transfer learning from source domains to a target domain, with the source data enabling more accurate estimation of the shared network types. In addition, the baseline log-odds of an edge between each pair of nodes is also shared across the groups, leading to information sharing about common topological properties of the graphs, including block structures, homophily behaviors and transitive edge patterns [40]. An important application would be to transferring information from large brain imaging repositories, such as the Human Connectome Project (HCP) and UK Biobank, to small neuroimaging studies in targeted populations.

In [58] shared kernels are used to model complex distributions of multiple variables across different domains. The motivating applications are studies investigating how DNA methylation profiles vary according to cancer subtype. For samples $i=1,\ldots,n$ , data consist of $\boldsymbol{x}_{i}=(x_{i1},\ldots,x_{ip})^{T}$ with $x_{ij}$ denoting the methylation level at site $j$ , for $j=1,\ldots,p$ with $p$ very large (e.g., $p=450,000$ ) and $y_{i}\in\{1,\ldots,K\}$ denoting the group membership. The density of the data in group $k$ for the $j$ th variable is $f_{j}^{(k)}(\cdot)=\sum_{h=1}^{H}\nu_{jh}^{(k)}\mathcal{K}(\cdot;\boldsymbol{% \theta}_{h})$ , with $\mathcal{K}(\boldsymbol{\theta})$ a family of densities parameterized by $\boldsymbol{\theta}$ and $\nu_{jh}^{(k)}$ a probability weight on kernel $h$ specific to site $j$ and group $k$ .

Although the motivation in [58] is testing for differences in methylation between different groups, the proposed approach can be directly applied to transfer learning focused on inferring the marginal densities of very high-dimensional data within a particular domain. The data from all the domains are used in inferring the shared kernel parameters $\boldsymbol{\theta}_{1},\ldots,\boldsymbol{\theta}_{H}$ and further borrowing of information occurs through a hierarchical model for the weights $\{\nu_{jh}^{(k)}\}$ . Even in using shared kernels, this approach allows highly flexible differences in distribution across the groups.

There is a rich literature on alternative Bayesian mixture models for borrowing of information across grouped data, while also allowing distinct characteristics of each group. Suppose we let $\boldsymbol{x}_{i}$ denote feature data for subject $i$ with $y_{i}\in\{1,\ldots,K\}$ denoting the subject’s group membership. Then, a common approach is to incorporate subject-specific parameters $\boldsymbol{\theta}_{i}$ within the likelihood function for $\boldsymbol{x}_{i}$ and then let $\boldsymbol{\theta}_{i}\sim P_{y_{i}}$ , with the collection of group-specific random effects distributions $(P_{1},\ldots,P_{K})\sim\Pi$ given an appropriate prior. Popular choices of $\Pi$ include the hierarchical Dirichlet process (HDP) [86] and nested Dirichlet process (NDP) [1], both of which fall within the broad class of hierarchical processes [10]. These approaches characterize each $P_{k}$ as almost surely discrete while incorporating statistical dependence between $P_{k}$ and $P_{l}$ for all $k\neq l$ , leading to dependence in clustering.

Alternatively, analogously to the multi-group factor models of [23], [90], and [15], Müller et al. [68] modeled the group-specific random effects distributions $P_{k}$ as a mixture of a common cross-group distribution $P_{0}$ and group-specific distributions $Q_{k}$ , with a hyperprior chosen for the mixture weight on $P_{0}$ to allow data adaptivity. This approach, and the above approaches, assume a priori exchangeability across the groups. In the future, it will be interesting to adapt these approaches and develop appropriate extensions explicitly targeting the transfer learning case in which one domain is the particular focus. A relevant recent advance is the graphical Dirichlet process of Chakrabarti et al. [13], which incorporates a known directed acyclic graph (DAG) characterizing the dependence structure across the groups.

3.4 Network transfer

As an alternative to viewing each source domain as providing exchangeable information about the target domain a priori, there is often expert knowledge about directed relationships between the different domains. Incorporating a network of relationships among the domains in transfer learning is referred to as network transfer, as opposed to direct transfer.

Figure 1: Direct transfer (left) and network transfer learning (right). In direct transfer all the source learners are used directly in supporting the training of the target learner, whereas in network transfer we can have a more complex structure with some of the source learners supporting other source learners rather than the target learner directly.

In Bayesian network meta-analysis [61] the goal is often to compare the efficacy of a pair of treatments based on multiple studies, some of which may involve arms with other treatments. Let $W,X,Y,Z$ be four available treatments among which we want to compare the efficacy of $X$ and $Y.$ Suppose that we have the dataset $\mathcal{D}_{XY}$ formed based on studies comparing $X$ and $Y$ and that we also have access to datasets $\mathcal{D}_{XW}$ , $\mathcal{D}_{YW}$ , $\mathcal{D}_{YZ}$ , and $\mathcal{D}_{WZ}$ comparing, respectively $X$ to $W$ , $Y$ to $W$ , $Y$ to $Z$ , and $W$ to $Z$ . We may have several different trials for certain of these comparisons. The knowledge extracted from the trials comparing other treatments can be used to indirectly improve the analysis of the $X$ vs $Y$ trials. Figure 2 shows a graph representing the observed comparisons between the treatments, sometimes referred to as the evidence network [62], and the associated network transfer graph.

Figure 2: Evidence network for the treatment comparison (left) and the network transfer of information between the associated learners in the meta analysis (right).

The general framework for Bayesian network meta-analysis is outlined in [62]. Denote the observed mean difference in the efficacy of treatments $k$ and $l$ in study $i$ by $\delta_{i,k,l}$ and the baseline difference in efficacy between treatments $k$ and $l$ by $d_{k,l}$ . We refer to $d_{k,l}$ as effect parameters. In [62] they are divided into basic parameters $\boldsymbol{d}_{b}$ and functional parameters $\boldsymbol{d}_{f}$ . Any set of effect parameters can be treated as basic parameters if the edges associated with them create a spanning tree of the evidence network. The functional parameters are the remaining effect parameters.

Network meta analysis assumes functional parameters can be represented as linear functions of basic parameters, i.e. $\boldsymbol{d}_{f}=\boldsymbol{F}\boldsymbol{d}_{b}$ for some matrix $\boldsymbol{F}$ , which is referred to as evidence consistency. Usually, these relations take the form $d_{j,k}=d_{j,l}-d_{k,l}$ for any treatments $j,k,l$ . In our example, we can choose ${d}_{X,W},{d}_{Y,W},{d}_{Z,W}$ to be the basic parameters and then relate the functional parameters to them via $d_{X,Y}=d_{X,W}-d_{Y,W}$ and $d_{Y,Z}=d_{Y,W}-d_{Z,W}$ . Leveraging this assumption is analogous to utilizing the shared parameter strategy outlined in section 3.1. We can use these identities to increase the precision of estimation of $d_{Y,W}$ which in turn, together with the estimates of $d_{X,W}$ , can increase the precision of the estimation of $d_{X,Y}$ . This is represented by the network transfer in Figure 2.

The linear relationship $\boldsymbol{d}_{f}=\boldsymbol{F}\boldsymbol{d}_{b}$ can be used to model the vector of observed differences in treatments $\boldsymbol{\delta}$ conditionally on $\boldsymbol{d}_{b}$ and the covariance of $\boldsymbol{\delta}_{b}$ , denoted by $\text{Cov}(\boldsymbol{\delta}_{b})=\boldsymbol{V}_{b}$ . Using the Gaussian distribution $\boldsymbol{\delta}\sim\mathcal{N}\left(\left(\boldsymbol{d}_{b}^{T},% \boldsymbol{d}_{b}^{T}\boldsymbol{F}^{T}\right)^{T},\boldsymbol{V}\right)$ is standard [61], [25], [57], where

\boldsymbol{V}=\begin{pmatrix}\boldsymbol{V}_{b}&\boldsymbol{V}_{b}\boldsymbol% {F}^{T}\\ \boldsymbol{F}\boldsymbol{V}_{b}^{T}&\boldsymbol{FV}_{b}\boldsymbol{F}^{T}\end% {pmatrix}.

This prior can be incorporated within a hierarchical model for the individual observations in each study. Further borrowing of information can be facilitated through placing a common random effects distribution on the basic treatment effect parameters as in [57].

Additional flexibility in information transfer can come from allowing violations in evidence consistency. Lu et al. [62] provide such a framework via $\boldsymbol{d}_{f}=\boldsymbol{F}\boldsymbol{d}_{b}+\boldsymbol{w}$ , where $\boldsymbol{w}$ represents inconsistencies between studies. In our example

d_{X,Y}=d_{X,W}-d_{Y,W}+w_{X,Y,W}

and

d_{Y,Z}=d_{Y,W}-d_{Z,W}+w_{Y,Z,W}.

The inferred size of $\boldsymbol{w}$ directly measures how related the domains are and determines how much information transfer should occur between them. There can be various sources of inconsistencies between the pairwise comparisons. They can stem from limitations in the design of individual studies and from changes in the baseline efficacy of treatments over time, for example due to increasing antibiotic resistance. This problem has recently been addressed in [57] where the basic parameters are assumed to vary over time according to a Gaussian process. Thus, the information transfer between domains is corrected for the times at which the associated datasets were collected.

Often the appropriate transfer network joining the domains is not known and needs to be inferred. One can take a brute-force approach to select the best transfer network under some quality measure by checking every possible graph. However, this approach quickly becomes intractable as the number of domains grows with millions of possible transfer networks on just eight domains. Zhou et al. [109] provide a greedy algorithm which starts from the target learner and at each step includes a source learner yielding the highest conditional marginal likelihood for the target task. Specifically, suppose we have learners $\mathcal{L}_{1},\dots,\mathcal{L}_{K}$ operating on datasets $\boldsymbol{D}^{(1)}=(\boldsymbol{y}^{(1)},\boldsymbol{X}^{(1)}),\dots,% \boldsymbol{D}^{(K)}=(\boldsymbol{y}^{(K)},\boldsymbol{X}^{(K)})$ , respectively, where $\mathcal{L}_{1}$ is the target learner. Let ${G}=(V,E)$ be the (connected) network transfer graph with $V=\{1,\dots,K\}$ . Let $\boldsymbol{\theta}_{1},\dots,\boldsymbol{\theta}_{K}$ be the parameters in $\mathcal{L}_{1},\dots,\mathcal{L}_{K}$ , where for every $(i,j)\in E$ there exist subvectors $\boldsymbol{\theta}_{i,\mathcal{C}_{i}}$ and $\boldsymbol{\theta}_{j,\mathcal{C}_{j}}$ of, respectively, $\boldsymbol{\theta}_{i}$ and $\boldsymbol{\theta}_{j}$ which are restricted to be equal (shared parameter approach). Then at each step, given the chosen set of learners $Q\subset V$ , which is referred to as the linkage set, let $N_{G}(Q)$ be the set of neighbors of $Q$ , consisting of all learners adjacent to at least one learner in $Q$ . We then select a new learner $j^{*}$ to be added to $Q$ via

j^{*}=\operatorname*{arg\ max}_{j\in N_{G}(Q)}p(\cup_{k\in Q}\boldsymbol{y}^{(% k)}\mid\cup_{k\in Q}\boldsymbol{X}^{(k)},\boldsymbol{D}^{(j)}).

(9)

The algorithm terminates once adding a new learner no longer increases the conditional likelihood in (9). The complexity of this algorithm is $O(K^{2})$ under the assumption that the conditional likelihood in (9) can be obtained in constant time. This can be further reduced to $O(K\log K)$ if the likelihood computation is parallelized between the learners in $Q$ . Zhou et al. [109] provide theoretical guarantees for the recovery of the optimal transfer subnetwork of $G$ .

Having explored a variety of Bayesian approaches to transferring information between domains in a flexible manner, we now discuss which among these are applicable to particular types of transfer learning problems depending on (i) feature spaces of source and target domains; (ii) the availability of labels and samples in both source and target datasets. We note that there is an immense Bayesian literature providing relevant models for transfer learning that we do not mention above. Instead, we have chosen to highlight some approaches that we find particularly interesting and useful.

4 Transfer with Common vs Overlap** Variables

Following criterion (i), transfer learning problems can be dichotomized based on whether or not the observations in the source and target domains live in the same feature spaces. We will refer to the former case as common variables and to the latter as overlap** variables transfer. This classification often determines which of the approaches presented in Section 3 are appropriate or even feasible to use.

4.1 Common variables transfer

Common variables transfer, also known as as homogeneous transfer [47], occurs when the source and target data and labels have the same meaning in the different domains, but may follow different distributions. For example, the same variables are collected for each of the study subjects but subjects in different groups may have considerably different attributes; hence, the distribution of the variables being collected may vary across groups. In this common variables transfer case, it typically makes sense to define the same form of likelihood for the data in each domain, though there may be systematic differences in the parameters. All of the methods described in Section 3 can be applied directly to common variables problems.

Examples of common variables transfer include clinical studies with patients divided according to their health status or subtype of disease. Researchers are often interested in improving the accuracy of inference and predictions for a particular group of patients by utilizing the information gathered from the other groups or from healthy individuals. This can be especially useful when dealing with rare diseases where it is often difficult to collect measurements on a large sample of affected patients. In [7] and [38], the authors use RNA sequencing datasets for different types of lung, kidney, head and neck cancer as source domains in order to improve the accuracy of subtype identification for particular types of lung cancer. Bayesian models borrowing strength across classes and types of cancer have been applied in other contexts, including survival analysis [64], [76], and protein network inference [85], [4].

4.2 Overlap** variables transfer

This case is more challenging as the source and target datasets do not consist of measurements on exactly the same variables for different study subjects. In order for transfer learning to apply, there has to be something in common across the domains. A typical setting is when the domains have overlap** but not completely identical sets of variables. For example, there may be a common focus across the domains in studying the impact of key predictors of interest on a response, measured under different covariates.

For regression or classification models, the coefficients for the key predictors are not directly comparable across models adjusting for different covariates. Hence, shared parameter models are not appropriate. Nonetheless, it may make sense to assume that the domain-specific coefficients for the key predictors are drawn from a common random effects distribution, thereby enabling borrowing of information. Shared latent space models are even more natural in this case. By jointly modeling the response, key predictors, and covariates as conditionally independent given latent factors, we induce a parsimonious latent factor regression/classification model [11], [31]. Multi-study variants of such models can seamlessly handle cases in which the covariates differ across domains. The multi-group variants of Bayesian mixture models discussed in Section 3 similarly apply for the joint distributions of key predictors, covariates and response(s) to induce flexible transfer learning.

Bayesian methods enjoy a key advantage in this context, in their ability to rely on joint latent feature models to transfer information across domains with partially overlap** variables. The majority of non-Bayesian approaches such as deep transfer learning ([102], [84], [77], [88], [63], [59]) rely on both domains having observations in the same feature space for pre-training and fine-tuning the learners. We detail some examples of Bayesian transfer with overlap** variables below.

4.2.1 Multi-study latent factor regression

Recall the multi-study latent factor model in equation (8). Previously, we considered the observed data on subject $i$ in study (domain) $k$ , $\boldsymbol{y}_{k,i}$ , to be $p$ -dimensional, with $p$ fixed across subjects and domains. To generalize this, we instead consider the $p$ -dimensional data $\boldsymbol{y}_{k,i}$ to be the complete data for subject $i$ in study $k$ that could have potentially been measured. Then, we define $\boldsymbol{m}_{k,i}=(m_{k,i,j},j=1,\ldots,p)^{T}$ to be the missingness pattern for the $(k,i)$ subject, with $m_{k,i,j}=1$ if the $j$ th variable is not observed for that subject and $m_{k,i,j}=0$ otherwise. A variable is missing for a subject if the study they participated in does not collect that variable, or if the study planned to collect that variable but it was not available.

Let $\boldsymbol{y}_{k,i}^{(obs)}=\{y_{k,i,j},j:m_{k,i,j}=0,j=1,\ldots,p\}$ denote the $p_{k,i}=\sum_{j=1}^{p}(1-m_{k,i,j})$ dimensional observed data vector for subject $(k,i)$ . The Gaussian multi-study latent factor model characterizes the complete data vector as $\boldsymbol{y}_{k,i}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{\Sigma}_{k})$ . This in turn induces a $p_{k,i}$ -dimensional multivariate Gaussian distribution for the observed data vector $\boldsymbol{y}_{k,i}^{(obs)}$ having covariance corresponding to the appropriate sub-matrix of $\boldsymbol{\Sigma}_{k}$ . In fitting Bayesian multi-study factor models, it is not necessary to impute the missing data. Instead, one can simply take into account the differing observed data contributions for each subject in implementing a Gibbs sampler or alternative Markov chain Monte Carlo (MCMC) algorithm for posterior sampling.

This approach can be used for transfer learning about the covariance structure in multivariate data specific to the target domain. Alternatively, when the focus is on regression, one can concatenate outcomes, predictors of interest and covariates together in the $\boldsymbol{y}_{k,i}$ data vector for subject $(k,i)$ . A Gaussian linear regression model for the outcome given the predictors of interest and covariates can be then obtained directly from the covariance $\boldsymbol{\Sigma}_{k}$ using standard multivariate Gaussian theory. This type of approach is straightforward to extend to mixed categorical and continuous data by following the popular approach of linking categorical variables to underlying Gaussian variables.

4.2.2 Nonlinear and nonparametric extensions

A limitation of the above multi-study factor analysis model is the assumption of multivariate Gaussianity. It is hence useful to consider extensions that incorporate shared and study-specific latent factors while relaxing these restrictive distribution assumptions, which also imply linear relationships among the variables.

In the single modality case, there is a rich literature on nonlinear factor models. For example, we could let

\displaystyle\boldsymbol{y}_{i}=f(\boldsymbol{\eta}_{i})+\boldsymbol{\epsilon}% _{i},\quad\boldsymbol{\epsilon}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(% \boldsymbol{0},\sigma^{2}\boldsymbol{I}_{p}),

(10)

where $\boldsymbol{\eta}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(\boldsymbol{0},% \boldsymbol{I}_{q})$ are the vectors of latent factors and $f(\cdot)$ is an unknown and potentially non-linear function. Gaussian process latent variable models (GP-LVMs) place a GP prior on the function $f$ map** from the latent to ambient space [56], [30], [80], [98]. Alternatively, the popular class of variational autoencoders (VAEs) characterize $f$ using deep neural networks and take a variational approach to inference [67], [45], [12].

While these highly flexible nonlinear latent variable models have exhibited appealing practical performance as black-box models for generating new data that resemble the training data, they are prone to a number of vexing issues in reproducing statistical inferences. One major challenge is the curse of dimensionality resulting from the fact that the function $f$ is an unknown map** from $q$ to $p$ dimensional space; the space of such functions is immense, necessitating an enormous amount of training data for adequate performance. Furthermore, these models are not identifiable without substantial additional constraints. Another common problem is referred to as posterior collapse [20], [92], [93], in which there is a lack of learning about the latent variables based on the data. While there have been some attempts at addressing these problems, there remains a lack of practically useful methodology to perform reproducible dimensionality reduction.

The above challenges are exacerbated in considering extensions to the multi-study (transfer learning) case. Hence, we recommend starting with more parsimonious nonlinear latent factor models in future work develo** such extensions. One promising point of departure is the recently proposed NIFTY framework of Xu et al. [98], which lets

	$\displaystyle\boldsymbol{y}_{i}$	$\displaystyle=$	$\displaystyle\boldsymbol{\Lambda}\boldsymbol{\eta}_{i}+\boldsymbol{\epsilon}_{% i},\quad\boldsymbol{\epsilon}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(% \boldsymbol{0},\boldsymbol{\Sigma}),$
	$\displaystyle\eta_{ih}$	$\displaystyle=$	$\displaystyle g_{h}(u_{ik_{h}}),\quad h=1,\ldots,q,$

where $\boldsymbol{\Lambda}$ is a factor loading matrix, $\boldsymbol{\Sigma}=\mbox{diag}(\sigma_{1}^{2},\ldots,\sigma_{p}^{2})$ , and $u_{ik}\stackrel{{\scriptstyle iid}}{{\sim}}U(0,1)$ for $k=1,\ldots,K\leq q$ . Each latent factor $\eta_{ih}$ is a transformation of a latent $u_{ik_{h}}$ via an unknown non-decreasing function $g_{h}$ . The subscript $k_{h}$ allows the same latent uniforms to be used for multiple factors, inducing dependence.

This model induces a flexible multivariate density for $\boldsymbol{y}_{i}$ while massively reducing dimensionality relative to model (10). In their paper, they provided theory on identifiability, leveraging on pre-training with state-of-the-art nonlinear dimensionality reduction algorithms. They also showed excellent performance for a wide variety of complex examples. They were even able to train a realistic generative model for bird songs based on few training examples; audio recordings of bird songs provide an example of massive dimensional data with low intrinsic dimension. NIFTY can exploit the complex low dimensional structure in the data for highly efficient performance.

In conducting inference for latent variable models, the NIFTY authors noticed a common problem of distributional shift. In particular, many of the current models assume that the latent variables are iid $\mathcal{N}(0,1)$ or $U(0,1)$ . Inferences on the parameters, such as the induced covariance in the Gaussian linear factor model case, critically depend on this assumption holding not just a priori but also a posteriori. Xu et al. [98] propose a general approach for solving latent variable distributional shift through forcing the posterior distribution of the latent variables to be very close to iid $U(0,1).$

4.2.3 Mixture models

An alternative direction towards building more flexible models for transfer learning, including in the partially overlap** variables case, is to rely on mixture models, building on the developments in Section 3.3. Such models have the advantage of also clustering subjects within the different domains. In the partially overlap** variables transfer learning case, it is appealing to define a joint model, as motivated above. However, Chandra et al. [14] recently noted a pitfall of mixture models in high dimensional cases in which the posterior tends to concentrate on trivial clusterings of the observations that place all subjects into one cluster or in singleton clusters.

As a solution in the single domain case, they proposed a latent mixture model formulation that lets

	$\displaystyle\boldsymbol{y}_{i}$	$\displaystyle=$	$\displaystyle\boldsymbol{\Lambda}\boldsymbol{\eta}_{i}+\boldsymbol{\epsilon}_{% i},\quad\boldsymbol{\epsilon}_{i}\overset{\mathrm{iid}}{\sim}\mathcal{N}(% \boldsymbol{0},\boldsymbol{\Sigma}),$
	$\displaystyle\boldsymbol{\eta}_{i}$	$\displaystyle\sim$	$\displaystyle\sum_{h=1}^{H}\nu_{h}\mathcal{N}(\boldsymbol{\mu}_{h},\boldsymbol% {\Delta}_{h}),$		(11)

so that a mixture of Gaussians model is used for the latent variables in a linear factor model. They prove that this model solves the above mentioned pitfall in Bayesian clustering in high dimensions. The trick is to model the variation across clusters in a lower dimensional latent space to address the curse of dimensionality.

With this single domain specification as the starting point, there are multiple promising directions forward in terms of extensions to the multiple domain transfer learning case. One possibility is to define a multi-study factor model as in Chandra et al. [15] but instead of assuming Gaussian shared and study-specific latent factors, use Gaussian mixture models to induce a flexible distribution on the latent factors while also producing separate clusters of subjects in each domain with respect to the shared and study-specific components. An alternative is to rely on the model in equation (11) but with domain-specific distributions for the latent factors defined as

\displaystyle f_{k}(\boldsymbol{\eta}_{i})=\int\mathcal{N}(\boldsymbol{\eta}_{% i};\boldsymbol{\theta}_{i})dP_{k}(\boldsymbol{\theta}_{i}),

where $\boldsymbol{\theta}_{i}=\{\boldsymbol{\mu}_{i},\boldsymbol{\Delta}_{i}\}$ are the Gaussian parameters and $P_{k}$ is a mixing distribution on these parameters that is specific to domain $k$ .

In the special case in which $P_{k}=P=\sum_{h=1}^{H}\nu_{h}\delta_{\boldsymbol{\theta}_{h}^{*}}$ with $\delta_{\boldsymbol{\theta}}$ a degenerate distribution concentrated at $\boldsymbol{\theta}$ we obtain the original model in (11). However by using the different priors $(P_{1},\ldots,P_{K})\sim\Pi$ considered in Section 3.3 we can allow differences across the domains while borrowing information; further borrowing is achieved through the implicit assumption of a latent space that is shared across domains - this is induced through the use of a common factor loading matrix $\boldsymbol{\Lambda}$ .

4.2.4 Multiresolution transfer learning

Closely related to the overlap** variables case is the setting in which data are collected for each domain on related random functions or stochastic processes. For example, let $f_{k}:\mathcal{T}\to\mathbb{R}$ denote a latent smooth continuously differentiable function for domain $k$ , and suppose that we have

\displaystyle y_{k,i}=f_{k}(t_{k,i})+\epsilon_{k,i},\quad\epsilon_{k,i}% \overset{\mathrm{iid}}{\sim}\mathcal{N}(0,\sigma^{2}),

(12)

with $\boldsymbol{y}_{k}=\{y_{k,i},i=1,\ldots,n_{k}\}$ the observed data and $\boldsymbol{t}_{k}=\{t_{k,i},i=1,\ldots,n_{k}\}$ the observation locations for domain $k$ . Often, the observation locations do not line up across domains and certain domains may have lower resolution data than others, with the resolution referring to the density of the observation points $\boldsymbol{t}_{k}$ over the support $\mathcal{T}.$

Model (12) represents a type of functional data analysis (FDA). In many FDA settings, we observe noisy realizations of a subject-specific function at a finite set of points, but here we are considering the case in which we have one function for each domain and are particularly interested in inference on the function for the source domain. There are many applications in which this type of problem arises. We may have domain-specific regression functions $f_{k}$ and want to borrow information across domains in a nonparametric regression context without assuming any common parameters. Alternatively, $\boldsymbol{y}_{k}$ may correspond to a domain $k$ -specific time series and we want to borrow information across related time series to enhance prediction for a target series.

A natural Bayesian approach to inference under model (12) is to consider a functional data extension of the hierarchical and random effects modeling approaches highlighted in Section 3. For example, one could use a hierarchical GP that lets $f_{k}\sim\mbox{GP}(f_{0},c)$ with $f_{0}$ in turn given a GP prior [5], [21]. For articles on using GPs in closely related transfer learning settings to that of (12) refer to [94], [103], [39]. These approaches can automatically accommodate the case in which observations are denser in some domains than others.

Wilson et al. [94] introduce GP regression networks (GPRN) which use latent GPs to transfer information between different continuous time processes. Specifically, given $K$ time series $\boldsymbol{y}_{1},\dots,\boldsymbol{y}_{K}$ , [94] let

\boldsymbol{y}_{k}=\sum_{q=1}^{Q}\boldsymbol{W}_{k,q}\odot\boldsymbol{f}_{q}+% \boldsymbol{\epsilon}_{k},

where $\odot$ is the Hadamard product, $\boldsymbol{f}_{q}\sim\mathcal{GP}(0,\boldsymbol{K}_{q}^{f})$ are latent basis GPs evaluated at the observation times, $\boldsymbol{W}_{k,q}\sim\mathcal{GP}(0,\boldsymbol{K}_{k,q}^{w})$ are domain-specific weights of the latent GPs, and $\boldsymbol{\epsilon}_{k}\sim\mathcal{N}(\boldsymbol{0},\sigma_{k}^{2}% \boldsymbol{I})$ represents measurement noise. The weights $\boldsymbol{W}_{k,q}$ determine the strength and structure of information transfer between the domains, analogously to the network of learners introduced in [109]. However, unlike in most works in Section 3.4, GPRN allows for the strength of the transfer within the network of learners to vary over time.

Although [94] simplify inference by assuming identical measurement times across domains, [39] extend the approach to allow different measurement times and increase flexibility by using deep GPs. By using a shared latent space approach with GPs serving as the basis for the latent space, these models are able to achieve transfer both across resolutions and different classes of observations, corresponding to different air pollutants in the application presented in [39].

5 Availability of labelled data

As we have seen, the Bayesian paradigm provides a fertile ground for develo** a rich variety of techniques relevant to transfer learning in both supervised [78], [18], [43], [47], [48], [49], [64] and unsupervised settings [23], [90], [15], [82], [81]. When the focus is on prediction, there are often challenges presented by the limited availability of labelled data. The term semi-supervised learning refers to the case in which labels are only available for a subset of the samples. The joint modeling approaches for overlap** variable transfer learning described in the previous section can trivially handle semi-supervised settings, as missing labels are just one type of missing data.

Particularly challenging are cases of one-shot and few-shot learning, which refers to having only a single or a few labelled samples in the target domain, respectively. For articles proposing Bayesian approaches to handle semi-supervised learning and these cases of a tiny number of labelled target data, refer to [101], [71], [75], [53], [55], [54]. In such cases, performance is critically dependent on borrowing of information from labelled data in the source domains. Common examples include classification based on images or audio ([53], [55], [54]). For example, we may have many labelled examples of different individuals handwriting but only a tiny number for the individual of interest.

In the transfer learning literature, inductive transfer learning refers to the case where target domain labels are available, while transductive transfer learning has labels available only in the source domains [70]. Of course, if there are certain systematic differences between the source and target domains, accurate transductive transfer may be impossible [22]. Nonetheless, there is a rich PAC-Bayesian literature on this topic ([32], [33], [34], [79]) which specifies conditions allowing for successful training of the target learner in the absence of target labels in classification settings. They provide theoretical upper bounds on the expected error on the target domains of a Gibbs classifier depending on various measures of divergence between the distributions of predictors and labels of both domains as well as the properties of the set of voter algorithms from which the Gibbs classifier is constructed.

6 Simulation illustration

In order to illustrate Bayesian transfer learning in practice, we run a simulation experiment focused on the problem of transfer learning targeting the covariance and/or precision matrix of a high-dimensional multivariate Gaussian distribution. In particular, there are data collected on the same set of variables for subjects in different groups and we would like to allow the covariance/precision to vary across groups, while borrowing information. This is a natural setting for both Bayesian multi-study factor analysis models and frequentist competitors focused on transfer learning in precision matrix estimation.

Specifically, we compare Bayesian Subspace Factor Analysis (SUFA) [15] presented in section 3.3 with the frequentist Trans-CLIME method proposed by Li et al. [74]. Trans-CLIME is a transfer learning extension of constrained $L_{1}$ minimization for inverse matrix estimation (CLIME) [87]. We also include the frequentist estimator proposed by Guo et al. [36], which Li et al. [74] refer to as multitask graphical lasso (MT-Glasso). While [15] focused on inferring the covariance, SUFA can just as easily be used to infer any functional of the covariance including the precision. We focus our comparisons on precision estimation, as this was the emphasis in [74] and [36].

We generate the synthetic data in two domains, $T$ (target) and $S$ (source), with sample sizes $n_{T}=40$ and $n_{S}=80$ , respectively, and consider the data dimensions $p\in\{40,60,80,\dots,280\}$ . We generate 50 replicated datasets for each value of $p$ . We report the average Frobenius and $L_{1}$ norm of the errors between the estimated and true values for the target precision matrix. For each value of $p$ considered, we generate the data using a fixed true precision matrix across all the simulation replicates. We present the results in Figure 3.

Refer to caption — Figure 3: Frobenius and $L_{1}$ norm errors of target precision matrix estimation for SUFA, Trans-CLIME and MT-Glasso over varying dimension $p$ .

As we can see, in terms of the $L_{1}$ norm of the error, the performance tends to improve significantly with growing dimension for SUFA and MT-Glasso, and to an extent for Trans-CLIME as well. This somewhat counterintuitive phenomenon, known as the blessing of dimensionality [104] is commonly encountered in the covariance and precision estimation literature [66], [96], [104], [72]. For the Frobenius norm error we see more instability in the performance of MT-Glasso and Trans-CLIME. Intuitively, the $L_{1}$ norm might be more favorable for these two methods, since they are both built on $L_{1}$ minimization.

Since Trans-CLIME does not guarantee positive definiteness and invertibility of the estimated precision matrices, it is problematic to use it for covariance estimation; even after selecting only the synthetic datasets for which Trans-CLIME did produce invertible precision matrices, the resulting covariance estimates were highly unstable and overall significantly worse than those given by SUFA. MT-Glasso produces invertible precision matrices but also yielded covariance estimates further from the truth than SUFA estimates.

Hence, we find that this particular Bayesian approach to transfer learning based on a shared latent space model is competitive with frequentist counterparts even on tasks it was not built for (estimating the precision instead of the covariance). Indeed, as we have partially illustrated, since we have a posterior for the covariance that has support on the space of positive semidefinite covariances, we can provide Bayes estimates and posterior credible intervals providing uncertainty quantification for any functional of the covariance of interest. Hence, from a single Bayesian analysis, we can provide multiple results of interest that are all internally coherent.

7 Discussion

Transfer learning is a timely problem given the abundance of datasets from related domains. In many applications, there is simply not enough data from the domain of interest to support reliable inference and accurate predictions as we seek to fit increasingly complex models. Hence, it becomes critical to cleverly borrow information from available “source” datasets.

Choosing the appropriate strength and structure of information transfer between the domains remains one of the key challenges. The Bayesian paradigm offers a wide variety of approaches to transfer learning, including shared parameters, hierarchical and random effects models, shared latent space, and network transfer methods. There is a rich literature develo** and applying these approaches in transfer learning settings, even though most often “transfer learning” is not mentioned in the associated papers.

This article has focused on providing a flavor for some of the interesting directions that are possible in terms of Bayesian transfer learning, but has not attempted a comprehensive overview of the massive relevant literature. Most of the transfer learning literature has focused on the simplest “common variables” case, where data from different domains consist of the same variables measured across different subjects. Bayesian ideas applied to transfer learning can be particularly useful in the more challenging settings presented in Section 4, where these existing methods largely do not apply.

We have purposely focused much of our attention on shared latent space-type models for Bayesian transfer learning, ranging from multi-study factor analysis to multi-group Bayesian nonparametric models. We focused on these areas because the associated models are not as well known to the broad community but are very practically useful, including in challenging high-dimensional and complex structured data cases. In addition, there have been interesting recent developments that we have highlighted, while sketching out some promising directions for ongoing research. This includes extending Bayesian continuous latent factor modeling approaches to transfer learning settings. Our view is that more careful statistical models will tend to dominate over-parametrized black boxes, such as VAEs, in many settings.

An additional interesting area for future research is Bayesian transfer learning involving deep neural networks. While in recent years there have been papers taking some early steps in this area [78], [16], there is plenty of potential for further impactful developments in this field, especially given the importance of transfer learning to deep neural networks training due to their data-hungry nature.

8 Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (grant agreement No. 856506), United States National Institutes of Health (R01ES035625), Office of Naval Research (N00014-21-1-2510), National Institute of Environmental Health Sciences (R01ES027498), and National Science Foundation (DMS-2230074). Suder’s contributions were supported in part by the Duke Summer Research Fellowship.

References

[1] {barticle}[author] \bauthor\bsnmAbel Rodríguez, \bfnmDavid B Dunson\binitsD. B. D. and \bauthor\bsnmGelfand, \bfnmAlan E\binitsA. E. (\byear2008). \btitleThe nested Dirichlet process. \bjournalJournal of the American Statistical Association. \endbibitem
[2] {barticle}[author] \bauthor\bsnmAvrahami, \bfnmOmri\binitsO., \bauthor\bsnmLischinski, \bfnmDani\binitsD. and \bauthor\bsnmFried, \bfnmOhad\binitsO. (\byear2021). \btitleGAN cocktail: Mixing GANs without dataset access. \bjournalEuropean Conference on Computer Vision. \endbibitem
[3] {barticle}[author] \bauthor\bsnmBaglama, \bfnmJames\binitsJ. and \bauthor\bsnmReichel, \bfnmLothar\binitsL. (\byear2005). \btitleAugmented implicitly restarted Lanczos bidiagonalization methods. \bjournalSIAM Journal on Scientific Computing. \endbibitem
[4] {barticle}[author] \bauthor\bsnmBaladandayuthapani, \bfnmVeerabhadran\binitsV., \bauthor\bsnmTalluri, \bfnmRajesh\binitsR., \bauthor\bsnmJi, \bfnmYuan\binitsY., \bauthor\bsnmCoombes, \bfnmKevin\binitsK., \bauthor\bsnmLu, \bfnmYiling\binitsY., \bauthor\bsnmHennessy, \bfnmBryan\binitsB., \bauthor\bsnmDavies, \bfnmMichael\binitsM. and \bauthor\bsnmMallick, \bfnmBani\binitsB. (\byear2014). \btitleBayesian sparse graphical models for classification with application to protein expression data. \bjournalThe Annals of Applied Statistics. \endbibitem
[5] {barticle}[author] \bauthor\bsnmBehseta, \bfnmSam\binitsS., \bauthor\bsnmKass, \bfnmRobert E\binitsR. E. and \bauthor\bsnmWallstrom, \bfnmGarrick L\binitsG. L. (\byear2005). \btitleHierarchical models for assessing variability among functions. \bjournalBiometrika. \endbibitem
[6] {barticle}[author] \bauthor\bsnmBhattacharya, \bfnmAnirban\binitsA., \bauthor\bsnmPati, \bfnmDebdeep\binitsD., \bauthor\bsnmPillai, \bfnmNatesh S.\binitsN. S. and \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. (\byear2015). \btitleDirichlet–Laplace priors for optimal shrinkage. \bjournalJournal of the American Statistical Association. \endbibitem
[7] {barticle}[author] \bauthor\bsnmBoluki, \bfnmShahin\binitsS., \bauthor\bsnmQian, \bfnmXiaoning\binitsX. and \bauthor\bsnmDougherty, \bfnmEdward R\binitsE. R. (\byear2021). \btitleOptimal Bayesian supervised domain adaptation for RNA sequencing data. \bjournalBioinformatics. \endbibitem
[8] {barticle}[author] \bauthor\bsnmBoschini, \bfnmMatteo\binitsM., \bauthor\bsnmBonicelli, \bfnmLorenzo\binitsL., \bauthor\bsnmPorrello, \bfnmAngelo\binitsA., \bauthor\bsnmBellitto, \bfnmGiovanni\binitsG., \bauthor\bsnmPennisi, \bfnmMatteo\binitsM., \bauthor\bsnmPalazzo, \bfnmSimone\binitsS., \bauthor\bsnmSpampinato, \bfnmConcetto\binitsC. and \bauthor\bsnmCalderara, \bfnmSimone\binitsS. (\byear2022). \btitleTransfer without forgetting. \bjournalEuropean Conference on Computer Vision. \endbibitem
[9] {barticle}[author] \bauthor\bsnmBu, \bfnmFan\binitsF., \bauthor\bsnmKagaayi, \bfnmJoseph\binitsJ., \bauthor\bsnmGrabowski, \bfnmKate\binitsK., \bauthor\bsnmRatmann, \bfnmOliver\binitsO. and \bauthor\bsnmXu, \bfnmJason\binitsJ. (\byear2023). \btitleInferring HIV transmission patterns from viral deep-sequence data via latent typed point processes. arXiv preprint arXiv 2302.11567. \endbibitem
[10] {barticle}[author] \bauthor\bsnmCamerlenghi, \bfnmFederico\binitsF., \bauthor\bsnmLijoi, \bfnmAntonio\binitsA., \bauthor\bsnmOrbanz, \bfnmPeter\binitsP. and \bauthor\bsnmPrünster, \bfnmIgor\binitsI. (\byear2019). \btitleDistribution theory for hierarchical processes. \bjournalThe Annals of Statistics. \endbibitem
[11] {barticle}[author] \bauthor\bsnmCarvalho, \bfnmCarlos M\binitsC. M., \bauthor\bsnmChang, \bfnmJeffrey\binitsJ., \bauthor\bsnmLucas, \bfnmJoseph E\binitsJ. E., \bauthor\bsnmNevins, \bfnmJoseph R\binitsJ. R., \bauthor\bsnmWang, \bfnmQuanli\binitsQ. and \bauthor\bsnmWest, \bfnmMike\binitsM. (\byear2008). \btitleHigh-dimensional sparse factor modeling: Applications in gene expression genomics. \bjournalJournal of the American Statistical Association. \endbibitem
[12] {barticle}[author] \bauthor\bsnmCemgil, \bfnmTaylan\binitsT., \bauthor\bsnmGhaisas, \bfnmSumedh\binitsS., \bauthor\bsnmDvijotham, \bfnmKrishnamurthy\binitsK., \bauthor\bsnmGowal, \bfnmSven\binitsS. and \bauthor\bsnmKohli, \bfnmPushmeet\binitsP. (\byear2020). \btitleThe autoencoding variational autoencoder. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[13] {bmisc}[author] \bauthor\bsnmChakrabarti, \bfnmArhit\binitsA., \bauthor\bsnmNi, \bfnmYang\binitsY., \bauthor\bsnmMorris, \bfnmEllen Ruth A.\binitsE. R. A., \bauthor\bsnmSalinas, \bfnmMichael L.\binitsM. L., \bauthor\bsnmChapkin, \bfnmRobert S.\binitsR. S. and \bauthor\bsnmMallick, \bfnmBani K.\binitsB. K. (\byear2023). \btitleGraphical Dirichlet process for clustering non-exchangeable grouped data. arXiv preprint arXiv 2302.09111. \endbibitem
[14] {barticle}[author] \bauthor\bsnmChandra, \bfnmNoirrit Kiran\binitsN. K., \bauthor\bsnmCanale, \bfnmAntonio\binitsA. and \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. (\byear2023). \btitleEsca** the curse of dimensionality in Bayesian model-based clustering. \bjournalJournal of Machine Learning Research. \endbibitem
[15] {barticle}[author] \bauthor\bsnmChandra, \bfnmNoirrit Kiran\binitsN. K., \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. and \bauthor\bsnmXu, \bfnmJason\binitsJ. (\byear2023). \btitleInferring covariance structure from multiple data sources via subspace factor analysis. arXiv preprint arXiv 2305.04113. \endbibitem
[16] {barticle}[author] \bauthor\bsnmChandra, \bfnmRohitash\binitsR. and \bauthor\bsnmKapoor, \bfnmArpit\binitsA. (\byear2020). \btitleBayesian neural multi-source transfer learning. \bjournalNeurocomputing. \endbibitem
[17] {barticle}[author] \bauthor\bsnmChen, \bfnmMing-Hui\binitsM.-H. and \bauthor\bsnmIbrahim, \bfnmJoseph\binitsJ. (\byear2006). \btitleThe relationship between the power prior and hierarchical models. \bjournalBayesian Analysis. \endbibitem
[18] {barticle}[author] \bauthor\bsnmChen, \bfnmMing-Hui\binitsM.-H. and \bauthor\bsnmIbrahim, \bfnmJoseph G.\binitsJ. G. (\byear2000). \btitlePower prior distributions for regression models. \bjournalStatistical Science. \endbibitem
[19] {barticle}[author] \bauthor\bsnmChen, \bfnmMing-Hui\binitsM.-H., \bauthor\bsnmIbrahim, \bfnmJoseph G.\binitsJ. G., \bauthor\bsnmLam, \bfnmPeter\binitsP., \bauthor\bsnmYu, \bfnmAlan\binitsA. and \bauthor\bsnmZhang, \bfnmYuanye\binitsY. (\byear2011). \btitleBayesian design of noninferiority trials for medical devices using historical data. \bjournalBiometrics. \endbibitem
[20] {barticle}[author] \bauthor\bsnmDai, \bfnmBin\binitsB., \bauthor\bsnmWang, \bfnmZiyu\binitsZ. and \bauthor\bsnmWipf, \bfnmDavid\binitsD. (\byear2020). \btitleThe usual suspects? Reassessing blame for VAE posterior collapse. \bjournalInternational Conference on Machine Learning. \endbibitem
[21] {barticle}[author] \bauthor\bsnmDaniel R. Kowal, \bfnmDavid S. Matteson\binitsD. S. M. and \bauthor\bsnmRuppert, \bfnmDavid\binitsD. (\byear2019). \btitleFunctional autoregression for sparsely sampled data. \bjournalJournal of Business & Economic Statistics. \endbibitem
[22] {barticle}[author] \bauthor\bsnmDavid, \bfnmShai Ben\binitsS. B., \bauthor\bsnmLu, \bfnmTyler\binitsT., \bauthor\bsnmLuu, \bfnmTeresa\binitsT. and \bauthor\bsnmPal, \bfnmDavid\binitsD. (\byear2010). \btitleImpossibility theorems for domain adaptation. \bjournalInternational Conference on Artificial Intelligence and Statistics. \endbibitem
[23] {barticle}[author] \bauthor\bsnmDe Vito, \bfnmRoberta\binitsR., \bauthor\bsnmBellio, \bfnmRuggero\binitsR., \bauthor\bsnmTrippa, \bfnmLorenzo\binitsL. and \bauthor\bsnmParmigiani, \bfnmGiovanni\binitsG. (\byear2019). \btitleMulti-study factor analysis. \bjournalBiometrics. \endbibitem
[24] {barticle}[author] \bauthor\bsnmDeng, \bfnmJia\binitsJ., \bauthor\bsnmDong, \bfnmWei\binitsW., \bauthor\bsnmSocher, \bfnmRichard\binitsR., \bauthor\bsnmLi, \bfnmLi-Jia\binitsL.-J., \bauthor\bsnmLi, \bfnmKai\binitsK. and \bauthor\bsnmFei-Fei, \bfnmLi\binitsL. (\byear2009). \btitleImageNet: A large-scale hierarchical image database. \bjournalIEEE Conference on Computer Vision and Pattern Recognition. \endbibitem
[25] {bbook}[author] \bauthor\bsnmDias, \bfnmSofia\binitsS., \bauthor\bsnmWelton, \bfnmNicky J\binitsN. J., \bauthor\bsnmSutton, \bfnmAlex J\binitsA. J. and \bauthor\bsnmAdes, \bfnmA E\binitsA. E. (\byear2014). \btitleNICE DSU technical support document 2: A generalised linear modelling framework for pairwise and network meta-analysis of randomised controlled trials. \bpublisherNational Institute for Health and Care Excellence (NICE). \endbibitem
[26] {barticle}[author] \bauthor\bsnmDing, \bfnmDaisy Yi\binitsD. Y., \bauthor\bsnmLi, \bfnmShuangning\binitsS., \bauthor\bsnmNarasimhan, \bfnmBalasubramanian\binitsB. and \bauthor\bsnmTibshirani, \bfnmRobert\binitsR. (\byear2022). \btitleCooperative learning for multiview analysis. \bjournalProceedings of the National Academy of Sciences. \endbibitem
[27] {barticle}[author] \bauthor\bsnmDuan, \bfnmYuyan\binitsY., \bauthor\bsnmYe, \bfnmKeying\binitsK. and \bauthor\bsnmSmith, \bfnmEric P.\binitsE. P. (\byear2006). \btitleEvaluating water quality using power priors to incorporate historical information. \bjournalEnvironmetrics. \endbibitem
[28] {barticle}[author] \bauthor\bsnmDurante, \bfnmDaniele\binitsD. and \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. (\byear2018). \btitleBayesian inference and testing of group differences in brain networks. \bjournalBayesian Analysis. \endbibitem
[29] {barticle}[author] \bauthor\bsnmEbrahimi, \bfnmSayna\binitsS., \bauthor\bsnmElhoseiny, \bfnmMohamed\binitsM., \bauthor\bsnmDarrell, \bfnmTrevor\binitsT. and \bauthor\bsnmRohrbach, \bfnmMarcus\binitsM. (\byear2020). \btitleUncertainty-guided continual learning with Bayesian neural networks. \bjournalInternational Conference on Learning Representations. \endbibitem
[30] {barticle}[author] \bauthor\bsnmEleftheriadis, \bfnmStefanos\binitsS., \bauthor\bsnmRudovic, \bfnmOgnjen\binitsO. and \bauthor\bsnmPantic, \bfnmMaja\binitsM. (\byear2014). \btitleDiscriminative shared gaussian processes for multiview and view-invariant facial expression recognition. \bjournalIEEE Transactions on Image Processing. \endbibitem
[31] {barticle}[author] \bauthor\bsnmFerrari, \bfnmFederico\binitsF. and \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. (\byear2021). \btitleBayesian factor analysis for inference on interactions. \bjournalJournal of the American Statistical Association. \endbibitem
[32] {barticle}[author] \bauthor\bsnmGermain, \bfnmPascal\binitsP., \bauthor\bsnmHabrard, \bfnmAmaury\binitsA., \bauthor\bsnmLaviolette, \bfnmFrançois\binitsF. and \bauthor\bsnmMorvant, \bfnmEmilie\binitsE. (\byear2013). \btitleA PAC-Bayesian approach for domain adaptation with specialization to linear classifiers. \bjournalInternational Conference on Machine Learning. \endbibitem
[33] {barticle}[author] \bauthor\bsnmGermain, \bfnmPascal\binitsP., \bauthor\bsnmHabrard, \bfnmAmaury\binitsA., \bauthor\bsnmLaviolette, \bfnmFrançois\binitsF. and \bauthor\bsnmMorvant, \bfnmEmilie\binitsE. (\byear2016). \btitleA new PAC-Bayesian perspective on domain adaptation. \bjournalInternational Conference on Machine Learning. \endbibitem
[34] {barticle}[author] \bauthor\bsnmGermain, \bfnmPascal\binitsP., \bauthor\bsnmHabrard, \bfnmAmaury\binitsA., \bauthor\bsnmLaviolette, \bfnmFrançois\binitsF. and \bauthor\bsnmMorvant, \bfnmEmilie\binitsE. (\byear2020). \btitlePAC-Bayes and domain adaptation. \bjournalNeurocomputing. \endbibitem
[35] {barticle}[author] \bauthor\bsnmGreen, \bfnmPeter J.\binitsP. J. (\byear1995). \btitleReversible jump Markov chain Monte Carlo computation and Bayesian model determination. \bjournalBiometrika. \endbibitem
[36] {barticle}[author] \bauthor\bsnmGuo, \bfnmJian\binitsJ., \bauthor\bsnmLevina, \bfnmElizaveta\binitsE., \bauthor\bsnmMichailidis, \bfnmGeorge\binitsG. and \bauthor\bsnmZhu, \bfnmJi\binitsJ. (\byear2011). \btitleJoint estimation of multiple graphical models. \bjournalBiometrika. \endbibitem
[37] {barticle}[author] \bauthor\bsnmGönen, \bfnmMehmet\binitsM. and \bauthor\bsnmMargolin, \bfnmA. A.\binitsA. A. (\byear2014). \btitleKernelized Bayesian transfer learning. \bjournalAAAI Conference on Artificial Intelligence. \endbibitem
[38] {barticle}[author] \bauthor\bsnmHajiramezanali, \bfnmEhsan\binitsE., \bauthor\bsnmZamani Dadaneh, \bfnmSiamak\binitsS., \bauthor\bsnmKarbalayghareh, \bfnmAlireza\binitsA., \bauthor\bsnmZhou, \bfnmMingyuan\binitsM. and \bauthor\bsnmQian, \bfnmXiaoning\binitsX. (\byear2018). \btitleBayesian multi-domain learning for cancer subtype discovery from next-generation sequencing count data. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[39] {barticle}[author] \bauthor\bsnmHamelijnck, \bfnmOliver\binitsO., \bauthor\bsnmDamoulas, \bfnmTheodoros\binitsT., \bauthor\bsnmWang, \bfnmKangrui\binitsK. and \bauthor\bsnmGirolami, \bfnmMark\binitsM. (\byear2019). \btitleMulti-resolution multi-task Gaussian processes. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[40] {barticle}[author] \bauthor\bsnmHoff, \bfnmPeter\binitsP. (\byear2007). \btitleModeling homophily and stochastic equivalence in symmetric relational data. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[41] {barticle}[author] \bauthor\bsnmHospedales, \bfnmT.\binitsT., \bauthor\bsnmAntoniou, \bfnmA.\binitsA., \bauthor\bsnmMicaelli, \bfnmP.\binitsP. and \bauthor\bsnmStorkey, \bfnmA.\binitsA. (\byear2022). \btitleMeta-learning in neural networks: A survey. \bjournalIEEE Transactions on Pattern Analysis and Machine Intelligence. \endbibitem
[42] {barticle}[author] \bauthor\bsnmIbrahim, \bfnmJ. G.\binitsJ. G., \bauthor\bsnmChen, \bfnmM. H.\binitsM. H. and \bauthor\bsnmSinha, \bfnmD.\binitsD. (\byear2001). \btitleBayesian semiparametric models for survival data with a cure fraction. \bjournalBiometrics. \endbibitem
[43] {barticle}[author] \bauthor\bsnmIbrahim, \bfnmJoseph G.\binitsJ. G., \bauthor\bsnmChen, \bfnmMing-Hui\binitsM.-H., \bauthor\bsnmGwon, \bfnmYeong**\binitsY. and \bauthor\bsnmChen, \bfnmFang\binitsF. (\byear2015). \btitleThe power prior: theory and applications. \bjournalStatistics in Medicine. \endbibitem
[44] {barticle}[author] \bauthor\bsnmIbrahim, \bfnmJoseph G\binitsJ. G., \bauthor\bsnmChen, \bfnmMing-Hui\binitsM.-H. and \bauthor\bsnmSinha, \bfnmDebajyoti\binitsD. (\byear2003). \btitleOn optimality properties of the power prior. \bjournalJournal of the American Statistical Association. \endbibitem
[45] {barticle}[author] \bauthor\bsnmIlse, \bfnmMaximilian\binitsM., \bauthor\bsnmTomczak, \bfnmJakub M.\binitsJ. M., \bauthor\bsnmLouizos, \bfnmChristos\binitsC. and \bauthor\bsnmWelling, \bfnmMax\binitsM. (\byear2020). \btitleDIVA: Domain invariant variational autoencoders. \bjournalConference on Medical Imaging with Deep Learning. \endbibitem
[46] {barticle}[author] \bauthor\bsnmKapoor, \bfnmSanyam\binitsS., \bauthor\bsnmKaraletsos, \bfnmTheofanis\binitsT. and \bauthor\bsnmBui, \bfnmThang D\binitsT. D. (\byear2021). \btitleVariational auto-regressive Gaussian processes for continual learning. \bjournalInternational Conference on Machine Learning. \endbibitem
[47] {barticle}[author] \bauthor\bsnmKarbalayghareh, \bfnmAlireza\binitsA., \bauthor\bsnmQian, \bfnmXiaoning\binitsX. and \bauthor\bsnmDougherty, \bfnmEdward R.\binitsE. R. (\byear2018). \btitleOptimal Bayesian transfer learning. \bjournalIEEE Transactions on Signal Processing. \endbibitem
[48] {barticle}[author] \bauthor\bsnmKarbalayghareh, \bfnmAlireza\binitsA., \bauthor\bsnmQian, \bfnmXiaoning\binitsX. and \bauthor\bsnmDougherty, \bfnmEdward R.\binitsE. R. (\byear2018). \btitleOptimal Bayesian transfer regression. \bjournalIEEE Signal Processing Letters. \endbibitem
[49] {barticle}[author] \bauthor\bsnmKarbalayghareh, \bfnmAlireza\binitsA., \bauthor\bsnmQian, \bfnmXiaoning\binitsX. and \bauthor\bsnmDougherty, \bfnmEdward R.\binitsE. R. (\byear2021). \btitleOptimal Bayesian transfer learning for count data. \bjournalIEEE/ACM Transactions on Computational Biology and Bioinformatics. \endbibitem
[50] {barticle}[author] \bauthor\bsnmKouw, \bfnmWouter M.\binitsW. M. and \bauthor\bsnmLoog, \bfnmMarco\binitsM. (\byear2019). \btitleAn introduction to domain adaptation and transfer learning. arXiv preprint arXiv 1812.11806. \endbibitem
[51] {barticle}[author] \bauthor\bsnmKumar, \bfnmAbhishek\binitsA., \bauthor\bsnmChatterjee, \bfnmSunabha\binitsS. and \bauthor\bsnmRai, \bfnmPiyush\binitsP. (\byear2021). \btitleBayesian structural adaptation for continual learning. \bjournalInternational Conference on Machine Learning. \endbibitem
[52] {barticle}[author] \bauthor\bsnmKuznetsova, \bfnmAlina\binitsA., \bauthor\bsnmRom, \bfnmHassan\binitsH., \bauthor\bsnmAlldrin, \bfnmNeil\binitsN., \bauthor\bsnmUijlings, \bfnmJasper\binitsJ., \bauthor\bsnmKrasin, \bfnmIvan\binitsI., \bauthor\bsnmPont-Tuset, \bfnmJordi\binitsJ., \bauthor\bsnmKamali, \bfnmShahab\binitsS., \bauthor\bsnmPopov, \bfnmStefan\binitsS., \bauthor\bsnmMalloci, \bfnmMatteo\binitsM., \bauthor\bsnmKolesnikov, \bfnmAlexander\binitsA., \bauthor\bsnmDuerig, \bfnmTom\binitsT. and \bauthor\bsnmFerrari, \bfnmVittorio\binitsV. (\byear2020). \btitleThe open images dataset V4. \bjournalInternational Journal of Computer Vision. \endbibitem
[53] {barticle}[author] \bauthor\bsnmLake, \bfnmBrenden M.\binitsB. M., \bauthor\bsnmSalakhutdinov, \bfnmRuslan\binitsR., \bauthor\bsnmGross, \bfnmJason\binitsJ. and \bauthor\bsnmTenenbaum, \bfnmJoshua B.\binitsJ. B. (\byear2011). \btitleOne shot learning of simple visual concepts. \bjournalCognitive Science. \endbibitem
[54] {barticle}[author] \bauthor\bsnmLake, \bfnmBrenden M.\binitsB. M., \bauthor\bsnmSalakhutdinov, \bfnmRuslan\binitsR. and \bauthor\bsnmTenenbaum, \bfnmJoshua B.\binitsJ. B. (\byear2015). \btitleHuman-level concept learning through probabilistic program induction. \bjournalScience. \endbibitem
[55] {barticle}[author] \bauthor\bsnmLake, \bfnmBrenden M\binitsB. M., \bauthor\bsnmSalakhutdinov, \bfnmRuss R\binitsR. R. and \bauthor\bsnmTenenbaum, \bfnmJosh\binitsJ. (\byear2013). \btitleOne-shot learning by inverting a compositional causal process. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[56] {barticle}[author] \bauthor\bsnmLawrence, \bfnmNeil D.\binitsN. D. and \bauthor\bsnmMoore, \bfnmAndrew J.\binitsA. J. (\byear2007). \btitleHierarchical Gaussian process latent variable models. \bjournalInternational Conference on Machine Learning. \endbibitem
[57] {bmisc}[author] \bauthor\bsnmLeBlanc, \bfnmPatrick M.\binitsP. M. and \bauthor\bsnmBanks, \bfnmDavid\binitsD. (\byear2023). \btitleTime-varying Bayesian network meta-analysis. arXiv preprint arXiv 2211.08312. \endbibitem
[58] {barticle}[author] \bauthor\bsnmLock, \bfnmEric F.\binitsE. F. and \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. (\byear2015). \btitleShared kernel Bayesian screening. \bjournalBiometrika. \endbibitem
[59] {barticle}[author] \bauthor\bsnmLong, \bfnmMingsheng\binitsM., \bauthor\bsnmCao, \bfnmYue\binitsY., \bauthor\bsnmWang, \bfnmJianmin\binitsJ. and \bauthor\bsnmJordan, \bfnmMichael I.\binitsM. I. (\byear2015). \btitleLearning Transferable Features with Deep Adaptation Networks. \bjournalInternational Conference on Machine Learning. \endbibitem
[60] {barticle}[author] \bauthor\bsnmLopes, \bfnmHedibert Freitas\binitsH. F. and \bauthor\bsnmWest, \bfnmMike\binitsM. (\byear2004). \btitleBayesian model assessment in factor analysis. \bjournalStatistica Sinica. \endbibitem
[61] {barticle}[author] \bauthor\bsnmLu, \bfnmG.\binitsG. and \bauthor\bsnmAdes, \bfnmA. E.\binitsA. E. (\byear2004). \btitleCombination of direct and indirect evidence in mixed treatment comparisons. \bjournalStatistics in Medicine. \endbibitem
[62] {barticle}[author] \bauthor\bsnmLu, \bfnmGuobing\binitsG. and \bauthor\bsnmAdes, \bfnmA. E.\binitsA. E. (\byear2006). \btitleAssessing evidence inconsistency in mixed treatment comparisons. \bjournalJournal of the American Statistical Association. \endbibitem
[63] {barticle}[author] \bauthor\bsnmLuo, \bfnmZelun\binitsZ., \bauthor\bsnmZou, \bfnmYuliang\binitsY., \bauthor\bsnmHoffman, \bfnmJudy\binitsJ. and \bauthor\bsnmFei-Fei, \bfnmLi F\binitsL. F. (\byear2017). \btitleLabel Efficient Learning of Transferable Representations across Domains and Tasks. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[64] {barticle}[author] \bauthor\bsnmMaity, \bfnmArnab\binitsA., \bauthor\bsnmBhattacharya, \bfnmAnirban\binitsA., \bauthor\bsnmMallick, \bfnmBani\binitsB. and \bauthor\bsnmBaladandayuthapani, \bfnmVeerabhadran\binitsV. (\byear2019). \btitleBayesian data integration and variable selection for pan‐cancer survival prediction using protein expression data. \bjournalBiometrics. \endbibitem
[65] {bincollection}[author] \bauthor\bsnmMcCloskey, \bfnmMichael\binitsM. and \bauthor\bsnmCohen, \bfnmNeal J.\binitsN. J. (\byear1989). \btitleCatastrophic interference in connectionist networks: The sequential learning problem. \bseriesPsychology of Learning and Motivation. \endbibitem
[66] {bmisc}[author] \bauthor\bsnmMolstad, \bfnmAaron J.\binitsA. J., \bauthor\bsnmEkvall, \bfnmKarl Oskar\binitsK. O. and \bauthor\bsnmSuder, \bfnmPiotr M.\binitsP. M. (\byear2022). \btitleDirect covariance matrix estimation with compositional data. arXiv preprint arXiv 2212.09833. \endbibitem
[67] {barticle}[author] \bauthor\bsnmMoran, \bfnmGemma Elyse\binitsG. E., \bauthor\bsnmSridhar, \bfnmDhanya\binitsD., \bauthor\bsnmWang, \bfnmYixin\binitsY. and \bauthor\bsnmBlei, \bfnmDavid\binitsD. (\byear2022). \btitleIdentifiable deep generative models via sparse decoding. \bjournalTransactions on Machine Learning Research. \endbibitem
[68] {barticle}[author] \bauthor\bsnmMüller, \bfnmPeter\binitsP., \bauthor\bsnmQuintana, \bfnmFernando\binitsF. and \bauthor\bsnmRosner, \bfnmGary\binitsG. (\byear2004). \btitleA method for combining inference across related nonparametric Bayesian models. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology). \endbibitem
[69] {barticle}[author] \bauthor\bsnmNiu, \bfnmShuteng\binitsS., \bauthor\bsnmLiu, \bfnmYongxin\binitsY., \bauthor\bsnmWang, \bfnmJian\binitsJ. and \bauthor\bsnmSong, \bfnmHoubing\binitsH. (\byear2020). \btitleA decade survey of transfer learning (2010–2020). \bjournalIEEE Transactions on Artificial Intelligence. \endbibitem
[70] {barticle}[author] \bauthor\bsnmPan, \bfnmSinno Jialin\binitsS. J. and \bauthor\bsnmYang, \bfnmQiang\binitsQ. (\byear2010). \btitleA survey on transfer learning. \bjournalIEEE Transactions on Knowledge and Data Engineering. \endbibitem
[71] {barticle}[author] \bauthor\bsnmPatacchiola, \bfnmMassimiliano\binitsM., \bauthor\bsnmTurner, \bfnmJack\binitsJ., \bauthor\bsnmCrowley, \bfnmElliot J.\binitsE. J., \bauthor\bsnmO' Boyle, \bfnmMichael\binitsM. and \bauthor\bsnmStorkey, \bfnmAmos J\binitsA. J. (\byear2020). \btitleBayesian meta-learning for the few-shot setting via deep kernels. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[72] {barticle}[author] \bauthor\bsnmQuefeng Li, \bfnmJianqing Fan\binitsJ. F. \bsuffixGuang Cheng and \bauthor\bsnmWang, \bfnmYuyan\binitsY. (\byear2018). \btitleEmbracing the blessing of dimensionality in factor models. \bjournalJournal of the American Statistical Association. \endbibitem
[73] {barticle}[author] \bauthor\bsnmRavi, \bfnmSachin\binitsS. and \bauthor\bsnmBeatson, \bfnmAlex\binitsA. (\byear2019). \btitleAmortized Bayesian meta-learning. \bjournalInternational Conference on Learning Representations. \endbibitem
[74] {barticle}[author] \bauthor\bsnmSai Li, \bfnmT. Tony Cai\binitsT. T. C. and \bauthor\bsnmLi, \bfnmHongzhe\binitsH. (\byear2022). \btitleTransfer learning in large-scale Gaussian graphical models with false discovery rate control. \bjournalJournal of the American Statistical Association. \endbibitem
[75] {barticle}[author] \bauthor\bsnmSalakhutdinov, \bfnmRuslan\binitsR., \bauthor\bsnmTenenbaum, \bfnmJoshua\binitsJ. and \bauthor\bsnmTorralba, \bfnmAntonio\binitsA. (\byear2012). \btitleOne-shot learning with a hierarchical nonparametric Bayesian model. \bjournalProceedings of ICML Workshop on Unsupervised and Transfer Learning. \endbibitem
[76] {barticle}[author] \bauthor\bsnmSamorodnitsky, \bfnmSarah\binitsS., \bauthor\bsnmHoadley, \bfnmKatherine\binitsK. and \bauthor\bsnmLock, \bfnmEric\binitsE. (\byear2020). \btitleA pan-cancer and polygenic Bayesian hierarchical model for the effect of somatic mutations on survival. \bjournalCancer Informatics. \endbibitem
[77] {barticle}[author] \bauthor\bsnmShin, \bfnmHoo-Chang\binitsH.-C., \bauthor\bsnmRoth, \bfnmHolger R\binitsH. R., \bauthor\bsnmGao, \bfnmMingchen\binitsM., \bauthor\bsnmLu, \bfnmLe\binitsL., \bauthor\bsnmXu, \bfnmZiyue\binitsZ., \bauthor\bsnmNogues, \bfnmIsabella\binitsI., \bauthor\bsnmYao, \bfnmJianhua\binitsJ., \bauthor\bsnmMollura, \bfnmDaniel\binitsD. and \bauthor\bsnmSummers, \bfnmRonald M\binitsR. M. \btitleDeep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. \bjournalIEEE Transactions on Medical Imaging. \endbibitem
[78] {barticle}[author] \bauthor\bsnmShwartz-Ziv, \bfnmRavid\binitsR., \bauthor\bsnmGoldblum, \bfnmMicah\binitsM., \bauthor\bsnmSouri, \bfnmHossein\binitsH., \bauthor\bsnmKapoor, \bfnmSanyam\binitsS., \bauthor\bsnmZhu, \bfnmChen\binitsC., \bauthor\bsnmLeCun, \bfnmYann\binitsY. and \bauthor\bsnmWilson, \bfnmAndrew G\binitsA. G. (\byear2022). \btitlePre-train your loss: Easy Bayesian transfer learning with informative priors. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[79] {barticle}[author] \bauthor\bsnmSicilia, \bfnmAnthony\binitsA., \bauthor\bsnmAtwell, \bfnmKatherine\binitsK., \bauthor\bsnmAlikhani, \bfnmMalihe\binitsM. and \bauthor\bsnmHwang, \bfnmSeong Jae\binitsS. J. (\byear2022). \btitlePAC-Bayesian domain adaptation bounds for multiclass learners. \bjournalUncertainty in Artificial Intelligence. \endbibitem
[80] {barticle}[author] \bauthor\bsnmSong, \bfnmGuoli\binitsG., \bauthor\bsnmWang, \bfnmShuhui\binitsS., \bauthor\bsnmHuang, \bfnmQingming\binitsQ. and \bauthor\bsnmTian, \bfnmQi\binitsQ. (\byear2019). \btitleHarmonized multimodal learning with Gaussian process latent variable models. \bjournalIEEE Transactions on Pattern Analysis and Machine Intelligence. \endbibitem
[81] {barticle}[author] \bauthor\bsnmSotiropoulos, \bfnmStamatios N\binitsS. N., \bauthor\bsnmHernández-Fernández, \bfnmMoisés\binitsM., \bauthor\bsnmVu, \bfnmAn T\binitsA. T., \bauthor\bsnmAndersson, \bfnmJesper L\binitsJ. L., \bauthor\bsnmMoeller, \bfnmSteen\binitsS., \bauthor\bsnmYacoub, \bfnmEssa\binitsE., \bauthor\bsnmLenglet, \bfnmChristophe\binitsC., \bauthor\bsnmUgurbil, \bfnmKamil\binitsK., \bauthor\bsnmBehrens, \bfnmTimothy E J\binitsT. E. J. and \bauthor\bsnmJbabdi, \bfnmSaad\binitsS. (\byear2016). \btitleFusion in diffusion MRI for improved fibre orientation estimation: An application to the 3T and 7T data of the Human Connectome Project. \bjournalNeuroimage. \endbibitem
[82] {barticle}[author] \bauthor\bsnmSotiropoulos, \bfnmStamatios N\binitsS. N., \bauthor\bsnmJbabdi, \bfnmSaad\binitsS., \bauthor\bsnmAndersson, \bfnmJesper L\binitsJ. L., \bauthor\bsnmWoolrich, \bfnmMark W\binitsM. W., \bauthor\bsnmUgurbil, \bfnmKamil\binitsK. and \bauthor\bsnmBehrens, \bfnmTimothy E J\binitsT. E. J. (\byear2013). \btitleRubiX: combining spatial resolutions for Bayesian inference of crossing fibers in diffusion MRI. \bjournalIEEE Transactions on Medical Imaging. \endbibitem
[83] {barticle}[author] \bauthor\bsnmSpiegelhalter, \bfnmDavid J.\binitsD. J., \bauthor\bsnmBest, \bfnmNicola G.\binitsN. G., \bauthor\bsnmCarlin, \bfnmBradley P.\binitsB. P. and \bauthor\bsnmVan Der Linde, \bfnmAngelika\binitsA. (\byear2002). \btitleBayesian measures of model complexity and fit. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology). \endbibitem
[84] {barticle}[author] \bauthor\bsnmTan, \bfnmChuanqi\binitsC., \bauthor\bsnmSun, \bfnmFuchun\binitsF., \bauthor\bsnmKong, \bfnmTao\binitsT., \bauthor\bsnmZhang, \bfnmWenchang\binitsW., \bauthor\bsnmYang, \bfnmChao\binitsC. and \bauthor\bsnmLiu, \bfnmChunfang\binitsC. (\byear2018). \btitleA survey on deep transfer learning. \bjournalArtificial Neural Networks and Machine Learning. \endbibitem
[85] {barticle}[author] \bauthor\bsnmTan, \bfnmLinda S. L.\binitsL. S. L., \bauthor\bsnmJasra, \bfnmAjay\binitsA., \bauthor\bsnmIorio, \bfnmMaria De\binitsM. D. and \bauthor\bsnmEbbels, \bfnmTimothy M. D.\binitsT. M. D. (\byear2017). \btitleBayesian inference for multiple Gaussian graphical models with application to metabolic association networks. \bjournalThe Annals of Applied Statistics. \endbibitem
[86] {barticle}[author] \bauthor\bsnmTeh, \bfnmYee Whye\binitsY. W., \bauthor\bsnmJordan, \bfnmMichael I.\binitsM. I., \bauthor\bsnmBeal, \bfnmMatthew J.\binitsM. J. and \bauthor\bsnmBlei, \bfnmDavid M.\binitsD. M. (\byear2006). \btitleHierarchical Dirichlet processes. \bjournalJournal of the American Statistical Association. \endbibitem
[87] {barticle}[author] \bauthor\bsnmTony Cai, \bfnmWeidong Liu\binitsW. L. and \bauthor\bsnmLuo, \bfnmXi\binitsX. (\byear2011). \btitleA constrained $L_{1}$ minimization approach to sparse precision matrix estimation. \bjournalJournal of the American Statistical Association. \endbibitem
[88] {barticle}[author] \bauthor\bsnmTzeng, \bfnmEric\binitsE., \bauthor\bsnmHoffman, \bfnmJudy\binitsJ., \bauthor\bsnmDarrell, \bfnmTrevor\binitsT. and \bauthor\bsnmSaenko, \bfnmKate\binitsK. (\byear2015). \btitleSimultaneous Deep Transfer Across Domains and Tasks. \bjournalIEEE International Conference on Computer Vision. \endbibitem
[89] {binbook}[author] \bauthor\bsnmVanschoren, \bfnmJoaquin\binitsJ. (\byear2019). \btitleMeta-Learning In \bbooktitleAutomated Machine Learning: Methods, Systems, Challenges. \endbibitem
[90] {barticle}[author] \bauthor\bsnmVito, \bfnmRoberta De\binitsR. D., \bauthor\bsnmBellio, \bfnmRuggero\binitsR., \bauthor\bsnmTrippa, \bfnmLorenzo\binitsL. and \bauthor\bsnmParmigiani, \bfnmGiovanni\binitsG. (\byear2021). \btitleBayesian multistudy factor analysis for high-throughput biological data. \bjournalThe Annals of Applied Statistics. \endbibitem
[91] {barticle}[author] \bauthor\bsnmWang, \bfnmBoyu\binitsB. and \bauthor\bsnmPineau, \bfnmJoelle\binitsJ. (\byear2015). \btitleOnline boosting algorithms for anytime transfer and multitask learning. \bjournalAAAI Conference on Artificial Intelligence. \endbibitem
[92] {barticle}[author] \bauthor\bsnmWang, \bfnmYixin\binitsY., \bauthor\bsnmBlei, \bfnmDavid\binitsD. and \bauthor\bsnmCunningham, \bfnmJohn P\binitsJ. P. (\byear2021). \btitlePosterior collapse and latent variable non-identifiability. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[93] {barticle}[author] \bauthor\bsnmWang, \bfnmZihao\binitsZ. and \bauthor\bsnmZiyin, \bfnmLiu\binitsL. (\byear2022). \btitlePosterior collapse of a linear latent variable model. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[94] {barticle}[author] \bauthor\bsnmWilson, \bfnmAndrew Gordon\binitsA. G., \bauthor\bsnmKnowles, \bfnmDavid A.\binitsD. A. and \bauthor\bsnmGhahramani, \bfnmZoubin\binitsZ. (\byear2012). \btitleGaussian process regression networks. \bjournalInternational Conference on Machine Learning. \endbibitem
[95] {barticle}[author] \bauthor\bsnmWood, \bfnmFrank\binitsF. and \bauthor\bsnmTeh, \bfnmYee Whye\binitsY. W. (\byear2009). \btitleA hierarchical nonparametric Bayesian approach to statistical language model domain adaptation. \bjournalInternational Conference on Artificial Intelligence and Statistics. \endbibitem
[96] {barticle}[author] \bauthor\bsnmXu, \bfnmJason\binitsJ. and \bauthor\bsnmLange, \bfnmKenneth\binitsK. (\byear2022). \btitleA proximal distance algorithm for likelihood-based sparse covariance estimation. \bjournalBiometrika. \endbibitem
[97] {barticle}[author] \bauthor\bsnmXu, \bfnmJu\binitsJ. and \bauthor\bsnmZhu, \bfnmZhanxing\binitsZ. (\byear2018). \btitleReinforced continual learning. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[98] {bmisc}[author] \bauthor\bsnmXu, \bfnmMaoran\binitsM., \bauthor\bsnmHerring, \bfnmAmy H.\binitsA. H. and \bauthor\bsnmDunson, \bfnmDavid B.\binitsD. B. (\byear2023). \btitleIdentifiable and interpretable nonparametric factor analysis. arXiv preprint arXiv 2311.08254. \endbibitem
[99] {barticle}[author] \bauthor\bsnmXuan, \bfnmJunyu\binitsJ., \bauthor\bsnmLu, \bfnmJie\binitsJ. and \bauthor\bsnmZhang, \bfnmGuangquan\binitsG. (\byear2021). \btitleBayesian transfer learning: An overview of probabilistic graphical models for transfer learning. arXiv preprint arXiv 2109.13233. \endbibitem
[100] {bbook}[author] \bauthor\bsnmYang, \bfnmQiang\binitsQ., \bauthor\bsnmZhang, \bfnmYu\binitsY., \bauthor\bsnmDai, \bfnmWenyuan\binitsW. and \bauthor\bsnmPan, \bfnmSinno Jialin\binitsS. J. (\byear2020). \btitleTransfer learning. \bpublisherCambridge University Press. \endbibitem
[101] {barticle}[author] \bauthor\bsnmYoon, \bfnmJaesik\binitsJ., \bauthor\bsnmKim, \bfnmTaesup\binitsT., \bauthor\bsnmDia, \bfnmOusmane\binitsO., \bauthor\bsnmKim, \bfnmSungwoong\binitsS., \bauthor\bsnmBengio, \bfnmYoshua\binitsY. and \bauthor\bsnmAhn, \bfnmSung**\binitsS. (\byear2018). \btitleBayesian model-agnostic meta-learning. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[102] {barticle}[author] \bauthor\bsnmYosinski, \bfnmJason\binitsJ., \bauthor\bsnmClune, \bfnmJeff\binitsJ., \bauthor\bsnmBengio, \bfnmYoshua\binitsY. and \bauthor\bsnmLipson, \bfnmHod\binitsH. (\byear2014). \btitleHow transferable are features in deep neural networks? \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[103] {barticle}[author] \bauthor\bsnmYousefi, \bfnmFariba\binitsF., \bauthor\bsnmSmith, \bfnmMichael T\binitsM. T. and \bauthor\bsnmÁlvarez, \bfnmMauricio\binitsM. (\byear2019). \btitleMulti-task learning for aggregated data using Gaussian processes. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[104] {barticle}[author] \bauthor\bsnmYuanpei Cao, \bfnmWei Lin\binitsW. L. and \bauthor\bsnmLi, \bfnmHongzhe\binitsH. (\byear2019). \btitleLarge covariance estimation for compositional data via composition-adjusted thresholding. \bjournalJournal of the American Statistical Association. \endbibitem
[105] {barticle}[author] \bauthor\bsnmZhang, \bfnmQiang\binitsQ., \bauthor\bsnmFang, \bfnm**yuan\binitsJ., \bauthor\bsnmMeng, \bfnmZaiqiao\binitsZ., \bauthor\bsnmLiang, \bfnmShangsong\binitsS. and \bauthor\bsnmYilmaz, \bfnmEmine\binitsE. (\byear2021). \btitleVariational continual Bayesian meta-learning. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[106] {barticle}[author] \bauthor\bsnmZhang, \bfnmWen\binitsW., \bauthor\bsnmDeng, \bfnmLingfei\binitsL., \bauthor\bsnmZhang, \bfnmLei\binitsL. and \bauthor\bsnmWu, \bfnmDongrui\binitsD. (\byear2023). \btitleA survey on negative transfer. \bjournalIEEE/CAA Journal of Automatica Sinica. \endbibitem
[107] {barticle}[author] \bauthor\bsnmZhao, \bfnmTingting\binitsT., \bauthor\bsnmWang, \bfnmZifeng\binitsZ., \bauthor\bsnmMasoomi, \bfnmAria\binitsA. and \bauthor\bsnmDy, \bfnmJennifer\binitsJ. (\byear2022). \btitleDeep Bayesian unsupervised lifelong learning. \bjournalNeural Networks. \endbibitem
[108] {barticle}[author] \bauthor\bsnmZhou, \bfnmAurick\binitsA. and \bauthor\bsnmLevine, \bfnmSergey\binitsS. (\byear2021). \btitleBayesian adaptation for covariate shift. \bjournalAdvances in Neural Information Processing Systems. \endbibitem
[109] {barticle}[author] \bauthor\bsnmZhou, \bfnmJiaying\binitsJ., \bauthor\bsnmDing, \bfnmJie\binitsJ., \bauthor\bsnmTan, \bfnmKean Ming\binitsK. M. and \bauthor\bsnmTarokh, \bfnmVahid\binitsV. (\byear2021). \btitleModel linkage selection for cooperative learning. \bjournalJournal of Machine Learning Research. \endbibitem
[110] {barticle}[author] \bauthor\bsnmZhuang, \bfnmFuzhen\binitsF., \bauthor\bsnmQi, \bfnmZhiyuan\binitsZ., \bauthor\bsnmDuan, \bfnmKeyu\binitsK., \bauthor\bsnmXi, \bfnmDongbo\binitsD., \bauthor\bsnmZhu, \bfnmYongchun\binitsY., \bauthor\bsnmZhu, \bfnmHengshu\binitsH., \bauthor\bsnmXiong, \bfnmHui\binitsH. and \bauthor\bsnmHe, \bfnmQing\binitsQ. (\byear2021). \btitleA comprehensive survey on transfer learning. \bjournalProceedings of the IEEE. \endbibitem