Meta-Learning with Versatile Loss Geometries
for Fast Adaptation Using Mirror Descent

Abstract

Utilizing task-invariant prior knowledge extracted from related tasks, meta-learning is a principled framework that empowers learning a new task especially when data records are limited. A fundamental challenge in meta-learning is how to quickly “adapt” the extracted prior in order to train a task-specific model within a few optimization steps. Existing approaches deal with this challenge using a preconditioner that enhances convergence of the per-task training process. Though effective in representing locally a quadratic training loss, these simple linear preconditioners can hardly capture complex loss geometries. The present contribution addresses this limitation by learning a nonlinear mirror map, which induces a versatile distance metric to enable capturing and optimizing a wide range of loss geometries, hence facilitating the per-task training. Numerical tests on few-shot learning datasets demonstrate the superior expressiveness and convergence of the advocated approach.

Index Terms— Meta-learning, bilevel optimization, mirror descent, loss geometries

1 Introduction

The success of deep learning relies heavily on large-scale and high-dimensional models, which require extensive training using a large number of data. However, this “data-driven learning” approach is not feasible in applications where data are scarce due to costly data collection and labelling process. Examples of such applications include drug discovery [1], machine translation [2], and robot manipulation [3].

In contrast, meta-learning offers a powerful approach for learning a task in data-limited setups. Specifically, meta-learning extracts task-invariant prior information from a collection of given tasks, that can subsequently aid learning of a new, albeit related task. Although this new task may have limited training data, the prior serves as a strong inductive bias that effectively transfers knowledge to aid its learning. In image classification for instance, a feature extractor learned from a collection of given tasks can act as a common prior, and thus benefit a variety of other image classification tasks.

Depending on how this “data-limited learning” is performed, meta-learning algorithms can be categorized into neural network (NN)- and optimization-based ones. In NN-based ones, the per-task learning is viewed as an NN map** from its training data to task-specific model parameters [4, 5]. The prior information is encoded in the NN weights, which are shared and optimized across tasks. With the universality of NNs in approximating complex map**s granted, their black-box structure challenges their reliability and interpretability. On the other hand, optimization-based meta-learning alternatives interpret “data-limited learning” as a cascade of a few optimization iterations (a.k.a. adaptation) over the model parameters. The prior here is captured by the shared hyperparameters of the iterative optimizer. A representative of these alternatives is the model-agnostic meta-learning (MAML) [6], which views the prior as a learnable task-invariant initialization of the optimizer. By starting from an informative initial point, the model parameters can rapidly converge to local minima within a few gradient descent (GD) steps. Building upon MAML, a series of variants have been proposed to learn different priors [7, 8, 9].

While optimization-based meta-learning has been proven effective numerically, recent studies suggest that its generalization and stability heavily rely on convergence of per-task optimization [7, 9]. This motivates one to grow the number of descent iterations. However, this can be infeasible as the overall complexity of meta-learning scales linearly with the number of GD steps [7]. Besides, using accelerated first-order optimizers, such as Adam [10], introduces extra backpropagation complexity when optimizing the prior. To improve the per-task convergence without markedly adding to the complexity, another line of research focuses on second-order optimization using a learnable precondition matrix having simple form [11, 12, 13, 14, 15, 16]. In fact, the precondition matrix captures the local quadratic curvature of the training loss, and linearly transforms the gradient based on this curvature. To acquire more expressive preconditioners, recent advances suggest replacing the linear matrix multiplication with a nonlinear NN transformation [17]. However, convergence of this NN-manipulated GD is an uncharted territory.

The present work advocates learning a generic distance metric induced by a strictly increasing nonlinear mirror map, which enables efficient optimization over generic loss geometries. All in all, our contribution is three-fold.

i)

Broadening linear preconditioners with guaranteed per-task convergence.
ii)

Blockwise inverse autoregressive flow (blockIAF) ensuring monotonicity and scalability of the mirror map.
iii)

Numerical tests showing superior performance and improved convergence compared to linear preconditioners.

2 Problem setup

To enable “data-limited learning” of a new task, meta-learning forms task-invariant priors using a collection of given tasks indexed by $t=1,\ldots,T$ . Each task comprises a dataset $\mathcal{D}^{\mathrm{}}_{t}:=\{(\mathbf{x}_{t}^{n},y_{t}^{n})\}_{n=1}^{N_{t}}$ consisting of $N_{t}$ data-label pairs, which are split into a training subset $\mathcal{D}^{\mathrm{trn}}_{t}$ , and a disjoint validation subset $\mathcal{D}^{\mathrm{val}}_{t}$ . The new task, indexed by $\star$ , contains a training subset $\mathcal{D}^{\mathrm{trn}}_{\star}$ , and a set of test data $\{\mathbf{x}_{\star}^{n}\}_{n=1}^{N_{\star}^{\mathrm{tst}}}$ for which the corresponding labels $\{y_{\star}^{n}\}_{n=1}^{N_{\star}^{\mathrm{tst}}}$ are to be predicted. The key premise of meta-learning is that all the aforementioned tasks share related model structures or data distributions. Thus, one can postulate a large model shared across all tasks, along with distinct model parameters $\boldsymbol{\phi}_{t}\in\mathbb{R}^{d}$ per individual task. But since the cardinality $N_{t}^{\mathrm{trn}}:=|\mathcal{D}^{\mathrm{trn}}_{t}|$ can be much smaller than $d$ , learning a task by directly optimizing $\boldsymbol{\phi}_{t}$ over $\mathcal{D}^{\mathrm{trn}}_{t}$ is impractical. Fortunately, since $T$ is considerably large, a task-invariant prior can be learned using $\{\mathcal{D}^{\mathrm{val}}_{t}\}_{t=1}^{T}$ to render per-task learning well posed.

Letting $\boldsymbol{\theta}\in\mathbb{R}^{d^{\prime}}$ denote the vector parameter of the prior, the meta-learning objective can be formulated as a bilevel optimization problem. The lower-level trains each task-specific model by optimizing $\boldsymbol{\phi}_{t}$ using $\mathcal{D}^{\mathrm{trn}}_{t}$ and $\boldsymbol{\theta}$ from the upper-level. The upper-level adjusts $\boldsymbol{\theta}$ by evaluating the optimized $\boldsymbol{\phi}_{t}$ on the validation sets $\{\mathcal{D}^{\mathrm{val}}_{t}\}_{t=1}^{T}$ . The two levels depend on each other and yield the following nested objective

		$\displaystyle\min_{\boldsymbol{\theta}}\sum_{t=1}^{T}\mathcal{L}(\boldsymbol{% \phi}_{t}^{*}(\boldsymbol{\theta});\mathcal{D}^{\mathrm{val}}_{t})$		(1a)
		$\displaystyle~{}\text{s.t.}~{}~{}\boldsymbol{\phi}_{t}^{*}(\boldsymbol{\theta}% )=\operatornamewithlimits{argmin}_{\boldsymbol{\phi}_{t}}\mathcal{L}(% \boldsymbol{\phi}_{t};\mathcal{D}^{\mathrm{trn}}_{t})+\mathcal{R}(\boldsymbol{% \phi}_{t};\boldsymbol{\theta}),~{}\forall t$		(1b)

where $\mathcal{L}$ is the loss function capturing each task-specific model fit, and $\mathcal{R}$ is the regularizer accounting for the task-invariant prior. From the Bayesian viewpoint, $\mathcal{L}$ and $\mathcal{R}$ represent the negative log-likelihood (nll), $-\log p(\mathbf{y}_{t}^{\mathrm{trn}}|\boldsymbol{\phi}_{t};\mathbf{X}_{t}^{% \mathrm{trn}})$ , and the negative log-prior (nlp) $-\log p(\boldsymbol{\phi}_{t};\boldsymbol{\theta})$ , where $\mathbf{X}_{t}^{\mathrm{trn}}:=[\mathbf{x}_{t}^{1},\ldots,\mathbf{x}_{t}^{N_{t% }^{\mathrm{trn}}}]$ and $\mathbf{y}_{t}^{\mathrm{trn}}:=[y_{t}^{1},\ldots,y_{t}^{N_{t}^{\mathrm{trn}}}]% ^{\top}$ ( ${}^{\top}$ denotes transpose). Bayes’ rule then implies $\boldsymbol{\phi}_{t}^{*}=\operatornamewithlimits{argmin}-\log p(\boldsymbol{% \phi}_{t}|$ $\mathbf{y}_{t}^{\mathrm{trn}}$ ; $\mathbf{X}_{t}^{\mathrm{trn}},\boldsymbol{\theta})$ is the maximum a posteriori (MAP) estimator.

Reaching the global optimum $\boldsymbol{\phi}_{t}^{*}$ is generally infeasible because the task-specific model is nonlinear. Hence, a prudent remedy is to rely on an approximate solver $\hat{\boldsymbol{\phi}}_{t}\approx\boldsymbol{\phi}_{t}^{*}$ obtained by a tractable optimizer. For instance, MAML replaces (1b) with a $K$ -step GD minimizing the nll:

\boldsymbol{\phi}_{t}^{(k)}(\boldsymbol{\theta})=\boldsymbol{\phi}_{t}^{(k-1)}% (\boldsymbol{\theta})-\alpha\nabla\mathcal{L}(\boldsymbol{\phi}_{t}^{(k-1)}(% \boldsymbol{\theta});\mathcal{D}^{\mathrm{trn}}_{t}),~{}\forall t

(2)

where $k=1,\ldots,K$ indexes iterations; initialization $\boldsymbol{\phi}_{t}^{(0)}=\boldsymbol{\phi}^{(0)}=\boldsymbol{\theta}$ ; approximate solver $\hat{\boldsymbol{\phi}}_{t}(\boldsymbol{\theta})=\boldsymbol{\phi}_{t}^{(K)}(% \boldsymbol{\theta})$ ; and $\alpha$ denotes the step size. Although $\mathcal{R}(\boldsymbol{\phi}_{t};\boldsymbol{\theta})=0$ in MAML, it has been shown that the GD solver satisfies [18]

\hat{\boldsymbol{\phi}}_{t}(\boldsymbol{\theta})\approx\boldsymbol{\phi}_{t}^{% *}(\boldsymbol{\theta})=\operatornamewithlimits{argmin}_{\boldsymbol{\phi}_{t}% }\mathcal{L}(\boldsymbol{\phi}_{t};\mathcal{D}^{\mathrm{trn}}_{t})+\frac{1}{2}% \|\boldsymbol{\phi}_{t}-\boldsymbol{\theta}\|_{\mathbf{\Lambda}_{t}}^{2},~{}\forall t

where the precision matrix $\mathbf{\Lambda}_{t}$ is determined by $\nabla^{2}\mathcal{L}(\boldsymbol{\theta};\mathcal{D}^{\mathrm{trn}}_{t})$ , $\alpha$ , and $K$ . This indicates that MAML’s optimization strategy (2) is approximately tantamount to an implicit Gaussian prior probability density function (pdf) $p(\boldsymbol{\phi}_{t};\boldsymbol{\theta})=\mathcal{N}(\boldsymbol{\theta},% \mathbf{\Lambda}_{t}^{-1})$ , with the task-invariant initialization serving as the mean vector. Alongside implicit priors, their explicit counterparts have also been investigated with various prior pdfs [7, 9].

For both implicit and explicit priors, numerical studies [11, 13] and theoretical analyses [7, 9] demonstrate that the gradient error for optimizing $\boldsymbol{\theta}$ in (1a) relies on the convergence accuracy of $\hat{\boldsymbol{\phi}}_{t}$ relative to a stationary point. In addition, employing a large $K$ or complicated optimizers could prohibitively escalate the overall complexity for solving (2). As a consequence, attention has been directed towards preconditioned GD (PGD) solvers, as in the update

\boldsymbol{\phi}_{t}^{(k)}(\boldsymbol{\theta})=\boldsymbol{\phi}_{t}^{(k-1)}% (\boldsymbol{\theta})-\alpha\mathbf{P}(\boldsymbol{\theta}_{P})\nabla\mathcal{% L}(\boldsymbol{\phi}_{t}^{(k-1)}(\boldsymbol{\theta});\mathcal{D}^{\mathrm{trn% }}_{t})

(3)

where $\boldsymbol{\theta}_{P}$ parametrizes $\mathbf{P}\in\mathbb{R}^{d\times d}$ , and the prior parameter is augmented as $\boldsymbol{\theta}:=[\boldsymbol{\phi}^{(0)\top},\boldsymbol{\theta}_{P}^{% \top}]^{\top}$ . To ensure (3) incurs affordable complexity after preconditioning, $\mathbf{P}$ must have a simple enough structure so that $\mathbf{P}(\boldsymbol{\theta}_{P})\nabla\mathcal{L}(\boldsymbol{\phi}_{t}^{(k% -1)};\mathcal{D}^{\mathrm{trn}}_{t})$ incurs computational complexity $\mathcal{O}(d)$ . Examples of such structures include diagonal [11, 12], block-diagonal [13, 14], and NN-based [15] matrices. A more generic preconditioner can be formed by replacing the linear transformation $\mathbf{P}(\boldsymbol{\theta}_{P})\nabla\mathcal{L}(\boldsymbol{\phi}_{t}^{(k% -1)};\mathcal{D}^{\mathrm{trn}}_{t})$ with a nonlinear NN $f(\nabla\mathcal{L}(\boldsymbol{\phi}_{t}^{(k-1)};\mathcal{D}^{\mathrm{trn}}_{% t});\boldsymbol{\theta}_{P})$ [17], but unfortunately convergence of this alternative iterate may not be guaranteed.

Essentially, GD conducts a pre-step greedy search with a quadratic loss approximation. To see this, let $\text{lin}(\mathcal{L}(\boldsymbol{\phi}_{t}),\bar{\boldsymbol{\phi}}_{t})$ $:=\mathcal{L}(\bar{\boldsymbol{\phi}}_{t};\mathcal{D}^{\mathrm{trn}}_{t})+(% \boldsymbol{\phi}_{t}-\bar{\boldsymbol{\phi}}_{t})^{\top}\nabla\mathcal{L}(% \bar{\boldsymbol{\phi}}_{t};\mathcal{D}^{\mathrm{trn}}_{t})$ . Using this linearization of $\mathcal{L}$ at $\bar{\boldsymbol{\phi}}_{t}\in\mathbb{R}^{d}$ , the GD update reduces to (cf. (2))

\boldsymbol{\phi}_{t}^{(k)}=\operatornamewithlimits{argmin}_{\boldsymbol{\phi}% _{t}}\text{lin}(\mathcal{L}(\boldsymbol{\phi}_{t}),\boldsymbol{\phi}_{t}^{(k-1% )})+\frac{1}{2\alpha}\|\boldsymbol{\phi}_{t}-\boldsymbol{\phi}_{t}^{(k-1)}\|_{% 2}^{2}

(4)

where dependencies on $\boldsymbol{\theta}$ are dropped hereafter for notational brevity. The term $\frac{1}{2\alpha}\|\boldsymbol{\phi}_{t}-\boldsymbol{\phi}_{t}^{(k-1)}\|_{2}^{2}$ implies the isotropic approximation $\nabla^{2}\mathcal{L}(\boldsymbol{\phi}_{t}^{(k-1)};\mathcal{D}^{\mathrm{trn}}% _{t})\approx\frac{1}{\alpha}\mathbf{I}_{d}$ , while (3) refines the approximation as a more informative matrix $\frac{1}{\alpha}\mathbf{P}^{-1}$ (if invertible). This quadratic local approximation is particularly effective when $K$ is large and $\alpha$ is small, which gradually ameliorates $\boldsymbol{\phi}_{t}^{(k)}$ to a stationary point. In meta-learning however, the standard setup relies on a small $K$ (e.g., $1$ or $5$ ) and a sufficiently large $\alpha$ , so that the model can quickly adapt to the task with low complexity. This tradeoff highlights the need for learning more expressive loss geometries.

3 Loss Geometries using Mirror Descent

Instead of quadratic approximations of the local loss induced by certain norms (e.g., $\|\cdot\|_{2}$ and $\|\cdot\|_{\mathbf{P}^{-1}}$ ), our fresh idea is a data-driven distance metric that captures a broader spectrum of loss geometries. This is accomplished by learning the so-termed “mirror map,” which will be introduced first. All the proofs are delegated to Appendix A.

3.1 Modeling the loss geometry using the mirror map

To generalize the (P)GD, we will replace the $\ell_{2}$ -norm in (4) with a generic metric $D_{h}$ to arrive at

\boldsymbol{\phi}_{t}^{(k)}=\operatornamewithlimits{argmin}_{\boldsymbol{\phi}% _{t}}\text{lin}(\mathcal{L}(\boldsymbol{\phi}_{t}),\boldsymbol{\phi}_{t}^{(k-1% )})+\frac{1}{\alpha}D_{h}(\boldsymbol{\phi}_{t},\boldsymbol{\phi}_{t}^{(k-1)})

(5)

where $D_{h}(\boldsymbol{\phi}_{t},\boldsymbol{\phi}_{t}^{(k-1)}):=h(\boldsymbol{\phi% }_{t})-\text{lin}(h(\boldsymbol{\phi}_{t}),\boldsymbol{\phi}_{t}^{(k-1)})$ is the Bregman divergence, and the associated distance-generating function $h:\mathbb{R}^{d}\mapsto\mathbb{R}$ is strongly convex to ensure the existence and uniqueness of the minimizer. As a result, $\nabla h$ is strictly increasing, and thus invertible¹¹1When $\nabla h$ is discontinuous but $h$ is proper, the inverse $(\nabla h)^{-1}$ is defined as $\nabla h^{*}(\mathbf{z}):=\operatornamewithlimits{argmax}_{\boldsymbol{\phi}}% \boldsymbol{\phi}^{\top}\mathbf{z}-h(\boldsymbol{\phi})$ , where $h^{*}(\mathbf{z}):=\sup_{\boldsymbol{\phi}}\boldsymbol{\phi}^{\top}\mathbf{z}-% h(\boldsymbol{\phi})$ is the Fenchel conjugate of $h$ .. Then, applying the optimality condition leads to the mirror descent (MD) update

\boldsymbol{\phi}_{t}^{(k)}=(\nabla h)^{-1}\big{(}\nabla h(\boldsymbol{\phi}_{% t}^{(k-1)})-\alpha\nabla\mathcal{L}(\boldsymbol{\phi}_{t}^{(k-1)};\mathcal{D}^% {\mathrm{trn}}_{t})\big{)}.

(6)

The invertible $\nabla h$ , dubbed mirror map, connects $\boldsymbol{\phi}_{t}$ in the primal space to $\nabla\mathcal{L}$ in the dual space under the endowed metric $D_{h}$ . As a special case, when choosing $h(\cdot)=\frac{1}{2}\|\cdot\|_{2}^{2}$ , it is easy to verify that (6) boils down to (2) due to the self-duality of the $\ell_{2}$ -norm. Likewise, (3) can be obtained with $h(\cdot)=\frac{1}{2}\|\cdot\|_{\mathbf{P}^{-1}}^{2}$ , where $\nabla h$ reduces to a linear map**. Function $h$ reflects our prior knowledge about the geometry of $\mathcal{L}$ . In particular, letting $h(\cdot)=\mathcal{L}(\cdot;\mathcal{D}^{\mathrm{trn}}_{t})$ (even when $\mathcal{L}$ is not strong convex) in (5) gives $\boldsymbol{\phi}_{t}^{(k)}=\operatornamewithlimits{argmin}_{\boldsymbol{\phi}% _{t}}\mathcal{L}(\boldsymbol{\phi}_{t};\mathcal{D}^{\mathrm{trn}}_{t})$ , which is precisely the original nll minimization solved in (2) and (3). Thus, an ideal choice of $h$ would yield $h\approx\mathcal{L}$ (up to a constant) within a sufficiently large region around $\boldsymbol{\phi}_{t}^{(k-1)}$ .

Different from past works that rely on a simple preselected $h$ to model loss geometries, we here acquire a data-driven $h$ by learning a strictly increasing $\nabla h$ that best fits the given tasks. Interestingly, (6) can be reformulated to yield an update of the dual vector $\mathbf{z}_{t}:=\nabla h(\boldsymbol{\phi}_{t})$ as

\mathbf{z}_{t}^{(k)}=\mathbf{z}_{t}^{(k-1)}-\alpha\nabla\mathcal{L}\big{(}(% \nabla h)^{-1}(\mathbf{z}_{t}^{(k-1)});\mathcal{D}^{\mathrm{trn}}_{t}\big{)}

(7)

with $\mathbf{z}_{t}^{(0)}=\nabla h(\boldsymbol{\phi}^{(0)})$ and $\hat{\boldsymbol{\phi}}_{t}=(\nabla h)^{-1}(\mathbf{z}_{t}^{(K)})$ . Hence, it suffices to learn a strictly increasing $(\nabla h)^{-1}$ and a task-invariant dual initialization $\mathbf{z}^{(0)}:=\nabla h(\boldsymbol{\phi}^{(0)})$ , thus removing the need for directly calculating $\nabla h$ .

3.2 Learning the inverse mirror map via blockIAF

Inspired by this observation, a prudent option is to model $(\nabla h)^{-1}$ as an inverse autoregressive flow (IAF) [19]. The notable benefit of IAF lies in its efficient parallelization of forward computation, that makes it considerably faster than computing its inverse. However, directly applying the dimension-wise IAF to the high-dimensional $\mathbf{z}_{t}\in\mathbb{R}^{d}$ will incur prohibitively high complexity of $\Omega(d^{2})$ . For this reason, we introduce a novel blockIAF model that effectively reduces complexity by performing block-wise (nonlinear) autoregression on a low-dimensional space encoding $\mathbf{z}_{t}$ . To this end, let $\{\mathcal{B}_{i}\}_{i=1}^{B}$ be a partition of the index set $\{1,\ldots,d\}$ , and $[\mathbf{z}_{t}]_{\mathcal{B}_{i}}$ denote the subvector of $\mathbf{z}_{t}$ restricted to the block $\mathcal{B}_{i}$ . The blockIAF model transforms $\mathbf{z}_{t}$ to $\boldsymbol{\phi}_{t}$ through

	$\displaystyle[\boldsymbol{\phi}_{t}]_{\mathcal{B}_{i}}$	$\displaystyle=[\mathbf{z}_{t}]_{\mathcal{B}_{i}}\odot\sigma(\boldsymbol{\alpha% }_{i})+\boldsymbol{\mu}_{i}$		(8a)
	$\displaystyle[\boldsymbol{\alpha}_{i}^{\top},\boldsymbol{\mu}_{i}^{\top}]^{\top}$	$\displaystyle=d_{i}\big{(}\{e_{j}([\mathbf{z}_{t}]_{\mathcal{B}_{j}})\}_{j=1}^% {i-1}\big{)},~{}i=1,\ldots,B$		(8b)

where nonlinearity $\sigma$ is positive and upper bounded (e.g., logistic function), $\sigma(\boldsymbol{\alpha}_{i}),\boldsymbol{\mu}_{i}\in\mathbb{R}^{|\mathcal{B% }_{i}|}$ are the scale and shift of $[\mathbf{z}_{i}]_{\mathcal{B}_{i}}$ , $e_{i}$ and $d_{i}$ denote learnable encoder and decoder for the $i$ -th block, and $\odot$ is the Hadamard (element-wise) product. In our implementation, $\{e_{i}\}_{i=1}^{B-1}$ and $\{d_{i}\}_{i=1}^{B}$ are multilayer perceptrons (MLPs) with ReLU activations. To further reduce complexity, all linear layers in MLPs are implemented by tensor mode product [13]. This technique is equivalent to a low-rank Kronecker approximation to MLPs’ weight matrices. This lowers the per-step MD complexity to $\mathcal{O}(d)$ .

The following theorem characterizes two important properties of the proposed blockIAF model.

Theorem 1.

Let $g:\mathbb{R}^{d}\mapsto\mathbb{R}^{d}$ be the blockIAF model (3.2). For any partition $\{\mathcal{B}_{i}\}_{i=1}^{B}$ , $g$ is strictly increasing, that is

(\mathbf{z}_{t}-\mathbf{z}_{t}^{\prime})^{\top}(g(\mathbf{z}_{t})-g(\mathbf{z}% _{t}^{\prime}))>0,~{}~{}\forall\mathbf{z}_{t}\neq\mathbf{z}_{t}^{\prime}.

(9)

Moreover, there exists a constant $C>0$ such that

\nabla(g^{-1})(\boldsymbol{\phi}_{t})\succeq C.

(10)

Theorem 1 asserts that with $(\nabla h)^{-1}=g$ , one ensures the desired strict monotonicity, and strong convexity of the induced $h$ (by noting that $\nabla^{2}h=\nabla(g^{-1})$ ). As a result, the per-task optimization (7) enjoys the standard convergence guarantee of MD. Although the convergence rate of MD is in the same order as GD, it outperforms GD markedly in the constant factor when $d$ is large [20], and relies on more relaxed assumptions [21].

Table 1: Comparison of meta-learning algorithms with different loss geometry models on the

5

-class miniImageNet dataset. Maximum and mean accuracies within its

95\%

confidence interval are in bold. (No model ensembling for a fair comparison.)

Method	Lower-level optimizer	Loss geometry model	$5$ -class accuracies
Method	Lower-level optimizer	Loss geometry model	$1$ -shot	$5$ -shot
MAML [6]	GD	identity matrix	$48.70\pm 1.84\%$	$63.11\pm 0.92\%$
MetaSGD [11]	PGD	diag. matrix	$50.47\pm 1.87\%$	$64.03\pm 0.94\%$
MT-net [14]	PGD	block diag. matrix	$51.70\pm 1.84\%$	$-$
WarpGrad [15]	PGD	NN-based low-rank matrix	$52.3\pm 0.8\%$	$68.4\pm 0.6\%$
MetaCurvature [13]	PGD	block diag. & Kron. (low-rank) matrix	$54.23\pm 0.88\%$	$67.99\pm 0.73\%$
MetaKFO [17]	NN-transformed GD	NN-based gradient transformation	$-$	$64.9\%$
ECML [16]	PGD	Gauss-Newton approximation	$48.94\pm 0.80\%$	$65.26\pm 0.67\%$
This paper’s method	MD	blockIAF-based mirror map	$\mathbf{56.10\pm 1.43\%}$	$\mathbf{69.59\pm 0.71\%}$

Refer to caption — (a) $\mathcal{L}(\boldsymbol{\phi}_{\star}^{(k)};\mathcal{D}^{\mathrm{trn}}_{\star})$ versus $k$

The meta-learning objective (2) is solved using alternating optimization. With $\boldsymbol{\theta}_{g}$ denoting the blockIAF parameters, let $\boldsymbol{\theta}:=[\mathbf{z}^{(0)\top},\boldsymbol{\theta}_{g}^{\top}]^{\top}$ be the prior parameter vector. In the $(r)$ -th iteration of (1a), the optimizer has access to $\boldsymbol{\theta}^{(r-1)}$ provided by its last iteration, and a batch of randomly sampled tasks $\mathcal{T}^{(r)}\subset\{1,\ldots,T\}$ . The optimizer first solves $\hat{\boldsymbol{\phi}}_{t}(\boldsymbol{\theta}^{(r-1)})$ for each $t\in\mathcal{T}^{(r)}$ leveraging the $K$ -step MD (7). Then, $\boldsymbol{\theta}^{(r-1)}$ is updated using mini-batch stochastic GD with step size $\beta$ :

\boldsymbol{\theta}^{(r)}=\boldsymbol{\theta}^{(r-1)}-\beta\frac{T}{|\mathcal{% T}^{(r)}|}\sum_{t\in\mathcal{T}^{(r)}}\nabla_{\boldsymbol{\theta}^{(r-1)}}% \mathcal{L}(\hat{\boldsymbol{\phi}}_{t}(\boldsymbol{\theta}^{(r-1)});\mathcal{% D}^{\mathrm{val}}_{t}).

A summary of the algorithm can be found in Appendix B.

4 Numerical tests

Here we compare the empirical performance of optimization-based meta-learning using different lower-level optimizers, on the standard few-shot classification dataset miniImageNet [22], where “shots” signify the per-class training data for each $t$ . The task-specific model is a standard $4$ -layer convolutional NN (CNN) [22, 6]. Each layer comprises a $3\times 3$ convolution of $64$ channels, batch normalization, ReLU activation, and $2\times 2$ max pooling module. After the convolutional layers, a linear regressor with softmax activation is appended to perform classification. Subset $\mathcal{B}_{i}$ is formed by the weight indices of the $i$ -th CNN layer. The autoregression in (8b) implies that “how to optimize weights of the $i$ -th layer” depends on “how weights of previous layers have been optimized.” This choice enables blockIAF to model the optimization dependency of high-level features (e.g., textures and patterns) on low-level ones (e.g., colors and edges). Test setups and hyperparameters can be found in Appendix C.

Table 1 lists various loss geometry models, where classification accuracy on new tasks is the figure of merit. For fairness, MAML is the backbone of all methods. By utilizing a more versatile loss geometry model, our approach outperforms the state-of-the-art ones by a large margin.

To further gauge the performance gain achieved by our novel approach, Fig. 1 visualizes the convergence of $\mathcal{L}(\boldsymbol{\phi}_{\star}^{(k)};\mathcal{D}^{\mathrm{trn}}_{\star})$ averaged on $1,000$ random new tasks. The proposed method results in faster convergence to a lower and more stable nll compared with all three competitors. Moreover, Fig. 0(a) reveals that both the proposed method and MetaCurvature improve the initialization compared to MAML and MetaSGD. This confirms that convergence and generalization of (1a) relies on the convergence accuracy of $\hat{\boldsymbol{\phi}}_{t}$ [7, 9]. Fig. 0(b) further illustrates that although the initial gradients of different methods have comparable norms $\|\nabla\mathcal{L}(\boldsymbol{\phi}_{\star}^{(0)};\mathcal{D}^{\mathrm{trn}}% _{\star})\|_{2}$ , our method can make better use of the gradient, leading to a rapid reduction of the nll as well as its gradient norm at $k=1$ . This improved gradient utilization highlights our method’s superior modeling of loss geometries.

5 Conclusions and outlook

Versatile loss geometry models can accelerate the lower-level convergence in meta-learning. A novel BlockIAF model is introduced to learn the inverse mirror map $(\nabla h)^{-1}$ induced by a strongly convex $h$ . The resultant algorithm generalizes preconditioning-based meta-learning, captures versatile loss geometries, and improves lower-level convergence. Effectiveness of the novel approach was validated on a standard few-shot dataset. Future research includes bi-level convergence guarantees for the proposed method, and development of more expressive yet scalable inverse mirror maps.

References

[1] Han Altae-Tran, Bharath Ramsundar, Aneesh S. Pappu, and Vijay Pande, “Low data drug discovery with one-shot learning,” ACS Central Science, vol. 3, no. 4, pp. 283–293, 2017.
[2] Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li, “Meta-learning for low-resource neural machine translation,” arXiv preprint arXiv:1808.08437, 2018.
[3] Ignasi Clavera, Anusha Nagabandi, Simin Liu, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn, “Learning to adapt in dynamic, real-world environments through meta-reinforcement learning,” in Proc. Int. Conf. Learn. Represent., 2019.
[4] Sachin Ravi and Hugo Larochelle, “Optimization as a model for few-shot learning,” in Proc. Int. Conf. Learn. Represent., 2017.
[5] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel, “A simple neural attentive meta-learner,” in Proc. Int. Conf. Learn. Represent., 2018.
[6] Chelsea Finn, Pieter Abbeel, and Sergey Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proc. Int. Conf. Mach. Learn., 2017, vol. 70, pp. 1126–1135.
[7] Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine, “Meta-learning with implicit gradients,” in Proc. Adv. Neural Inf. Process. Syst., 2019, vol. 32.
[8] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto, “Meta-learning with differentiable convex optimization,” in Proc. IEEE/CVF Conf. on Comp. Vis. and Pat. Recog., 2019.
[9] Yilang Zhang, Bingcong Li, Shijian Gao, and Georgios B. Giannakis, “Scalable bayesian meta-learning through generalized implicit gradients,” in Proc. AAAI Conf. Artif. Intel., 2023, vol. 37(9), pp. 11298–11306.
[10] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Represent., 2015.
[11] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li, “Meta-sgd: Learning to learn quickly for few-shot learning,” arXiv preprint arXiv:1707.09835, 2017.
[12] Boyan Gao, Henry Gouk, Hae Beom Lee, and Timothy M Hospedales, “Meta mirror descent: Optimiser learning for fast convergence,” arXiv preprint arXiv:2203.02711, 2022.
[13] Eunbyung Park and Junier B Oliva, “Meta-curvature,” in Proc. Adv. Neural Inf. Process. Syst., 2019, vol. 32.
[14] Yoonho Lee and Seung** Choi, “Gradient-based meta-learning with learned layerwise metric and subspace,” in Proc. Int. Conf. Mach. Learn., 2018, vol. 80, pp. 2927–2936.
[15] Sebastian Flennerhag, Andrei A. Rusu, Razvan Pascanu, Francesco Visin, Hujun Yin, and Raia Hadsell, “Meta-learning with warped gradient descent,” in Proc. Int. Conf. Learn. Represent., 2020.
[16] Markus Hiller, Mehrtash Harandi, and Tom Drummond, “On enforcing better conditioned meta-learning for rapid few-shot adaptation,” in Proc. Adv. Neural Inf. Process. Syst., 2022, vol. 35, pp. 4059–4071.
[17] Sébastien M. R. Arnold, Shariq Iqbal, and Fei Sha, “When maml can adapt fast and how to assist when it cannot,” in Proc. Int. Conf. Artif. Intel. and Stats., 2021, vol. 130, pp. 244–252.
[18] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths, “Recasting gradient-based meta-learning as hierarchical Bayes,” in Proc. Int. Conf. Learn. Represent., 2018.
[19] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling, “Improved variational inference with inverse autoregressive flow,” in Proc. Adv. Neural Inf. Process. Syst., 2016, vol. 29.
[20] Aharon Ben-Tal, Tamar Margalit, and Arkadi Nemirovski, “The ordered subsets mirror descent optimization method with applications to tomography,” SIAM Journal on Optimization, vol. 12, no. 1, pp. 79–108, 2001.
[21] Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Stephen Boyd, and Peter W Glynn, “Stochastic mirror descent in variationally coherent optimization problems,” in Proc. Adv. Neural Inf. Process. Syst., 2017, vol. 30.
[22] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, and Daan Wierstra, “Matching networks for one shot learning,” in Proc. Adv. Neural Inf. Process. Syst., 2016, vol. 29.
[23] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G. Dimakis, “Compressed sensing using generative models,” in Proc. Int. Conf. Mach. Learn., Doina Precup and Yee Whye Teh, Eds., 2017, vol. 70, pp. 537–546.

Appendix

Appendix A Proof of Theorem 1

Theorem 1 (Restated).

Let $g:\mathbb{R}^{d}\mapsto\mathbb{R}^{d}$ denote the blockIAF model (3.2). For any partition $\{\mathcal{B}_{i}\}_{i=1}^{B}$ , $g$ is strictly increasing, that is

(\mathbf{z}_{t}-\mathbf{z}_{t}^{\prime})^{\top}(g(\mathbf{z}_{t})-g(\mathbf{z}% _{t}^{\prime}))>0,~{}~{}\forall\mathbf{z}_{t}\neq\mathbf{z}_{t}^{\prime}.

Moreover, there exists a constant $C>0$ such that

\nabla(g^{-1})(\boldsymbol{\phi}_{t})\succeq C.

Proof.

Let $\pi:=[\mathcal{B}_{1},\ldots,\mathcal{B}_{B}]$ denote a permutation of $\{1,\ldots,n\}$ , and $\mathbf{Q}_{\pi}\in\mathbb{R}^{d\times d}:=\big{[}[\mathbf{I}_{d}]_{\mathcal{B% }_{1}},\ldots,[\mathbf{I}_{d}]_{\mathcal{B}_{B}}\big{]}$ the permutation matrix under $\pi$ , where $[\mathbf{I}_{d}]_{\mathcal{B}_{i}}$ is the submatrix of the identity $\mathbf{I}_{d}\in\mathbb{R}^{d\times d}$ restricted to the columns indexed by $\mathcal{B}_{i}$ .

Consider the partial derivatives (cf. (3.2))

\frac{\partial[g(\mathbf{z}_{t})]_{\mathcal{B}_{i}}}{\partial[\mathbf{z}_{t}]_% {\mathcal{B}_{j}}}=\frac{\partial[\boldsymbol{\phi}_{t}(\mathbf{z}_{t})]_{% \mathcal{B}_{i}}}{\partial[\mathbf{z}_{t}]_{\mathcal{B}_{j}}}=\begin{cases}% \text{a $|\mathcal{B}_{i}|\times|\mathcal{B}_{j}|$ matrix},&\text{if}~{}i>j\\ \text{diag}(\sigma(\boldsymbol{\alpha}_{i})),&\text{if}~{}i=j\\ \mathbf{0}_{d},&\text{otherwise}\end{cases}.

(11)

It can be verified that the Jacobian $\nabla_{[\mathbf{z}_{t}]_{\pi}}[g(\mathbf{z}_{t})]_{\pi}$ of the permuted parameters is block-upper-triangular, with the $i$ -th diagonal block given by $\frac{\partial[g(\mathbf{z}_{t})]_{\mathcal{B}_{i}}}{\partial[\mathbf{z}_{t}]_% {\mathcal{B}_{i}}}=\text{diag}(\sigma(\boldsymbol{\alpha}_{i}))\succ 0$ . It thus holds that $\nabla_{[\mathbf{z}_{t}]_{\pi}}[g(\mathbf{z}_{t})]_{\pi}\succ 0$ , or equivalently, $\mathbf{Q}_{\pi}^{\top}\nabla g(\mathbf{z}_{t})\mathbf{Q}_{\pi}\succ 0$ , which implies that

\nabla g(\mathbf{z}_{t})\succ 0,~{}\forall\mathbf{z}_{t}\in\mathbb{R}^{d}\;.

(12)

Letting $\tilde{g}(\alpha):=g(\alpha\mathbf{z}_{t}+(1-\alpha)\mathbf{z}_{t}^{\prime})$ , it holds for $\forall\mathbf{z}_{t}\neq\mathbf{z}_{t}^{\prime}$

$\displaystyle(\mathbf{z}_{t}-\mathbf{z}_{t}^{\prime})^{\top}(g(\mathbf{z}_{t})% -g(\mathbf{z}_{t}^{\prime}))$	$\displaystyle=(\mathbf{z}_{t}-\mathbf{z}_{t}^{\prime})^{\top}(\tilde{g}(1)-% \tilde{g}(0))$
	$\displaystyle=(\mathbf{z}_{t}-\mathbf{z}_{t}^{\prime})^{\top}\int_{0}^{1}% \tilde{g}^{\prime}(\alpha)d\alpha$
	$\displaystyle=\int_{0}^{1}(\mathbf{z}_{t}-\mathbf{z}_{t}^{\prime})^{\top}% \nabla g(\alpha\mathbf{z}_{t}+(1-\alpha)\mathbf{z}_{t}^{\prime})(\mathbf{z}_{t% }-\mathbf{z}_{t}^{\prime})d\alpha>0$	(13)

where the inequality follows from (12).

Next, upper bounding $\sigma\leq 1/C$ , we will show that $\nabla(g^{-1})(\boldsymbol{\phi}_{t})\succeq C$ for some constant $C>0$ . To obtain the inverse $g^{-1}$ , notice that (8a) can be readily rewritten as

[\mathbf{z}_{t}]_{\mathcal{B}_{i}}=([\boldsymbol{\phi}_{t}]_{\mathcal{B}_{i}}-% \boldsymbol{\mu}_{i})\odot 1/\sigma(\boldsymbol{\alpha}_{i}).

where $/$ is the element-wise division. Similar to (11), it can be easily verified that the Jacobian $\nabla_{[\boldsymbol{\phi}_{t}]_{\pi}}[(g^{-1})(\boldsymbol{\phi}_{t})]_{\pi}$ is also block-upper-triangular, with $i$ -th diagonal block $\frac{\partial[(g^{-1})(\boldsymbol{\phi}_{t})]_{\mathcal{B}_{i}}}{\partial[% \boldsymbol{\phi}_{t}]_{\mathcal{B}_{i}}}=\text{diag}^{-1}(\sigma(\boldsymbol{% \alpha}_{i}))\succeq C$ .

As a result, we have that

\nabla(g^{-1})(\boldsymbol{\phi}_{t})\succeq\min_{i=1,\ldots,B}1/\|\sigma(% \boldsymbol{\alpha}_{i})\|_{\infty}\succeq C

which completes the proof. ∎

Appendix B Summary of the algorithm

Input:

\{\mathcal{D}^{\mathrm{}}_{t}\}_{t=1}^{T}

, step sizes

\alpha

and

\beta

, maximum number of iterations

K

and

R

, and blockIAF mirror map

\nabla h

Initialization: randomly initialize

\boldsymbol{\theta}^{(0)}=[\mathbf{z}^{(0)\top},\boldsymbol{\theta}_{g}^{\top}% ]^{\top}

1 for $r=1,\ldots,R$ do

2 Randomly sample a mini-batch of tasks

\mathcal{T}^{(r)}\subset\{1,\ldots,T\}

;

3 for $t\in\mathcal{T}^{(r)}$ do

4 Initialize

\mathbf{z}_{t}^{(0)}=\mathbf{z}^{(0)}

;

5 for $k=1,\ldots,K$ do

6 Map

\boldsymbol{\phi}_{t}^{(k-1)}(\boldsymbol{\theta}^{(r-1)})=(\nabla h)^{-1}(% \mathbf{z}_{t}^{(k-1)}(\boldsymbol{\theta}^{(r-1)});\boldsymbol{\theta}_{g}^{(% r-1)})

;

7 Descend

\mathbf{z}_{t}^{(k)}(\boldsymbol{\theta}^{(r-1)})=\mathbf{z}_{t}^{(k-1)}(% \boldsymbol{\theta}^{(r-1)})-\alpha\nabla\mathcal{L}\big{(}\boldsymbol{\phi}_{% t}^{(k-1)}(\boldsymbol{\theta}^{(r-1)});\mathcal{D}^{\mathrm{trn}}_{t}\big{)}

;

9 end for

10 Map

\hat{\boldsymbol{\phi}}_{t}(\boldsymbol{\theta}^{(r-1)})=(\nabla h)^{-1}(% \mathbf{z}_{t}^{(K)}(\boldsymbol{\theta}^{(r-1)});\boldsymbol{\theta}_{g}^{(r-% 1)})

;

12 end for

13 Update

\boldsymbol{\theta}^{(r)}=\boldsymbol{\theta}^{(r-1)}-\beta\frac{T}{|\mathcal{% T}^{r}|}\sum_{t\in\mathcal{T}^{r}}\nabla_{\boldsymbol{\theta}^{(r-1)}}\mathcal% {L}(\hat{\boldsymbol{\phi}}_{t}(\boldsymbol{\theta}^{(r-1)});\mathcal{D}^{% \mathrm{val}}_{t})

;

15 end for

Output:

\hat{\boldsymbol{\theta}}=\boldsymbol{\theta}^{(R)}

Algorithm 1 Meta-learning with MD and blockIAF

Appendix C Numerical setups

This section elaborates further on the dataset and setups of the numerical tests.

The miniImageNet dataset is a few-shot classification dataset comprising natural images from $100$ classes, each containing $600$ samples. All images are cropped and resized to $84\times 84$ , as suggested by [4]. The $100$ classes are disjointly divided into $3$ groups with corresponding size $64$ , $20$ and $16$ , which are available to the meta-training, meta-validation, and meta-testing phases, respectively. The task setups follow from the standard $M$ -class $N$ -shot few-shot learning protocol [4, 6]. In particular, $\mathcal{D}^{\mathrm{trn}}_{t}$ per task $t$ contains $M$ classes randomly drawn from the dataset, each consisting of $N$ labeled data. It is easy to see that $|\mathcal{D}^{\mathrm{trn}}_{t}|=MN,~{}\forall t$ . Likewise, $\mathcal{D}^{\mathrm{val}}_{t}$ is constructed in a manner akin to $\mathcal{D}^{\mathrm{trn}}_{t}$ , albeit with each class comprising $15$ labeled data.

The hyperparameters used in the tests are the same as those used by MAML [6], and are listed in Table 2. Our implementation relies on PyTorch, and codes are available at https://github.com/zhangyilang/MetaMirrorDescent.

Table 2: Hyperparameter setup for the numerical tests.

Hyperparameter	Notation	Value
Lower-level iterations	$K$	$5$
Lower-level learning rate	$\alpha$	$10^{-2}$
Upper-leve iterations	$R$	$60,000$
Upper-level learning rate	$\beta$	$10^{-3}$
Upper-level SGD batch size	$\|\mathcal{T}^{(r)}\|$	$4$

All the MLPs used in blockIAF have three fully-connected layers with ReLU nonlinearity, and with the weight matrix of each layer Kronecker factorized [13]. Let $\text{size}_{i}:=d_{i,1}\times d_{i,2}\times\ldots\times d_{i,O_{i}}$ be the size of the original tensor corresponding to the vector $[\boldsymbol{\phi}_{t}]_{\mathcal{B}_{i}}$ , where $O_{i}$ is the total order of the tensor, and $\prod_{j=1}^{O_{i}}d_{i,j}=|\mathcal{B}_{i}|$ . Each layer of the encoder $e_{i}$ outputs a tensor with dimensionality of half size. This implies that the output tensor of the $l$ -th layer of $e_{i}$ has size $\lfloor\frac{\text{size}_{i}}{2^{l}}\rfloor:=\lfloor\frac{d_{i,1}}{2^{l}}% \rfloor\times\lfloor\frac{d_{i,2}}{2^{l}}\rfloor\times\ldots\times\lfloor\frac% {d_{i,O_{i}}}{2^{l}}\rfloor,~{}l=1,2,3$ . The decoder $d_{i}$ first vectorizes and concatenates the embeddings provided by $\{e_{j}\}_{j=1}^{i-1}$ , maps this concatenated embedding vector to $\lfloor\frac{\text{size}_{i}}{8}\rfloor$ , and recovers the tensor to $\text{size}_{i}$ by performing the inverse size operations of $e_{i}$ ; that is, its $l$ -th layer changes the tensor size from $\lfloor\frac{\text{size}_{i}}{2^{4-l}}\rfloor$ to $\lfloor\frac{\text{size}_{i}}{2^{3-l}}\rfloor$ .

Appendix D Complexity analysis

Next, complexity comparison is implemented to justify the effectiveness of the introduced blockIAF model. To showcase the computational efficiency, numerical complexities are assessed using the 5-class 5-shot miniImageNet dataset. In the test, the blockIAF-based mirror map incurs a $9.1\%$ increase of forward and backpropagation time compared to the basic GD update in MAML. This slight increment confirms the claimed low complexity of the proposed approach.

Meta-Learning with Versatile Loss Geometries for Fast Adaptation Using Mirror Descent