Adapters Strike Back

Jan-Martin O. Steitz1           Stefan Roth1,2
11\ {}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTDepartment of Computer Science, TU Darmstadt           22\ {}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT hessian.AI
Abstract

Adapters provide an efficient and lightweight mechanism for adapting trained transformer models to a variety of different tasks. However, they have often been found to be outperformed by other adaptation mechanisms, including low-rank adaptation. In this paper, we provide an in-depth study of adapters, their internal structure, as well as various implementation choices. We uncover pitfalls for using adapters and suggest a concrete, improved adapter architecture, called Adapter+, that not only outperforms previous adapter implementations but surpasses a number of other, more complex adaptation mechanisms in several challenging settings. Despite this, our suggested adapter is highly robust and, unlike previous work, requires little to no manual intervention when addressing a novel scenario. Adapter+ reaches state-of-the-art average accuracy on the VTAB benchmark, even without a per-task hyperparameter optimization.222Code is available at https://github.com/visinf/adapter_plus.

1 Introduction

000.10.10.10.10.20.20.20.20.30.30.30.30.40.40.40.40.50.50.50.50.60.60.60.672727272747474747676767678787878Fine-tuning accuracy

Linear probing

# Parameters (M)Accuracy (%)Adapter+ (ours)SPT-Adapter [21]SSF [39]Adapter+ opt. (ours)SPT-Adapter \circlearrowrightSSF \circlearrowrightFacT-TK [32]Consolidator [20]VPT [31]FacT-TK \circlearrowrightLoRA [29]VPT \circlearrowright
Figure 1: Parameter-accuracy characteristics of adaptation methods on the VTAB [65] test sets. We report original results and re-evaluations (\circlearrowright) after a complete training schedule with suitable data normalization. Our Adapter+ has clearly the best parameter-accuracy trade-off. The vertical, dashed line shows the possible minimal number of tunable parameters when only the classifiers are trained, i.e., using linear probing (61% accuracy).

Transfer learning from an off-the-shelf model, pre-trained on a large dataset like ImageNet [55] to a downstream task by fully fine-tuning the model’s parameters is a common paradigm. A typical CNN architecture, like a ResNet [24], has several tens of millions of parameters. However, since the introduction of transformers [59] into the realm of computer vision [13, 4, 5, 52, 63, 51], model sizes have grown exponentially from around a hundred million parameters for a vision transformer (ViT) [13] to more than a billion parameters [10, 46]. This leads to huge storage requirements when fine-tuning on multiple downstream tasks because a complete set of the model’s parameters needs to be saved per task. Additionally, large models require correspondingly large datasets [e.g., 56] to be trained to their full potential, yet tend to overfit easily if the target dataset in transfer learning is too small. One solution is linear probing [12], where only the linear classifier is trained, but this usually yields inferior results compared to full fine-tuning.

As a consequence, there is a growing interest in parameter-efficient tuning methods. The main idea is to freeze the parameters of the pre-trained model and add a comparatively small amount of parameters to the model, which are then tuned together with the classifier to adapt the model to the downstream task at hand. Representative methods with different underlying concepts include VPT [31], which prepends the sequence of image tokens in the attention with trainable tokens to learn a prompt tuning, LoRA [29], where the attention weights are updated with learnable low-rank decomposition matrices, and Adapters [28], which are small bottleneck modules that are added to every transformer layer of the network. Adapters were first proposed for CNNs by Rebuffi et al. [53] and various formulations [49, 28, 22] exist for the now common ViT architecture.

LoRAVPT \circlearrowrightSSF \circlearrowrightFacT \circlearrowrightConsol.SPT \circlearrowrightAdapter+808080808181818182828282838383838484848482.482.482.482.482.582.582.582.580.380.380.380.382.682.682.682.682.482.482.482.482.282.282.282.284.084.084.084.0Accuracy (%)NaturalLoRAVPT \circlearrowrightSSF \circlearrowrightFacT \circlearrowrightConsol.SPT \circlearrowrightAdapter+858585858686868684.384.384.384.384.684.684.684.685.785.785.785.784.784.784.784.786.386.386.386.385.385.385.385.386.586.586.586.5Accuracy (%)SpecializedLoRAVPT \circlearrowrightSSF \circlearrowrightFacT \circlearrowrightConsol.SPT \circlearrowrightAdapter+58585858606060606262626260.160.160.160.162.162.162.162.158.058.058.058.062.362.362.362.360.960.960.960.960.560.560.560.563.363.363.363.3Accuracy (%)Structured
Figure 2: Average accuracy for VTAB subgroups on the test sets. For methods marked with \circlearrowright, we report results of our re-evaluation after a complete training schedule with suitable data normalization to ensure a fair comparison. Adapter+ is evaluated with rank r[1..32]𝑟delimited-[]1..32r\!\in\![1..32]italic_r ∈ [ 1..32 ].

Recent work on parameter-efficient transfer learning [e.g., 31, 20, 67, 39, 21, 32] presents adapters as a baseline method for the adaptation to downstream tasks in computer vision. However, we identified various common issues in their implementations, which we find to have a negative influence on the adaptation performance. For further details, refer to the supplemental material. Additionally, while adapters have been well studied in natural language processing (NLP), there is no study that broadly examines the different adapter configurations for vision transformers. As a result, adapters have seemed to underperform in comparison to recent parameter-efficient adaptation methods, e.g., reported accuracies of adapters on VTAB of 73.9% in [67] and 60.8% in [31].

In this work, we therefore revisit the idea of adapters and investigate how they can perform at their best in connection with ViTs. Our contribution hereby is threefold: (1) We show the first in-depth and systematic study on the effects of the adapter position in the transformer and of the adapter’s inner structure with ViTs, as well as evaluate different variants of parameter initialization. (2) We further propose a learnable, channel-wise scaling as extension to plain adapters, which proves to be beneficial for computer vision tasks. (3) Finally, we present Adapter+, an adapter configuration with an excellent parameter-accuracy trade-off compared to other work, as shown in Fig. 1. Adapter+ reaches a state-of-the-art average accuracy of 77.6% on VTAB [65] without any hyperparameter optimization per task and 3.7 percentage points (pp) over previous adapter baselines. We also reach a state-of-the-art accuracy of 90.7% on FGVC [31] with the lowest number of parameters compared to other methods. Finally, Adapter+ shows the best robustness in terms of accuracy across the VTAB subgroups, see Fig. 2.

2 Related work

One possibility to adapt a pre-trained network to a novel task, apart from full fine-tuning, is to only selectively tune some of the parameters, e.g., only training the classifier [12]. Cai et al. [3] proposed to tune only the biases of an otherwise frozen network to adapt it to a downstream task. BitFit [64] then showed the efficacy of this method for NLP transformers.

Modular adaptation.   The concept of adding small, trainable modules with only a few parameters to an otherwise frozen network was first proposed for adapting CNNs by Rebuffi et al. [53] and called adapters. Other approaches replaced all convolutions in the network with depth-wise separable convolutions and only tuned their spatial parts [19], learned binary masks to prune a pre-trained network per target task [41], or created a student network by augmenting the original network with adapter-like modules and skip connections, which then mimicked a teacher network by disabling parts of its pre-trained and added modules [43].

Following the rise of transformers in NLP [59, 11, 50], Houlsby et al. [28] proposed adapter modules in the form of bottlenecks for transformer layers. Pfeiffer et al. [49] conducted an architecture search on NLP tasks to find a more parameter-efficient configuration of adapter modules that only acts on the transformer’s feed-forward network (FFN), thus saving roughly half of the parameters over [28].

Prompt tuning.   Inspired by changing the output of a network for NLP with hand-crafted textual prompts, which modifies the attention over the original input tokens, Lester et al. [37] proposed prompt tuning: A set of learnable tokens is added to the input sequence and trained with backpropagation to prompt a frozen language model to perform downstream tasks. Li and Liang [38] extended on prompt tuning by adding learnable tokens at every transformer layer of the model, which they termed prefix tuning. Jia et al. [31] applied prompt tuning to vision transformers, then called visual prompt tuning (VPT), by preprending the sequence of image patch embeddings with such trainable tokens (VPT-Shallow). They also showed a variant resembling prefix tuning with stronger adaptation capabilities that adds tokens at every layer of the network (VPT-Deep).

Low-rank approaches.   Also focusing on the attention part of the transformer layers, Hu et al. [29] proposed low-rank adaptation (LoRA) where the attention weights are updated with low-rank decomposition matrices. The matrices can be merged with the attention weights for inference. The structure of LoRA is very similar to an adapter, which can be seen as a superset of LoRA acting on the transformer’s FFN. He et al. [22] proposed a formalism to unify LoRA, adapters, and prefix tuning [38]. It allowed them to combine the beneficial aspects of all three methods into a scaled parallel adapter (Scaled PA) for NLP tasks. AdaptFormer [6] then applied the concept of Scaled PA to vision transformers.

Other related work.   Newer approaches for vision transformers proposed different techniques to further enhance the parameter-accuracy trade-off in adaptation. NOAH [67] performs an architecture search for a combination of adapters, LoRA, and VPT for each task. SSF [39] scales and shifts the features in the network after every operation, i.e., attention, FFN, layer normalization, with task-specific, trainable modules. Jie and Deng [32] aggregate the weights of a ViT into a single 3D tensor. Task-specific weight updates of this tensor are learned as a matrix decomposed into parameter-efficient factors, hence they termed their method factor-tuning (FacT). SPT [21] measures the importance of the weights of a pre-trained network for a downstream task. Based on a desired parameter budget, the most important parameters are chosen for tuning and adapters or LoRA are used for weight matrices that contain enough parameters of importance. Consolidator [20] adapts weights in multiple orderings of channel-wise groups. The updates for all groups are merged for efficient storage and inference.

Despite these new developments, we show that the simple concept of adapters exhibits an even better parameter-accuracy trade-off in combination with vision transformers – if done right and with the addition of a channel-wise scaling.

3 Adapters for vision transformers

3.1 Vision transformer basics

In this work, we concentrate on the parameter-efficient adaptation of vision transformers (ViT) [13]. The ViT is closely modeled after the transformer model for natural language processing (NLP) proposed by Vaswani et al. [59]. A learned linear projection embeds non-overlap** and flattened patches of the input image into a sequence of n𝑛nitalic_n tokens 𝒙n×d𝒙superscript𝑛𝑑\bm{x}\in\mathbb{R}^{n\times d}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, where d𝑑ditalic_d is called the hidden dimension of the transformer. A positional encoding is added to the embeddings and the sequence is prepended with a trainable [CLS] token. The sequence length and the dimension of the tokens stay fixed throughout the architecture. The sequence is sent through consecutive transformer layers that each consist of a multi-head self-attention and a feed-forward network (FFN). For the self-attention, the tokens are projected to queries, keys, and values (𝑸𝑸\bm{Q}bold_italic_Q, 𝑲𝑲\bm{K}bold_italic_K, and 𝑽𝑽\bm{V}bold_italic_V) and the output of each of the M𝑀Mitalic_M attention heads is calculated as

Attention(𝒙)=Softmax(𝑸(𝒙)𝑲(𝒙)𝖳d)𝑽(𝒙),Attention𝒙Softmax𝑸𝒙𝑲superscript𝒙𝖳superscript𝑑𝑽𝒙\text{Attention}(\bm{x})=\text{Softmax}\left(\frac{\bm{Q}(\bm{x})\bm{K}(\bm{x}% )^{\mkern-1.5mu\mathsf{T}}}{\sqrt{d^{\prime}}}\right)\bm{V}(\bm{x}),Attention ( bold_italic_x ) = Softmax ( divide start_ARG bold_italic_Q ( bold_italic_x ) bold_italic_K ( bold_italic_x ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG end_ARG ) bold_italic_V ( bold_italic_x ) , (1)

with d=d/Msuperscript𝑑𝑑𝑀d^{\prime}=\nicefrac{{d}}{{M}}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = / start_ARG italic_d end_ARG start_ARG italic_M end_ARG being the inner dimension of the head. The FFN consists of a multilayer perceptron with two linear layers (with weights 𝑾isubscript𝑾𝑖\bm{W}_{i}bold_italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and biases 𝒃isubscript𝒃𝑖\bm{b}_{i}bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and a GELU [26] non-linearity as activation in between:

FFN(𝒙)=GELU(𝒙𝑾1+𝒃1)𝑾2+𝒃2.FFN𝒙GELU𝒙subscript𝑾1subscript𝒃1subscript𝑾2subscript𝒃2\text{FFN}(\bm{x})=\text{GELU}(\bm{x}\bm{W}_{1}+\bm{b}_{1})\bm{W}_{2}+\bm{b}_{% 2}.FFN ( bold_italic_x ) = GELU ( bold_italic_x bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (2)

Both attention and FFN are employed with a preceding layer normalization (LN) [1] and a skip connection and, therefore, transform an input sequence 𝒙𝒙\bm{x}bold_italic_x sequentially as

𝒙𝒙\displaystyle\bm{x}bold_italic_x Attention(LN(𝒙))+𝒙maps-toabsentAttentionLN𝒙𝒙\displaystyle\mapsto\text{Attention}(\text{LN}(\bm{x}))+\bm{x}↦ Attention ( LN ( bold_italic_x ) ) + bold_italic_x (3a)
𝒙𝒙\displaystyle\bm{x}bold_italic_x FFN(LN(𝒙))+𝒙.maps-toabsentFFNLN𝒙𝒙\displaystyle\mapsto\text{FFN}(\text{LN}(\bm{x}))+\bm{x}\,.↦ FFN ( LN ( bold_italic_x ) ) + bold_italic_x . (3b)

To keep the notation concise, we will omit the LNs of attention and FFN in the following; each attention and FFN is assumed to be always preceded by an LN.

3.2 Adapters and their inner structure

Refer to caption
(a) Inner structure
Refer to caption
(b) Pre
Refer to caption
(c) Post
Refer to caption
(d) Parallel
Refer to caption
(e) Intermediate
Figure 3: Illustrations of (a) the inner structure of an adapter with feed-forward layers (FF), activation layer (Act), and optional layer normalization (LN) and scaling, (b)–(d) different possible adapter positions to connect the adapter to the FFN section of the transformer layer. Modules with trainable parameters are shown in red and frozen modules in blue.

Adapters [28] are small modules that are added to the transformer layers. They allow to tailor a network to a new task or domain, where instead of tuning the parameters of the whole network, only the adapter parameters and the classifier are trained. Adapters take the form of bottlenecks with an inner dimension of rdmuch-less-than𝑟𝑑r\ll ditalic_r ≪ italic_d. We call r𝑟ritalic_r the rank of the adapter. In detail, a down-projection to dimension r𝑟ritalic_r with weights 𝑾downd×rsubscript𝑾downsuperscript𝑑𝑟\bm{W}_{\text{down}}\in\mathbb{R}^{d\times r}bold_italic_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT and biases 𝒃downrsubscript𝒃downsuperscript𝑟\bm{b}_{\text{down}}\in\mathbb{R}^{r}bold_italic_b start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT is followed by a non-linear activation function σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ), typically a GELU [26] as used throughout the ViT, and an up-projection with weights 𝑾upr×dsubscript𝑾upsuperscript𝑟𝑑\bm{W}_{\text{up}}\in\mathbb{R}^{r\times d}bold_italic_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT and biases 𝒃updsubscript𝒃upsuperscript𝑑{\bm{b}_{\text{up}}\in\mathbb{R}^{d}}bold_italic_b start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT back to the hidden dimension d𝑑ditalic_d of the transformer layer. This yields a base adapter module

Adapterbase(𝒙)=σ(𝒙𝑾down+𝒃down)𝑾up+𝒃up.subscriptAdapterbase𝒙𝜎𝒙subscript𝑾downsubscript𝒃downsubscript𝑾upsubscript𝒃up\text{Adapter}_{\text{base}}(\bm{x})=\sigma(\bm{x}\bm{W}_{\text{down}}+\bm{b}_% {\text{down}})\bm{W}_{\text{up}}+\bm{b}_{\text{up}}\,.Adapter start_POSTSUBSCRIPT base end_POSTSUBSCRIPT ( bold_italic_x ) = italic_σ ( bold_italic_x bold_italic_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT up end_POSTSUBSCRIPT . (4)

The base adapter module can be further enhanced with a normalization layer, e.g., a layer normalization (LN) [1]. Additionally, the output of the bottleneck can be scaled by s𝑠sitalic_s as

Adapter(𝒙)=sAdapterbase(LN(𝒙)).Adapter𝒙𝑠subscriptAdapterbaseLN𝒙\text{Adapter}(\bm{x})=s\cdot\text{Adapter}_{\text{base}}\big{(}\text{LN}(\bm{% x})\big{)}\,.Adapter ( bold_italic_x ) = italic_s ⋅ Adapter start_POSTSUBSCRIPT base end_POSTSUBSCRIPT ( LN ( bold_italic_x ) ) . (5)

For layer-wise scaling, the factor s𝑠sitalic_s is taken to be a scalar, i.e. s𝑠s\in\mathbb{R}italic_s ∈ blackboard_R, and can be either fixed as a hyperparameter or learned during training. Layer-wise scaling was proposed by He et al. [22] and Hu et al. [29] but deemed not effective compared to a fixed scaling for tasks in NLP. Here, we additionally propose to use a channel-wise, learned scaling where 𝒔d𝒔superscript𝑑\bm{s}\in\mathbb{R}^{d}bold_italic_s ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. We investigate its capabilities in Sec. 4.3. In most cases, the adapter is used with a skip connection, hence the complete feature transformation becomes

𝒙Adapter(𝒙)+𝒙.maps-to𝒙Adapter𝒙𝒙\bm{x}\mapsto\text{Adapter}(\bm{x})+\bm{x}\,.bold_italic_x ↦ Adapter ( bold_italic_x ) + bold_italic_x . (6)

The complete inner structure of an adapter including its skip connection is visualized in Fig. 3(a).

3.3 Adapter positions

Although the architecture of bottleneck adapters for transformers is rather simple, there are various ways to plug them into the transformer layer. Previous work has not yet investigated what the optimum position is for the use with a ViT [13]. Here, we evaluate four possible adapter positions, shown in Figs. 3(b), 3(c), 3(d) and 3(e). We postulate that it is easier for an adapter to learn to modify features previously transformed by a frozen module in the network rather than to anticipate what changes to the features are needed in adapting for a frozen module that follows the adapter. Putting it differently, we argue that the adapter should follow a frozen module.

Pre-Adapter.   The first adapter position we analyze applies the adapter to the output 𝒙𝒙\bm{x}bold_italic_x of the attention section of the transformer layer before it is passed into the FFN, but with the skip connection of the attention already added (Fig. 3(b)). The feature transformation of the FFN section with the adapter attached, therefore, becomes

𝒙FFN(Adapter(𝒙)+𝒙)+(Adapter(𝒙)+𝒙).maps-to𝒙FFNAdapter𝒙𝒙Adapter𝒙𝒙\bm{x}\mapsto\text{FFN}\bigl{(}\text{Adapter}(\bm{x})+\bm{x}\bigr{)}+\bigl{(}% \text{Adapter}(\bm{x})+\bm{x}\bigr{)}\,.bold_italic_x ↦ FFN ( Adapter ( bold_italic_x ) + bold_italic_x ) + ( Adapter ( bold_italic_x ) + bold_italic_x ) . (7)

Note that the two occurrences of Adapter(𝒙)Adapter𝒙\text{Adapter}(\bm{x})Adapter ( bold_italic_x ) in Eq. 7 refer to the same instantiation. In this configuration, the adapter has the full information from the feature transformation happening in the attention but needs to estimate the transformation that will be happening in the FFN that follows. As a result, especially the last FFN before the linear classifier will be hard to adapt. To the best of our knowledge, this adapter position has not been considered in the literature.

Post-Adapter.   In this case, the adapter is positioned at the very end of the transformer layer on the output of the FFN with its skip connection added as

𝒙Adapter(FFN(𝒙)+𝒙)+(FFN(𝒙)+𝒙),maps-to𝒙AdapterFFN𝒙𝒙FFN𝒙𝒙\bm{x}\mapsto\text{Adapter}\bigl{(}\text{FFN}(\bm{x})+\bm{x}\bigr{)}+\bigl{(}% \text{FFN}(\bm{x})+\bm{x}\bigr{)}\,,bold_italic_x ↦ Adapter ( FFN ( bold_italic_x ) + bold_italic_x ) + ( FFN ( bold_italic_x ) + bold_italic_x ) , (8)

where the FFNs refer to the same intantiation (Fig. 3(c)). That way, the adapter has access to the feature transformation happening in the FFN and the unmodified features via the skip connection. This position has been proposed by Pfeiffer et al. [49] as the result of an architecture search, but only for adapting transformers for NLP tasks and not for a ViT.

Parallel-Adapter.   Next, we consider a parallel setting as proposed by [22], where the adapter is located parallel to the FFN and both share a skip connection (Fig. 3(d)):

𝒙FFN(𝒙)+Adapter(𝒙)+𝒙.maps-to𝒙FFN𝒙Adapter𝒙𝒙\bm{x}\mapsto\text{FFN}(\bm{x})+\text{Adapter}(\bm{x})+\bm{x}\,.bold_italic_x ↦ FFN ( bold_italic_x ) + Adapter ( bold_italic_x ) + bold_italic_x . (9)

Therefore, both adapter and FFN work on the output of the attention section of the transformer layer and the adapter needs to learn the necessary residual transformation to the one produced by the frozen FFN.

Intermediate-Adapter.   Finally, we consider the original adapter position as proposed by Houlsby et al. [28]. The adapter is plugged behind the FFN but before the skip connection of the FFN is added (Fig. 3(e)). The adapter additionally possesses its own skip connection:

𝒙Adapter(FFN(𝒙))+FFN(𝒙)+𝒙.maps-to𝒙AdapterFFN𝒙FFN𝒙𝒙\bm{x}\mapsto\text{Adapter}\bigl{(}\text{FFN}(\bm{x})\bigr{)}+\text{FFN}(\bm{x% })+\bm{x}\,.bold_italic_x ↦ Adapter ( FFN ( bold_italic_x ) ) + FFN ( bold_italic_x ) + bold_italic_x . (10)

Note that the two occurrences of FFN(𝒙)FFN𝒙\text{FFN}(\bm{x})FFN ( bold_italic_x ) in Eq. 10 refer to the same instantiation. The adapter sees the transformed features coming from the FFN but cannot access the features added later on by the skip connection of the FFN.

3.4 Initialization of adapter parameters

Since training a deep learning model is a non-convex optimization problem, the initialization of parameters is important. In this work, we evaluate three different variants of parameter initializations for adapters proposed in the literature. All of them have the goal to initialize the adapters in a way that minimizes the initial influence of the adapters at the start of their training. This is a sensible goal since adapters extend an already pre-trained frozen network.

Houlsby initialization.   Houlsby et al. [28] propose to draw the weights of the projection matrices from a zero-centered Gaussian distribution with a standard deviation of σ=0.01𝜎0.01\sigma=0.01italic_σ = 0.01, truncated at 2σ2𝜎2\sigma2 italic_σ, and use zero for their biases.

BERT initialization.   For the BERT model [11], the initialization works similar to [28] but the Gaussian distribution has a standard deviation of σ=0.02𝜎0.02\sigma=0.02italic_σ = 0.02 and is not truncated. This form of initialization is used by Pfeiffer et al. [49].

LoRA initialization.   LoRA [29] initializes the weights and biases of the down-projection with a uniform Kaiming He initialization [23]; the weights and biases of the up-projection are initialized to zero. Therefore, the output of the adapter at the beginning of training equals zero and the adapter initially does not contribute.

3.5 Data normalization in pre-processing

Data normalization is common practice during image pre-processing. It is typically done by shifting and scaling of each input pixel xijsubscript𝑥𝑖𝑗x_{ij}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT for each channel c𝑐citalic_c as

x^ijc=(xijcμc)/σc.subscript^𝑥𝑖𝑗𝑐subscript𝑥𝑖𝑗𝑐subscript𝜇𝑐subscript𝜎𝑐\hat{x}_{ijc}=(x_{ijc}-\mu_{c})/\sigma_{c}\,.over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i italic_j italic_c end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i italic_j italic_c end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) / italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT . (11)

Most widely used are the mean 𝝁=(0.485,0.456,0.406)𝖳𝝁superscript0.4850.4560.406𝖳{\bm{\mu}=(0.485,0.456,0.406)^{\mkern-1.5mu\mathsf{T}}}bold_italic_μ = ( 0.485 , 0.456 , 0.406 ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT and standard deviation 𝝈=(0.229,0.224,0.225)𝖳𝝈superscript0.2290.2240.225𝖳{\bm{\sigma}=(0.229,0.224,0.225)^{\mkern-1.5mu\mathsf{T}}}bold_italic_σ = ( 0.229 , 0.224 , 0.225 ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT of the ImageNet dataset [55], commonly referred to as ImageNet normalization. Another option is using 0.5 for every element of 𝝁𝝁\bm{\mu}bold_italic_μ and 𝝈𝝈\bm{\sigma}bold_italic_σ, which is commonly referred to as Inception normalization because it is used for the Inception family of CNN architectures, starting with Inception-v3 [58]. The ImageNet normalization aims to center the input data around 00 with a standard deviation of 1111. The Inception normalization, on the other hand, transforms the input values such they are strictly in range [1,1]11[-1,1][ - 1 , 1 ].

Because we try to adapt to a target domain on a very low parameter budget, it is important to use the data normalization the network saw during its pre-training. Otherwise, the parameter-efficient transfer method of choice needs to first compensate for the shift in input data statistics and loses parts of its capacity to adapt to the target domain.

4 Experiments

4.1 Datasets

In order to carry out a detailed study of the utility of adapters in the context of ViT models, we experiment with two standard benchmarks for task adaptation.

VTAB.   The Visual Task Adaptation Benchmark (VTAB) [65] consists of 19 tasks, which are further grouped into three categories: Natural, Specialized, and Structured. The Natural group contains natural images captured using standard photographic equipment. The Specialized group is built from datasets of images captured with specialized equipment, from remote sensing and medical domains. Lastly, the Structured group is for evaluating the understanding of the scene structure. Here, the majority of the datasets are compiled from synthetic images with scenes that are easy to assess for humans but have a large domain gap to natural image datasets. Each task of VTAB consists of 800 training and 200 validation images. The test sets have the same number of images as the test sets in the original datasets.

FGVC.   Following Jia et al. [31], we compile five datasets for fine-grained visual classification (FGVC): CUB-200-2011 [61], NABirds [27], Oxford Flowers [45], Stanford Dogs [34], and Stanford Cars [17]. Because VTAB benchmarks task adaptation in a low-data regime in terms of the number of available training images, we use FGVC to evaluate adaptation methods in settings where training data is abundant. Where validation sets are not available in FGVC, we follow Jia et al. [31] to create the validation splits.

For further details regarding the dataset properties of VTAB and FGVC, see supplemental material.

4.2 Experimental settings

For all our experiments, we use a ViT-B/16 network [13] that was pre-trained on ImageNet-21k [55]. We follow its pre-training settings, in particular, regarding input data normalization. We train all models with an AdamW [40] optimizer with a learning rate of 103E-3{10}^{-3}start_ARG end_ARG start_ARG ⁢ end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 3 end_ARG end_ARG, a weight decay of 104E-4{10}^{-4}start_ARG end_ARG start_ARG ⁢ end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG, and a batch size of 64, following [67]. For full fine-tuning, we use a learning rate of 104E-4{10}^{-4}start_ARG end_ARG start_ARG ⁢ end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG, which we found leads to better results. We use a cosine learning rate schedule with a linear warm-up over the first 10 epochs and train for 100 epochs in total. We use stochastic depth with linearly increasing drop rates as a function of network depth from 0 to 0.1 for the frozen network and with a drop rate of 0.1 for the adapters during training. Apart from data normalization (cf. Sec. 3.4), we resize input images to 224×\times×224 px for VTAB and use a randomly resize crop to 224×\times×224 px and horizontal flip** for FGVC. For the ablations and to determine hyperparameters, we evaluate on the validation splits. We include the validation sets in the training data for producing final results.

4.3 Exploring adapter configurations

Adapter position.

Table 1: Adapter position. We report the average accuracy in % (±plus-or-minus\pm± std. dev.) on the VTAB val sets for different adapter positions. Adapterbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT with Houlsby initialization and rank r=8𝑟8r\!=\!8italic_r = 8 is used in all experiments.
Position Natural Specialized Structured Average
Pre 82.4 ±plus-or-minus\pm± 0.4 86.2 ±plus-or-minus\pm± 0.8 57.5 ±plus-or-minus\pm± 0.5 75.3 ±plus-or-minus\pm± 0.3
Intermediate 83.0 ±plus-or-minus\pm± 0.4 85.0 ±plus-or-minus\pm± 0.8 57.2 ±plus-or-minus\pm± 0.5 75.1 ±plus-or-minus\pm± 0.3
Parallel 83.0 ±plus-or-minus\pm± 0.3 86.2 ±plus-or-minus\pm± 0.6 57.7 ±plus-or-minus\pm± 0.6 75.6 ±plus-or-minus\pm± 0.3
Post 83.0 ±plus-or-minus\pm± 0.3 85.7 ±plus-or-minus\pm± 0.4 59.1 ±plus-or-minus\pm± 0.3 76.0 ±plus-or-minus\pm± 0.2

We first evaluate the four possible positions to connect an adapter to the FFN section of the transformer layer, as described in Sec. 3.3. In our ablation, we use Adapterbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT (cf. Eq. 4) with rank r=8𝑟8r\!=\!8italic_r = 8 and use the Houlsby initialization. In this experiment, the adapters neither have a layer normalization nor use scaling.

The results on the VTAB validation set for all four adapter positions are presented in Tab. 1. The Post-Adapter yields the best result with 76.0% average accuracy over all VTAB subgroups. It confirms our hypothesis that the adapter should follow the frozen FFN module because it can then post-hoc modify the features flowing through the network. The parallel configuration comes in second with 75.6% average accuracy, receiving the same input as the FFN but having to learn a residual modification to the FFN instead of a subsequent one. Pre-Adapter and Intermediate-Adapter are subpar compared to the other positions. They either do not have access to the feature transformation happening afterwards in the FFN or to the features of the skip connection containing the output of the attention.

Table 2: Inner adapter structure. We evaluate the different components of the adapter structure, e.g., normalization layer (Norm), layer-wise and channel-wise learnable scaling on the VTAB val sets. The difference to Adapterbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT (first row) is shown in ΔbasesubscriptΔbase\Delta_{\text{base}}roman_Δ start_POSTSUBSCRIPT base end_POSTSUBSCRIPT.
Bias Norm Scaling Initialization Accuracy (%) ΔbasesubscriptΔbase\Delta_{\text{base}}roman_Δ start_POSTSUBSCRIPT base end_POSTSUBSCRIPT
Houlsby 76.0 ±plus-or-minus\pm± 0.2 0.00.00.00.0
Houlsby 75.6 ±plus-or-minus\pm± 0.4 0.4-0.4-0.4- 0.4
LoRA 75.5 ±plus-or-minus\pm± 0.3 0.5-0.5-0.5- 0.5
BERT 75.8 ±plus-or-minus\pm± 0.3 0.2-0.2-0.2- 0.2
Houlsby 75.9 ±plus-or-minus\pm± 0.3 0.1-0.1-0.1- 0.1
layer Houlsby 75.9 ±plus-or-minus\pm± 0.3 0.1-0.1-0.1- 0.1
layer Houlsby 76.2 ±plus-or-minus\pm± 0.3 +0.2
channel Houlsby 75.8 ±plus-or-minus\pm± 0.3 0.2-0.2-0.2- 0.2
channel Houlsby 76.5 ±plus-or-minus\pm± 0.2 +0.5

Inner structure.   Next, we investigate the impact of the inner structure of adapters including their initialization. Tab. 2 shows our findings with average accuracies calculated over the three VTAB subgroups. Removing the biases from the linear layers leads to a decrease in accuracy of 0.4 percentage points (pp). We find that the Houlsby initialization of the adapter parameters is best while BERT and LoRA initializations reduce the accuracy by 0.2 pptimes0.2pp0.2\text{\,}\mathrm{\textup{pp}}start_ARG 0.2 end_ARG start_ARG times end_ARG start_ARG pp end_ARG and 0.5 pptimes0.5pp0.5\text{\,}\mathrm{\textup{pp}}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG pp end_ARG. Adding layer normalization (LN) to the adapter is slightly detrimental for all settings, both with scaling and without, while additionally adding 2d2𝑑2d2 italic_d parameters per layer. We find that a learned scaling is in general beneficial for image-classification tasks. Adding layer-wise scaling leads to a gain of 0.2 pptimes0.2pp0.2\text{\,}\mathrm{\textup{pp}}start_ARG 0.2 end_ARG start_ARG times end_ARG start_ARG pp end_ARG. The inclusion of a learned, channel-wise scaling, as proposed here, gives the strongest improvement of 0.5 pptimes0.5pp0.5\text{\,}\mathrm{\textup{pp}}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG pp end_ARG, reaching an accuracy of 76.5% on the VTAB validation set while only adding half of the parameters compared to LN.

What makes a great adapter?   From our systematic exploration of possible adapter configurations, we conclude that adapter modules in the Post-Adapter position with a learnable, channel-wise scaling and Houlsby initialization work best for computer vision tasks. We call our proposed adapter configuration Adapter+. The addition of layer normalization, as suggested by Pfeiffer et al. [49], is not necessary and even leads to detrimental effects in our setting.

Configurations from previous work.

Table 3: Comparison of Adapter+ with adapter configurations from previous work. We report the average accuracy in % (±plus-or-minus\pm± std. dev.) of each subgroup and across all groups on the VTAB val sets.
Configuration # Param (M) Natural Specialized Structured Average
Houlsby [28], r=8𝑟8r\!=\!8italic_r = 8 0.39 82.9 ±plus-or-minus\pm± 0.2 85.5 ±plus-or-minus\pm± 0.3 58.9 ±plus-or-minus\pm± 0.8 75.8 ±plus-or-minus\pm± 0.3
Houlsby [28], r=4𝑟4r\!=\!4italic_r = 4 0.24 82.9 ±plus-or-minus\pm± 0.4 84.9 ±plus-or-minus\pm± 0.3 58.3 ±plus-or-minus\pm± 0.6 75.4 ±plus-or-minus\pm± 0.3
Pfeiffer [49] 0.21 82.9 ±plus-or-minus\pm± 0.3 86.1 ±plus-or-minus\pm± 0.9 58.4 ±plus-or-minus\pm± 0.7 75.8 ±plus-or-minus\pm± 0.4
AdaptFormer [6] 0.19 83.0 ±plus-or-minus\pm± 0.4 85.0 ±plus-or-minus\pm± 0.2 57.4 ±plus-or-minus\pm± 0.5 75.2 ±plus-or-minus\pm± 0.2
Adapter+ 0.20 83.0 ±plus-or-minus\pm± 0.2 86.8 ±plus-or-minus\pm± 0.6 59.7 ±plus-or-minus\pm± 0.4 76.5 ±plus-or-minus\pm± 0.2

Different configurations of adapters have been established in previous work. We compare their configurations to our systematic approach with rank r=8𝑟8r\!=\!8italic_r = 8 on the VTAB validation sets. Using our own implementations already leads to better results than reported in literature but enables us to compare on equal footing. Houlsby et al. [28] use an Intermediate-Adapter with their proposed initialization both at the FFN section as well at the attention part of the transformer layer. Additionally, they adapt the LN parameters of the backbone. We, therefore, compare their setting additionally with r=4𝑟4r\!=\!4italic_r = 4 to compare on roughly the same parameter budget. Pfeiffer et al. [49] suggest a Post-Adapter like us but with a BERT initialization and they employ a layer normalization inside the adapter. AdaptFormer [6] has the same configuration as a scaled parallel adapter (Scaled PA) [22], which was proposed for NLP tasks, the only difference being the layer-wise scaling s𝑠sitalic_s. Scaled PA uses a fixed scaling of s=4𝑠4s=4italic_s = 4 for the adapters whereas AdaptFormer suggests to use s=0.1𝑠0.1s=0.1italic_s = 0.1 for vision tasks. Optimizing s𝑠sitalic_s for VTAB may lead to better results. Our results are presented in Tab. 3. We see a clear advantage of our Adapter+ configuration, gaining at least 0.7 pptimes0.7pp0.7\text{\,}\mathrm{\textup{pp}}start_ARG 0.7 end_ARG start_ARG times end_ARG start_ARG pp end_ARG over all previous adapter realizations considered despite having the second lowest number of trainable parameters.

Table 4: Detailed results on the VTAB test sets. We report original results and re-evaluations (\circlearrowright) in % after a complete training schedule with suitable data normalization. Grayed out numbers are not included in the ranking for best and second best results. \dagger: Early-stop** based on the test set, \bullet: unsuitable data normalization, ↯: per-task hyperparameter optimization. 1Average across the average accuracies of the VTAB groups, following previous work. 2No complete code release for Consolidator, hence training and evaluation details are unknown.
Natural Specialized Structured

# Param (M)

Cifar100 [35]

Caltech101 [15]

DTD [8]

Flower102 [45]

Pets [47]

SVHN [44]

Sun397 [62]

Average

Camelyon [60]

EuroSAT [25]

Resisc45 [7]

Retinopathy [14]

Average

Clevr-Count [33]

Clevr-Dist. [33]

DMLab [2]

KITTI-Dist. [18]

dSpr-Loc. [42]

dSpr-Ori [42]

sNORB-Azi. [36]

sNORB-Ele. [36]

Average

Global Average1

Full 85.8 73.2 92.6 70.4 97.9 86.2 90.6 39.6 78.6 87.1 96.6 87.5 74.0 86.3 66.6 61.0 49.8 79.7 82.6 51.9 33.5 37.0 57.8 74.2
Linear 0.04 78.1 88.1 69.0 99.1 90.0 36.0 56.9 73.9 79.8 90.7 73.7 73.7 79.5 32.4 30.5 35.9 61.9 11.2 26.2 14.3 24.5 29.6 61.0
LoRA [29] 0.29 83.0 91.7 71.6 99.2 90.9 83.8 56.7 82.4 86.2 95.7 83.5 71.9 84.3 77.7 62.3 49.0 80.2 82.2 51.7 31.0 47.0 60.1 75.6
VPT-Deep ↯\bullet [31] 0.60 78.8 90.8 65.8 98.0 88.3 78.1 49.6 78.5 81.8 96.1 83.4 68.4 82.4 68.5 60.0 46.5 72.8 73.6 47.9 32.9 37.8 55.0 72.0
VPT-Deep ↯\circlearrowright 0.60 83.0 93.0 71.2 99.0 91.3 84.1 56.0 82.5 84.9 96.6 82.5 74.5 84.6 77.5 58.7 49.7 79.6 86.2 56.1 37.9 50.7 62.1 76.4
NOAH ↯
\clipbox0 0 0
\clipbox0 0 0
{\dagger\mathbin{\vphantom{\circ}\text{\ooalign{\clipbox{0 0 0 {}}{$\bullet$}% \cr$\circ$\cr}}}}† start_BINOP \clipbox0 0 0 ∙∘ end_BINOP
[67]
0.43 69.6 92.7 70.2 99.1 90.4 86.1 53.7 80.2 84.4 95.4 83.9 75.8 84.9 82.8 68.9 49.9 81.7 81.8 48.3 32.8 44.2 61.3 75.5
SSF ↯\dagger [39] 0.24 69.0 92.6 75.1 99.4 91.8 90.2 52.9 81.6 87.4 95.9 87.4 75.5 86.6 75.9 62.3 53.3 80.6 77.3 54.9 29.5 37.9 59.0 75.7
SSF ↯\circlearrowright 0.24 61.9 92.3 73.4 99.4 92.0 90.8 52.0 80.3 86.5 95.8 87.5 72.8 85.7 77.4 57.6 53.4 77.0 78.2 54.3 30.3 36.1 58.0 74.6
FacT-TK8\dagger\bullet† ∙ [32] 0.05 70.3 88.7 69.8 99.0 90.4 84.2 53.5 79.4 82.8 95.6 82.8 75.7 84.2 81.1 68.0 48.0 80.5 74.6 44.0 29.2 41.1 58.3 74.0
FacT-TK8subscriptFacT-TK8\text{FacT-TK}_{8}FacT-TK start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT\circlearrowright 0.05 74.9 92.7 73.7 99.1 91.3 85.5 57.7 82.1 86.8 94.9 84.1 70.9 84.2 81.9 64.1 49.2 77.2 83.8 53.1 28.2 44.7 60.3 75.5
FacT-TK≤32\dagger\bullet† ∙ [32] 0.10 70.6 90.6 70.8 99.1 90.7 88.6 54.1 80.6 84.8 96.2 84.5 75.7 85.3 82.6 68.2 49.8 80.7 80.8 47.4 33.2 43.0 60.7 75.6
FacT-TK≤32\circlearrowright 0.10 74.6 93.7 73.6 99.3 90.6 88.7 57.5 82.6 87.6 95.4 85.5 70.4 84.7 84.3 62.6 51.9 79.2 85.5 52.0 36.4 46.6 62.3 76.5
Consolidator22\,{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT [20] 0.30 74.2 90.9 73.9 99.4 91.6 91.5 55.5 82.4 86.9 95.7 86.6 75.9 86.3 81.2 68.2 51.6 83.5 79.8 52.3 31.9 38.5 60.9 76.5
SPT-Adapter {\dagger\bullet}† ∙ [21] 0.23 72.9 93.2 72.5 99.3 91.4 84.6 55.2 81.3 85.3 96.0 84.3 75.5 85.3 82.2 68.0 49.3 80.0 82.4 51.9 31.7 41.2 60.8 75.8
SPT-Adapter \circlearrowright 0.22 74.7 94.1 73.0 99.1 91.2 84.5 57.5 82.0 85.7 94.9 85.7 70.2 84.1 81.3 63.2 49.1 80.7 83.5 52.0 26.4 41.5 59.7 75.3
SPT-Adapter {\dagger\bullet}† ∙ [21] 0.43 72.9 93.2 72.5 99.3 91.4 88.8 55.8 82.0 86.2 96.1 85.5 75.5 85.8 83.0 68.0 51.9 81.2 82.4 51.9 31.7 41.2 61.4 76.4
SPT-Adapter \circlearrowright 0.43 74.9 93.2 71.6 99.2 91.1 87.9 57.2 82.2 87.0 95.4 86.5 72.4 85.3 81.1 63.2 50.3 80.2 84.4 51.4 31.5 42.2 60.5 76.0
Adapter+, r=1𝑟1r\!=\!1italic_r = 1 0.07 85.4 92.4 73.1 99.1 91.3 83.1 58.1 83.2 87.2 96.6 85.3 72.6 85.5 80.7 60.6 50.9 79.9 83.3 55.6 27.1 43.0 60.1 76.3
Adapter+, r=2𝑟2r\!=\!2italic_r = 2 0.09 85.4 93.0 72.7 99.2 90.6 85.3 58.0 83.5 87.9 96.8 85.5 71.4 85.4 83.2 61.0 51.6 80.1 86.1 56.3 30.7 46.5 61.9 76.9
Adapter+, r=4𝑟4r\!=\!4italic_r = 4 0.13 84.8 93.8 72.7 99.2 90.6 86.5 57.4 83.6 87.5 96.9 85.9 71.5 85.4 83.4 61.6 53.6 81.4 87.3 55.3 34.4 48.1 63.1 77.4
Adapter+, r=8𝑟8r\!=\!8italic_r = 8 0.20 84.6 94.2 72.3 99.3 90.7 87.6 56.7 83.6 87.7 97.0 86.7 72.3 85.9 83.2 60.9 53.8 80.3 88.1 55.6 35.7 47.7 63.1 77.6
Adapter+, r=16𝑟16r\!=\!16italic_r = 16 0.35 83.7 94.2 71.5 99.3 90.6 88.2 55.8 83.3 87.5 97.0 87.4 72.9 86.2 82.9 60.9 53.7 80.8 88.4 55.2 37.3 46.9 63.3 77.6
Adapter+, r[1..4]𝑟delimited-[]1..4r\!\in\![1..4]italic_r ∈ [ 1..4 ] 0.11 85.4 93.8 72.7 99.1 90.6 86.5 58.1 83.7 87.5 96.8 85.9 71.4 85.4 83.4 61.0 53.6 81.4 87.3 55.3 34.4 48.1 63.1 77.4
Adapter+, r[1..8]𝑟delimited-[]1..8r\!\in\![1..8]italic_r ∈ [ 1..8 ] 0.16 85.4 93.8 72.7 99.1 90.7 87.6 58.1 83.9 87.7 96.8 86.7 72.3 85.9 83.4 60.9 53.8 80.3 88.1 55.3 35.7 47.7 63.1 77.7
Adapter+, r[1..32]𝑟delimited-[]1..32r\!\in\![1..32]italic_r ∈ [ 1..32 ] 0.27 85.4 93.8 72.7 99.1 90.7 88.2 58.1 84.0 87.5 96.8 87.8 73.9 86.5 83.4 60.9 53.8 80.3 87.2 55.3 37.9 47.7 63.3 77.9

4.4 Main results

VTAB.

We evaluate Adapter+ on the VTAB test sets and compare to other methods in Tab. 4. We provide results for full fine-tuning and tuning only the linear classifier while freezing the rest of the backbone [12] as a baseline of classical fine-tuning methods. As competing parameter-efficient tuning methods, we include LoRA [29], VPT [31], NOAH [67], SSF [39], FacT [32], Consolidator [20], and SPT [21].

Wherever possible, we re-evaluate the other methods with a suitable data normalization for the pre-trained backbone and after the full training schedule to enable a fair comparison. For LoRA, we use our own implementation because the original work does not cover VTAB. For VPT, we adopt the number of tokens per task from their hyperparameter optimization but find that we do not need to tune learning rate and weight decay per task. Additionally, deviating from the original implementation, we optimize with AdamW [40] instead of SGD [54] and change to an appropriate data normalization. We present the original results from [31] on VTAB together with our re-evaluation. Our improved implementation of VPT increases the average accuracy by 4.4 pptimes4.4pp4.4\text{\,}\mathrm{\textup{pp}}start_ARG 4.4 end_ARG start_ARG times end_ARG start_ARG pp end_ARG from 72.0% to 76.4%. SSF, FacT, and SPT released code to evaluate on VTAB. For FacT and SPT, we change the data normalization to match the backbone; SSF already uses the correct one. We re-run the provided code and present the results after a full training schedule. For completeness, we also report the results from the original publications. However, we found that the code releases of [39, 32, 21] use early stop** based on the best result on the test set. We argue that tuning hyperparameters such as the number of training epochs on the test set goes against established practices in machine learning; rather the validation set should be used for early stop**. Yet, due to the limited size of the training and validation sets in VTAB, it is not feasible to report test results without also training on the validation data. Hence, we chose to complete a full training schedule of 100 epochs instead of using early stop**. Training SSF for the full schedule leads to a decrease in average accuracy of 1.1 pptimes1.1pp1.1\text{\,}\mathrm{\textup{pp}}start_ARG 1.1 end_ARG start_ARG times end_ARG start_ARG pp end_ARG over the original publication and re-evaluating SPT leads to a decrease of up to 0.5 pptimes0.5pp0.5\text{\,}\mathrm{\textup{pp}}start_ARG 0.5 end_ARG start_ARG times end_ARG start_ARG pp end_ARG, even with a corrected data normalization. FacT on the other hand benefits from our re-revaluation, since the accuracy decrease from training a complete schedule is offset by improvements from applying the appropriate data normalization. There was no complete code release with configurations to train Consolidator on VTAB at the time of writing, hence we report results as-is.

Adapter+ shows the best parameter-accuracy trade-off among all methods evaluated. This can also be clearly seen in Fig. 1. Additionally, Adapter+ sets a new state of the art with an average accuracy of up to 77.6% over all VTAB subgroups even without any per-task hyperparameter optimization. If we determine the optimal rank r𝑟ritalic_r per task on the validation set, we can further improve the accuracy to 77.9%. Optimizing the rank leads to a better parameter-accuracy trade-off than using a fixed rank across all tasks.

In Fig. 2, we compare the average accuracy on the subgroups of VTAB. Wherever possible, we present the results of re-evaluating methods after the last training epoch and matching the data normalization to the backbone. The average accuracies of Adapter+ with r[1..32]𝑟delimited-[]1..32r\in[1..32]italic_r ∈ [ 1..32 ] are consistently higher than those of the competing methods. Note that the accuracies of other methods except SPT differ drastically across the different VTAB subgroups. Adapter+, on the other hand, shows a high degree of robustness to the domain shifts between groups.

FGVC.

Table 5: Detailed results on the FGVC test sets. We report original results and re-evaluations (\circlearrowright) in % after a complete training schedule with suitable data normalization. Grayed out numbers are not included in the ranking for best and second best results.

# Param (M)

CUB200 [61]

NABirds [27]

Oxford Flowers [45]

Stanford Dogs [34]

Stanford Cars [17]

Average

Full 86.0 88.0 81.5 99.2 85.6 90.6 89.0
Linear 0.18 88.9 81.8 99.5 92.6 52.8 83.1
VPT-Deep [31] 0.85 88.5 84.2 99.0 90.2 83.6 89.1
VPT-Deep \circlearrowright 0.85 90.1 83.3 99.6 90.3 85.0 89.7
SSF [39] 0.39 89.5 85.7 99.6 89.6 89.2 90.7
SSF \circlearrowright 0.39 88.9 85.0 99.6 88.9 88.9 90.3
SPT-Adapter [21] 0.40 89.1 83.3 99.2 91.1 86.2 89.8
SPT-LoRA [21] 0.52 88.6 83.4 99.5 91.4 87.3 90.1
Adapter+, r[1..32]𝑟delimited-[]1..32r\!\in\![1..32]italic_r ∈ [ 1..32 ] 0.34 90.0 83.2 99.6 91.6 89.1 90.7
Adapter+ (best epoch) 0.34 90.4 85.0 99.7 92.6 89.1 91.4

Next, we present our results on the FGVC benchmark in Tab. 5. From the contenders, only SSF [39] has released code and hyperparameter configurations for training on FGVC at the time of writing. As we know from the code releases for VTAB, the reported numbers show the accuracy for early stop** based on the test set. Therefore, we expect a similar evaluation for FGVC. While we do not endorse early stop** based on the test set, we additionally provide numbers for that setting in Tab. 5 for the sake of comparability. Even when training for a complete schedule, Adapter+ shows the best average accuracy with 90.7% over all five datasets in FGVC, 0.4 pptimes0.4pp0.4\text{\,}\mathrm{\textup{pp}}start_ARG 0.4 end_ARG start_ARG times end_ARG start_ARG pp end_ARG over the second best method under similar evaluation. When early stop** with the test set, Adapter+ reaches 91.4% average accuracy, 0.7 pptimes0.7pp0.7\text{\,}\mathrm{\textup{pp}}start_ARG 0.7 end_ARG start_ARG times end_ARG start_ARG pp end_ARG over the second best method and 2.4 pptimes2.4pp2.4\text{\,}\mathrm{\textup{pp}}start_ARG 2.4 end_ARG start_ARG times end_ARG start_ARG pp end_ARG better than full fine-tuning. This demonstrates that Adapter+ also yields state-of-the-art results for task adaptation when training data is abundant while having the best parameter efficiency.

4.5 Ablations

Data normalization.

We showcase the effect of using an unsuitable data normalization for the chosen ViT in Tab. 6. The gap between ImageNet and Inception normalization (see Sec. 3.5) is largest for VPT [31], with a 3.4 pptimes3.4pp3.4\text{\,}\mathrm{\textup{pp}}start_ARG 3.4 end_ARG start_ARG times end_ARG start_ARG pp end_ARG difference in average accuracy, which explains around two-thirds of the gain for our re-evaluation as shown in Fig. 1. We suspect that VPT has less of an ability to scale and shift the data because the learnable tokens only act on the attention mechanism. LoRA [29], FacT [32], and adapters all employ linear layers that can directly scale and shift the features of the frozen backbone and thus compensate better for improper data normalization. It is worth mentioning that our Adapter+ is the most robust to improper normalization out of the methods evaluated, with a gap of only 2.6 pptimes2.6pp2.6\text{\,}\mathrm{\textup{pp}}start_ARG 2.6 end_ARG start_ARG times end_ARG start_ARG pp end_ARG average accuracy.

Training regularization.   We investigate the importance of training regularization methods like stochastic depth [30] and dropout [16] for training adapters on a frozen ViT backbone and evaluate on the VTAB validation sets. We use linearly increasing drop rates as a function of network depth from 0 to 0.1 for the frozen layers of the ViT model, and a drop rate of 0.1 when using dropout or stochastic depth for the adapter modules. The results in Tab. 7 show a clear benefit for using stochastic regularization for the frozen layers as well as the adapters during training. Using dropout in the adapters is only slightly better than no regularization for adapters, with a gain of only 0.1 pptimes0.1pp0.1\text{\,}\mathrm{\textup{pp}}start_ARG 0.1 end_ARG start_ARG times end_ARG start_ARG pp end_ARG. With an increase in accuracy of 0.7 pptimes0.7pp0.7\text{\,}\mathrm{\textup{pp}}start_ARG 0.7 end_ARG start_ARG times end_ARG start_ARG pp end_ARG, stochastic depth is the preferred regularization method for adapters. However, our results show that the more important part is the stochastic depth regularization for the frozen modules of the ViT backbone. Disabling it in training leads to a loss of 1.5 pptimes1.5pp1.5\text{\,}\mathrm{\textup{pp}}start_ARG 1.5 end_ARG start_ARG times end_ARG start_ARG pp end_ARG accuracy compared to a training where stochastic depth is used throughout the model.

Table 6: Effects of ImageNet vs. Inception data normalization. All methods are evaluated on the VTAB val sets. In column ΔAveragesubscriptΔAverage\Delta_{\text{Average}}roman_Δ start_POSTSUBSCRIPT Average end_POSTSUBSCRIPT we report the increase in accuracy in pp across all VTAB subgroups.
ImageNet norm Inception norm

Natural

Specialized

Structured

Average

Natural

Specialized

Structured

Average

ΔAveragesubscriptΔAverage\Delta_{\text{Average}}roman_Δ start_POSTSUBSCRIPT Average end_POSTSUBSCRIPT

VPT 79.2 83.0 53.8 72.0 82.2 86.2 57.9 75.4 3.4
LoRA 78.4 84.1 53.2 71.9 82.0 85.8 56.4 74.7 2.8
FacT-TK 78.0 83.3 56.1 72.4 81.6 85.6 58.1 75.1 2.7
Adapter+ 80.5 85.0 56.0 73.9 83.0 86.8 59.7 76.5 2.6
Table 7: Influence of training regularization. We evaluate accuracy in % with Adapterbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT with rank r=8𝑟8r\!=\!8italic_r = 8 on the VTAB val sets.
Adapter
Stochastic Depth Dropout None
ViT Stochastic Depth 76.0 75.4 75.3
None 74.5 74.3 73.7

5 Conclusion

Applied at the right position and with an optimal inner structure, the simple concept of adapters produces state-of-the-art results for task adaptation. To understand how adapters can “strike back”, we conducted the first systematic and in-depth study on how to best construct adapters and integrate them with vision transformers. This allowed us to determine the optimal connection point for the adapter in the transformer layer. Further, we proposed to use a learnable, channel-wise scaling and showed its benefit for computer vision tasks. Our insights led us to the creation of Adapter+ that yields the highest accuracy and the best parameter-accuracy trade-off on VTAB (77.6%, 0.2M) without any per-task hyperparameter optimization and on FGVC (90.7%, 0.34M), showing its superiority over more complicated methods.

Acknowledgements.   This work has been funded by the LOEWE initiative (Hesse, Germany) within the emergenCITY center.

References

  • Ba et al. [2016] Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv:1607.06450 [stat.ML], 2016.
  • Beattie et al. [2016] Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. DeepMind Lab. arXiv:1612.03801 [cs.AI], 2016.
  • [3] Han Cai, Chuang Gan, Ligeng Zhu, and Song Han. TinyTL: Reduce memory, not parameters for efficient on-device learning. In NeurIPS*2020.
  • Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, pages 213–229, 2020.
  • Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, pages 9630–9640, 2021.
  • [6] Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and ** Luo. AdaptFormer: Adapting vision transformers for scalable visual recognition. In NeurIPS*2022.
  • Cheng et al. [2017] Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE, 105(10):1865–1883, 2017.
  • Cimpoi et al. [2014] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In CVPR, pages 3606–3613, 2014.
  • [9] Ekin Dogus Cubuk, Barret Zoph, Jonathon Shlens, and Quoc Le. RandAugment: Practical automated data augmentation with a reduced search space. In NeurIPS*2020.
  • Dehghani et al. [2023] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme Ruiz, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Collier, Alexey A. Gritsenko, Vighnesh Birodkar, Cristina Nader Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetic, Dustin Tran, Thomas Kipf, Mario Lucic, Xiaohua Zhai, Daniel Keysers, Jeremiah J. Harmsen, and Neil Houlsby. Scaling vision transformers to 22 billion parameters. In ICML, pages 7480–7512, 2023.
  • Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171–4186, 2019.
  • Donahue et al. [2014] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In ICML, pages 647–655, 2014.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  • Dugas et al. [2015] Emma Dugas, Jorge Jared, and Will Cukierski. Diabetic retinopathy detection. Kaggle, 2015.
  • Fei-Fei et al. [2006] Li Fei-Fei, Robert Fergus, and Pietro Perona. One-shot learning of object categories. IEEE T. Pattern Anal. Mach. Intell., 28(4):594–611, 2006.
  • Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In ICML, pages 1050–1059, 2016.
  • Gebru et al. [2017] Timnit Gebru, Jonathan Krause, Yilun Wang, Duyun Chen, Jia Deng, and Li Fei-Fei. Fine-grained car detection for visual census estimation. In AAAI, pages 4502–4508, 2017.
  • Geiger et al. [2013] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The KITTI dataset. Int. J. Robotics Res., 32(11):1231–1237, 2013.
  • Guo et al. [2019] Yunhui Guo, Yandong Li, Liqiang Wang, and Tajana Rosing. Depthwise convolution is all you need for learning multiple visual domains. In AAAI, pages 8368–8375, 2019.
  • Hao et al. [2023] Tianxiang Hao, Hui Chen, Yuchen Guo, and Guiguang Ding. Consolidator: Mergable adapter with group connections for visual adaptation. In ICLR, 2023.
  • He et al. [2023] Haoyu He, Jianfei Cai, **g Zhang, Dacheng Tao, and Bohan Zhuang. Sensitivity-aware visual parameter-efficient tuning. In ICCV, 2023.
  • He et al. [2022] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. In ICLR, 2022.
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, pages 1026–1034, 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • Helber et al. [2019] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 12(7):2217–2226, 2019.
  • Hendrycks and Gimpel [2023] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv:1606.08415 [cs.LG], 2023.
  • Horn et al. [2015] Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge J. Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In CVPR, pages 595–604, 2015.
  • Houlsby et al. [2019] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In ICML, pages 2790–2799, 2019.
  • Hu et al. [2022] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
  • Huang et al. [2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In ECCV, pages 646–661, 2016.
  • Jia et al. [2022] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge J. Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In ECCV, pages 709–727, 2022.
  • Jie and Deng [2023] Shibo Jie and Zhi-Hong Deng. FacT: Factor-tuning for lightweight adaptation on vision transformer. In AAAI, pages 1060–1068, 2023.
  • Johnson et al. [2017] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, pages 1988–1997, 2017.
  • Khosla et al. [2011] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for fine-grained image categorization. In CVPR Workshop on Fine-grained Visual Classification, 2011.
  • Krizhevsky [2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Canadian Institute for Advanced Research, 2009.
  • LeCun et al. [2004] Yann LeCun, Fu Jie Huang, and Léon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In CVPR, pages 97–104, 2004.
  • Lester et al. [2021] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In EMNLP, pages 3045–3059, 2021.
  • Li and Liang [2021] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In ACL/IJCNLP, pages 4582–4597, 2021.
  • [39] Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning. In NeurIPS*2022.
  • Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
  • Mallya et al. [2018] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In ECCV, pages 72–88, 2018.
  • Matthey et al. [2017] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dSprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset, 2017.
  • Morgado and Vasconcelos [2019] Pedro Morgado and Nuno Vasconcelos. NetTailor: Tuning the architecture, not just the weights. In CVPR, pages 3044–3054, 2019.
  • Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  • Nilsback and Zisserman [2008] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In ICVGIP, pages 722–729, 2008.
  • Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. DINOv2: Learning robust visual features without supervision. arXiv:2304.07193 [cs.CV], 2023.
  • Parkhi et al. [2012] Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. In CVPR, pages 3498–3505, 2012.
  • [48] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS*2019, pages 8024–8035.
  • Pfeiffer et al. [2021] Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. AdapterFusion: Non-destructive task composition for transfer learning. In EACL, pages 487–503, 2021.
  • Radford et al. [2018] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018.
  • Radford et al. [ICML] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, ICML.
  • Ranftl et al. [2021] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In ICCV, pages 12159–12168, 2021.
  • [53] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. In NIPS*2017, pages 506–516.
  • Rosenblatt [1958] Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev., 65(6):386–408, 1958.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision, 115(13):211–252, 2015.
  • [56] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS*2022.
  • Steiner et al. [2022] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? Data, augmentation, and regularization in vision transformers. Trans. Mach. Learn. Res., 2022.
  • Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the Inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
  • [59] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS*2017, pages 5998–6008.
  • Veeling et al. [2018] Bastiaan S. Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation equivariant CNNs for digital pathology. In MICCAI, pages 210–218, 2018.
  • Wah et al. [2011] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 dataset. Technical report, California Institute of Technology, 2011.
  • Xiao et al. [2010] Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. SUN database: Large-scale scene recognition from abbey to zoo. In CVPR, pages 3485–3492, 2010.
  • [63] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, José M. Álvarez, and ** Luo. SegFormer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS*2021, pages 12077–12090.
  • Zaken et al. [2022] Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In ACL, 2022.
  • Zhai et al. [2020] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, and Neil Houlsby. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv:1910.04867 [cs.CV], 2020.
  • Zhang et al. [2018] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018.
  • Zhang et al. [2022] Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. Neural prompt search. arXiv:2206.04673 [cs.CV], 2022.
\thetitle

Supplementary Material

In this appendix, we provide further details and results, which could not be included in the main paper due to space limitations.

Appendix A Why did adapters underperform for ViTs?

First, we want to shed more light on why adapters do not rank well in the literature for parameter-efficient transfer learning for vision tasks. By comparison of numbers reported for adapters on VTAB in the publications referenced in Tab. 4 of the main paper, we found that they essentially stem from only two sources.

The first source is VPT [31], where results for an adapter with a reduction factor of 256256256256, amongst other configurations, are reported. For a ViT-B/16 with a hidden dimension of d=768𝑑768d\!=\!768italic_d = 768, this is equal to an adapter with rank r=3𝑟3r\!=\!3italic_r = 3. Despite citing Pfeiffer et al. [49], who suggest a Post-Adapter position, the actual implementation in the code base***https://github.com/KMnP/vpt equals an Intermediate-Adapter that performs worse on VTAB (see Sec. 3.3 of the main paper). The initialization used for the adapter parameters most resembles a LoRA initialization but sets the adapter parameters to zero everywhere. Therefore, there is no randomization in the initialization of the adapter parameters, and different seeds only affect the initialization of the classifier. Additionally, the intermediate features in the adapter bottlenecks then become all zero, leading to identical gradients in the up-projections at the start of training, which hinders optimization. As a result, the adapter baseline used by VPT only reaches 60.0% average accuracy on the VTAB test sets. This is a gap of 17.6 percentage points (pp) compared to our Adapter+ with rank r=8𝑟8r\!=\!8italic_r = 8 (77.6% average accuracy). Even when considering the loss of around \qtyrange23 caused by an unsuitable data normalization in the VPT implementation, this is still a very significant gap. The numbers for an adapter with rank r=3𝑟3r\!=\!3italic_r = 3 from VPT are also reported in [39] as a baseline.

The second source for adapter baseline results is the NOAH pre-print [67]. There, an adapter with rank r=8𝑟8r\!=\!8italic_r = 8 is used. Its implementationhttps://github.com/ZhangYuanhan-AI/NOAH performs the following feature transformation:

𝒙Adapter(FFN(𝒙))+𝒙.maps-to𝒙AdapterFFN𝒙𝒙\bm{x}\mapsto\text{Adapter}\bigl{(}\text{FFN}(\bm{x})\bigr{)}+\bm{x}\,.bold_italic_x ↦ Adapter ( FFN ( bold_italic_x ) ) + bold_italic_x . (12)

This is closest to the Intermediate-Adapter (cf. Eq. 10 of the main paper) but misses the skip connection bypassing the adapter and containing FFN(𝒙)FFN𝒙\text{FFN}(\bm{x})FFN ( bold_italic_x ). Thus, the adapter does not learn a residual function to an identity map** but instead must learn a more complex map** to transform its input. Therefore, the adapter becomes harder to train [24], leading to an average accuracy of 73.9% on the VTAB test sets or 3.7 pptimes3.7pp3.7\text{\,}\mathrm{\textup{pp}}start_ARG 3.7 end_ARG start_ARG times end_ARG start_ARG pp end_ARG behind our Adapter+. For the NOAH adapter results, we see a proliferation to the publications of FacT [32] and SPT [21]. The adapter implementation from NOAH is also used in the code released for Consolidatorhttps://github.com/THU-MIG/Consolidator [20] but their results are produced with rank r=16𝑟16r\!=\!16italic_r = 16, giving a slightly better average accuracy of 74.3%, or 3.3 pptimes3.3pp3.3\text{\,}\mathrm{\textup{pp}}start_ARG 3.3 end_ARG start_ARG times end_ARG start_ARG pp end_ARG less than Adapter+.

In summary, the examined baseline implementations differ from the configurations proposed by Houlsby et al. [28] and Pfeiffer et al. [49] and introduce issues that lead to their underperformance. In our paper, we show that adapters are capable of reaching 77.6% average accuracy for rank r=8𝑟8r\!=\!8italic_r = 8 and 77.9% for our optimized version of Adapter+, uplifting adapters from an easy-to-beat baseline to a state-of-the-art transfer method.

Appendix B Dataset properties

In Tabs. 8 and 9, we show the statistics of each task in VTAB [65] and FGVC [31] with regard to the number of classes and the number of images in the train, validation, and test splits. The tables are largely “borrowed” from [31].

Table 8: Dataset details for VTAB.
Group Task # Classes Splits
Train Val Test
Natural CIFAR-100 [35] 100100100100 800800800800 200200200200 10 0001000010\,00010 000
Caltech-101 [15] 102102102102 6 08460846\,0846 084
DTD [8] 47474747 1 88018801\,8801 880
Oxford Flowers [45] 102102102102 6 14961496\,1496 149
Pets [47] 37373737 3 66936693\,6693 669
SVHN [44] 10101010 26 0322603226\,03226 032
Sun397 [62] 397397397397 21 7502175021\,75021 750
Specialized Patch Camelyon [60] 2222 800800800800 200200200200 32 7683276832\,76832 768
EuroSAT [25] 10101010 5 40054005\,4005 400
RESISC45 [7] 45454545 6 30063006\,3006 300
Diabetic Retinopathy [14] 5555 42 6704267042\,67042 670
Structured CLEVR-Count [33] 8888 800800800800 200200200200 15 0001500015\,00015 000
CLEVR-Distance [33] 6666 15 0001500015\,00015 000
DMLab [2] 6666 22 7352273522\,73522 735
KITTI-Distance [18] 4444 711711711711
dSprites-Location [42] 16161616 73 7287372873\,72873 728
dSprites-Orientation [42] 16161616 73 7287372873\,72873 728
smallNORB-Azimuth [36] 18181818 12 1501215012\,15012 150
smallNORB-Elevation [36] 9999 12 1501215012\,15012 150
Table 9: Dataset details for FGVC. For datasets marked with *, we follow [31] to randomly sample train and validation splits because validation sets are not available from the original datasets.
Dataset # Classes Splits
Train Val Test
CUB-200-2011* [61] 200200200200 5 39453945\,3945 394 600600600600 5 79457945\,7945 794
NABirds* [27] 555555555555 21 5362153621\,53621 536 2 39323932\,3932 393 6 08460846\,0846 084
Oxford Flowers [45] 102102102102 1 02010201\,0201 020 1 02010201\,0201 020 6 14961496\,1496 149
Stanford Dogs* [34] 120120120120 10 8001080010\,80010 800 1 20012001\,2001 200 8 58085808\,5808 580
Stanford Cars* [17] 196196196196 7 32973297\,3297 329 815815815815 8 04180418\,0418 041

Appendix C More experimental settings

For all experiments conducted with our implementation, we average the results over three seeds. This includes the (re-)evaluations of LoRA and VPT. We built our implementation on PyTorch [48], PyTorch Lightning,§§§https://lightning.ai/pytorch-lightning and timm.https://github.com/huggingface/pytorch-image-models We run experiments with bfloat16 mixed precision on a NVIDIA RTX A6000 GPU.

For our experiments in the main paper, we report results for a fixed adapter rank r𝑟ritalic_r as well as ranks optimized per task. For the per-task optimization of Adapter+, we use a hyperparameter sweep over the set of ranks r{1,2,4,8,16,32}𝑟12481632r\!\in\!\{1,2,4,8,16,32\}italic_r ∈ { 1 , 2 , 4 , 8 , 16 , 32 }. We evaluate on the validation sets of VTAB and FGVC and choose the per-task ranks from the specified range(s) to steer the number of average parameters. The ranks we used to produce the results on the VTAB and FGVC test sets (see Tabs. 4 and 5 in the main paper) are shown in detail in Tab. 10 and Tab. 11, respectively.

Appendix D Calculation of no. of trainable parameters

Suppose we have a ViT with a hidden dimension d𝑑ditalic_d, N𝑁Nitalic_N transformer layers, and adapters with rank r𝑟ritalic_r. The total number of learnable parameters for Adapterbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT modules (cf. Eq. 4 of the main paper) attached to the FFN of every transformer layer then amounts to N(2dr+r+d)𝑁2𝑑𝑟𝑟𝑑N(2dr+r+d)italic_N ( 2 italic_d italic_r + italic_r + italic_d ). Including layer normalization in the adapter modules amounts to N2d𝑁2𝑑N2ditalic_N 2 italic_d additional parameters. The addition of learned, layer-wise scaling amounts to N𝑁Nitalic_N extra parameters and choosing learned, channel-wise scaling instead adds Nd𝑁𝑑Nditalic_N italic_d extra parameters. Adapter+ (see Sec. 4.3 of the main paper) thus amounts to N(2dr+2d+r)𝑁2𝑑𝑟2𝑑𝑟N(2dr+2d+r)italic_N ( 2 italic_d italic_r + 2 italic_d + italic_r ) total parameters. Additionally, for a task with c𝑐citalic_c classes, we add a classifier with dc+c𝑑𝑐𝑐dc+citalic_d italic_c + italic_c learnable parameters.

Table 10: Adapter rank r𝑟ritalic_r for each VTAB task for optimized versions of Adapter+ with different ranges of permitted ranks.
Natural Specialized Structured

# Param (M)

CIFAR-100 [35]

Caltech-101 [15]

DTD [8]

Flowers [45]

Pets [47]

SVHN [44]

Sun397 [62]

Camelyon [60]

EuroSAT [25]

RESISC45 [7]

Retinopathy [14]

CLEVR-Count [33]

CLEVR-Dist. [33]

DMLab [2]

KITTI-Dist. [18]

dSpr-Loc. [42]

dSpr-Ori. [42]

sNORB-Azi. [36]

sNORB-Ele. [36]

r[1..4]𝑟delimited-[]1..4r\!\in\![1..4]italic_r ∈ [ 1..4 ] 0.11 1 4 2 1 4 4 1 4 2 4 2 4 2 4 4 4 4 4 4
r[1..8]𝑟delimited-[]1..8r\!\in\![1..8]italic_r ∈ [ 1..8 ] 0.16 1 4 2 1 8 8 1 8 2 8 8 4 8 8 8 8 4 8 8
r[1..32]𝑟delimited-[]1..32r\!\in\![1..32]italic_r ∈ [ 1..32 ] 0.27 1 4 2 1 8 16 1 16 2 32 32 4 8 8 8 32 4 32 8
Table 11: Adapter rank r𝑟ritalic_r for each FGVC dataset for optimized versions of Adapter+ with different ranges of permitted ranks.

# Param (M)

CUB-200 [61]

NABirds [27]

Oxford Flowers [45]

Stanford Dogs [34]

Stanford Cars [17]

r[1..32]𝑟delimited-[]1..32r\!\in\![1..32]italic_r ∈ [ 1..32 ] 0.34 2 2 1 1 32

Appendix E Vision transformer pre-training

As we add only very few parameters to an otherwise frozen backbone, the generalization capability of the feature representations produced by the backbone is important. For ViTs, there are a number of off-the-shelf models available with differences in their training procedures. Here, we examine three different pre-trainings as examples: (1) Original: The ViT-B/16 weights used in the main paper, pre-trained with supervision on ImageNet-21k [55] following the training procedure of the original ViT publication [13],https://storage.googleapis.com/vit_models/imagenet21k/ViT-B_16.npz (2) ImageNet-1k: the same ViT weights further fine-tuned on ImageNet-1k [55],******https://storage.googleapis.com/vit_models/imagenet21k+imagenet2012/ViT-B_16-224.npz and (3) AugReg: weights from a pre-training with stronger data augmentation in the form of Mixup [66] and RandAugment [9] following [57].††††††https://storage.googleapis.com/vit_models/augreg/B_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.0-sd_0.0.npz

In Tab. 12, we summarize our results for Adapter+ with rank r=8𝑟8r\!=\!8italic_r = 8 evaluated on the VTAB validation sets. We notice that additional fine-tuning on ImageNet-1k gives a slight edge (83.4% average accuracy over 83.0% for second best) in adaption for tasks that contain natural images. However, the fine-tuning is detrimental for the Specialized and Structured group. Not fine-tuning on ImageNet-1k is beneficial for the Structured group with a large increase of 3.7 pptimes3.7pp3.7\text{\,}\mathrm{\textup{pp}}start_ARG 3.7 end_ARG start_ARG times end_ARG start_ARG pp end_ARG. The AugReg training setting improves the transfer to the Specialized group but is worse than the other settings for natural images. Overall, the original supervised training on ImageNet-21k generalizes best across all tasks in VTAB with an average accuracy of 76.5%, 0.3 pptimes0.3pp0.3\text{\,}\mathrm{\textup{pp}}start_ARG 0.3 end_ARG start_ARG times end_ARG start_ARG pp end_ARG better than AugReg training and 1.2 pptimes1.2pp1.2\text{\,}\mathrm{\textup{pp}}start_ARG 1.2 end_ARG start_ARG times end_ARG start_ARG pp end_ARG better than ImageNet-1k fine-tuning.

Table 12: Influence of ViT pre-training. We use Adapter+ with rank r=8𝑟8r\!=\!8italic_r = 8 for the evaluation and report the average accuracy in % for each subgroup and across all groups on the VTAB val sets.
Pre-training Natural Specialized Structured Average
ImageNet-1k 83.4 86.5 56.0 75.3
AugReg 81.6 87.2 59.7 76.2
Original 83.0 86.8 59.7 76.5

Appendix F Generality of the conclusions

Using DINO [5] as an example of a ViT trained with self-supervision, we show in Tab. 13 that the orders of best-to-worst adapter position is consistent with that of a supervised backbone in terms of average accuracy, albeit with a higher standard deviation. The ranking also stays the same for the comparison of Adapter+ with adapter configurations from previous work as presented in Tab. 14. This shows that our conclusions generalize beyond backbones with supervised pre-training to backbones based on self-supervised pre-training.

Table 13: Adapter position with DINO backbone. We report average accuracy in % (±plus-or-minus\pm± std. dev.) on the VTAB val sets for different adapter positions. Adapterbasebase{}_{\text{base}}start_FLOATSUBSCRIPT base end_FLOATSUBSCRIPT with Houlsby initialization and rank r=8𝑟8r\!=\!8italic_r = 8 is used in all experiments.
Position Natural Specialized Structured Average
Pre 76.8 ±plus-or-minus\pm± 0.4 86.2 ±plus-or-minus\pm± 0.6 53.6 ±plus-or-minus\pm± 0.7 72.2 ±plus-or-minus\pm± 0.3
Intermediate 76.8 ±plus-or-minus\pm± 0.4 85.8 ±plus-or-minus\pm± 0.8 52.6 ±plus-or-minus\pm± 0.9 71.8 ±plus-or-minus\pm± 0.4
Parallel 76.7 ±plus-or-minus\pm± 0.3 86.8 ±plus-or-minus\pm± 0.4 54.1 ±plus-or-minus\pm± 0.7 72.5 ±plus-or-minus\pm± 0.3
Post 76.9 ±plus-or-minus\pm± 0.2 86.3 ±plus-or-minus\pm± 0.5 55.3 ±plus-or-minus\pm± 0.7 72.8 ±plus-or-minus\pm± 0.3
Table 14: Comparison of Adapter+ with adapter configurations from previous work with DINO backbone. We report the average accuracy in % (±plus-or-minus\pm± std. dev.) of each subgroup and across all groups on the VTAB val sets.
Configuration # Param Natural Specialized Structured Average
Houlsby [28], r=8𝑟8r\!=\!8italic_r = 8 0.39 77.4 ±plus-or-minus\pm± 0.4 86.5 ±plus-or-minus\pm± 0.7 52.9 ±plus-or-minus\pm± 0.8 72.3 ±plus-or-minus\pm± 0.4
Houlsby [28], r=4𝑟4r\!=\!4italic_r = 4 0.24 77.2 ±plus-or-minus\pm± 0.5 86.2 ±plus-or-minus\pm± 0.5 53.2 ±plus-or-minus\pm± 0.8 72.2 ±plus-or-minus\pm± 0.3
Pfeiffer [49] 0.21 76.8 ±plus-or-minus\pm± 0.4 86.2 ±plus-or-minus\pm± 0.3 54.4 ±plus-or-minus\pm± 1.0 72.5 ±plus-or-minus\pm± 0.4
AdaptFormer [6] 0.19 76.5 ±plus-or-minus\pm± 0.4 85.8 ±plus-or-minus\pm± 0.4 53.0 ±plus-or-minus\pm± 0.5 71.8 ±plus-or-minus\pm± 0.3
Adapter+ 0.20 76.7 ±plus-or-minus\pm± 0.3 86.4 ±plus-or-minus\pm± 0.5 55.4 ±plus-or-minus\pm± 0.8 72.8 ±plus-or-minus\pm± 0.3