HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility
  • failed: esdiff
  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-SA 4.0
arXiv:2403.15194v1 [cs.CV] 22 Mar 2024

Your Image is My Video: Resha** the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion

Sofia Casarin11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Cynthia I. Ugwu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Sergio Escalera2,323{}^{2,3}start_FLOATSUPERSCRIPT 2 , 3 end_FLOATSUPERSCRIPT, Oswald Lanz11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTFree University of Bozen-Bolzano, Bolzano, Italy
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTComputer Vision Center, Barcelona, Spain
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTUniversitat de Barcelona, Barcelona, Spain
{scasarin, cugwu, olanz}@unibz.it, [email protected]
  
Abstract

The landscape of deep learning research is moving towards innovative strategies to harness the true potential of data. Traditionally, emphasis has been on scaling model architectures, resulting in large and complex neural networks, which can be difficult to train with limited computational resources. However, independently of the model size, data quality (i.e. missing amount and variability) is still a major factor that affects model generalization. In this work, we propose a novel technique to exploit available data through the use of automatic data augmentation for the tasks of image classification and semantic segmentation. We introduce the first Differentiable Augmentation Search method (DAS) to generate variations of images that can be processed as videos. Compared to previous approaches, DAS is extremely fast and flexible, allowing the search on very large search spaces in less than a GPU day. Our intuition is that the increased receptive field in the temporal dimension provided by DAS could lead to benefits also to the spatial receptive field. More specifically, we leverage DAS to guide the resha** of the spatial receptive field by selecting task-dependant transformations. As a result, compared to standard augmentation alternatives, we improve in terms of accuracy on ImageNet, Cifar10, Cifar100, Tiny-ImageNet, Pascal-VOC-2012 and CityScapes datasets when plugging-in our DAS over different light-weight video backbones.

1 Introduction

Creating models with significantly increased capacity, in an attempt to achieve incremental performance improvements, has been the prevailing approach in designing Convolutional Neural Network (CNN) classifiers. As a result, CNNs with increased depth [45, 16, 40, 60], and Vision Transformers (ViTs) [13] were proposed over the years. Specifically, ViT has demonstrated promising results on a wide variety of computer vision tasks including image classification and semantic segmentation [48, 13, 36, 56].

Refer to caption
(a) Conceptual representation of the proposed approach. We reshape the Receptive Field (RF) by applying affine transformations optimized through our Differentiable Augmentation Search (DAS). On the top right you can see how fusing with random transformation would not lead to benefits as, when concatenating in time, the employed shift mechanism would fuse features related to random parts. On the bottom, the augmentations guided by DAS obtain specific shapes of the RF so that more context is kept.
Refer to caption
Refer to caption
Refer to caption
Refer to caption

rotatetranslatezoomfound by DAS

(b) RF visualization (ResNet-50, with GSF fusion) when different single or composed transformations are applied. The last column shows our DAS selected operation for CIFAR-10 and CIFAR-100, which combines translation, rotation and zoom. More details in Tab. 4.
Figure 1: 0(a) overviews our approach and 0(b) shows a real example of obtained receptive fields. The employed transformations are fundamental to shape the receptive field, as shown in 0(b). The augmented images with DAS (Sec. 3.1) are concatenated in time, and processed through a video network that partially shifts and fuses the features (Sec. 3.2).

However, while these techniques have demonstrated remarkable success, it is noteworthy that these high-capacity models necessitate increased computational resources for effective training and inference, making them economically impractical for training and deployment within practical application scenarios. Moreover, the over-parametrization of ViT and Deep CNNs makes the networks prone to overfitting, thus requiring strong regularization to achieve ideal performance. As a result, one alternative trend relies in exploiting the power of data. In this work, we tackle this problem and propose a novel way of augmenting data with the goal of expanding the receptive field of CNNs.
Data augmentation techniques, usually employed to enhance the generalization capabilities of machine learning models, rely on meticulous design, necessitating domain-specific knowledge. To this aim, various auto data augmentation methods, inspired by Neural Architecture Search (NAS), have been extensively utilized in the field of supervised learning [12, 11, 34, 27] to search for accurate augmentation strategies. However, many of these approaches, either require much time [11, 17, 47], forcing the search to be performed on a proxy task or do not propose an effective search strategy, but rather a well defined search space as in [12]. The former, makes a strong assumption that the proxy task provides a predictive indication of the larger task [12]. The dramatic reduction in parameter space of the latter, which allows simple grid search, implies a strong knowledge in the search space definition. As a result, as we empirically show, including “noisy" transformations would imply much larger searching time to achieve the same performance. By harnessing affine transformations optimized through our new Differentiable Augmentation Search (DAS) strategy, we generate a series of images that exhibit motion within a specific region, treating them analogously to video sequences. Inspired by transformation-based models [49] that, operating in the space of affine transforms, re-phrase the frames prediction problem as modelling transformation within frames, we introduce a novel methodology to extend the Receptive Field (Fig. 0(a)). With the underlying hypothesis that augmenting the temporal dimension’s receptive field (RF) could potentially yield advantages for the spatial RF as well, we establish the viability of this approach and empirically showcase its merits by employing Video Networks for processing such image sequences. In this context, to alleviate the unnecessary increase of computational overhead for image-classification and semantic segmentation tasks, we exploit a feature shift mechanism [42], which in the context of video action recognition tasks demonstrates comparable performance to a 3D CNN while kee** 2D CNN complexity. In summary, our contributions are:

  • We re-formulate the automatic data augmentation field in a differentiable manner, and propose DAS. By defining a continuous search space of image transformations and exploiting a perturbation-based approach for the transformation selection, we provide a very general and easy-to-deploy alternative to slower existing reinforcement learning methods, and to more search space definition sensitive random ones.

  • We propose a new way of handling 2D data by repeatedly transforming and concatenating images as frames in a video, obtaining a new perspective to exploit the richness of data. To this aim, we address the question of how the increase of the receptive field in a third dimension impacts the original 2D spatial receptive field.

  • We successfully expand, as a result, the receptive field for image classification and segmentation tasks. This allows obtaining ResNet-152 state-of-the-art results for ImageNet while employing a ResNet-50 temporal expanded network, having less than half-parameters, and surpassing DeepLabv3 by 1.3 % on Pascal-VOC and ResNest models by 1.1% on CityScapes datasets.

2 Related Work

In our work, we are interested in image classification and semantic image segmentation tasks. A large focus of the computer vision community has been on engineering better network architectures to improve performance. In earlier designs ImageNet progress was dominated by CNNs [22, 40, 44, 21, 46, 18] and more prominently now Vision Transformers  [13, 30] have reached competitive results due to their wider receptive fields. Semantic segmentation requires either global features or contextual interactions to accurately classify, at the pixel level, objects at multiple scales. To this aim, atrous convolutions [3, 5], spatial pyramid pooling modules [61], and networks with attention modules [60, 14, 54] with almost 1 billion parameters were proposed. While these methods enhance the receptive field, achieving good results in terms of performance, they are extremely data hungry and often require strong regularization techniques, such as data augmentation.

Refer to caption
Figure 2: Our method takes an input image and processes it through a DAS cell. The cell, as shown more in detail in Fig. 3, applies all possible transformations defined in the search space and generates an input video. The video is processed through a video network integrated by a temporal shift mechanism, with the goal of shifting the features of adjacent frames. As it can be observed in the pink box, the features shifted and combined as if a kernel 3×3×33333\times 3\times 33 × 3 × 3 was applied. As the content derives from transformations of the same image, the result over the original 2D image is a resha** of the RF. Finally, the predictions for the video input are combined so that the performance for the original 2D task are given back as feedback to the DAS cell.

2.1 Automatic Data Augmentation

Traditionally, data augmentation requires manual design and domain knowledge. Random crop**, image mirroring, and color distortion are common in natural image datasets like CIFAR-10 and ImageNet. Elastic distortion and affine transformations such as translation and rotation are more common in datasets like MNIST[23] and SVHN [35]. Many methods have been influenced by NAS [63] to find the best dataset-specifc set of augmentation policies/strategies  [11, 24, 47, 25, 17]. AutoAugment (AA) [11], the first automatic augmentation method, uses reinforcement learning to predict accurate problem-dependant augmentation policies. Despite its success, the approach has the main drawback of running the search on a smaller version of the datasets due to the multi-level search space and the repetitive training, making a strong assumption that the proxy task provides a predictive indication of the larger task. Methods like [27, 12, 34] drastically reduce the parameter space for data augmentation, which allows the methods to be trained on the full dataset. In RandAugmnet (RA) [12] only two hyperparameters are used, one controlling the number of augmentations to combine for each image, and the other controlling the magnitude of operations. The downside of RA is that it performs a grid search over a set of augmentation operations incurring up to ×80absent80\times 80× 80 overhead over a single training [34]. UniformAugment (UA) [27] and TrivialAugment (TA) [34], instead, propose parameter-free algorithms where hyperparameters are sampled uniformly in the augmentation space. Similarly to UA, we define a continuous augmentations space. However, differently from UA, RA and AA, we do not randomly sample the ad-hoc hyperparameters or perform the search on a reduced dataset, but use a continuous search strategy which makes our method fast without the need to reduce drastically the search space or to include user designer bias.

2.2 Enhancing the Receptive Field

In the context of CNNs, the general trend that led to very deep networks is motivated by the seek for a receptive field expansion. The concept of RF is indeed important for understanding and diagnosing how deep CNNs work as it determines what information a single neuron has access to. Over the years, many attempts were made to expand such a field, either by increasing the factors determining its theoretical size, e.g. the depth of a neural network and the kernel size, or by providing more context information.

Spatial Domain

In the context of image classification and semantic segmentation, using context information from the whole image can significantly help improve the performance. Mostajabi et al. [33] demonstrated that by using the “zoom-out" features they can achieve impressive performance for the semantic segmentation task. Liu et al. [29] noticed that, although theoretically, the features from the top layers of a neural network should have very large receptive fields, in practice the empirical size is much smaller and consequently not enough to capture the global context. To this aim, they propose to use global averaging to pool the context features from layers of the neural network, proving empirically that such a method results in a larger empirical RF. In [38], Richter et al. question the actual need for a very deep network and propose a method for layer pruning based on the analysis of the size of the RF.

Temporal Domain

How to effectively expand the receptive field in the temporal domain has been investigated in video understanding research [51, 32, 26, 41, 42]. 3D CNNs can capture three-dimensional features, however, they are more computationally intensive than 2D CNNś video networks. In [26] the authors implement the Temporal Shift Module (TSM) where a 2D CNN is used as a backbone to extract spatial information. The temporal features are integrated by shifting a fixed amount of channels forward and backward along the temporal dimension. Subsequently, Sudhakaran et al. [42] proposed the Gate-Shift-Fuse (GSF) module, a spatiotemporal feature extraction that leverages on a learnable shift of the channels on the temporal dimension as well as channel weighting to fuse the shifted features. TSM and GSF are lightweight and therefore good video network candidates for our image-to-video pipeline in Fig. 2.

In contrast to prior methods that enhance the spatial receptive field by concatenating global or zoomed-out features or that utilize deeper networks, our method extends beyond by aggregating features derived from diverse transformations and by processing them as a video. We employ distinct transformations to generate variations of the input. Given that data augmentation necessitates domain-specific knowledge, we adopt automatic data augmentation.

3 Methods

In this section, we first introduce our new differentiable transformation search process effectively responsible for the reshape of the Receptive Field for image classification and image segmentation problems (Sec. 3.1). We highlight the benefit of a continuous search space and motivate the importance of a perturbation-based selection technique. We then detail in Sec. 3.2 how we make use of DAS to exploit the power of data, showing that with respect to traditional methods expanding the RF, our effect results in a “resha**" rather than an expansion. The pipeline of the proposed architecture is shown in Fig. 2.

3.1 Differentiable Augmentation Search

In order to extend images to video and to properly reshape the RF (Fig. 0(a)), a set of optimal transformations needs to be found. We propose a new general and differentiable approach, later deployed with a restricted search space for our use case, searching for the transformations generating the “best video" to process for a given video network and downstream task. Inspired by Differentiable Neural Architecture Search [28, 53] we define a continuous search space of transformations, which leads to a differentiable learning objective for the joint optimization of the transformations to be applied and the weights of the architecture. Following [28, 64], we search for a computation cell as the initial block that generates the input for the chosen video network.
A cell, depicted in Fig. 3, is a directed acyclic graph consisting of an ordered sequence of N𝑁Nitalic_N nodes. Each node x(i)superscript𝑥𝑖x^{(i)}italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in the cell is the transformed image, and each directed edge (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) is associated with a data augmentation technique. For the deployment of our Differentiable Architecture Search, we defined two search spaces. The first one, similarly to AA and RA methods [11, 12], comprises as set of augmentations Shear X/Y, Translate X/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Color, Brightness, Sharpness, Cutout, and Identity that corresponds to applying no transformation. The cell has two input nodes and a single output node. The size of such a search space is 1314superscript131413^{14}13 start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT. The second search space, deployed for our purpose, includes Translate X/Y, Scale, Rotate, and Identity. It employs a cell with one single input and output node, having a size of 510superscript5105^{10}5 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT. Given the set 𝒯𝒯\mathcal{T}caligraphic_T of candidate transformations, the categorical choice of applying a transformation is relaxed to a Softmax of all possible transformations t𝒯𝑡𝒯t\in\mathcal{T}italic_t ∈ caligraphic_T. For a given edge (i,j)𝑖𝑗(i,j)( italic_i , italic_j ), the transformation to be applied to the input x𝑥xitalic_x is expressed as:

t¯(i,j)(x)superscript¯𝑡𝑖𝑗𝑥\displaystyle\bar{t}^{(i,j)}(x)over¯ start_ARG italic_t end_ARG start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ( italic_x ) =t𝒯exp(τt(i,j))t𝒯exp(τt(i,j))t(x),absentsubscript𝑡𝒯superscriptsubscript𝜏𝑡𝑖𝑗subscript𝑡𝒯superscriptsubscript𝜏superscript𝑡𝑖𝑗𝑡𝑥\displaystyle=\sum_{t\in\mathcal{T}}\frac{\exp(\tau_{t}^{(i,j)})}{\sum_{t\in% \mathcal{T}}\exp(\tau_{t^{\prime}}^{(i,j)})}\cdot t(x),= ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T end_POSTSUBSCRIPT divide start_ARG roman_exp ( italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T end_POSTSUBSCRIPT roman_exp ( italic_τ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ) end_ARG ⋅ italic_t ( italic_x ) , (1)

where the weights of a transformation are parameterized by a vector τ(i,j)superscript𝜏𝑖𝑗\tau^{(i,j)}italic_τ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT of dimension |𝒯|𝒯|\mathcal{T}|| caligraphic_T |. The famous bi-level optimization problem [10] of NAS is re-formularized as:

min𝝉val(𝝉,𝒘*(𝝉))s.t.𝒘*(𝝉)=argmin𝒘train(𝒘,𝝉),subscript𝝉subscriptval𝝉superscript𝒘𝝉s.t.superscript𝒘𝝉subscript𝒘subscripttrain𝒘𝝉\begin{gathered}\min_{\boldsymbol{\tau}}\quad\mathcal{L}_{\text{val}}(% \boldsymbol{\tau},\boldsymbol{w}^{*}(\boldsymbol{\tau}))\\ \text{s.t.}\quad\boldsymbol{w}^{*}(\boldsymbol{\tau})=\arg\min_{\boldsymbol{w}% }\mathcal{L}_{\text{train}}(\boldsymbol{w},\boldsymbol{\tau}),\end{gathered}start_ROW start_CELL roman_min start_POSTSUBSCRIPT bold_italic_τ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT val end_POSTSUBSCRIPT ( bold_italic_τ , bold_italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_italic_τ ) ) end_CELL end_ROW start_ROW start_CELL s.t. bold_italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( bold_italic_τ ) = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ( bold_italic_w , bold_italic_τ ) , end_CELL end_ROW (2)

where the model weights 𝒘𝒘\boldsymbol{w}bold_italic_w and the transformations parameters 𝝉𝝉\boldsymbol{\tau}bold_italic_τ are jointly optimized via gradient updates following standard DARTS procedure. We would like to stress that, differently from the original DARTS, here there is no real architecture optimization. The real CNN network that processes the data is fixed and chosen a priori. What we are looking for is a cell where only transformations are applied, and we treat this cell as a sort of “stem" block, placed before the backbone. At the end of the search phase, the best transformations are not chosen by selecting the largest τ𝜏\tauitalic_τ value, as this was shown in [53] to be based on a wrong assumption, i.e. the τ𝜏\tauitalic_τ representing the strength of a transformation agrees with the discretization accuracy at convergence. We rather deploy a perturbation-based approach, where the transformation importance is evaluated in terms of its contribution to the neural network performance. To this aim, for each transformation on a given edge, we mask it while preserving all other transformations, then re-evaluate the cell+CNN. The operation resulting in the greatest reduction in network validation accuracy is identified as the pivotal operation on that edge. Let us indeed consider a cell from a simplified search space composed of only two transformations:

𝐈=[100010001]𝐈matrix100010001\displaystyle\begin{array}[]{c}\mathbf{I}=\begin{bmatrix}1&0&0\\ 0&1&0\\ 0&0&1\\ \end{bmatrix}\end{array}start_ARRAY start_ROW start_CELL bold_I = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] end_CELL end_ROW end_ARRAY 𝐓=[10tx01ty001]𝐓matrix10subscript𝑡𝑥01subscript𝑡𝑦001\displaystyle\begin{array}[]{c}\mathbf{T}=\begin{bmatrix}1&0&t_{x}\\ 0&1&t_{y}\\ 0&0&1\\ \end{bmatrix}\end{array}start_ARRAY start_ROW start_CELL bold_T = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] end_CELL end_ROW end_ARRAY (3)

where xIsubscript𝑥𝐼x_{I}italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT represents the output of Ix𝐼𝑥I\cdot xitalic_I ⋅ italic_x, and xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT the translated output Tx𝑇𝑥T\cdot xitalic_T ⋅ italic_x. Assume m*superscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT to be the optimal feature map, shared across all edges according to the unrolled estimation view. The current estimation of m*superscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT can then be written as:

m¯(x)=eτTeτT+eτIxT+eτIeτT+eτIxI¯𝑚𝑥superscript𝑒subscript𝜏𝑇superscript𝑒subscript𝜏𝑇superscript𝑒subscript𝜏𝐼subscript𝑥𝑇superscript𝑒subscript𝜏𝐼superscript𝑒subscript𝜏𝑇superscript𝑒subscript𝜏𝐼subscript𝑥𝐼\overline{m}(x)=\frac{e^{\tau_{T}}}{e^{\tau_{T}}+e^{\tau_{I}}}x_{T}+\frac{e^{% \tau_{I}}}{e^{\tau_{T}}+e^{\tau_{I}}}x_{I}over¯ start_ARG italic_m end_ARG ( italic_x ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + divide start_ARG italic_e start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT (4)

The optimal τT*subscriptsuperscript𝜏𝑇{\tau^{*}_{T}}italic_τ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and τI*subscriptsuperscript𝜏𝐼{\tau^{*}_{I}}italic_τ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT minimizing the var(m¯(x)m*)¯𝑚𝑥superscript𝑚(\overline{m}(x)-m^{*})( over¯ start_ARG italic_m end_ARG ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) variance between the optimal feature map m*superscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and the current estimation m¯(x)¯𝑚𝑥\overline{m}(x)over¯ start_ARG italic_m end_ARG ( italic_x ) are:

τI*subscriptsuperscript𝜏𝐼\displaystyle{\tau^{*}_{I}}italic_τ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT var(xTm*)proportional-toabsentvarsubscript𝑥𝑇superscript𝑚\displaystyle\propto\text{var}(x_{T}-m^{*})∝ var ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) (5)
τT*subscriptsuperscript𝜏𝑇\displaystyle{\tau^{*}_{T}}italic_τ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT var(xIm*)proportional-toabsentvarsubscript𝑥𝐼superscript𝑚\displaystyle\propto\text{var}(x_{I}-m^{*})∝ var ( italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) (6)

We refer to the Supplementary material for a detailed proof. As in the original paper, also for the transformation search space it holds that, from Eq. 5 and Eq. 6, we can see that the relative magnitudes of τIsubscript𝜏𝐼\tau_{I}italic_τ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and τTsubscript𝜏𝑇\tau_{T}italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT come down to which one of xIsubscript𝑥𝐼x_{I}italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT or xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is closer to m*superscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT in variance. As xIsubscript𝑥𝐼x_{I}italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT comes from the mixed output of a previous edge, and the goal of every edge is to estimate m*superscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT (unrolled estimation) xIsubscript𝑥𝐼x_{I}italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is also directly estimating m*superscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is the output of a single transformation instead of the complete mixed output of edge e𝑒eitalic_e, so even at convergence it will deviate from m*superscript𝑚m^{*}italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. Therefore, if we choose the largest τ𝜏\tauitalic_τ as indicating the best transformation, the algorithm will naturally be led to choose identities. Therefore, as on one hand, including the identity in the search space is fundamental to remove the a-priori bias that a transformation is always needed, on the other hand, as we show in Sec. 4.2 a cell consisting of only identity transformations leads to poor performance, motivating our transformation selection choice.

Refer to caption
Figure 3: Cell structure in DAS. Multiple operations are defined on each edge, collectively applied to the image, and optimized during training: as the gradients are updated through multiple steps, the τ𝜏\tauitalic_τ values associated with each operation change. Fig. 3b depicts one step of such process through thicker edges. In the end, the cell is discretized through a perturbation-based approach, the final operations are chosen and composed (black edges).

3.2 Temporal Data Augmentation

Our goal is to expand and reshape the spatial receptive field. To this aim, we generate videos by picking from a set of transforms, summarized in our search space defined by Translate X/Y, Scale, Rotate, and Identity. We included in our search-space commonly used transformations to model motion in videos [49]. We added scaling, motivated by the general difficulty of segmenting objects at different scale, and identity, to remove the bias that the image always benefits from applying a transformation.
Each transformation is applied ×Tabsent𝑇\times T× italic_T times, a hyperparameter that defines the length of the generated video. For example, given an image I(x,y)𝐼𝑥𝑦I(x,y)italic_I ( italic_x , italic_y ), a frame V(t,x,y)𝑉𝑡𝑥𝑦V(t,x,y)italic_V ( italic_t , italic_x , italic_y ) of a video applying a vertical translation δy𝛿𝑦\delta yitalic_δ italic_y is obtained as:

V(t,x,y)=I(x,ytδy)t=1,,T.formulae-sequence𝑉𝑡𝑥𝑦𝐼𝑥𝑦𝑡𝛿𝑦𝑡1𝑇V(t,x,y)=I(x,y-t\delta y)\quad t=1,\dots,T.italic_V ( italic_t , italic_x , italic_y ) = italic_I ( italic_x , italic_y - italic_t italic_δ italic_y ) italic_t = 1 , … , italic_T . (7)

As Fig. 2 highlights, the video is given as input to a video network that extracts the features and produces a prediction for the original classification/segmentation task task. For the image classification task, we average the predictions for each frame, while for the semantic segmentation task we first “undo" the transformation to preserve the locality concept. Although a general video network can process the video input, as we show in our experiments, to reduce the complexity and keep efficiency in our tasks of interest, we deploy a 2D backbone with a temporal-shift mechanism, i.e. GSF, integrated. Such a technique is well-established in the domain of video understanding, allowing to achieve the performance of 3D CNN but maintaining 2D CNN’s complexity. As the authors of [26, 42] claim, for each inserted GSF, the temporal RF will be enlarged by 2 as if running a convolution with the kernel size of 3 along the temporal dimension. Therefore, part of the features among 3 adjacent frames is mixed. Let us now focus on what the content of those frames is. Fig. 4 gives a proof of concept of our claim for the operations included in our search space. As the content in the adjacent frames is nothing but the same image either translated, scaled, or rotated, the increase in size in the temporal RF can be mapped in an augmentation in size and reshape of the spatial RF. Let us consider the case of a single-path network, for simplicity. The theoretical RF [1] size can be expressed as depending only on the stride s𝑠sitalic_s and kernel size k𝑘kitalic_k:

r0=l=1L((ki1)j=1l1sj)+1subscript𝑟0superscriptsubscript𝑙1𝐿subscript𝑘𝑖1superscriptsubscriptproduct𝑗1𝑙1subscript𝑠𝑗1\begin{split}r_{0}&=\sum_{l=1}^{L}((k_{i}-1)\prod_{j=1}^{l-1}{s_{j}})+1\\ \end{split}\vspace{-0.2cm}start_ROW start_CELL italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + 1 end_CELL end_ROW (8)

Let us now consider the situation where one GSF module has been inserted after a 2D convolution with kernel 3×3333\times 33 × 3, stride s=1𝑠1s=1italic_s = 1. Assuming we are considering the first convolutional layer, the theoretical RF for each frame will have a size of 3. Therefore r1f0=r1f1=r1f2=3superscriptsubscript𝑟1𝑓0superscriptsubscript𝑟1𝑓1superscriptsubscript𝑟1𝑓23r_{1}^{f0}=r_{1}^{f1}=r_{1}^{f2}=3italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f 0 end_POSTSUPERSCRIPT = italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f 1 end_POSTSUPERSCRIPT = italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f 2 end_POSTSUPERSCRIPT = 3. As Fig. 4 shows, the same region now covers different features, that depend on the applied transformation. As these features are mixed through the GSF mechanism, if we map back to the space of frame f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the spatial receptive field will have a size equal to RF1=3×r1f0i=02A(r1fi)𝑅subscript𝐹13superscriptsubscript𝑟1𝑓0superscriptsubscript𝑖02𝐴superscriptsubscript𝑟1𝑓𝑖RF_{1}=3\times r_{1}^{f0}-\bigcap_{i=0}^{2}A(r_{1}^{fi})italic_R italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3 × italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f 0 end_POSTSUPERSCRIPT - ⋂ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_i end_POSTSUPERSCRIPT ), where i=02A(r1fi)superscriptsubscript𝑖02𝐴superscriptsubscript𝑟1𝑓𝑖\bigcap_{i=0}^{2}A(r_{1}^{fi})⋂ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f italic_i end_POSTSUPERSCRIPT ) denotes the intersecting area among the receptive field of the different frames. If we define as L=(x1,y1)𝐿subscript𝑥1subscript𝑦1L=(x_{1},y_{1})italic_L = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) the bottom left corner, and R=(x2,y2)𝑅subscript𝑥2subscript𝑦2R=(x_{2},y_{2})italic_R = ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) the top right corner, the intersection area, for the translation case, between two RFs can be simply found as:

A=(min(x1,x1)max(y1,y1))×(min(x2,x2)max(y2,y2))\begin{split}A=(\min(x_{1},x_{1}{{}^{\prime}})-\max(y_{1},y_{1}{{}^{\prime}}))% \times\\ (\min(x_{2},x_{2}{{}^{\prime}})-\max(y_{2},y_{2}{{}^{\prime}}))\end{split}start_ROW start_CELL italic_A = ( roman_min ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT ) - roman_max ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT ) ) × end_CELL end_ROW start_ROW start_CELL ( roman_min ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT ) - roman_max ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT ) ) end_CELL end_ROW (9)

where for the case of a translation of txsubscript𝑡𝑥t_{x}italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT along the x-axis and tysubscript𝑡𝑦t_{y}italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT over the y-axis, xi=x+txx_{i}{{}^{\prime}}=x+t_{x}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT = italic_x + italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, yi=x+tyy_{i}{{}^{\prime}}=x+t_{y}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT = italic_x + italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. For a rotation of θ𝜃\thetaitalic_θ degrees counter clockwise, around the point (0,0) (as depicted in 4) xi=xcos(θ)ysin(θ)x_{i}{{}^{\prime}}=x\cos(-\theta)-y\sin(-\theta)italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT = italic_x roman_cos ( - italic_θ ) - italic_y roman_sin ( - italic_θ ), yi=xsin(θ)+ycos(θ)y_{i}{{}^{\prime}}=x\sin(-\theta)+y\cos(-\theta)italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT = italic_x roman_sin ( - italic_θ ) + italic_y roman_cos ( - italic_θ ). The intersection area calculation however cannot be generalized in this case, as the shape of the intersection polygon depends on the rotation angle. Finally, for the scaling operation the RF1=γ2×(r1f0)×γ2×(r1f0)𝑅subscript𝐹1superscript𝛾2superscriptsubscript𝑟1𝑓0superscript𝛾2superscriptsubscript𝑟1𝑓0RF_{1}=\gamma^{2}\times(r_{1}^{f0})\times\gamma^{2}\times(r_{1}^{f0})italic_R italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f 0 end_POSTSUPERSCRIPT ) × italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f 0 end_POSTSUPERSCRIPT ). Please, note that as some components of the RFs overlap (particularly visible for the case of scaling), what is happening is a reshape of the RF, rather than a re-size, as some parts will contribute more than others. Indeed, we need to recall that there is a difference between the theoretical RF and the Empirical Receptive Field ERF, defined as y(0,0,0)x0(i,j,z)𝑦000superscript𝑥0𝑖𝑗𝑧\frac{\partial y(0,0,0)}{\partial x^{0}(i,j,z)}divide start_ARG ∂ italic_y ( 0 , 0 , 0 ) end_ARG start_ARG ∂ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_i , italic_j , italic_z ) end_ARG, i.e. how much y(0,0,0)𝑦000y(0,0,0)italic_y ( 0 , 0 , 0 ) changes as x0(i,j,k)superscript𝑥0𝑖𝑗𝑘x^{0}(i,j,k)italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_i , italic_j , italic_k ) changes by a small amount. Here, (i, j, k) is the voxel on a pth layer, y is the output and x0superscript𝑥0x^{0}italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is the input layer. If we consider the previous example of a 2D convolution with k=3×3𝑘33k=3\times 3italic_k = 3 × 3, s=1𝑠1s=1italic_s = 1, tx=ty=1,θ=30,γ=2formulae-sequencesubscript𝑡𝑥subscript𝑡𝑦1formulae-sequence𝜃superscript30𝛾2t_{x}=t_{y}=1,\theta=30^{\circ},\gamma=2italic_t start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = 1 , italic_θ = 30 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , italic_γ = 2: the new receptive field size for the translation is RF1=19𝑅subscript𝐹119RF_{1}=19italic_R italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 19, for the rotation is RF1=14,19𝑅subscript𝐹11419RF_{1}=14,19italic_R italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 14 , 19, and for the scaling is RF1=144𝑅subscript𝐹1144RF_{1}=144italic_R italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 144. As we show in Sec. 4.2, our method reshapes the RF at the cost of a negligible increase # of parameters with respect to standard 2D backbones processing images. However, in terms of memory occupation a trade-off is necessary. We empirically find that expanding the image ×5absent5\times 5× 5 times well balances accuracy and memory-occupation.

Refer to caption
Figure 4: RF reshape for translation (top-row) and rotation (bottom-row). In our representation we are assuming that GSF was inserted once, therefore the temporal RF is expanded by 2 and three adjacent frames are considered. The first column shows the features of original image (f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), while second and third columns show the features of frames obtained applying the transformation. The last column shows the effect of processing the input with a temporal shift mechanism.

To sum up, we propose a method to expand the spatial receptive field using data augmentation techniques to generate videos. To this aim, we propose a new auto-augmentation method, namely DAS, that, to the best of our knowledge, is the first differentiable auto-augmentation approach proposed in the field of image classification. The “fake videos" are processed by a 2D backbone with an integrated temporal shift mechanism, achieving high performance in video-understanding tasks while kee** 2D CNNs complexity. We finally demonstrate that the increased temporal RF from video processing corresponds to an augmented spatial RF, with the extent determined by the specific transformation used. As we will show in Sec. 4, such a procedure results in lightweight models that reach state-of-the-art and allow reducing the number of parameters typically increased to expand the receptive field, such as kernel size and depth of a network.

4 Experiments

In this section, we investigate the performance of our temporal expansion approach and our automatic augmentation approach for two visual tasks: image classification and semantic segmentation. Best results are highlited in bold in the tables. We also validate our approach through a series of ablation studies. We refer to the Supplementary materials for all the implementation details, more qualitative results, and for additional experiments on the robustness of DAS.

Method

# Params

FLOPs

Top-1 Acc.

ResNet-50 [16]

25.6

4.11G

76.30

SE-ResNet-50 [19]

28.1

-

76.90

Inception-v3 [45]

27.2

11.46G

77.12

BnInception [20]

-

-

77.41

Oct-ResNet-50 [6]

25.6

-

77.30

ResNeXT-50 (32×\times×4d) [59]

25.0

8.52G

77.80

Res2Net-50 (14w×\times×8s) [15]

-

-

78.10

ResNet-101 [16]

44.6

7.86G

77.40

ResNet-152 [16]

60.2

11.60G

78.30

SE-ResNet-152 [19]

67.2

-

78.40

ResNeXt-101 (32×\times×4d) [59]

88.8

32.95G

78.80

AttentionNeXt-56 [50]

31.9

-

78.8

ViT-L-32 [13]

304

61.55G

79.66

FFC-ResNet-50 [9]

26.7

5.49G

77.80

FFC-ResNeXt-50 [9]

28.0

5.66G

78.00

FFC-ResNet-101 [9]

46.1

9.23G

78.80

FFC-ResNet-152 [9]

62.6

12.96G

78.90

(Ours) - BnInception

-

-

78.12

(Ours)- Inception-v3

27.3

11.51G

78.66

(Ours) - ResNet-50

25.7

4.2G

79.45

(Ours) - ResNet-101

44.4

7.96G

80.05

(Ours) - ResNet-152

60.4

11.66G

80.13
Table 1: Plugging our method into state-of-the-art networks on ImageNet. The first two sets are top-1 accuracy scores obtained by various state-of-the-art methods, which we transcribe from the corresponding papers. Deeper models are listed in the second set. The third set reports the performances of plugging [9], and last set shows the effect of employing DAS+2D backbone+GSF.
Refer to caption
Figure 5: ImageNet results for different Resnet depths.

4.1 Comparison with SOTAs

Method

FLOPs

mIOU

Adelaide_VeryDeep_FCN_VOC [57]

-

79.10

DeepLabv2-CRF [4]

-

79.70

CentraleSupelec Deep G-CRF [2]

-

80.20

HikSeg_COCO [43]

-

81.40

SegModel [39]

-

81.80

TuSimple [52]

-

83.10

Large Kernel Matters [37]

3.7G

83.60

ResNet-38_MS_COCO [58]

12.11G

84.90

PSPNet [61]

16.55G

85.40

DeepLabv3 [3]

49.68G

85.70

(Ours) - ResNet-38_MS_COCO

12.25G

85.70

(Ours) - PSPNet

16.67 G

86.10

(Ours) - DeepLabv3

49.89G

87.00
Table 2: Model comparison with SOTA in PASCAL-VOC-2012.

Image Classification

We use ImageNet, Cifar10, Cifar100 and TinyImageNET for the image classification experiments. Tab. 1 shows the comparison of our approach with other state-of-the art models in ImageNet. The provided results, in the lower part of the table, are obtained running the searching procedure to find the optimal transformation for each independent backbone, i.e. BNInception, Inception-v3, ResNet-50-101-152. The best found transformations are given in the Supplementary materials. We employed for this set of experiments video-backbones trained from scratch, where we incorporate the best temporal fusion method we found with our ablation studies. We observe an improvement with our approach for every employed 2D backbone using roughly the same parameters, having the strongest boost of 3.15 % in terms of accuracy-wise metric for ResNet-50 architectures. Compared to ViT-L-32 that achieves 79.66 %, we use 1/5 of the parameters and FLOPs. In general, we observe stronger benefits for ResNet-like architectures than Inception-like ones. Moreover, as Fig. 5 highlights, the accuracy of such architectures, with and without our method, when reducing the depth experiences a less steep performance drop. This is reasonable and desired, given the proof of concept provided in Fig. 4. With equal depths, we improve by a fair margin also over FFC [9], a popular method that achieves the expansion of the RF by employing fast fourier convolutions. Similar behaviours are experienced when we compare DAS+temporal expansion with other methods expanding the RF for Cifar10, Cifar100, Tiny and ImageNet datasets, in Tab 4. Of particular notice should be the drop in performance when a bigger kernel is used, probably attributable to a stronger overfitting, and the comparison with the popular dilated convolution (third column), well known for expanding the RF. We improve with respect such a method by a fair margin over each dataset. We attribute this result to the effective resha**, rather than enlargement of the RF.

Image Semantic Segmentation

In Tab. 2 and 3 we show the performance of our method applied on Pascal-VOC-2012 and Cityscapes datasets for the image segmentation task. We compare our approach with popular methods employed in semantic segmentation, and observe a considerable gain especially for DAS combined with DeepLab (with a Resnet-101 backbone) when compared with the 2D counterpart. Indeed an improvement of 1.3 % in the mIOU is experienced. The best set of transformations was searched for every different backbone and dataset. We indeed empirically show in Sec. 4.2 how applying not optimized transformations impacts the performance. For Cityscapes, the best results where achieved for ResNeSt backbone, with a 85.1 % mIOU, confirming also a general trend we observed for other datasets. Indeed we experience a much higher improvement when dealing with ResNet architectures than Incpetion-like modules. This is probably due to the nature of the fusion mechanism we employ, that, when placed on the skip connections allows better preserving the spatial information. Fig. 6 shows some qualitative results on CityScapes. One can observe a better reconstruction in details when comparing 5(c) and 5(d). This is particularly visible for the reconstruction of people, (see first and third row) and of street lamps. We noticed an over segmentation for PASCAL-VOC-2012 dataset in some cases when applying our methodology. However our method still outperforms all compared state-of-the-art alternatives. We refer the reader to the Supplementary material for some qualitative results on PASCAL dataset and for additional on Cityscapes.

Method

Backbone

FLOPs

mIOU

FCN [31]

ResNet-101

-

77.02

SETR [62]

ViT-Large

-

78.10

Maskformer [7]

ResNet-101

73G

78.50

NonLocal [55]

ResNet-101

-

79.40

PSPNet [61]

ResNet-101

63G

79.77

Mask2former [8]

ResNet-101

-

80.10

DeepLab-v3 [3]

ResNet-101

92.65G

81.30

DeepLab-v3 [3]

Xception-65

-

82.10

DeepLab-v3 [3]

ResNeSt 101

-

82.90

DeepLab-v3 [3]

ResNeSt 200

285G

83.30

(Ours) PSPNet

ResNet-50

32G

81.21

(Ours) ResNeSt

Xception-65

-

82.40

(Ours) DeepLab-v3

ResNet-101

93G

82.60

(Ours) ResNeSt ResNet-200

286G

85.10
Table 3: Modal comparison with SOTA in CityScapes.

Dataset

Model

Baseline

Dilated

(5×5)55(5\times 5)( 5 × 5 )

Ours

R-18

94.12

94.31

92.16

95.12

Cifar10

R-50

95.66

96.23

95.51

95.74

WR-28

96.33

-

-

96.40

R-18

71.20

72.18

70.08

73.10

Cifar100

R-50

74.82

76.12

75.60

76.71

WR-28

81.40

-

-

83.06

R-18

61.00

62.10

61.03

62.87

Tiny

R-50

63.16

64.11

61.12

65.91

R-101

64.21

65.30

63.60

67.51

R-50

76.30

77.28

76.10

79.45

ImageNET

R-101

77.40

78.15

76.11

80.05

R-152

78.30

79.22

78.00

80.13
Table 4: Comparison in terms of accuracy with other methods expanding the RF. “WR” stands for WideResnet.

4.2 Ablation

In this section, we present and discuss ablation studieson Cifar10 and Pascal-VOC-2012. First, we study the effect of each component on the original task accuracy (Tab. 5). We find that the improved accuracy is not a simple matter of augmentation techniques, nor a mere contribution of the fusion mechanism. Indeed, both elements are needed as the simple stack of the same image (third column) does not imply an expansion of the RF, but actually determines a drop of performance probably due to much higher overfitting. Next, in Tab. 6 we ablate on the best temporal fusion mechanism. As expected, we see that TSM and GSF have comparable performance, and number of parameters, with GSF being slightly better. Surprisingly, 3D CNNs perform poorly, maybe because of the small amount of data compared to the needs of 3D CNNs.Next in Tab. 7 we study the impact of the auto-augmentation with our search space. We found that DAS performs best, given a fixed budget searching time of 24 hours decided a priori. We see comparable performance of the three methods for Cifar10 dataset, but we attribute such a behaviour to the relative simplicity of the task itself. Indeed, in Pascal-VOC we experience a much higher difference. Finally, in Tab. 8 we ablate the specificity of the found genotypes for a given dataset, providing quantitative results for the proof of concept provided in Fig. 0(a), i.e. the effect of random transformation. We check the performance on PASCAL-VOC and Cifar-10 when the genotype to generate the video derives from the search phase for CityScapes (Random1) and for Cifar-100 (Random 2). Interestingly, we see a huge drop in performance when employing a “random genotype", confirming the need of a carefully designed search procedure.

Baseline

Augment

Replica

Ours

Pascal-VOC

85.40

85.51

83.40

86.10
Cifar-10

94.12

94.23

90.11

95.12
Table 5: Ablation on different model components. Augment indicates the usage of additional augmentation derived from our search space without temporal fusion. Replica stands for the stack of image copies, that is, video produced with identity transforms and temporal fusion.

TSM

3D

GSF

85.86 %

77.00 %

86.10 %
Pascal-VOC 51.32 M

107 M

51.43 M

16.55 G

26.78 G

16.67 G

94.77 %

92.10 %

95.12 %

Cifar-10 11.18 M

33.17 M

11.20 M

37.12 M

49 M

37.24 M

Table 6: Ablation on the temporal fusion method given PSPNet backbone for Pascal-VOC and Resnet-18 for Cifar10.

AA

RA

DAS

Pascal-VOC

82.15

84.00

87.00
Cifar-10

94.35

92.98

95.12
Table 7: Ablation on the auto-augmentation technique given the same searching budget time.
Refer to caption
(a) Image
Refer to caption
(b) Ground Truth
Refer to caption
(c) DeepLab-v3
Refer to caption
(d) Ours
Figure 6: Zoom in of the qualitative results for Cityscapes. The fourth colum depicts our results with DAS and DeepLab-v3 integrated with the temporal shift mechanism.

Random 1

Random 2

DAS

Pascal-VOC

77.25

82.11

86.10
Cifar-10

93.14

94.15

95.12
Table 8: Ablation on the effectiveness of looking for the best transformation. Random 1 is the best transformation found for City-Scapes, Random 2 for Cifar100.

5 Conclusions

We proposed a differentiable auto-augmentation method for the tasks of image classification and semantic segmentation, i.e. DAS. We defined a very flexible continuous search space, and employed a perturbation-based selection method to over-coming the limitations of previous approaches [11, 12]. We showed that by applying the optimal transformations found by DAS to generate variations of images, we can effectively reshape the RF. This is achieved by processing the new input as videos with a CNN integrated with a temporal shift mechanism that performs feature mixing in time. Our method proposes a new way of handling 2D data to exploit their richness, and investigates how the increase of the receptive field in the temporal dimension impacts the original spatial receptive field. We observed an improvement in terms of accuracy with respect to standard augmentation techniques, for both image classification and segmentation tasks, using different backbones on different datasets. We also successfully reshaped the receptive field, as shown in Fig. 0(b), which in terms of qualitative results turned out into more detailed segmentation masks.

Limitations and future work

Our method is compact, fast and accurate, but currently limited by its memory footprint. Memory requirement in training grows with number of generated frames, which restricted us to search transformations for 8-frame videos in our experiment. DAS with longer videos might provide further performance boosts. Possible future work include experimenting with other backbone families, such as transformers. Other venue of research could be the adaptation of DAS to video-to-video data augmentation, which would require a proper definition of the search space of transformations. Another interesting direction could be Deep-DAS, where optimal feature map transforms are searched at multiple network layers.

References

  • Araujo et al. [2019] André Araujo, Wade Norris, and Jack Sim. Computing receptive fields of convolutional neural networks. Distill, 2019. https://distill.pub/2019/computing-receptive-fields.
  • Chandra and Kokkinos [2016] Siddhartha Chandra and Iasonas Kokkinos. Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 402–418. Springer, 2016.
  • Chen et al. [2017] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • Chen et al. [2018a] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2018a.
  • Chen et al. [2018b] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018b.
  • Chen et al. [2019] Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3435–3444, 2019.
  • Cheng et al. [2021] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
  • Cheng et al. [2022] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
  • Chi et al. [2020] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution. Advances in Neural Information Processing Systems, 33:4479–4488, 2020.
  • Colson et al. [2007] Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of operations research, 153:235–256, 2007.
  • Cubuk et al. [2019] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 113–123, 2019.
  • Cubuk et al. [2020] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  • Fu et al. [2019] Jun Fu, **g Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3146–3154, 2019.
  • Gao et al. [2019] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture. IEEE transactions on pattern analysis and machine intelligence, 43(2):652–662, 2019.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Ho et al. [2019] Daniel Ho, Eric Liang, Xi Chen, Ion Stoica, and Pieter Abbeel. Population based augmentation: Efficient learning of augmentation policy schedules. In International conference on machine learning, pages 2731–2741. PMLR, 2019.
  • Howard et al. [2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • Hu et al. [2018] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
  • Jian et al. [2016] S Jian, H Kaiming, R Shaoqing, and Z Xiangyu. Deep residual learning for image recognition. In IEEE Conference on Computer Vision & Pattern Recognition, pages 770–778, 2016.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lim et al. [2019] Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment. Advances in Neural Information Processing Systems, 32, 2019.
  • Lin et al. [2019a] Chen Lin, Minghao Guo, Chuming Li, Xin Yuan, Wei Wu, Junjie Yan, Dahua Lin, and Wanli Ouyang. Online hyper-parameter learning for auto-augmentation strategy. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6579–6588, 2019a.
  • Lin et al. [2019b] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7083–7093, 2019b.
  • LingChen et al. [2020] Tom Ching LingChen, Ava Khonsari, Amirreza Lashkari, Mina Rafi Nazari, Jaspreet Singh Sambee, and Mario A Nascimento. Uniformaugment: A search-free probabilistic data augmentation approach. arXiv preprint arXiv:2003.14348, 2020.
  • Liu et al. [2018] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations, 2018.
  • Liu et al. [2016] Wei Liu, Andrew Rabinovich, and Alexander C. Berg. Parsenet: Looking wider to see better. abs/1506.04579, 2016.
  • Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
  • Long et al. [2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • Luo and Yuille [2019] Chenxu Luo and Alan L Yuille. Grouped spatial-temporal aggregation for efficient action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5512–5521, 2019.
  • Mostajabi et al. [2015] Mohammadreza Mostajabi, Payman Yadollahpour, and Gregory Shakhnarovich. Feedforward semantic segmentation with zoom-out features. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015.
  • Müller and Hutter [2021] Samuel G Müller and Frank Hutter. Trivialaugment: Tuning-free yet state-of-the-art data augmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 774–782, 2021.
  • Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  • Peng et al. [2017] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2017.
  • Richter et al. [2021] Mats L. Richter, Schöning J, Wiedenroth A., and Krumnack U. Should you go deeper? optimizing convolutional neural network architectures without training. arXiv preprint arXiv:2106.12307v2, 2021.
  • Shen et al. [2017] Falong Shen, Rui Gan, Shuicheng Yan, and Gang Zeng. Semantic segmentation via structured patch prediction, context crf and guidance crf. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1953–1961, 2017.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Sudhakaran et al. [2020] Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Gate-shift networks for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1102–1111, 2020.
  • Sudhakaran et al. [2023] Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. Gate-shift-fuse for video action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • Sun et al. [2016] Haiming Sun, Di Xie, and Shiliang Pu. Mixed context networks for semantic segmentation. arXiv preprint arXiv:1610.05854, 2016.
  • Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  • Tan and Le [2019] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  • Tian et al. [2020] Keyu Tian, Chen Lin, Ming Sun, Lu** Zhou, Junjie Yan, and Wanli Ouyang. Improving auto-augment via augmentation-wise weight sharing. Advances in Neural Information Processing Systems, 33:19088–19098, 2020.
  • Touvron et al. [2021] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 32–42, 2021.
  • van Amersfoort et al. [2017] Joost R. van Amersfoort, A. Kannan, Marc’Aurelio Ranzato, Arthur Szlam, Du Tran, and Soumith Chintala. Transformation-based models of video sequences. ArXiv, abs/1701.08435, 2017.
  • Wang et al. [2017] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2017.
  • Wang et al. [2016] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
  • Wang et al. [2018a] Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. Understanding convolution for semantic segmentation. In 2018 IEEE winter conference on applications of computer vision (WACV), pages 1451–1460. Ieee, 2018a.
  • Wang et al. [2021] Ruochen Wang, Minhao Cheng, Xiangning Chen, Xiaocheng Tang, and Cho-Jui Hsieh. Rethinking architecture selection in differentiable nas. In International Conference on Learning Representation, 2021.
  • [54] W Wang, J Dai, Z Chen, Z Huang, Z Li, X Zhu, X Hu, T Lu, L Lu, H Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. arxiv 2022. arXiv preprint arXiv:2211.05778.
  • Wang et al. [2018b] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018b.
  • Wortsman et al. [2022] Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR, 2022.
  • Wu et al. [2016] Zifeng Wu, Chunhua Shen, and Anton van den Hengel. Bridging category-level and instance-level semantic image segmentation. arXiv preprint arXiv:1605.06885, 2016.
  • Wu et al. [2019] Zifeng Wu, Chunhua Shen, and Anton Van Den Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition, 90:119–133, 2019.
  • Xie et al. [2017] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
  • Zhang et al. [2022] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, and Alexander Smola. Resnest: Split-attention networks. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2735–2745, 2022.
  • Zhao et al. [2017] Hengshuang Zhao, Jian** Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
  • Zheng et al. [2021] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6881–6890, 2021.
  • Zoph and Le [2016] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
  • Zoph et al. [2018] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
\thetitle

Supplementary Material

Appendix A Overview

This supplementary material is organized as follows. We first introduce the proof for Eq. 5 and Eq. 6, which motivate our choice in the final transformations selection, based on a perturbation approach (Sec. B). We then present the implementation details for our method, giving an overview of the hyper-parameters utilized (Sec. C), and provide the details for the best transformations found by DAS for the Image-to-Video setup (Sec. D). We conduct an additional ablation study to further prove the effectiveness of our approach by testing the results under a re-shuffle operation, which motivates the need for the GSF as temporal shift mechanism (Sec. E). For completeness, we report the result of DAS in the “standard" setup of Image-to-Image, where auto-augmentation methods are usually employed and compare with two SOTA approaches, i.e. AutoAugment (AA) and RandAugment (RA), conducting ablations studies to highlight the difference with respect to DAS (Sec. F). We conclude with more qualitative results for the semantic segmentation datasets (Sec. G) and with an additional ablation on Cifar100 testing performance robustness with reduced data (Sec. H).

Appendix B Proof of Equations  5, 6

Let θI=Softmax(τI)subscript𝜃𝐼𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscript𝜏𝐼\theta_{I}=Softmax(\tau_{I})italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( italic_τ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) and θT=Softmax(τT)subscript𝜃𝑇𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscript𝜏𝑇\theta_{T}=Softmax(\tau_{T})italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ). Then the mixed operation in Eq. 4 can be re-written as m¯(x)=θTxT+θIxI¯𝑚𝑥subscript𝜃𝑇subscript𝑥𝑇subscript𝜃𝐼subscript𝑥𝐼\overline{m}(x)=\theta_{T}x_{T}+\theta_{I}x_{I}over¯ start_ARG italic_m end_ARG ( italic_x ) = italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT. The objective can be formally formulated as:

minθI,θT=var(m¯(x)m*)s.t.θI+θT=1\min_{\theta_{I},\theta_{T}}=\text{var}(\overline{m}(x)-m^{*})\quad\quad s.t.% \quad\quad\theta_{I}+\theta_{T}=1roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT = var ( over¯ start_ARG italic_m end_ARG ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) italic_s . italic_t . italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 1 (10)

This constraint optimization problem can be solved with Lagrangian multiplies:

L(θI,θT,λ)𝐿subscript𝜃𝐼subscript𝜃𝑇𝜆\displaystyle L(\theta_{I},\theta_{T},\lambda)italic_L ( italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_λ ) =var(m¯(x)m*)λ(θI+θT1)absentvar¯𝑚𝑥superscript𝑚𝜆subscript𝜃𝐼subscript𝜃𝑇1\displaystyle=\text{var}(\overline{m}(x)-m^{*})-\lambda(\theta_{I}+\theta_{T}-1)= var ( over¯ start_ARG italic_m end_ARG ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - italic_λ ( italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 1 ) (11)
=var(θTT(x)+θIxm*)absentvarsubscript𝜃𝑇𝑇𝑥subscript𝜃𝐼𝑥superscript𝑚\displaystyle=\text{var}(\theta_{T}T(x)+\theta_{I}x-m^{*})= var ( italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_T ( italic_x ) + italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) (12)
λ(θI+θT1)𝜆subscript𝜃𝐼subscript𝜃𝑇1\displaystyle\quad-\lambda(\theta_{I}+\theta_{T}-1)- italic_λ ( italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 1 )
=var(θTT(x)+θIx(θI+θT)m*)absentvarsubscript𝜃𝑇𝑇𝑥subscript𝜃𝐼𝑥subscript𝜃𝐼subscript𝜃𝑇superscript𝑚\displaystyle=\text{var}(\theta_{T}T(x)+\theta_{I}x-(\theta_{I}+\theta_{T})m^{% *})= var ( italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_T ( italic_x ) + italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT italic_x - ( italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) (13)
λ(θI+θT1)𝜆subscript𝜃𝐼subscript𝜃𝑇1\displaystyle\quad-\lambda(\theta_{I}+\theta_{T}-1)- italic_λ ( italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 1 )
=var[θT(T(x)m*)+θI(xm*)]absentvardelimited-[]subscript𝜃𝑇𝑇𝑥superscript𝑚subscript𝜃𝐼𝑥superscript𝑚\displaystyle=\text{var}[\theta_{T}(T(x)-m^{*})+\theta_{I}(x-m^{*})]= var [ italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ] (14)
λ(θI+θT1)𝜆subscript𝜃𝐼subscript𝜃𝑇1\displaystyle\quad-\lambda(\theta_{I}+\theta_{T}-1)- italic_λ ( italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 1 )
=var(θT(T(x)m*)=var(xm*)\displaystyle=\text{var}(\theta_{T}(T(x)-m^{*})=\text{var}(x-m^{*})= var ( italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) = var ( italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) (15)
+2cov[θT(T(x)m*),θI(xm*)]2covsubscript𝜃𝑇𝑇𝑥superscript𝑚subscript𝜃𝐼𝑥superscript𝑚\displaystyle\quad+2\text{cov}[\theta_{T}(T(x)-m^{*}),\theta_{I}(x-m^{*})]+ 2 cov [ italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) , italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ]
λ(θI+θT1)𝜆subscript𝜃𝐼subscript𝜃𝑇1\displaystyle\quad-\lambda(\theta_{I}+\theta_{T}-1)- italic_λ ( italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 1 )
=θT2var(T(x)m*)+θI2var(xm*)absentsuperscriptsubscript𝜃𝑇2var𝑇𝑥superscript𝑚superscriptsubscript𝜃𝐼2var𝑥superscript𝑚\displaystyle=\theta_{T}^{2}\text{var}(T(x)-m^{*})+\theta_{I}^{2}\text{var}(x-% m^{*})= italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT var ( italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT var ( italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) (16)
+2θTθIcov[T(x)m*,xm*]2subscript𝜃𝑇subscript𝜃𝐼cov𝑇𝑥superscript𝑚𝑥superscript𝑚\displaystyle\quad+2\theta_{T}\theta_{I}\text{cov}[T(x)-m^{*},x-m^{*}]+ 2 italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT cov [ italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ]
λ(θI+θT1)𝜆subscript𝜃𝐼subscript𝜃𝑇1\displaystyle\quad-\lambda(\theta_{I}+\theta_{T}-1)- italic_λ ( italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 1 )

Setting partial derivatives to 0

Lλ𝐿𝜆\displaystyle\frac{\partial L}{\partial\lambda}divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_λ end_ARG =θI+θT1=0absentsubscript𝜃𝐼subscript𝜃𝑇10\displaystyle=\theta_{I}+\theta_{T}-1=0= italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - 1 = 0 (17)
LθT𝐿subscript𝜃𝑇\displaystyle\frac{\partial L}{\partial\theta_{T}}divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG =2θTvar(T(x)m*)absent2subscript𝜃𝑇var𝑇𝑥superscript𝑚\displaystyle=2\theta_{T}\text{var}(T(x)-m^{*})= 2 italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT var ( italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) (18)
+2θIcov[T(x)m*,xm*]λ=02subscript𝜃𝐼cov𝑇𝑥superscript𝑚𝑥superscript𝑚𝜆0\displaystyle\quad+2\theta_{I}\text{cov}[T(x)-m^{*},x-m^{*}]-\lambda=0+ 2 italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT cov [ italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ] - italic_λ = 0
LθI𝐿subscript𝜃𝐼\displaystyle\frac{\partial L}{\partial\theta_{I}}divide start_ARG ∂ italic_L end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG =2θIvar(xm*)absent2subscript𝜃𝐼var𝑥superscript𝑚\displaystyle=2\theta_{I}\text{var}(x-m^{*})= 2 italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT var ( italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) (19)
+2θTcov[T(x)m*,xm*]λ=02subscript𝜃𝑇cov𝑇𝑥superscript𝑚𝑥superscript𝑚𝜆0\displaystyle\quad+2\theta_{T}\text{cov}[T(x)-m^{*},x-m^{*}]-\lambda=0+ 2 italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT cov [ italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ] - italic_λ = 0

we obtain equations whose solution are

θTvar(T(x)m*)+θIcov[T(x)m*,xm*]subscript𝜃𝑇var𝑇𝑥superscript𝑚subscript𝜃𝐼cov𝑇𝑥superscript𝑚𝑥superscript𝑚\displaystyle\theta_{T}\text{var}(T(x)-m^{*})+\theta_{I}\text{cov}[T(x)-m^{*},% x-m^{*}]italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT var ( italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT cov [ italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ] (20)
=θIvar(xm*)+θTcov[T(x)m*,xm*]absentsubscript𝜃𝐼var𝑥superscript𝑚subscript𝜃𝑇cov𝑇𝑥superscript𝑚𝑥superscript𝑚\displaystyle=\theta_{I}\text{var}(x-m^{*})+\theta_{T}\text{cov}[T(x)-m^{*},x-% m^{*}]= italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT var ( italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT cov [ italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ]

Substituting θIsubscript𝜃𝐼\theta_{I}italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT with (1θT)1subscript𝜃𝑇(1-\theta_{T})( 1 - italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) we get:

θT*=var(xm*)cov[T(x)m*,xm*]zsuperscriptsubscript𝜃𝑇var𝑥superscript𝑚cov𝑇𝑥superscript𝑚𝑥superscript𝑚𝑧\displaystyle\theta_{T}^{*}=\frac{\text{var}(x-m^{*})-\text{cov}[T(x)-m^{*},x-% m^{*}]}{z}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG var ( italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - cov [ italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_z end_ARG (21)

where z=var(T(x)m*)+var(xm*)2cov[T(x)m*,xm*]𝑧var𝑇𝑥superscript𝑚var𝑥superscript𝑚2cov𝑇𝑥superscript𝑚𝑥superscript𝑚z=\text{var}(T(x)-m^{*})+\text{var}(x-m^{*})-2\text{cov}[T(x)-m^{*},x-m^{*}]italic_z = var ( italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) + var ( italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - 2 cov [ italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ]. Similarly we obtain

θI*=var(T(x)m*)cov[(T(x)m*,xm*]z\displaystyle\theta_{I}^{*}=\frac{\text{var}(T(x)-m^{*})-\text{cov}[(T(x)-m^{*% },x-m^{*}]}{z}italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG var ( italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - cov [ ( italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_z end_ARG (22)

Given that θi=eτieτi+eτjsubscript𝜃𝑖superscript𝑒subscript𝜏𝑖superscript𝑒subscript𝜏𝑖superscript𝑒subscript𝜏𝑗\theta_{i}=\frac{e^{\tau_{i}}}{e^{\tau_{i}}+e^{\tau_{j}}}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG, with i=I,T𝑖𝐼𝑇i=I,Titalic_i = italic_I , italic_T, we obtain:

τT*superscriptsubscript𝜏𝑇\displaystyle\tau_{T}^{*}italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT =log[var(xm*)cov(T(x)m*,xm*)]+Cabsentlogdelimited-[]var𝑥superscript𝑚cov𝑇𝑥superscript𝑚𝑥superscript𝑚𝐶\displaystyle=\text{log}[\text{var}(x-m^{*})-\text{cov}(T(x)-m^{*},x-m^{*})]+C= log [ var ( italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - cov ( italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ] + italic_C (23)
τI*superscriptsubscript𝜏𝐼\displaystyle\tau_{I}^{*}italic_τ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT =log[var(T(x)m*)cov(T(x)m*,xm*)]+Cabsentlogdelimited-[]var𝑇𝑥superscript𝑚cov𝑇𝑥superscript𝑚𝑥superscript𝑚𝐶\displaystyle=\text{log}[\text{var}(T(x)-m^{*})-\text{cov}(T(x)-m^{*},x-m^{*})% ]+C= log [ var ( italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) - cov ( italic_T ( italic_x ) - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_x - italic_m start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) ] + italic_C (24)

where the only difference between τIsubscript𝜏𝐼\tau_{I}italic_τ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and τTsubscript𝜏𝑇\tau_{T}italic_τ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is the first term inside the logarithm. Therefore, if we choose the operation associated to the largest τ𝜏\tauitalic_τ, assuming it is related to the strength of the transformation, we will always end up choosing identity operations. This proof applies also for search spaces with more than two operations, as the transformation T𝑇Titalic_T previously defined as a translation can be seen as the composition of multiple transformations.

Appendix C Implementation details

Our hyper-parameters are summarized in Tab 9. We kept the same hyper-parameters during the search phase and the training from scratch, with the only difference in the additional optimizer needed for the Architect neural network. Such a network, responsible for the topology optimization, was trained with Adam optimizer, with 3e43𝑒43e-43 italic_e - 4 as learning rate and 1e31𝑒31e-31 italic_e - 3 as weight decay-rate. For all image datasets we applied standard augmentation techniques, such as random horizontal flip, random crop,and cutout, on the inputs to the DAS cell. Every image, after being augmented, undergoes the temporal expansion, achieved through an image replication. Transformations are then applied inside the DAS cell to each frame so that smoothness and continuity are kept during the video generation. The image replication module acts as a “stem" module to allow multiple cells with multiple nodes. Inference cost is not affected, as DAS is involved only during training.

Cifar10

Cifar100

Tiny

ImageNet

Pascal-VOC

CityScapes

Optimization

Image size

(32,32)

(32,32)

(64,64)

(224,224)

(380, 380)

(1024,1024)

Optimizer

SGD

SGD

SGD

SGD

SGD

SGD

Batch size

96

96

64

32

32

16

Learning rate scheduler

step decay

step decay

step decay

step decay

poly

poly

Base Learning rate

0.1

0.1

0.1

0.1

0.03

0.03

Weight decay

1e-4

1e-4

5e-4

1e-4

1e-4

1e-4

Epochs

90

90

90

100

80

130

Number of segments

8

8

8

8

8

8

Table 9: Hyperparameters employed for our experiments.

Appendix D Example of cells found by DAS

We provide the graph visualization for the cells found by DAS for Cifar100 (Fig. 6(a)), ImageNet (Fig. 6(b)), Pascal-VOC (Fig. 6(c)) and Cityscapes (Fig. 6(d)). We do not report the results for Cifar10 and Tiny ImageNet as the found cell is the same as Cifar100 and ImageNet, respectively. This justifies the results previously introduced in Tab. 8 for Cifar-10.

Refer to caption
(a) Cifar100.
Refer to caption
(b) ImageNet.
Refer to caption
(c) Pascal.
Refer to caption
(d) Cityscapes.
Figure 7: Best transformations found by DAS.

Appendix E Additional ablations on DAS for Image-to-Video

Tab. 10 compares the results we previously showed in Tab. 5 in the main paper, with an additional experiment to prove the need for the GSF component. To this aim, the frames of the video input (obtained with the best transformations found by DAS) are randomly shuffled with the goal of loosing the temporal continuity. This experiment aims at showing that both components, DAS and GSF, are needed, but does not imply a limitation of DAS in the search space definition. As the optimization of the DAS cell to find the optimal transformations occurs during the training of the network, even given a huge search space with non continuous transformations, DAS will optimize to find the best transformations that lead to the highest validation accuracy for that architecture. As a result, as we show with further experiments in Sec. F the approach stays robust even under noisy transformations. The experiments are run with a PSPNet with ResNet-50 backbone for Pascal-VOC dataset and with ResNet-18 for Cifar10 dataset. For each dataset, we show the accuracy (first row) the # of parameters (second row) and the number of flops (third row) with an input size 32×32323232\times 3232 × 32 and 400×400400400400\times 400400 × 400 for Cifar10 and Pascal-VOC, respectively. Finally, Fig. 8 gives an example of our RF (left) and standard 2d CNN (right) for an ImageNet sample.

Refer to caption
Figure 8: Receptive Field shape difference between our method (left) and standard 2D CNNs (right).

Baseline

DAS Aug (S)

Re-shuffle

Ours

Pascal

85.40

85.51

85.44

86.10

51.32 M 51.43 M
16.55 Gflops 16.67 Gflops

Cifar10

94.12

94.23

94.15

95.12

11.18 M 11.20 M
37.12 Mflops 37.12 Mflops
Table 10: Additional ablation experiments. Baseline was obtained with the 2D backbone with standard augmentation techniques. “DAS Aug (S)" stands for the inclusion of additional DAS augmentations in Space S, meaning that the data is processed by a 2D backbone. Re-shuffle processes the input in the same way as DAS Aug (S) but stacks the transformations in the temporal dimension to create a video, and subsequently re-shuffles the frames of the video. “Ours" processes the input obtained with DAS with temporal continuity preserved. The backbone for the last two experiments is 2D+temporal shift.

The little difference in the “re-shuffle" experiment performed for Pascal-VOC and Cifar-10 datasets with respect to the baseline and DAS Aug S is probably due to perturbations. The temporal shift mechanism, i.e. GSF, is designed to learn to shift features among adjacent frames. However, if those features are not consistent across the time dimension, GSF correctly learns not to route gated features. As a result, the experiment reconducts to processing data augemnted as in DAS Aug S with a 2D backbone integrated with a temporal shift mechanism that learns not to shift.

Appendix F Experiments on DAS for Image-to-Image

F.1 Comparison with SOTAs

Tab. 11 compares our Differentiable Augmentation Search with other SOTA auto-augmentation techniques, i.e. AA [1] and RA [2] for the task of image-to-image. This means that no temporal expansion is performed, and a comparable search-space usually deployed for finding standard data-augmentation is defined. Similar to AA and RA, we define in our search space the following set of transformations: Shear X/Y, Translate X/Y, Rotate, AutoContrast, Invert, Equalize, Solarize, Posterize, Color, Brightness, Sharpness, Cutout, and Identity that corresponds to applying no transformation. We run experiments on Cifar-10, Cifar-100, SVHN, and ImageNet, for this set of experiments, we did not fix a budget time for the required search time. Following RA setup, for comparison purposes, we employed a Wide-ResNet-28-2 for the first three datasets, and a ResNet-50 model for ImageNET.

search

Cifar-10

Cifar-100

SVHN

ImageNet

space

WRN

WRN

WRN

ResNet

Baseline

0

94.90

75.40

96.70

76.30

AA

1032superscript103210^{32}10 start_POSTSUPERSCRIPT 32 end_POSTSUPERSCRIPT

95.90

78.50

98.00

77.60

RA

102superscript10210^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

95.80

78.30

98.30

77.60

DAS

1013superscript101310^{13}10 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT

96.10 78.90 98.30 77.90
Table 11: Comparison among different auto-augmentation methods. WRN stands for Wide-ResNet-28-2, while ResNet is the ResNet-50 model. Best results are bolded.

DAS out-performs previous auto-augmentation methods in all datasets but SVHN, where it equals RA performance.

F.2 Advantages of DAS

We ablate now on the importance of introducing our differentiable algorithm highlighting the two main drawbacks of the cited competitors. On the one hand, AA is extremely competitive in terms of obtained accuracy, surpassing RA in Cifar-10, Cifar-100, and having equal performance on Imagenet. However, AA is extremely slow, requiring 15000 GPU hours to look for the optimal policy on a reduced ImageNet. On the other hand RA is extremely efficient, as it reduces the search space to 102superscript10210^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT different choices, but we argue it is not robust when introducing not relevant transformations. The authors of [2] indeed show that when introducing color transformations in the Cifar-10 experiments, they experience a degradation of validation accuracy on average. This implies that one needs to carefully design the search space, and cannot include transformations that potentially may harm the performance on the dataset. A justification for such a behaviour is due to their search space definition, where a transformation is selected with uniform probability 1/K1𝐾1/K1 / italic_K. This implies that as the number of K transformations in the search space increases, the probability is reduced, and the time required to find the best transformations increases. On the other hand, under a fixed searching-time budget, this results in a higher variability when the procedure is run multiple times. Evidence supporting this is displayed in Fig. 9, where we fix for Cifar-10 a searching budget time of 24 hours, and exacerbate this behaviour by progressively adding a noise transformation.

Refer to caption
Figure 9: Results on Cifar-10 for each auto-augmentation technique. Experiments are run 5 times, with the shaded area representing the variance.
Refer to caption
Figure 10: Top-1 test accuracy on Cifar-100 dataset given different portion of removed training dataset.

Appendix G Segmentation results

We provide more segmentation results on Pascal (Fig. 11) and CityScapes (Fig. 12) datasets. In our the figures we provide the original image (first column), the ground truth (second column), results from DeepLabv3 (column 3) and results with our methods (column 4). We highlight with a square the details where attention should be put to appreciate the difference in the results. We observe in our method, as general behaviour, a stronger capability in reconstructing details, e.g. the back part of the airplane, the details in the motorcycle, plants in Pascal-VOC, street lamps in Cityscapes. We also see that, with respect to the baseline, fewer classes are misclassified, as it can be seen for the portion of the table in the sixth row of Pascal-VOC results, in traffic lights in the third row of Cityscapes results, and in the sidewalk of the sixth row of Cityscapes.

Refer to caption
Refer to caption
Refer to caption
Figure 11: VOC qualitative results. Original image (a), Ground Truth (b), DeepLabv3 (c) and Ours (d) images are displayed.
Refer to caption
Refer to caption
Refer to caption
Figure 12: City qualitative results. Original image (a), Ground Truth (b), DeepLabv3 (c) and Ours (d) images are displayed.

Appendix H Generalizability with reduced training data

To strengthen our point we run a further ablation on Cifar100, shown in Fig. 10. When reducing the size of the dataset, we barely experience a performance degradation (compared to standard augmentations (Aug) and to DAS augmnentations not concatenated in time (DAS Aug S)), finding a very useful application in scenarios where few data are available. Compared to finding new data, the cost of representing an image as a video is largely reduced.

References

  • [1] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 113–123, 2019.
  • [2] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.