ATOM: Attention Mixer for Efficient Dataset Distillation

Samir Khaki1,  Ahmad Sajedi1∗,  Kai Wang2,  Lucy Z. Liu3,  Yuri A. Lawryshyn1,
and  Konstantinos N. Plataniotis1
1University of Toronto         2 National University of Singapore         3Royal Bank of Canada (RBC)
{samir.khaki, ahmad.sajedi}@mail.utoronto.ca
Code: https://github.com/DataDistillation/ATOM
Equal contribution
Abstract

Recent works in dataset distillation seek to minimize training expenses by generating a condensed synthetic dataset that encapsulates the information present in a larger real dataset. These approaches ultimately aim to attain test accuracy levels akin to those achieved by models trained on the entirety of the original dataset. Previous studies in feature and distribution matching have achieved significant results without incurring the costs of bi-level optimization in the distillation process. Despite their convincing efficiency, many of these methods suffer from marginal downstream performance improvements, limited distillation of contextual information, and subpar cross-architecture generalization. To address these challenges in dataset distillation, we propose the ATtentiOn Mixer (ATOM) module to efficiently distill large datasets using a mixture of channel and spatial-wise attention in the feature matching process. Spatial-wise attention helps guide the learning process based on consistent localization of classes in their respective images, allowing for distillation from a broader receptive field. Meanwhile, channel-wise attention captures the contextual information associated with the class itself, thus making the synthetic image more informative for training. By integrating both types of attention, our ATOM module demonstrates superior performance across various computer vision datasets, including CIFAR10/100 and TinyImagenet. Notably, our method significantly improves performance in scenarios with a low number of images per class, thereby enhancing its potential. Furthermore, we maintain the improvement on cross-architectures and applications such as neural architecture search.

1 Introduction

Refer to caption
Figure 1: The ATOM Framework utilizes inherent information to capture both context and location, resulting in significantly improved performance in dataset distillation. We display the performance of various components within the ATOM framework, showcasing a 5.8%percent5.85.8\%5.8 % enhancement from the base distribution matching performance on CIFAR10 at IPC50. Complete numerical details can be found in Table 4.

Efficient deep learning has surged in recent years due to the increasing computational costs associated with training and inferencing pipelines [26, 63, 76, 53, 75, 51, 71, 52, 54, 2]. This growth can be attributed to the escalating complexity of model architectures and the ever-expanding scale of datasets. Despite the increasing computational burden, two distinct approaches have emerged as potential avenues for addressing this issue: the model-centric and data-centric approaches. The model-centric approach is primarily concerned with mitigating computational costs by refining the architecture of deep learning models. Techniques such as pruning, quantization, knowledge distillation, and architectural simplification are key strategies employed within this paradigm [26, 68, 71, 65, 29, 30, 50, 49]. In contrast, the data-centric approach adopts a different perspective, focusing on exploring and leveraging the inherent redundancy within datasets. Rather than modifying model architectures, this approach seeks to identify or construct a smaller dataset that retains the essential information necessary for maintaining performance levels. Coreset selection was a fairly adopted method for addressing this gap [47, 6, 4, 55, 60]. In particular works such as Herding [66] and K-Center [55] offered a heuristic-based approach to intelligently select an informative subset of data. However, as a heuristic-based method, the downstream performance is limited by the information contained solely in the subset. More recently, shapely data selection [17] found the optimal subset of data by measuring the downstream performance for every subset combination achievable in the dataset. However inefficient this may be, the downstream performance is still limited by the diversity of samples selected. therefore, Dataset Distillation (DD) [63] has emerged as a front-runner wherein a synthetic dataset can be learned.

Dataset distillation aims to distill large-scale datasets into a smaller representation, such that downstream models trained on this condensed dataset will retain competitive performance with those trained on the larger original one [63, 76, 7]. Recently, many techniques have been introduced to address this challenge, including gradient matching [76, 74, 38], feature/distribution matching [75, 51, 77], and trajectory matching [7, 14, 21]. However, many of these methods suffer from complex and computationally heavy distillation pipelines [76, 7, 21] or inferior performance [75, 51, 76]. A promising approach, DataDAM [51], effectively tackled the computational challenges present in prior distillation techniques by employing untrained neural networks, in contrast to bi-level optimization methods. However, despite its potential, DataDAM faced several significant limitations: (1) it obscured relevant class-content-based information existing channel-wise in intermediate layers; (2) it only achieved marginal enhancements on previous dataset distillation algorithms; and (3) it exhibited inferior cross-architecture generalization.

In this work, we introduce ATtentiOn Mixer, dubbed ATOM as an efficient dataset distillation pipeline that strikes an impressive balance between computational efficiency and superior performance. Drawing upon spatial attention matching techniques from prior studies like DataDAM [51], we expand our receptive field of information in the matching process. Our key contribution lies in mixing spatial information with channel-wise contextual information. Intuitively, different convolutional filters focus on different localizations of the input feature; thus, channel-wise attention aids in the distillation matching process by compressing and aggregating information from multiple regions as evident by the performance improvmenets displayed in Figure 1. ATOM not only combines localization and context, but it also produces distilled images that are more generalizable to various downstream architectures, implying that the distilled features are true representations of the original dataset. Moreover, our approach demonstrates consistent improvements across all settings on a comprehensive distillation test suite. In summary, the key contributions of this study can be outlined as follows:

[C1]: We provide further insight into the intricacies of attention matching, ultimately introducing the use of channel-wise attention matching for capturing a higher level of information in the feature-matching process. Our mixing module combines both spatial localization awareness of a particular class, with distinctive contextual information derived channel-wise.

[C2]: Empirically we show superior performance against previous dataset distillation methods including feature matching and attention matching works, without bi-level optimization on common computer vision datasets.

[C3]: We extend our findings by demonstrating superior performance in cross-architecture and neural architecture search. In particular, we provide a channel-only setting that maintains the majority of the performance while incurring a lower computational cost.

2 Related Works

Coreset Selection. Coreset selection, an early data-centric approach, aimed to efficiently choose a representative subset from a full dataset to enhance downstream training performance and efficiency. Various methods have been proposed in the past, including geometry-based approaches [1, 10, 55, 57, 66], loss-based techniques as mentioned in [59, 46], decision-boundary-focused methods [42, 16], bilevel optimization strategies [32, 33], and gradient-matching algorithms outlined in [43, 31]. Notable among them are Random, which randomly selects samples as the coreset; Herding, which picks samples closest to the cluster center; K-Center, which selects multiple center points to minimize the maximum distance between data points and their nearest center; and Forgetting, which identifies informative training samples based on learning difficulties [6, 4, 55, 59]. While these selection-based methods have shown moderate success in efficient training, they inherently possess limitations in capturing rich information. Since each image in the selected subset is treated independently, they lack the rich features that could have been captured if the diversity within classes had been considered. These limitations have motivated the emergence of dataset distillation within the field.

Refer to caption
Figure 2: (a) An overview of the proposed ATOM framework. By mixing attention, ATOM is able to capture both spatial localization and class context. (b) Demonstration of the internal architecture for spatial- and channel-wise attention in the ATOM Module. The spatial-wise attention computes attention at specific locales through different filters, resulting in a matrix output, whereas the channel-wise attention calculates attention between each filter, naturally producing a vectorized output.

Dataset Distillation. Dataset distillation has emerged as a learnable method of synthesizing a smaller, information-rich dataset from a large-scale real dataset. This approach offers a more efficient training paradigm, commonly applied in various downstream applications such as continual learning [9, 51, 76, 20, 70], neural architecture search [27, 58], and federated learning [28, 69, 39, 40]. The seminal work, initially proposed by Wang et al. [63], introduced bilevel optimization, comprising an outer loop for learning the pixel-level synthetic dataset and an inner loop for training the matching network. Following this, several studies adopted surrogate objectives to tackle unrolled optimization problems in meta-learning. For example, gradient matching methods [76, 74, 38, 34, 15] learn images by aligning network gradients derived from real and synthetic datasets. Trajectory matching [7, 14, 11, 21] improves performance by minimizing differences in model training trajectories between original and synthetic samples. Meanwhile, feature matching strategies [75, 61, 51, 77, 73, 51] aim to align feature distributions between real and synthetic data within diverse latent spaces. Despite significant advancements in this field, methods still struggle to find a trade-off between the computational costs associated with the distillation pipeline and the model’s performance. A recent work, DataDAM [51], used spatial attention to improve the performance of feature-matching-based methods by selectively matching features based on their spatial attention scores. However, although this method operates without bilevel optimization, it only marginally improves performance on larger test suites. In this study, we delve deeper into the potential of attention-based methods and demonstrate superior performance compared to DataDAM and previous benchmarks across various computer vision datasets. Additionally, we achieve a lower computational cost compared to conventional attention-matching approaches by leveraging information in a channel-wise manner.

Attention Mechanism. Attention mechanisms have been widely adopted in deep learning to enhance performance across various tasks [3, 64, 72]. Initially applied in natural language processing [3], it has extended to computer vision, with global attention models [64] improving image classification and convolutional block attention modules [67] enhancing feature map selection. Additionally, attention aids model compression in knowledge distillation [72]. They are lauded for their ability to efficiently incorporate global contextual information into feature representations. When applied to feature maps, attention can take the form of either spatial or channel-based methods. Spatial methods focus on identifying the informative regions (”where”), while channel-based methods complementarily emphasize the informative features (”what”). Both spatial localization and channel information are crucial for identifying class characteristics. Recently, Sajedi et al. proposed DataDAM [51] to concentrate only on spatial attention, capturing class correlations within image localities for efficient training purposes. However, inspired by the inherent obfuscation of the content in the attention maps, we propose an Attention Mixer module that uses a unique combination of spatial and channel-wise attention to capture localization and information content.

3 Methodology

Given the larger source dataset 𝒯={(𝒙i,yi)}i=1|𝒯|𝒯superscriptsubscriptsubscript𝒙𝑖subscript𝑦𝑖𝑖1𝒯\mathcal{T}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{|\mathcal{T}|}caligraphic_T = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT containing |𝒯|𝒯|\mathcal{T}|| caligraphic_T | real image-label pairs, we generate a smaller learnable synthetic dataset 𝒮={(𝒔j,yj)}j=1|𝒮|𝒮superscriptsubscriptsubscript𝒔𝑗subscript𝑦𝑗𝑗1𝒮\mathcal{S}=\{(\bm{s}_{j},y_{j})\}_{j=1}^{|\mathcal{S}|}caligraphic_S = { ( bold_italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT with |𝒮|𝒮|\mathcal{S}|| caligraphic_S | synthetic image and label pairs. Following previous works [76, 74, 61, 51, 7], we use random sampling to initialize our synthetic dataset. For every class k𝑘kitalic_k, we obtain a batch of real and synthetic data (Bk𝒯subscriptsuperscript𝐵𝒯𝑘B^{\mathcal{T}}_{k}italic_B start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Bk𝒮subscriptsuperscript𝐵𝒮𝑘B^{\mathcal{S}}_{k}italic_B start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, respectively) and use a neural network ϕ𝜽()subscriptitalic-ϕ𝜽\phi_{\bm{\theta}}(\cdot)italic_ϕ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ⋅ ) with randomly initialized weights 𝜽𝜽\bm{\theta}bold_italic_θ [22] to extract intermediate and output features. We illustrate our method in Figure 2 where an L𝐿Litalic_L-layer neural network ϕ𝜽()subscriptitalic-ϕ𝜽\phi_{\bm{\theta}}(\cdot)italic_ϕ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ⋅ ) is used to extract features from the real and synthetic sets. The collection of feature maps from the real and synthetic sets can be expressed as ϕ𝜽(𝒯k)=[𝒇𝜽,1𝒯k,,𝒇𝜽,L𝒯k]subscriptitalic-ϕ𝜽subscript𝒯𝑘subscriptsuperscript𝒇subscript𝒯𝑘𝜽1subscriptsuperscript𝒇subscript𝒯𝑘𝜽𝐿\phi_{\bm{\theta}}({\mathcal{T}}_{k})=[\bm{f}^{\mathcal{T}_{k}}_{\bm{\theta},1% },\cdots,\bm{f}^{\mathcal{T}_{k}}_{\bm{\theta},L}]italic_ϕ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = [ bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_L end_POSTSUBSCRIPT ] and ϕ𝜽(𝒮k)=[𝒇𝜽,1𝒮k,,𝒇𝜽,L𝒮k]subscriptitalic-ϕ𝜽subscript𝒮𝑘subscriptsuperscript𝒇subscript𝒮𝑘𝜽1subscriptsuperscript𝒇subscript𝒮𝑘𝜽𝐿\phi_{\bm{\theta}}(\mathcal{S}_{k})=[\bm{f}^{\mathcal{S}_{k}}_{\bm{\theta},1},% \cdots,\bm{f}^{\mathcal{S}_{k}}_{\bm{\theta},L}]italic_ϕ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = [ bold_italic_f start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_f start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_L end_POSTSUBSCRIPT ], respectively. The feature 𝒇𝜽,l𝒯ksubscriptsuperscript𝒇subscript𝒯𝑘𝜽𝑙\bm{f}^{\mathcal{T}_{k}}_{\bm{\theta},l}bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT comprises a multi-dimensional array within |Bk𝒯|×Cl×Wl×Hlsuperscriptsubscriptsuperscript𝐵𝒯𝑘subscript𝐶𝑙subscript𝑊𝑙subscript𝐻𝑙\mathbb{R}^{|B^{\mathcal{T}}_{k}|\times C_{l}\times W_{l}\times H_{l}}blackboard_R start_POSTSUPERSCRIPT | italic_B start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | × italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, obtained from the real dataset at the lthsuperscript𝑙thl^{\text{th}}italic_l start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT layer, where Clsubscript𝐶𝑙C_{l}italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT denotes the number of channels and Hl×Wlsubscript𝐻𝑙subscript𝑊𝑙H_{l}\times W_{l}italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT represents the spatial dimensions. Correspondingly, a feature 𝒇𝜽,l𝒮ksubscriptsuperscript𝒇subscript𝒮𝑘𝜽𝑙\bm{f}^{\mathcal{S}_{k}}_{\bm{\theta},l}bold_italic_f start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT is derived for the synthetic dataset.

We now introduce the Attention Mixer Module (ATOM) which generates attention maps for the intermediate features derived from both the real and synthetic datasets. Leveraging a feature-based map** function A()𝐴A(\cdot)italic_A ( ⋅ ), ATOM takes the intermediate feature maps as input and produces a corresponding attention map for each feature. Formally, we express this as: A(ϕ𝜽(𝒯k))=[𝒂𝜽,1𝒯k,,𝒂𝜽,L1𝒯k]𝐴subscriptitalic-ϕ𝜽subscript𝒯𝑘subscriptsuperscript𝒂subscript𝒯𝑘𝜽1subscriptsuperscript𝒂subscript𝒯𝑘𝜽𝐿1A\big{(}\phi_{\bm{\theta}}({\mathcal{T}}_{k})\big{)}=[\bm{a}^{\mathcal{T}_{k}}% _{\bm{\theta},1},\cdots,\bm{a}^{\mathcal{T}_{k}}_{\bm{\theta},L-1}]italic_A ( italic_ϕ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) = [ bold_italic_a start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_a start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_L - 1 end_POSTSUBSCRIPT ] and A(ϕ𝜽(𝒮k))=[𝒂𝜽,1𝒮k,,𝒂𝜽,L1𝒮k]𝐴subscriptitalic-ϕ𝜽subscript𝒮𝑘subscriptsuperscript𝒂subscript𝒮𝑘𝜽1subscriptsuperscript𝒂subscript𝒮𝑘𝜽𝐿1A(\phi_{\bm{\theta}}({\mathcal{S}}_{k}))=[\bm{a}^{\mathcal{S}_{k}}_{\bm{\theta% },1},\cdots,\bm{a}^{\mathcal{S}_{k}}_{\bm{\theta},L-1}]italic_A ( italic_ϕ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) = [ bold_italic_a start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_a start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_L - 1 end_POSTSUBSCRIPT ] for the real and synthetic sets, respectively. Previous works [51, 72] have shown that spatial attention, which aggregates the absolute values of feature maps across the channel dimension, can emphasize common spatial locations associated with high neuron activation. The implication of this is retaining the most informative regions, thus generating an efficient feature descriptor. In this work, we also consider the effect of channel-wise attention, which emphasizes the most significant information captured by each channel based on the magnitude of its activation. Since different filters explore different regions or locations of the input feature, channel-wise activation yields the best aggregation of the global information. Ultimately, we convert the feature map 𝒇𝜽,l𝒯ksubscriptsuperscript𝒇subscript𝒯𝑘𝜽𝑙\bm{f}^{\mathcal{T}_{k}}_{\bm{\theta},l}bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT of the lthsuperscript𝑙thl^{\text{th}}italic_l start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT layer into an attention map 𝒂𝜽,l𝒯ksubscriptsuperscript𝒂subscript𝒯𝑘𝜽𝑙\bm{a}^{\mathcal{T}_{k}}_{\bm{\theta},l}bold_italic_a start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT representing spatial or channel-wise attention using the corresponding map** functions As()subscript𝐴𝑠A_{s}(\cdot)italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( ⋅ ) or Ac()subscript𝐴𝑐A_{c}(\cdot)italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ⋅ ) respectively. Formally, we can denote the spatial and channel-wise attention maps as:

As(𝒇𝜽,l𝒯k)=i=1Cl|(𝒇𝜽,l𝒯k)i|ps,subscript𝐴𝑠subscriptsuperscript𝒇subscript𝒯𝑘𝜽𝑙superscriptsubscript𝑖1subscript𝐶𝑙superscriptsubscriptsubscriptsuperscript𝒇subscript𝒯𝑘𝜽𝑙𝑖subscript𝑝𝑠\displaystyle A_{s}(\bm{f}^{\mathcal{T}_{k}}_{\bm{\theta},l})=\sum_{i=1}^{C_{l% }}\big{|}{(\bm{f}^{\mathcal{T}_{k}}_{\bm{\theta},l})}_{i}\big{|}^{p_{s}},italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | ( bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (1)
Ac(𝒇𝜽,l𝒯k)=i=1HlWl|(𝒇𝜽,l𝒯k)i|pc,subscript𝐴𝑐subscriptsuperscript𝒇subscript𝒯𝑘𝜽𝑙superscriptsubscript𝑖1subscript𝐻𝑙subscript𝑊𝑙superscriptsubscriptsuperscriptsubscriptsuperscript𝒇subscript𝒯𝑘𝜽𝑙𝑖subscript𝑝𝑐\displaystyle A_{c}(\bm{f}^{\mathcal{T}_{k}}_{\bm{\theta},l})=\sum_{i=1}^{H_{l% }*W_{l}}\big{|}(\bm{f}^{\mathcal{T}_{k}}_{\bm{\theta},l})^{\star}_{i}\big{|}^{% p_{c}},italic_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∗ italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | ( bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (2)

where, (𝒇𝜽,l𝒯k)i=𝒇𝜽,l𝒯k(:,i,:,:)subscriptsubscriptsuperscript𝒇subscript𝒯𝑘𝜽𝑙𝑖subscriptsuperscript𝒇subscript𝒯𝑘𝜽𝑙:𝑖::{(\bm{f}^{\mathcal{T}_{k}}_{\bm{\theta},l})}_{i}=\bm{f}^{\mathcal{T}_{k}}_{\bm% {\theta},l}(:,i,:,:)( bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ( : , italic_i , : , : ) is the feature map of channel i𝑖iitalic_i from the lthsuperscript𝑙thl^{\text{th}}italic_l start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT layer, and the power and absolute value operations are applied element-wise; meanwhile, the symbol flattens the feature map along the spatial dimension ((𝒇𝜽,l𝒯k)|Bk𝒯|×Cl×WlHl)superscriptsubscriptsuperscript𝒇subscript𝒯𝑘𝜽𝑙superscriptsubscriptsuperscript𝐵𝒯𝑘subscript𝐶𝑙subscript𝑊𝑙subscript𝐻𝑙\left((\bm{f}^{\mathcal{T}_{k}}_{\bm{\theta},l})^{*}\in\mathbb{R}^{|B^{% \mathcal{T}}_{k}|\times C_{l}\times W_{l}*H_{l}}\right)( ( bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | italic_B start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | × italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∗ italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), such that (𝒇𝜽,l𝒯k)i=(𝒇𝜽,l𝒯k)(:,:,i)subscriptsuperscriptsubscriptsuperscript𝒇subscript𝒯𝑘𝜽𝑙𝑖superscriptsubscriptsuperscript𝒇subscript𝒯𝑘𝜽𝑙::𝑖(\bm{f}^{\mathcal{T}_{k}}_{\bm{\theta},l})^{\star}_{i}=(\bm{f}^{\mathcal{T}_{k% }}_{\bm{\theta},l})^{\star}(:,:,i)( bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( bold_italic_f start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( : , : , italic_i ). By leveraging both types of attention, we can better encapsulate the relevant information in the intermediate features, as investigated in Figure 4. Further, the effect of power parameters for spatial and channel-wise attention, i.e. pssubscript𝑝𝑠p_{s}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is studied in the Table 4.

Given our generated spatial and channel attention maps for the intermediate features, we apply standard normalization such that we can formulate a matching loss between the synthetic and real datasets. We denote our generalized loss ATOMsubscriptATOM\mathcal{L}_{\text{ATOM}}caligraphic_L start_POSTSUBSCRIPT ATOM end_POSTSUBSCRIPT as:

𝔼𝜽P𝜽[k=1Kl=1L1𝔼𝒯k[𝒛𝜽,l𝒯k𝒛𝜽,l𝒯k2]𝔼𝒮k[𝒛𝜽,l𝒮k𝒛𝜽,l𝒮k2]2],subscript𝔼similar-to𝜽subscript𝑃𝜽delimited-[]superscriptsubscript𝑘1𝐾superscriptsubscript𝑙1𝐿1superscriptdelimited-∥∥subscript𝔼subscript𝒯𝑘delimited-[]subscriptsuperscript𝒛subscript𝒯𝑘𝜽𝑙subscriptdelimited-∥∥subscriptsuperscript𝒛subscript𝒯𝑘𝜽𝑙2subscript𝔼subscript𝒮𝑘delimited-[]subscriptsuperscript𝒛subscript𝒮𝑘𝜽𝑙subscriptdelimited-∥∥subscriptsuperscript𝒛subscript𝒮𝑘𝜽𝑙22\displaystyle\displaystyle\mathop{\mathbb{E}}_{\bm{\theta}\sim P_{\bm{\theta}}% }\bigg{[}\sum_{k=1}^{K}\sum_{l=1}^{L-1}\Big{\lVert}\displaystyle{\mathbb{E}}_{% \mathcal{T}_{k}}\Big{[}\frac{\bm{z}^{\mathcal{T}_{k}}_{\bm{\theta},l}}{{\lVert% \bm{z}^{\mathcal{T}_{k}}_{\bm{\theta},l}\rVert}_{2}}\Big{]}-\displaystyle% \mathbb{E}_{{\mathcal{S}}_{k}}\Big{[}\frac{\bm{z}^{\mathcal{S}_{k}}_{\bm{% \theta},l}}{{\lVert\bm{z}^{\mathcal{S}_{k}}_{\bm{\theta},l}\rVert}_{2}}\Big{]}% \Big{\rVert}^{2}\bigg{]},blackboard_E start_POSTSUBSCRIPT bold_italic_θ ∼ italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT ∥ blackboard_E start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG bold_italic_z start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_z start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ] - blackboard_E start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG bold_italic_z start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_z start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (3)

where, in the case of spatial attention, we denote 𝒛𝜽,l𝒯k=vec(𝒂𝜽,l𝒯k)|Bk𝒯|×(Wl×Hl)subscriptsuperscript𝒛subscript𝒯𝑘𝜽𝑙𝑣𝑒𝑐subscriptsuperscript𝒂subscript𝒯𝑘𝜽𝑙superscriptsubscriptsuperscript𝐵𝒯𝑘subscript𝑊𝑙subscript𝐻𝑙\bm{z}^{\mathcal{T}_{k}}_{\bm{\theta},l}=vec(\bm{a}^{\mathcal{T}_{k}}_{\bm{% \theta},l})\in\mathbb{R}^{|B^{\mathcal{T}}_{k}|\times(W_{l}\times H_{l})}bold_italic_z start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT = italic_v italic_e italic_c ( bold_italic_a start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT | italic_B start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | × ( italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT and 𝒛𝜽,l𝒮k=vec(𝒂𝜽,l𝒮k)|Bk𝒮|×(Wl×Hl)subscriptsuperscript𝒛subscript𝒮𝑘𝜽𝑙𝑣𝑒𝑐subscriptsuperscript𝒂subscript𝒮𝑘𝜽𝑙superscriptsubscriptsuperscript𝐵𝒮𝑘subscript𝑊𝑙subscript𝐻𝑙\bm{z}^{\mathcal{S}_{k}}_{\bm{\theta},l}=vec(\bm{a}^{\mathcal{S}_{k}}_{\bm{% \theta},l})\in\mathbb{R}^{|B^{\mathcal{S}}_{k}|\times(W_{l}\times H_{l})}bold_italic_z start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT = italic_v italic_e italic_c ( bold_italic_a start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT | italic_B start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | × ( italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT to represent the vectorized spatial attention map pairs at the lthsuperscript𝑙thl^{\text{th}}italic_l start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT layer for the real and synthetic datasets, respectively. Meanwhile, for channel-based attention, we have 𝒛𝜽,l𝒯k=vec(𝒂𝜽,l𝒯k)|Bk𝒯|×(Cl)subscriptsuperscript𝒛subscript𝒯𝑘𝜽𝑙𝑣𝑒𝑐subscriptsuperscript𝒂subscript𝒯𝑘𝜽𝑙superscriptsubscriptsuperscript𝐵𝒯𝑘subscript𝐶𝑙\bm{z}^{\mathcal{T}_{k}}_{\bm{\theta},l}=vec(\bm{a}^{\mathcal{T}_{k}}_{\bm{% \theta},l})\in\mathbb{R}^{|B^{\mathcal{T}}_{k}|\times(C_{l})}bold_italic_z start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT = italic_v italic_e italic_c ( bold_italic_a start_POSTSUPERSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT | italic_B start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | × ( italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT and 𝒛𝜽,l𝒮k=vec(𝒂𝜽,l𝒮k)|Bk𝒮|×(Cl)subscriptsuperscript𝒛subscript𝒮𝑘𝜽𝑙𝑣𝑒𝑐subscriptsuperscript𝒂subscript𝒮𝑘𝜽𝑙superscriptsubscriptsuperscript𝐵𝒮𝑘subscript𝐶𝑙\bm{z}^{\mathcal{S}_{k}}_{\bm{\theta},l}=vec(\bm{a}^{\mathcal{S}_{k}}_{\bm{% \theta},l})\in\mathbb{R}^{|B^{\mathcal{S}}_{k}|\times(C_{l})}bold_italic_z start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT = italic_v italic_e italic_c ( bold_italic_a start_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_θ , italic_l end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT | italic_B start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | × ( italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT to represent the flattened channel attention map pairs at the lthsuperscript𝑙thl^{\text{th}}italic_l start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT layer for the real and synthetic datasets, respectively. The parameter K𝐾Kitalic_K is the number of categories in a dataset, and P𝜽subscript𝑃𝜽P_{\bm{\theta}}italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT denotes the distribution of network parameters. We estimate the expectation terms in Equation 3 empirically if ground-truth data distributions are not available.

Following previous works [51, 75, 61, 77, 73], we leverage the features in the final layer to regularize our matching process. In particular, the features of the penultimate layer represent a high-level abstraction of information from the input images in an embedded representation and can thus be used to inject semantic information in the matching process [51, 48, 75, 19]. Thus, we employ MMDsubscriptMMD\mathcal{L}_{\text{MMD}}caligraphic_L start_POSTSUBSCRIPT MMD end_POSTSUBSCRIPT as described in [51, 75] out-of-the-box.

Finally, we learn the synthetic dataset by minimizing the following optimization problem using SGD optimizer:

𝒮=argmin𝒮(ATOM+λMMD),superscript𝒮subscriptargmin𝒮subscriptATOM𝜆subscriptMMD\displaystyle\mathcal{S}^{*}=\operatorname*{arg\,min}_{\mathcal{S}}\>\big{(}% \mathcal{L}_{\text{ATOM}}+\lambda\mathcal{L}_{\text{MMD}}\big{)},caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT ATOM end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT MMD end_POSTSUBSCRIPT ) , (4)

where λ𝜆\lambdaitalic_λ is the task balance parameter inherited from [51]. In particular, we highlight that MMDsubscriptMMD\mathcal{L}_{\text{MMD}}caligraphic_L start_POSTSUBSCRIPT MMD end_POSTSUBSCRIPT brings semantic information from the final layer, while ATOMsubscriptATOM\mathcal{L}_{\text{ATOM}}caligraphic_L start_POSTSUBSCRIPT ATOM end_POSTSUBSCRIPT mixes the spatial and channel-wise attention information from the intermediate layers. Note that our approach assigns a fixed label to each synthetic sample and keeps it constant during training. A summary of the learning algorithm can be found in Algorithm 1.

Algorithm 1 Attention Mixer for Dataset Distillation

Input: Real training dataset 𝒯={(𝒙i,yi)}i=1|𝒯|𝒯superscriptsubscriptsubscript𝒙𝑖subscript𝑦𝑖𝑖1𝒯\mathcal{T}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{|\mathcal{T}|}caligraphic_T = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_T | end_POSTSUPERSCRIPT
Required: Initialized synthetic samples for K𝐾Kitalic_K classes, Deep neural network ϕ𝜽subscriptitalic-ϕ𝜽\phi_{\bm{\theta}}italic_ϕ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT parameterized with 𝜽𝜽\bm{\bm{\theta}}bold_italic_θ, Probability distribution over randomly initialized weights P𝜽subscript𝑃𝜽P_{\bm{\theta}}italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT, Learning rate η𝒮subscript𝜂𝒮\eta_{\mathcal{S}}italic_η start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, Task balance parameter λ𝜆\lambdaitalic_λ, Number of training iterations I𝐼Iitalic_I.

1:Initialize synthetic dataset 𝒮𝒮\mathcal{S}caligraphic_S
2:for i=1,2,,I𝑖12𝐼i=1,2,\cdots,Iitalic_i = 1 , 2 , ⋯ , italic_I do
3:     Sample 𝜽𝜽\bm{\theta}bold_italic_θ from P𝜽subscript𝑃𝜽P_{\bm{\theta}}italic_P start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT
4:     Sample mini-batch pairs Bk𝒯superscriptsubscript𝐵𝑘𝒯B_{k}^{\mathcal{T}}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT and Bk𝒮superscriptsubscript𝐵𝑘𝒮B_{k}^{\mathcal{S}}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT from the real
5:       and synthetic sets for each class k𝑘kitalic_k
6:     Compute ATOMsubscriptATOM\mathcal{L}_{\text{ATOM}}caligraphic_L start_POSTSUBSCRIPT ATOM end_POSTSUBSCRIPT and MMDsubscriptMMD\mathcal{L}_{\text{MMD}}caligraphic_L start_POSTSUBSCRIPT MMD end_POSTSUBSCRIPT
7:     Calculate =ATOM+λMMDsubscriptATOM𝜆subscriptMMD\mathcal{L}=\mathcal{L}_{\text{ATOM}}+\lambda\mathcal{L}_{\text{MMD}}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT ATOM end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT MMD end_POSTSUBSCRIPT
8:     Update the synthetic dataset using 𝒮𝒮η𝒮𝒮𝒮𝒮subscript𝜂𝒮subscript𝒮\mathcal{S}\leftarrow\mathcal{S}-\eta_{\mathcal{S}}\nabla_{\mathcal{S}}% \mathcal{L}caligraphic_S ← caligraphic_S - italic_η start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT caligraphic_L
9:end for

Output: Synthetic dataset 𝒮={(𝒔i,yi)}i=1|𝒮|𝒮superscriptsubscriptsubscript𝒔𝑖subscript𝑦𝑖𝑖1𝒮\mathcal{S}=\{(\bm{s}_{i},y_{i})\}_{i=1}^{|\mathcal{S}|}caligraphic_S = { ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT

4 Experiments

4.1 Experimental Setup

Datasets. Our method is evaluated on the CIFAR-10 and CIFAR-100 datasets [35], which maintain a resolution of 32 ×\times× 32, aligning with state-of-the-art benchmarks. Furthermore, we resize the Tiny ImageNet [37] datasets to 64 ×\times× 64 for additional experimentation. The supplementary materials provide more detailed dataset information.

Dataset CIFAR-10 CIFAR-100 Tiny ImageNet
IPC 1 10 50 1 10 50 1 10 50
Ratio % 0.02 0.2 1 0.2 2 10 0.2 2 10
Random 14.4±2.0subscript14.4plus-or-minus2.014.4_{\pm 2.0}14.4 start_POSTSUBSCRIPT ± 2.0 end_POSTSUBSCRIPT 26.0±1.2subscript26.0plus-or-minus1.226.0_{\pm 1.2}26.0 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 43.4±±1.043.4subscript±plus-or-minus1.043.4\textpm_{\pm 1.0}43.4 ± start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 4.2±±0.34.2subscript±plus-or-minus0.34.2\textpm_{\pm 0.3}4.2 ± start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 14.6±0.5subscript14.6plus-or-minus0.514.6_{\pm 0.5}14.6 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 30.0±0.4subscript30.0plus-or-minus0.430.0_{\pm 0.4}30.0 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 1.4±0.1subscript1.4plus-or-minus0.11.4_{\pm 0.1}1.4 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 5.0±0.2subscript5.0plus-or-minus0.25.0_{\pm 0.2}5.0 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 15.0±0.4subscript15.0plus-or-minus0.415.0_{\pm 0.4}15.0 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT
Herding [66] 21.5±1.2subscript21.5plus-or-minus1.221.5_{\pm 1.2}21.5 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 31.6±0.7subscript31.6plus-or-minus0.731.6_{\pm 0.7}31.6 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 40.4±0.6subscript40.4plus-or-minus0.640.4_{\pm 0.6}40.4 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 8.3±0.3subscript8.3plus-or-minus0.38.3_{\pm 0.3}8.3 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 17.3±0.3subscript17.3plus-or-minus0.317.3_{\pm 0.3}17.3 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 33.7±0.5subscript33.7plus-or-minus0.533.7_{\pm 0.5}33.7 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 2.8±0.2subscript2.8plus-or-minus0.22.8_{\pm 0.2}2.8 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 6.3±0.2subscript6.3plus-or-minus0.26.3_{\pm 0.2}6.3 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 16.7±0.3subscript16.7plus-or-minus0.316.7_{\pm 0.3}16.7 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT
K-Center [55] 21.5±1.3subscript21.5plus-or-minus1.321.5_{\pm 1.3}21.5 start_POSTSUBSCRIPT ± 1.3 end_POSTSUBSCRIPT 14.7±0.9subscript14.7plus-or-minus0.914.7_{\pm 0.9}14.7 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT 27.0±1.4subscript27.0plus-or-minus1.427.0_{\pm 1.4}27.0 start_POSTSUBSCRIPT ± 1.4 end_POSTSUBSCRIPT 8.4±0.3subscript8.4plus-or-minus0.38.4_{\pm 0.3}8.4 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 17.3±0.3subscript17.3plus-or-minus0.317.3_{\pm 0.3}17.3 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 30.5±0.3subscript30.5plus-or-minus0.330.5_{\pm 0.3}30.5 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT - - -
Forgetting [59] 13.5±1.2subscript13.5plus-or-minus1.213.5_{\pm 1.2}13.5 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 23.3±1.0subscript23.3plus-or-minus1.023.3_{\pm 1.0}23.3 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 23.3±1.1subscript23.3plus-or-minus1.123.3_{\pm 1.1}23.3 start_POSTSUBSCRIPT ± 1.1 end_POSTSUBSCRIPT 4.5±0.2subscript4.5plus-or-minus0.24.5_{\pm 0.2}4.5 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 15.1±0.3subscript15.1plus-or-minus0.315.1_{\pm 0.3}15.1 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT - 1.6±0.1subscript1.6plus-or-minus0.11.6_{\pm 0.1}1.6 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 5.1±0.2subscript5.1plus-or-minus0.25.1_{\pm 0.2}5.1 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 15.0±0.3subscript15.0plus-or-minus0.315.0_{\pm 0.3}15.0 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT
DD[63] - 36.8±1.2subscript36.8plus-or-minus1.236.8_{\pm 1.2}36.8 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT - - - - - - -
LD[5] 25.7±0.7subscript25.7plus-or-minus0.725.7_{\pm 0.7}25.7 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 38.3±0.4subscript38.3plus-or-minus0.438.3_{\pm 0.4}38.3 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 42.5±0.4subscript42.5plus-or-minus0.442.5_{\pm 0.4}42.5 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 11.5±0.4subscript11.5plus-or-minus0.411.5_{\pm 0.4}11.5 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT - - - -
DC [76] 28.3±0.5subscript28.3plus-or-minus0.528.3_{\pm 0.5}28.3 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 44.9±0.5subscript44.9plus-or-minus0.544.9_{\pm 0.5}44.9 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 53.9±0.5subscript53.9plus-or-minus0.553.9_{\pm 0.5}53.9 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 12.8±0.3subscript12.8plus-or-minus0.312.8_{\pm 0.3}12.8 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 25.2±0.3subscript25.2plus-or-minus0.325.2_{\pm 0.3}25.2 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 30.6±0.6subscript30.6plus-or-minus0.630.6_{\pm 0.6}30.6 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 5.3±0.1subscript5.3plus-or-minus0.15.3_{\pm 0.1}5.3 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 12.9±0.1subscript12.9plus-or-minus0.112.9_{\pm 0.1}12.9 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 12.7±0.4subscript12.7plus-or-minus0.412.7_{\pm 0.4}12.7 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT
DCC [38] 32.9±0.8subscript32.9plus-or-minus0.832.9_{\pm 0.8}32.9 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 49.4±0.5subscript49.4plus-or-minus0.549.4_{\pm 0.5}49.4 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 61.6±0.4subscript61.6plus-or-minus0.461.6_{\pm 0.4}61.6 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 13.3±0.3subscript13.3plus-or-minus0.313.3_{\pm 0.3}13.3 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 30.6±0.4subscript30.6plus-or-minus0.430.6_{\pm 0.4}30.6 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT - - - -
DSA [74] 28.8±0.7subscript28.8plus-or-minus0.728.8_{\pm 0.7}28.8 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 52.1±0.5subscript52.1plus-or-minus0.552.1_{\pm 0.5}52.1 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 60.6±0.5subscript60.6plus-or-minus0.560.6_{\pm 0.5}60.6 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 13.9±0.3subscript13.9plus-or-minus0.313.9_{\pm 0.3}13.9 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 32.3±0.3subscript32.3plus-or-minus0.332.3_{\pm 0.3}32.3 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 42.8±0.4subscript42.8plus-or-minus0.442.8_{\pm 0.4}42.8 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 5.7±0.1subscript5.7plus-or-minus0.15.7_{\pm 0.1}5.7 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 16.3±0.2subscript16.3plus-or-minus0.216.3_{\pm 0.2}16.3 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 15.1±0.2subscript15.1plus-or-minus0.215.1_{\pm 0.2}15.1 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT
DM [75] 26.0±0.8subscript26.0plus-or-minus0.826.0_{\pm 0.8}26.0 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 48.9±0.6subscript48.9plus-or-minus0.648.9_{\pm 0.6}48.9 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 63.0±0.4subscript63.0plus-or-minus0.463.0_{\pm 0.4}63.0 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 11.4±0.3subscript11.4plus-or-minus0.311.4_{\pm 0.3}11.4 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 29.7±0.3subscript29.7plus-or-minus0.329.7_{\pm 0.3}29.7 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 43.6±0.4subscript43.6plus-or-minus0.443.6_{\pm 0.4}43.6 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 3.9±0.2subscript3.9plus-or-minus0.23.9_{\pm 0.2}3.9 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 12.9±0.4subscript12.9plus-or-minus0.412.9_{\pm 0.4}12.9 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 25.3±0.2subscript25.3plus-or-minus0.225.3_{\pm 0.2}25.3 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT
GLaD [8] 28.0±0.8subscript28.0plus-or-minus0.828.0_{\pm 0.8}28.0 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 46.7±0.5subscript46.7plus-or-minus0.546.7_{\pm 0.5}46.7 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 59.9±0.7subscript59.9plus-or-minus0.759.9_{\pm 0.7}59.9 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT - - - - - -
CAFE [61] 30.3±1.1subscript30.3plus-or-minus1.130.3_{\pm 1.1}30.3 start_POSTSUBSCRIPT ± 1.1 end_POSTSUBSCRIPT 46.3±0.6subscript46.3plus-or-minus0.646.3_{\pm 0.6}46.3 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 55.5±0.6subscript55.5plus-or-minus0.655.5_{\pm 0.6}55.5 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 12.9±0.3subscript12.9plus-or-minus0.312.9_{\pm 0.3}12.9 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 27.8±0.3subscript27.8plus-or-minus0.327.8_{\pm 0.3}27.8 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 37.9±0.3subscript37.9plus-or-minus0.337.9_{\pm 0.3}37.9 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT - - -
CAFE+DSA [61] 31.6±0.8subscript31.6plus-or-minus0.831.6_{\pm 0.8}31.6 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 50.9±0.5subscript50.9plus-or-minus0.550.9_{\pm 0.5}50.9 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 62.3±0.4subscript62.3plus-or-minus0.462.3_{\pm 0.4}62.3 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 14.0±0.3subscript14.0plus-or-minus0.314.0_{\pm 0.3}14.0 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 31.5±0.2subscript31.5plus-or-minus0.231.5_{\pm 0.2}31.5 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 42.9±0.2subscript42.9plus-or-minus0.242.9_{\pm 0.2}42.9 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT - - -
VIG [41] 26.5±1.2subscript26.5plus-or-minus1.226.5_{\pm 1.2}26.5 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 54.6±0.1subscript54.6plus-or-minus0.154.6_{\pm 0.1}54.6 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 35.6±0.6subscript35.6plus-or-minus0.635.6_{\pm 0.6}35.6 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 17.8±0.1subscript17.8plus-or-minus0.117.8_{\pm 0.1}17.8 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 29.3±0.1subscript29.3plus-or-minus0.129.3_{\pm 0.1}29.3 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT - - - -
KIP [44] 29.8±1.0subscript29.8plus-or-minus1.029.8_{\pm 1.0}29.8 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 46.1±0.7subscript46.1plus-or-minus0.746.1_{\pm 0.7}46.1 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 53.2±0.7subscript53.2plus-or-minus0.753.2_{\pm 0.7}53.2 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 12.0±0.2subscript12.0plus-or-minus0.212.0_{\pm 0.2}12.0 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 29.0±0.3subscript29.0plus-or-minus0.329.0_{\pm 0.3}29.0 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT - - - -
MTT [7] 31.9±1.2subscript31.9plus-or-minus1.231.9_{\pm 1.2}31.9 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 56.4±0.7subscript56.4plus-or-minus0.756.4_{\pm 0.7}56.4 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 65.9±0.6subscript65.9plus-or-minus0.665.9_{\pm 0.6}65.9 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 13.8±0.6subscript13.8plus-or-minus0.613.8_{\pm 0.6}13.8 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 33.1±0.4subscript33.1plus-or-minus0.433.1_{\pm 0.4}33.1 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 42.9±0.3subscript42.9plus-or-minus0.342.9_{\pm 0.3}42.9 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 6.2±0.4subscript6.2plus-or-minus0.46.2_{\pm 0.4}6.2 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 17.3±0.2subscript17.3plus-or-minus0.217.3_{\pm 0.2}17.3 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 26.5±0.3subscript26.5plus-or-minus0.326.5_{\pm 0.3}26.5 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT
DAM [51] 32.0±1.2subscript32.0plus-or-minus1.232.0_{\pm 1.2}32.0 start_POSTSUBSCRIPT ± 1.2 end_POSTSUBSCRIPT 54.2±0.8subscript54.2plus-or-minus0.854.2_{\pm 0.8}54.2 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 67.0±0.4subscript67.0plus-or-minus0.467.0_{\pm 0.4}67.0 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 14.5±0.5subscript14.5plus-or-minus0.514.5_{\pm 0.5}14.5 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 34.8±0.5subscript34.8plus-or-minus0.534.8_{\pm 0.5}34.8 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 49.4±0.3subscript49.4plus-or-minus0.349.4_{\pm 0.3}49.4 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 8.3±0.4subscript8.3plus-or-minus0.48.3_{\pm 0.4}8.3 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 18.7±0.3subscript18.7plus-or-minus0.318.7_{\pm 0.3}18.7 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 28.7±0.3subscript28.7plus-or-minus0.328.7_{\pm 0.3}28.7 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT
ATOM (Ours) 34.8±1.0subscript34.8plus-or-minus1.0\mathbf{34.8}_{\pm 1.0}bold_34.8 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 57.9±0.7subscript57.9plus-or-minus0.7\mathbf{57.9}_{\pm 0.7}bold_57.9 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 68.8±0.5subscript68.8plus-or-minus0.5\mathbf{68.8}_{\pm 0.5}bold_68.8 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 18.1±0.4subscript18.1plus-or-minus0.4\mathbf{18.1}_{\pm 0.4}bold_18.1 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 35.7±0.4subscript35.7plus-or-minus0.4\mathbf{35.7}_{\pm 0.4}bold_35.7 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 50.2±0.3subscript50.2plus-or-minus0.3\mathbf{50.2}_{\pm 0.3}bold_50.2 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 9.1±0.2subscript9.1plus-or-minus0.2\mathbf{9.1}_{\pm 0.2}bold_9.1 start_POSTSUBSCRIPT ± 0.2 end_POSTSUBSCRIPT 19.5±0.4subscript19.5plus-or-minus0.4\mathbf{19.5}_{\pm 0.4}bold_19.5 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 29.1±0.3subscript29.1plus-or-minus0.3\mathbf{29.1}_{\pm 0.3}bold_29.1 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT
Full Dataset 84.8±0.1subscript84.8plus-or-minus0.184.8_{\pm 0.1}84.8 start_POSTSUBSCRIPT ± 0.1 end_POSTSUBSCRIPT 56.2±0.3subscript56.2plus-or-minus0.356.2_{\pm 0.3}56.2 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT 37.6±0.4subscript37.6plus-or-minus0.437.6_{\pm 0.4}37.6 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT
Table 1: Comparison with previous dataset distillation methods on CIFAR-10, CIFAR-100 and Tiny ImageNet. The works DD and LD use AlexNet [36] for CIFAR-10 dataset. All other methods use ConvNet for training and evaluation. Bold entries are the best results.

Network Architectures. We employ a ConvNet architecture [18] for distillation, following prior studies. The default ConvNet comprises three convolutional blocks, each consisting of a 128-kernel 3 ×\times× 3 convolutional layer, instance normalization, ReLU activation, and 3 ×\times× 3 average pooling with a stride of 2. To accommodate the increased resolutions in Tiny ImageNet, we append a fourth convolutional block. Network parameters are initialized using normal initialization [22] in all experiments.

Evaluation Protocol. We evaluate the methods using standard measures from previous studies [75, 76, 61, 74, 51]. Five sets of synthetic images are generated from a real training dataset with 1, 10, and 50 images per class. Then, 20 neural network models are trained on each synthetic set using an SGD optimizer with a fixed learning rate of 0.01. Each experiment reports the mean and standard deviation values for 100 models to assess the efficacy of distilled datasets. Furthermore, computational costs are assessed by calculating run-time per step over 100 iterations, as well as peak GPU memory usage during 100 iterations of training.

Implementation Details. We use the SGD optimizer with a fixed learning rate of 1 to learn synthetic datasets containing 1, 10, and 50 IPCs over 8000 iterations with task balances (λ𝜆\lambdaitalic_λ) set at 0.01. Previous works have shown that ps=4subscript𝑝𝑠4p_{s}=4italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 4 is sufficient for spatial attention matching [51]. As such we set our default case as: pc=ps=4subscript𝑝𝑐subscript𝑝𝑠4p_{c}=p_{s}=4italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 4. This is further ablated in Table 4. We adopt differentiable augmentation for both training and evaluating the synthetic set, following [76, 51]. For dataset reprocessing, we utilized the Kornia implementation of Zero Component Analysis (ZCA) with default parameters, following previous works [44, 7, 51]. All experiments are performed on a single A100 GPU with 80 GB of memory. Further hyperparameter details can be found in the supplementary materials.

Competitive Methods. In this paper, we compare the empirical results of ATOM on three computer vision datasets: CIFAR10/100 and TinyImageNet. We evaluate ATOM against four corset selection approaches and thirteen distillation methods for training set synthesis. The corset selection methods include Random selection [47], Herding [6, 4], K-Center [55], and Forgetting [60]. We also compare our approach with state-of-the-art distillation methods, including Dataset Distillation [63] (DD), Flexible Dataset Distillation [5] (LD), Dataset Condensation [76] (DC), Dataset Condensation with Contrastive (DCC) [38], Dataset Condensation with Differentiable Siamese Augmentation [74] (DSA), Distribution Matching [75] (DM), Deep Generative Priors (GLaD), Aligning Features [61] (CAFE), VIG [41], Kernel Inducing Points [44, 45] (KIP), Matching Training Trajectories [7] (MTT), and Attention Matching [51] (DAM).

4.2 Comparison with State-of-the-art Methods

Performance Comparison. In this section, we present a comparative analysis of our method against coreset and dataset distillation approaches. ATOM consistently outperforms these studies, especially at smaller distillation ratios, as shown in Table 1. Since the goal of dataset distillation is to generate a more compact synthetic set, we emphasize our significant performance improvements at low IPCs. We achieve almost 4%percent44\%4 % improvement over the previous attention matching framework [51], DataDAM when evaluated on CIFAR-100 at IPC1. Notably, our performance on CIFAR-100 at IPC50 is 50.2% – that is nearly 90% of the baseline accuracy at a mere 10% of the original dataset. These examples motivate the development of dataset distillation works as downstream models can achieve relatively competitive performance with their baselines at a fraction of the training costs. Our primary objective in this study is to investigate the impact of channel-wise attention within the feature-matching process. Compared to prior attention-based and feature-based methodologies, our findings underscore the significance of channel-wise attention and the ATOM module, as validated also in the ablation studies in Figure 4.

Cross-architecture Generalization. In this section, we assess the generalization capacity of our refined dataset by training various unseen deep neural networks on it and then evaluating their performance on downstream classification tasks. Following established benchmarks [76, 75, 61, 51], we examine classic CNN architectures such as AlexNet [36], VGG-11 [56], ResNet-18 [23], and additionally, a standard Vision Transformer (ViT) [13]. Specifically, we utilize synthetic images learned from CIFAR-10 with IPC50 using ConvNet as the reference model and subsequently train the aforementioned networks on the refined dataset to assess their performance on downstream tasks. The results, as depicted in Table 2, indicate that ATOM demonstrates superior generalization across a spectrum of architectures. Notably, it achieves a significant performance boost of over 4% compared to the prior state-of-the-art on ResNet-18 [23]. This implies that the channel-wise attention mechanism effectively identifies features not only relevant to ConvNet but also to a wider range of deep neural networks, thereby enhancing the refined dataset with this discerned information.

ConvNet AlexNet VGG-11 ResNet-18 ViT Avg.
DC [76] 53.9±0.5subscript53.9plus-or-minus0.5{53.9_{\pm 0.5}}53.9 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 28.8±0.7subscript28.8plus-or-minus0.7{28.8_{\pm 0.7}}28.8 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 38.8±1.1subscript38.8plus-or-minus1.1{38.8_{\pm 1.1}}38.8 start_POSTSUBSCRIPT ± 1.1 end_POSTSUBSCRIPT 20.9±1.0subscript20.9plus-or-minus1.0{20.9_{\pm 1.0}}20.9 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 30.1±0.5subscript30.1plus-or-minus0.5{30.1_{\pm 0.5}}30.1 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 34.5±0.8subscript34.5plus-or-minus0.8{34.5_{\pm 0.8}}34.5 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT
CAFE [61] 62.3±0.4subscript62.3plus-or-minus0.4{62.3_{\pm 0.4}}62.3 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 43.2±0.4subscript43.2plus-or-minus0.4{43.2_{\pm 0.4}}43.2 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 48.8±0.5subscript48.8plus-or-minus0.5{48.8_{\pm 0.5}}48.8 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 43.3±0.7subscript43.3plus-or-minus0.7{43.3_{\pm 0.7}}43.3 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 22.7±0.7subscript22.7plus-or-minus0.7{22.7_{\pm 0.7}}22.7 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 44.1±0.5subscript44.1plus-or-minus0.5{44.1_{\pm 0.5}}44.1 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT
DSA [74] 60.6±0.5subscript60.6plus-or-minus0.560.6_{\pm 0.5}60.6 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 53.7±0.6subscript53.7plus-or-minus0.653.7_{\pm 0.6}53.7 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 51.4±1.0subscript51.4plus-or-minus1.051.4_{\pm 1.0}51.4 start_POSTSUBSCRIPT ± 1.0 end_POSTSUBSCRIPT 47.8±0.9subscript47.8plus-or-minus0.947.8_{\pm 0.9}47.8 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT 43.3±0.4subscript43.3plus-or-minus0.4{43.3_{\pm 0.4}}43.3 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 51.4±0.7subscript51.4plus-or-minus0.7{51.4_{\pm 0.7}}51.4 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT
DM [75] 63.0±0.4subscript63.0plus-or-minus0.463.0_{\pm 0.4}63.0 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 60.1±0.5subscript60.1plus-or-minus0.560.1_{\pm 0.5}60.1 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 57.4±0.8subscript57.4plus-or-minus0.857.4_{\pm 0.8}57.4 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 52.9±0.4subscript52.9plus-or-minus0.452.9_{\pm 0.4}52.9 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 45.2±0.4subscript45.2plus-or-minus0.4{45.2_{\pm 0.4}}45.2 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 55.7±0.5subscript55.7plus-or-minus0.5{55.7_{\pm 0.5}}55.7 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT
KIP [44] 56.9±0.4subscript56.9plus-or-minus0.456.9_{\pm 0.4}56.9 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 53.2±1.6subscript53.2plus-or-minus1.653.2_{\pm 1.6}53.2 start_POSTSUBSCRIPT ± 1.6 end_POSTSUBSCRIPT 53.2±0.5subscript53.2plus-or-minus0.553.2_{\pm 0.5}53.2 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 47.6±0.8subscript47.6plus-or-minus0.847.6_{\pm 0.8}47.6 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 18.3±0.6subscript18.3plus-or-minus0.6{18.3_{\pm 0.6}}18.3 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 45.8±0.8subscript45.8plus-or-minus0.8{45.8_{\pm 0.8}}45.8 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT
MTT [7] 66.2±0.6subscript66.2plus-or-minus0.666.2_{\pm 0.6}66.2 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 43.9±0.9subscript43.9plus-or-minus0.943.9_{\pm 0.9}43.9 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT 48.7±1.3subscript48.7plus-or-minus1.348.7_{\pm 1.3}48.7 start_POSTSUBSCRIPT ± 1.3 end_POSTSUBSCRIPT 60.0±0.7subscript60.0plus-or-minus0.760.0_{\pm 0.7}60.0 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 47.7±0.6subscript47.7plus-or-minus0.6{47.7_{\pm 0.6}}47.7 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 53.3±0.8subscript53.3plus-or-minus0.8{53.3_{\pm 0.8}}53.3 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT
DAM [51] 67.0±0.4subscript67.0plus-or-minus0.467.0_{\pm 0.4}67.0 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 63.9±0.9subscript63.9plus-or-minus0.963.9_{\pm 0.9}63.9 start_POSTSUBSCRIPT ± 0.9 end_POSTSUBSCRIPT 64.8±0.5subscript64.8plus-or-minus0.564.8_{\pm 0.5}64.8 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT 60.2±0.7subscript60.2plus-or-minus0.760.2_{\pm 0.7}60.2 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 48.2±0.8subscript48.2plus-or-minus0.8{48.2_{\pm 0.8}}48.2 start_POSTSUBSCRIPT ± 0.8 end_POSTSUBSCRIPT 60.8±0.7subscript60.8plus-or-minus0.7{60.8_{\pm 0.7}}60.8 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT
ATOM (Ours) 68.8±0.4subscript68.8plus-or-minus0.4\mathbf{68.8}_{\pm 0.4}bold_68.8 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT 64.1±0.7subscript64.1plus-or-minus0.7\mathbf{64.1}_{\pm 0.7}bold_64.1 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 66.4±0.6subscript66.4plus-or-minus0.6\mathbf{66.4}_{\pm 0.6}bold_66.4 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 64.5±0.6subscript64.5plus-or-minus0.6\mathbf{64.5}_{\pm 0.6}bold_64.5 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT 49.5±0.7subscript49.5plus-or-minus0.7\mathbf{49.5}_{\pm 0.7}bold_49.5 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT 62.7±0.6subscript62.7plus-or-minus0.6\mathbf{62.7}_{\pm 0.6}bold_62.7 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT
Table 2: Cross-architecture testing performance (%) on CIFAR-10 with 50 images per class. The ConvNet architecture is employed for distillation. Bold entries are the best results.
Method Run Time (Sec.) GPU memory (MB)
IPC1 IPC10 IPC50 IPC1 IPC10 IPC50
DC [76] 0.16±0.01subscript0.16plus-or-minus0.010.16_{\pm 0.01}0.16 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 3.31±0.02subscript3.31plus-or-minus0.023.31_{\pm 0.02}3.31 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 15.74±0.10subscript15.74plus-or-minus0.1015.74_{\pm 0.10}15.74 start_POSTSUBSCRIPT ± 0.10 end_POSTSUBSCRIPT 3515 3621 4527
DSA [74] 0.22±0.02subscript0.22plus-or-minus0.020.22_{\pm 0.02}0.22 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 4.47±0.12subscript4.47plus-or-minus0.124.47_{\pm 0.12}4.47 start_POSTSUBSCRIPT ± 0.12 end_POSTSUBSCRIPT 20.13±0.58subscript20.13plus-or-minus0.5820.13_{\pm 0.58}20.13 start_POSTSUBSCRIPT ± 0.58 end_POSTSUBSCRIPT 3513 3639 4539
DM [75] 0.08±0.02subscript0.08plus-or-minus0.020.08_{\pm 0.02}0.08 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 0.08±0.02subscript0.08plus-or-minus0.020.08_{\pm 0.02}0.08 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 0.08±0.02subscript0.08plus-or-minus0.020.08_{\pm 0.02}0.08 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 3323 3455 3605
MTT [7] 0.36±0.23subscript0.36plus-or-minus0.230.36_{\pm 0.23}0.36 start_POSTSUBSCRIPT ± 0.23 end_POSTSUBSCRIPT 0.40±0.20subscript0.40plus-or-minus0.200.40_{\pm 0.20}0.40 start_POSTSUBSCRIPT ± 0.20 end_POSTSUBSCRIPT OOM 2711 8049 OOM
DAM [51] 0.09±0.01subscript0.09plus-or-minus0.010.09_{\pm 0.01}0.09 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.08±0.01subscript0.08plus-or-minus0.010.08_{\pm 0.01}0.08 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.16±0.04subscript0.16plus-or-minus0.040.16_{\pm 0.04}0.16 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT 3452 3561 3724
ATOMsuperscriptATOM\textbf{ATOM}^{\dagger}ATOM start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT (Ours) 0.08±0.02subscript0.08plus-or-minus0.020.08_{\pm 0.02}0.08 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 0.08±0.02subscript0.08plus-or-minus0.020.08_{\pm 0.02}0.08 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 0.13±0.03subscript0.13plus-or-minus0.030.13_{\pm 0.03}0.13 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT 3152 3263 4151
ATOM (Ours) 0.10±0.02subscript0.10plus-or-minus0.020.10_{\pm 0.02}0.10 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 0.10±0.01subscript0.10plus-or-minus0.010.10_{\pm 0.01}0.10 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT 0.17±0.02subscript0.17plus-or-minus0.020.17_{\pm 0.02}0.17 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT 3601 4314 5134
Table 3: Comparisons of training time and GPU memory usage for prior dataset distillation methods. Run time is averaged per step over 100 iterations, while GPU memory usage is reported as peak memory during the same 100 iterations of training on an A100 GPU for CIFAR-10. Methods that surpass the GPU memory threshold and fail to run are denoted as OOM (out-of-memory). ATOMsuperscriptATOM\textbf{ATOM}^{\dagger}ATOM start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT represents our method with on-channel attention, hence offering a better tradeoff in computational complexity.

Distillation Cost Analysis. In this section, we delve into an examination of the training costs required for the distillation process. Although the main goal of dataset distillation is to reduce training costs across different applications such as neural architecture search and continual learning, the distillation technique itself must be efficient, enabling smooth operation on consumer-grade hardware. Approaches such as DC, DSA, and MTT introduce additional computational overhead due to bi-level optimization and training an expert model. In contrast, our method, akin to DM and DAM, capitalizes on randomly initialized networks, obviating the need for training and thereby reducing the computational cost per step involved in the matching stage. As illustrated in Table 3 utilizing solely the channel-based ATOMsuperscriptATOM\texttt{ATOM}^{\dagger}ATOM start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT decreases the computational burden of matching compared to the default ATOM configuration. This efficiency is crucial, as channel-wise attention offers a more effective distillation process while maintaining superior performance (refer to Figure 4).

Convergence Speed Analysis. In  Figure 3, we plot the downstream testing accuracy evolution for the synthetic images on CIFAR10 IPC50. Comparing with previous methods, DM [75] and DataDAM [51], we can explicitly see an improvement in convergence speed and a significantly higher steady state achieved with the ATOM framework. Our included convergence analysis supports the practicality of our method and the consistency to which we outperform previous baselines.

Refer to caption
Figure 3: Test accuracy evolution of synthetic image learning on CIFAR10 with IPC50 for ATOM (ours), DM [75] and DataDAM [51].

4.3 Ablation Studies and Analysis

Refer to caption
Figure 4: Sample learned synthetic images for CIFAR-10/100 (32×\times×32 resolution) IPC10 and TinyImageNet (64×\times×64 resolution) IPC 1.

Evaluation of loss components in ATOM. In Table 4, we evaluate the effect of different attention-matching mechanisms with respect to pure feature matching in intermediate layers and distribution matching in the final layer (MMDsubscriptMMD\mathcal{L}_{\text{MMD}}caligraphic_L start_POSTSUBSCRIPT MMD end_POSTSUBSCRIPT). The results clearly demonstrate that attention-matching improves the performance of the distillation process. In particular, the attention-matching process improves feature matching by 8.0%percent8.08.0\%8.0 %. Further, it seems that channel attention is able to capture the majority of relevant information from the intermediate features, as evidenced by an improvement of over 1.5%percent1.51.5\%1.5 % from spatial attention matching. Ultimately, this provides an incentive to favor channel attention in the distillation process.

MMDsubscriptMMD\mathcal{L}_{\text{MMD}}caligraphic_L start_POSTSUBSCRIPT MMD end_POSTSUBSCRIPT Feature Map Spatial Atn. Channel Atn. Performance (%)
- - - 63.0±0.4subscript63.0plus-or-minus0.463.0_{\pm 0.4}63.0 start_POSTSUBSCRIPT ± 0.4 end_POSTSUBSCRIPT
- - 60.8±0.6subscript60.8plus-or-minus0.660.8_{\pm 0.6}60.8 start_POSTSUBSCRIPT ± 0.6 end_POSTSUBSCRIPT
- - 67.0±0.7subscript67.0plus-or-minus0.767.0_{\pm 0.7}67.0 start_POSTSUBSCRIPT ± 0.7 end_POSTSUBSCRIPT
- - 68.6±0.3subscript68.6plus-or-minus0.368.6_{\pm 0.3}68.6 start_POSTSUBSCRIPT ± 0.3 end_POSTSUBSCRIPT
- 68.8±0.5subscript68.8plus-or-minus0.5\mathbf{68.8}_{\pm 0.5}bold_68.8 start_POSTSUBSCRIPT ± 0.5 end_POSTSUBSCRIPT
Table 4: Evaluation of loss components and attention components in ATOM using CIFAR-10 with IPC50.

Evaluating attention balance in ATOM. In this section, we evaluate the balance between spatial and channel-wise attention through the power value p𝑝pitalic_p. Referencing  Equation 1 and  Equation 2, modulating the values of pssubscript𝑝𝑠p_{s}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ultimately affects the balance of spatial and channel-wise attention in ATOMsubscriptATOM\mathcal{L}_{\text{ATOM}}caligraphic_L start_POSTSUBSCRIPT ATOM end_POSTSUBSCRIPT. In Table 5, we examine the impact of different exponentiation powers p𝑝pitalic_p in the attention-matching mechanisms. Specifically, we conduct a grid-based search to investigate how varying the exponentiation of spatial (pssubscript𝑝𝑠p_{s}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) and channel (pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT) attention influences subsequent performance. Our findings reveal that optimal performance (nearly 1%percent11\%1 % improvement over our default) occurs when the exponentiation for channel attention significantly exceeds that of spatial attention. This suggests that assigning a higher exponential value places greater emphasis on channel-attention matching over spatial-wise matching. This aligns with our observations from the loss component ablation, where channel-wise matching was found to encapsulate the majority of information within the feature map. Consequently, we deduce that prioritizing channel-wise matching will enhance downstream performance outcomes.

Channel Attention pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT Spatial Attention pssubscript𝑝𝑠p_{s}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
1 2 4 8
1 57.4% 57.5% 57.0% 56.2%
2 58.2% 57.5% 57.2% 56.3%
4 58.4% 58.5% 57.9% 57.6%
8 58.8% 58.7% 58.2% 57.8%
Table 5: Evaluation of power values in the spatial and channel attention computations for ATOMsubscriptATOM\mathcal{L}_{\text{ATOM}}caligraphic_L start_POSTSUBSCRIPT ATOM end_POSTSUBSCRIPT using CIFAR-10 with IPC10.

Visualization of Synthetic Images. We include samples of our distilled images in Figure 4. The images appear to be interleaved with artifacts that assimilate the background and object information into a mixed collage-like appearance. The synthetic images effectively capture the correlation between background and object elements, suggesting their potential for generalizability across various architectures, as empirically verified in Table 2. Additional visualizations are available in the supplementary material.

4.4 Applications

Neural Architecture Search. In Table Table 6 we leverage our distilled synthetic datasets as proxy sets to accelerate Neural Architecture Search. In line with previous state-of-the-art, [51, 76, 74], we outline our architectural search space, comprising 720 ConvNets on the CIFAR-10 dataset. We commence with a foundational ConvNet and devise a consistent grid, varying in depth D𝐷absentD\initalic_D ∈ {1, 2, 3, 4}, width W𝑊absentW\initalic_W ∈ {32, 64, 128, 256}, activation function A𝐴absentA\initalic_A ∈ {Sigmoid, ReLU, LeakyReLU}, normalization technique N𝑁absentN\initalic_N ∈ {None, BatchNorm, LayerNorm, InstanceNorm, GroupNorm}, and pooling operation P𝑃absentP\initalic_P ∈ {None, MaxPooling, AvgPooling}. Additionally, we benchmark our approach against several state-of-the-art methods, including Random, DSA [76], DM [75], CAFE [61], DAM [51], and Early-Stop**. Our method demonstrates superior performance, accompanied by a heightened Spearman’s correlation (0.75), thereby reinforcing the robustness of ATOM and its potential in neural architecture search.

Random DSA DM CAFE DAM ATOM Early-stop** Full Dataset
Performance (%) 88.9 87.2 87.2 83.6 89.0 88.9 88.9 89.2
Correlation 0.70 0.66 0.71 0.59 0.72 0.75 0.69 1.00
Time cost (min) 206.4 206.4 206.6 206.4 206.4 206.4 206.2 5168.9
Storage (imgs) 500 500 500 500 500 500 5×1045superscript1045\times 10^{4}5 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT 5×1045superscript1045\times 10^{4}5 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
Table 6: Neural architecture search on CIFAR-10 with IPC50.

5 Limitations

Many studies in dataset distillation encounter a constraint known as re-distillation costs [62, 24, 25]. This limitation becomes apparent when adjusting the number of images per class (IPC) or the distillation ratios. Like most other distillation methods, our approach requires re-distillation on the updated setting configuration, which limits flexibility regarding configuration changes and storage allocation. Additionally, we observed in Table 2 that dataset distillation methods often struggle with generalizing to transformer architectures. Despite ATOM outperforming other methods, there is still a noticeable performance drop compared to convolutional neural networks. This suggests that the effectiveness of transformers for downstream training might be constrained by the distilled data.

6 Conclusion

In this work, we introduced an Attention Mixer (ATOM) for efficient dataset distillation. Previous approaches have struggled with marginal performance gains, obfuscating channel-wise information, and high computational overheads. ATOM addresses these issues by effectively combining information from different attention mechanisms, facilitating a more informative distillation process with untrained neural networks. Our approach utilizes a broader receptive field to capture spatial information while preserving distinct content information at the channel level, thus better aligning synthetic and real datasets. By capturing information across intermediate layers, ATOM facilitates multi-scale distillation. We demonstrated the superior performance of ATOM on standard distillation benchmarks and its favorable performance across multiple architectures. We conducted several ablative studies to justify the design choices behind ATOM. Furthermore, we applied our distilled data to Neural Architecture Search, showing a superior correlation with the real large-scale dataset. In the future, we aim to extend attention mixing to various downstream tasks, including image segmentation and localizations. We also hope to address limitations of ATOM, such as re-distillation costs and cross-architecture generalizations on transformers.

References

  • Agarwal et al. [2020] Sharat Agarwal, Himanshu Arora, Saket Anand, and Chetan Arora. Contextual diversity for active learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pages 137–153. Springer, 2020.
  • Amer et al. [2021] Hossam Amer, Ahmed H Salamah, Ahmad Sajedi, and En-hui Yang. High performance convolution using sparsity and patterns for inference in deep convolutional neural networks. arXiv preprint arXiv:2104.08314, 2021.
  • Bahdanau et al. [2015] Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, 2015.
  • Belouadah and Popescu [2020] Eden Belouadah and Adrian Popescu. Scail: Classifier weights scaling for class incremental learning. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1266–1275, 2020.
  • Bohdal et al. [2020] Ondrej Bohdal, Yongxin Yang, and Timothy Hospedales. Flexible dataset distillation: Learn labels instead of images. arXiv preprint arXiv:2006.08572, 2020.
  • Castro et al. [2018] Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In Proceedings of the European conference on computer vision (ECCV), pages 233–248, 2018.
  • Cazenavette et al. [2022] George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4750–4759, 2022.
  • Cazenavette et al. [2023] George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Generalizing dataset distillation via deep generative prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3739–3748, 2023.
  • Chen et al. [2024] Xuxi Chen, Yu Yang, Zhangyang Wang, and Baharan Mirzasoleiman. Data distillation can be like vodka: Distilling more times for better quality. In The Twelfth International Conference on Learning Representations, 2024.
  • Chen et al. [2010] Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, pages 109–116, 2010.
  • Cui et al. [2023] Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. Scaling up dataset distillation to imagenet-1k with constant memory. In International Conference on Machine Learning, pages 6565–6590. PMLR, 2023.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Dosovitskiy et al. [2021] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  • Du et al. [2023] Jiawei Du, Yidi Jiang, Vincent YF Tan, Joey Tianyi Zhou, and Haizhou Li. Minimizing the accumulated trajectory error to improve dataset distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3749–3758, 2023.
  • Du et al. [2024] Jiawei Du, Qin Shi, and Joey Tianyi Zhou. Sequential subset matching for dataset distillation. Advances in Neural Information Processing Systems, 36, 2024.
  • Ducoffe and Precioso [2018] Melanie Ducoffe and Frederic Precioso. Adversarial active learning for deep networks: a margin based approach. arXiv preprint arXiv:1802.09841, 2018.
  • Ghorbani et al. [2022] Amirata Ghorbani, James Zou, and Andre Esteva. Data shapley valuation for efficient batch active learning. In 2022 56th Asilomar Conference on Signals, Systems, and Computers, pages 1456–1462. IEEE, 2022.
  • Gidaris and Komodakis [2018] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4367–4375, 2018.
  • Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  • Gu et al. [2024] Jianyang Gu, Kai Wang, Wei Jiang, and Yang You. Summarizing stream data for memory-restricted online continual learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024.
  • Guo et al. [2024] Ziyao Guo, Kai Wang, George Cazenavette, HUI LI, Kaipeng Zhang, and Yang You. Towards lossless dataset distillation via difficulty-aligned trajectory matching. In The Twelfth International Conference on Learning Representations, 2024.
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • He et al. [2024a] Yang He, Lingao Xiao, and Joey Tianyi Zhou. You only condense once: Two rules for pruning condensed datasets. Advances in Neural Information Processing Systems, 36, 2024a.
  • He et al. [2024b] Yang He, Lingao Xiao, Joey Tianyi Zhou, and Ivor Tsang. Multisize dataset condensation. In The Twelfth International Conference on Learning Representations, 2024b.
  • Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Ho and Ermon [2016] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
  • Jia et al. [2023] Yuqi Jia, Saeed Vahidian, **gwei Sun, Jianyi Zhang, Vyacheslav Kungurtsev, Neil Zhenqiang Gong, and Yiran Chen. Unlocking the potential of federated learning: The symphony of dataset distillation via deep generative latents. arXiv preprint arXiv:2312.01537, 2023.
  • Khaki and Luo [2023] Samir Khaki and Weihan Luo. Cfdp: Common frequency domain pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4714–4723, 2023.
  • Khaki and Plataniotis [2024] Samir Khaki and Konstantinos N Plataniotis. The need for speed: Pruning transformers with one recipe. arXiv preprint arXiv:2403.17921, 2024.
  • Killamsetty et al. [2021a] Krishnateja Killamsetty, Sivasubramanian Durga, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pages 5464–5474. PMLR, 2021a.
  • Killamsetty et al. [2021b] Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8110–8118, 2021b.
  • Killamsetty et al. [2021c] Krishnateja Killamsetty, Xujiang Zhao, Feng Chen, and Rishabh Iyer. Retrieve: Coreset selection for efficient and robust semi-supervised learning. Advances in Neural Information Processing Systems, 34:14488–14501, 2021c.
  • Kim et al. [2022] Jang-Hyun Kim, **uk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, and Hyun Oh Song. Dataset condensation via efficient synthetic-data parameterization. In International Conference on Machine Learning, pages 11102–11118. PMLR, 2022.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Krizhevsky et al. [2017] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  • Le and Yang [2015] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  • Lee et al. [2022] Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, and Sungroh Yoon. Dataset condensation with contrastive signals. In International Conference on Machine Learning, pages 12352–12364. PMLR, 2022.
  • Liu et al. [2023a] ** Liu, Xin Yu, and Joey Tianyi Zhou. Meta knowledge condensation for federated learning. In The Eleventh International Conference on Learning Representations, 2023a.
  • Liu et al. [2023b] Songhua Liu, **gwen Ye, Runpeng Yu, and Xinchao Wang. Slimmable dataset condensation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3759–3768, 2023b.
  • Loo et al. [2023] Noel Loo, Ramin Hasani, Mathias Lechner, and Daniela Rus. Dataset distillation with convexified implicit gradients. In International Conference on Machine Learning, pages 22649–22674. PMLR, 2023.
  • Margatina et al. [2021] Katerina Margatina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. Active learning by acquiring contrastive examples. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 650–663, 2021.
  • Mirzasoleiman et al. [2020] Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pages 6950–6960. PMLR, 2020.
  • Nguyen et al. [2021a] Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridge-regression. In International Conference on Learning Representations, 2021a.
  • Nguyen et al. [2021b] Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel-ridge regression. In International Conference on Learning Representations, 2021b.
  • Paul et al. [2021] Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021.
  • Rebuffi et al. [2017] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
  • Saito et al. [2018] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3723–3732, 2018.
  • Sajedi and Plataniotis [2021] Ahmad Sajedi and Konstantinos N Plataniotis. On the efficiency of subclass knowledge distillation in classification tasks. arXiv preprint arXiv:2109.05587, 2021.
  • Sajedi et al. [2022] Ahmad Sajedi, Yuri A Lawryshyn, and Konstantinos N Plataniotis. Subclass knowledge distillation with known subclass labels. In 2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), pages 1–5. IEEE, 2022.
  • Sajedi et al. [2023a] Ahmad Sajedi, Samir Khaki, Ehsan Amjadian, Lucy Z Liu, Yuri A Lawryshyn, and Konstantinos N Plataniotis. Datadam: Efficient dataset distillation with attention matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17097–17107, 2023a.
  • Sajedi et al. [2023b] Ahmad Sajedi, Samir Khaki, Konstantinos N. Plataniotis, and Mahdi S. Hosseini. End-to-end supervised multilabel contrastive learning, 2023b.
  • Sajedi et al. [2023c] Ahmad Sajedi, Yuri A Lawryshyn, and Konstantinos N Plataniotis. A new probabilistic distance metric with application in gaussian mixture reduction. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023c.
  • Sajedi et al. [2024] Ahmad Sajedi, Samir Khaki, Yuri A Lawryshyn, and Konstantinos N Plataniotis. Probmcl: Simple probabilistic contrastive learning for multi-label visual classification. arXiv preprint arXiv:2401.01448, 2024.
  • Sener and Savarese [2018] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Sinha et al. [2020] Samarth Sinha, Han Zhang, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, and Augustus Odena. Small-gan: Speeding up gan training using core-sets. In International Conference on Machine Learning, pages 9005–9015. PMLR, 2020.
  • Such et al. [2020] Felipe Petroski Such, Aditya Rawal, Joel Lehman, Kenneth Stanley, and Jeffrey Clune. Generative teaching networks: Accelerating neural architecture search by learning to generate synthetic training data. In International Conference on Machine Learning, pages 9206–9216. PMLR, 2020.
  • Toneva et al. [2018] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2018.
  • Toneva et al. [2019] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2019.
  • Wang et al. [2022] Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12196–12205, 2022.
  • Wang et al. [2023] Kai Wang, Jianyang Gu, Daquan Zhou, Zheng Zhu, Wei Jiang, and Yang You. Dim: Distilling dataset into generative model. arXiv preprint arXiv:2303.04707, 2023.
  • Wang et al. [2018a] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. arXiv preprint arXiv:1811.10959, 2018a.
  • Wang et al. [2018b] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018b.
  • Wang et al. [2021] Yi Ru Wang, Samir Khaki, Weihang Zheng, Mahdi S. Hosseini, and Konstantinos N. Plataniotis. Conetv2: Efficient auto-channel size optimization for cnns. In 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 998–1003, 2021.
  • Welling [2009] Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1121–1128, 2009.
  • Woo et al. [2018] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  • Wu et al. [2016] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4820–4828, 2016.
  • Xiong et al. [2023] Yuanhao Xiong, Ruochen Wang, Minhao Cheng, Felix Yu, and Cho-Jui Hsieh. Feddm: Iterative distribution matching for communication-efficient federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16323–16332, 2023.
  • Yang et al. [2024] Enneng Yang, Li Shen, Zhenyi Wang, Tongliang Liu, and Guibing Guo. An efficient dataset condensation plugin and its application to continual learning. Advances in Neural Information Processing Systems, 36, 2024.
  • Yu et al. [2017] Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7370–7379, 2017.
  • Zagoruyko and Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.
  • Zhang et al. [2024] Hansong Zhang, Shikun Li, Pengju Wang, and Shiming Zeng, Dan Ge. M3D: Dataset condensation by minimizing maximum mean discrepancy. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024.
  • Zhao and Bilen [2021] Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pages 12674–12685. PMLR, 2021.
  • Zhao and Bilen [2023] Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6514–6523, 2023.
  • Zhao et al. [2021] Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. In International Conference on Learning Representations, 2021.
  • Zhao et al. [2023] Ganlong Zhao, Guanbin Li, Yipeng Qin, and Yizhou Yu. Improved distribution matching for dataset condensation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7856–7865, 2023.
  • Zhou et al. [2023] Daquan Zhou, Kai Wang, Jianyang Gu, Xiangyu Peng, Dongze Lian, Yifan Zhang, Yang You, and Jiashi Feng. Dataset quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17205–17216, 2023.
\thetitle

Supplementary Material

7 Implementation Details

7.1 Datasets

We conducted experiments on three main datasets: CIFAR10/100 [35] and TinyImageNet [37]. These datasets are considered single-label multi-class; hence, each image has exactly one class label. The CIFAR10/100 are conventional computer vision benchmarking datasets comprising 32×\times×32 colored natural images. They consist of 10 coarse-grained labels (CIFAR10) and 100 fine-grained labels (CIFAR100), each with 50,000 training samples and 10,000 test samples. The CIFAR10 classes include ”Airplane”, ”Car”, ”Bird”, ”Cat”, ”Deer”, ”Dog”, ”Frog”, ”Horse”, ”Ship”, and ”Truck”. The TinyImageNet dataset, a subset of ImageNet-1K [12] with 200 classes, contains 100,000 high-resolution training images and 10,000 test images resized to 64×\times×64. The experiments on these datasets make up the benchmarking for many previous dataset distillation works [76, 61, 51, 7, 8, 78].

7.2 Dataset Pre-processing

We applied the standardized preprocessing techniques to all datasets, following the guidelines provided in DM [75] and DataDAM [51]. Following previous works, we apply the default Differentiable Siamese Augmentation (DSA) [74] scheme during distillation and evaluation. Specifically for the CIFAR10/100 datasets, we integrated Kornia zero-phase component analysis (ZCA) whitening, following the parameters outlined in [7, 51]. Similar to DataDAM [51], we opted against ZCA for TinyImagenet due to the computational bottlenecks associated with full-scale ZCA transformation on a larger dataset with double the resolution. Note that we visualized the distilled images by directly applying the inverse transformation based on the corresponding data pre-processing, without any additional modifications.

7.3 Hyperparameters

Our method conveniently introduces only one additional hyperparameter: the power term in channel attention, i.e. pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. All the other hyperparameters used in our method are directly inherited from the published work, DataDAM [51]. Therefore, we include an updated hyperparameter table in Table 7 aggregating our power term with the remaining pre-set hyperparameters. In the main paper, we discussed the effect of power terms on both channel- and spatial-wise attention and ultimately found that higher channel attention paired with lower spatial attention works best. However, our default, as stated in the main draft, is pc=ps=4subscript𝑝𝑐subscript𝑝𝑠4p_{c}=p_{s}=4italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 4. Regarding the distillation and train-val settings, we use the SGD optimizer with a learning rate of 1.0 for learning the synthetic images and a learning rate of 0.01 for training neural network models (for downstream evaluation). For CIFAR10/100 (low-resolution), we use a 3-layer ConvNet; meanwhile, for TinyImagenet (medium-resolution), we use a 4-layer ConvNet, following previous works in the field [75, 51, 7]. Our batch size for learning the synthetic images was set to 128 due to the computational overhead of a larger matching set.

7.4 Neural Architecture Search Details

Following previous works [51, 76, 74, 75], we define a search space consisting of 720 ConvNets on the CIFAR10 dataset. Models are evaluated on CIFAR10 using our IPC 50 distilled set as a proxy under the neural architecture search (NAS) framework. The architecture search space is constructed as a uniform grid that varies in depth D𝐷absentD\initalic_D ∈ {1, 2, 3, 4}, width W𝑊absentW\initalic_W ∈ {32, 64, 128, 256}, activation function A𝐴absentA\initalic_A ∈ {Sigmoid, ReLu, LeakyReLu}, normalization technique N𝑁absentN\initalic_N ∈ {None, BatchNorm, LayerNorm, InstanceNorm, GroupNorm}, and pooling operation P𝑃absentP\initalic_P ∈ {None, MaxPooling, AvgPooling} to create varying versions of the standard ConvNet. These candidate architectures are then evaluated based on their validation performance and ranked accordingly. In the main paper, Table 6 measures various costs and performance metrics associated with each distillation method. Overall distillation improves the computational cost; however, ATOM achieves the highest correlation, which is by far the most “important“ metric in this NAS search, as it indicates that our proxy set best estimates the original dataset.

8 Additional Visualizations.

We include additional visualizations of our synthetic datasets in Figure 5, Figure 6, Figure 7. The first two represent CIFAR10/100 at IPC 50, while the third depicts TinyImageNet at IPC 10. Our images highly exhibit learned artifacts from the distillation process that are, in turn, helpful during downstream classification tasks.

Refer to caption
Figure 5: Distilled Image Visualization: CIFAR-10 dataset with IPC 50.
Refer to caption
Figure 6: Distilled Image Visualization: CIFAR-100 dataset with IPC 50.
Refer to caption
Figure 7: Distilled Image Visualization: TinyImageNet dataset with IPC 10.
Hyperparameters Options/ Value
Category Parameter Name Description Range
Optimization Learning Rate η𝒮subscript𝜂𝒮\bm{\eta_{\mathcal{S}}}bold_italic_η start_POSTSUBSCRIPT bold_caligraphic_S end_POSTSUBSCRIPT (images) Step size towards global/local minima (0,10.0]010.0(0,10.0]( 0 , 10.0 ] IPC \leq 50: 1.01.01.01.0
IPC >>> 50: 10.010.010.010.0
Learning Rate ηθsubscript𝜂𝜃\bm{\eta_{\mathcal{\bm{\theta}}}}bold_italic_η start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT (network) Step size towards global/local minima (0,1.0]01.0(0,1.0]( 0 , 1.0 ] 0.010.010.010.01
Optimizer (images) Updates synthetic set to approach global/local minima SGD with Momentum: 0.50.50.50.5
Momentum Weight Decay: 0.00.00.00.0
Optimizer (network) Updates model to approach global/local minima SGD with Momentum: 0.90.90.90.9
Momentum Weight Decay: 5e45𝑒45e-45 italic_e - 4
Scheduler (images) - - -
Scheduler (network) Decays the learning rate over epochs StepLR Decay rate: 0.50.50.50.5
Step size: 15.015.015.015.0
Iteration Count Number of iterations for learning synthetic data [1,)1[1,\infty)[ 1 , ∞ ) 8000
Loss Function Task Balance λ𝜆\lambdaitalic_λ Regularization Multiplier [0,)0[0,\infty)[ 0 , ∞ ) Low Resolution: 0.010.010.010.01
High Resolution: 0.020.020.020.02
Spatial Power Value pssubscript𝑝𝑠p_{s}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT Exponential power for amplification of spatial attention [1,)1[1,\infty)[ 1 , ∞ ) 4
Channel Power Value pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT Exponential power for amplification of channel attention [1,)1[1,\infty)[ 1 , ∞ ) 4
Loss Configuration Type of error function used to measure distribution discrepancy - Mean Squared Error
Normalization Type Type of normalization used in the SAM module on attention maps - L2
DSA Augmentations Color Randomly adjust (jitter) the color components of an image brightness 1.0
saturation 2.0
contrast 0.5
Crop Crops an image with padding ratio crop pad 0.125
Cutout Randomly covers input with a square cutout ratio 0.5
Flip Flips an image with probability p in range: (0,1.0]01.0(0,1.0]( 0 , 1.0 ] 0.50.50.50.5
Scale Shifts pixels either column-wise or row-wise scaling ratio 1.21.21.21.2
Rotate Rotates image by certain angle 0360superscript0superscript3600^{\circ}-360^{\circ}0 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT - 360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT [15,+15]superscript15superscript15[-15^{\circ},+15^{\circ}][ - 15 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , + 15 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ]
Encoder Parameters Conv Layer Weights The weights of convolutional layers \mathbb{R}blackboard_R bounded by kernel size Uniform Distribution
Activation Function The non-linear function at the end of each layer - ReLU
Normalization Layer Type of normalization layer used after convolutional blocks - InstanceNorm
Table 7: Hyperparameters Details – boilerplate obtained from DataDAM [51].