License: arXiv.org perpetual non-exclusive license
arXiv:2402.15430v2 [cs.CV] 11 Apr 2024

Hierarchical Invariance for Robust and Interpretable Vision Tasks at Larger Scales

Shuren Qi, Yushu Zhang, Chao Wang, Zhihua Xia, Xiaochun Cao, and Jian Weng S. Qi, Y. Zhang, and C. Wang are with the College of Computer Science and Technology, Nan**g University of Aeronautics and Astronautics, Nan**g, China (e-mail: shurenqi, yushu, [email protected]). Z. Xia and J. Weng are with the College of Cyber Security, **an University, Guangzhou, China (e-mail: [email protected], [email protected]). X. Cao is with the School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China (e-mail: [email protected]).
Abstract

Develo** robust and interpretable vision systems is a crucial step towards trustworthy artificial intelligence. In this regard, a promising paradigm considers embedding task-required invariant structures, e.g., geometric invariance, in the fundamental image representation. However, such invariant representations typically exhibit limited discriminability, limiting their applications in larger-scale trustworthy vision tasks. For this open problem, we conduct a systematic investigation of hierarchical invariance, exploring this topic from theoretical, practical, and application perspectives. At the theoretical level, we show how to construct over-complete invariants with a Convolutional Neural Networks (CNN)-like hierarchical architecture yet in a fully interpretable manner. The general blueprint, specific definitions, invariant properties, and numerical implementations are provided. At the practical level, we discuss how to customize this theoretical framework into a given task. With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner. We demonstrate the above arguments with accuracy, invariance, and efficiency results on texture, digit, and parasite classification experiments. Furthermore, at the application level, our representations are explored in real-world forensics tasks on adversarial perturbations and Artificial Intelligence Generated Content (AIGC). Such applications reveal that the proposed strategy not only realizes the theoretically promised invariance, but also exhibits competitive discriminability even in the era of deep learning. For robust and interpretable vision tasks at larger scales, hierarchical invariant representation can be considered as an effective alternative to traditional CNN and invariants.

Index Terms:
Image representation, hierarchical invariance, robustness, discriminability, interpretability.

1 Introduction

The trustworthiness is a rising topic in modern Artificial Intelligence (AI) communities [1]. Over the past decade, deep learning techniques, especially Convolutional Neural Networks (CNN), have led to breakthrough results in numerous AI tasks, e.g., processing human perceptual information [2], playing board games [3], and solving hard science problems [4]. More recently, their applications are expanding into trust-related scenarios, e.g., biometrics [5], medical diagnostics [6], self-driving cars [7], and misinformation detection [8]. In such scenarios, the robustness and interpretability of AI systems are crucial [9]: 1) robustness means the performance of system is stable for intra-class variations on the input; 2) interpretability means the behavior of system can be understood or predicted by humans.

Integrating invariant structures into image representations is a principled design towards robust and interpretable vision systems [10]. Specifically, representations play a fundamental role in visual systems, where the system is generally built on meaningful representations of digital images (rather than the raw data) [11]. Note that the proper design/learning of such representations in fact relies on priors w.r.t. the task of interest [12]. Here, the concept of symmetry from the Erlangen Program [13, 14] may be the most fruitful prior – informally, a symmetry of a system is a transformation that leaves a certain property of system invariant. Symmetry priors are ubiquitous in vision tasks, e.g., translation as a symmetry of the object classification system where object category is invariant under translation [15].

Next, we begin by providing some historical perspectives on the invariance in the development of image representations. The quest for invariance is as old as the field of computer vision itself, spanning both hand-crafted and learning approaches [16]:

  • In the hand-crafted approach, symmetry priors (e.g., invariance and equivariance) w.r.t. geometric transformations (e.g., translation, rotation, and scaling) have been recognized as main ideas in designing representations. Such ideas cover almost all classical and state-of-the-art methods, from global features (e.g., moment invariants [17]), to local sparse features (e.g., SIFT [18]), and to local dense features (e.g., DAISY [19]). However, these hand-crafted representations are all fixed in design, relying on (under)-complete dictionaries, and therefore fail to provide sufficient discriminability at larger scales, e.g., ImageNet classification task [20].

  • In the learning approach, the CNN achieves over-complete representations of strong discriminative power for larger-scale vision tasks, through a cascade of learnable nonlinear transformations. As a textbook view of deep learning, representations should be learned not designed [2]. Therefore, classical CNN representations are equipped with very few symmetry priors, typically just translation equivariance [15], but has recently been proven to no longer hold in deeper layers of the CNN with downsampling structures [21]. In general, these learning representations lack robustness and interpretability guarantees, e.g., the presence and understanding of adversarial perturbations [22], and therefore cannot be well extended to trustworthy tasks [23].

Historically, to a certain extent, efforts at invariance and discriminability have developed independently in hand-crafted and learning approaches. The compatibility between invariance and discriminability has emerged as a tricky problem when moving towards trustworthy AI.

TABLE I: Conceptual Comparison with Related Research Approaches.
Approach Traditional invariance Traditional CNN Scattering networks Equivariant networks Hierarchical invariance
\rowcolormygray2Discriminative
Robust
\rowcolormygray2Interpretable
Efficient

1.1 State of the Art and Motivation

In trust-related scenarios at larger scales, recent advances are seeking more advanced invariant designs of image representations, fulfilling discriminability, robustness, and interpretability simultaneously.

  • On the discriminability of hand-crafted representations, researchers introduce successful experiences behind learning representations, especially cascading and over-complete designs. The most representative work is Invariant Scattering Convolution Networks [24], where the classical wavelet transform is expanded into an over-complete representation with deep cascading. Unlike typical CNN: 1) regarding the architecture, convolutional layers are defined by fixed wavelet filters, with modulus-based nonlinearity, but without subsequent pooling; 2) regarding the representation properties, the architecture yields the translation equivariance and certain robustness w.r.t. non-linear deformations. Following theoretical works further explored various geometric invariants [25], more general mathematical formulations [26], and the potential for improving the efficiency, interpretability, and robustness of state-of-the-art CNN techniques [27]. Regarding applications, they provided competitive results in a variety of tasks on audio [28], image [24, 25, 27] and graph [29] data, some of which are even interdisciplinary [30, 31]. With similar design goals and paths, we consider Scattering Networks as a main competitor for our work.

  • On the robustness and interpretability of learning representations, researchers introduce successful experiences behind hand-crafted representations, especially invariance and equivariance designs. The most representative work is Group Equivariant Convolutional Networks [32], where the classical convolution is generalized to a new definition on the symmetry group. Unlike typical CNN: 1) regarding the architecture, convolutional layers are learned but with new structure inspired by symmetry priors, e.g., re-parameterizing the filter to control symmetry; 2) regarding the representation properties, the architecture commonly provides the joint equivariance for translation and rotation. Following theoretical works further explored the equivariance for rotation [33, 34, 35], flip** [33], scaling [36, 37], and their combination [38] from various mathematical theories, including steerable filters [33], harmonic analysis [34], scale space [37], Lie groups [39], and B-spline interpolation [40]. Regarding applications, they played a key role in low-level vision tasks [41], especially scientific discoveries with symmetry priors [42, 43]. With similar design goals but on the learning path, we consider Equivariant Networks as a secondary competitor for our work.

Motivation. Despite starting from different theories, the above state-of-the-art methods exhibit common problems in the implementation efficiency and representation capability. An over-simplified interpretation on the technical level of such methods is as follows [38]: The input feature map is convolved with symmetry versions of the same filter to obtain multi-channel features, where the distortion on the input (e.g., rotation) corresponds to the cyclic shift between channels, and hence the invariance is achieved by pooling across channels. 1) Regarding the implementation efficiency, this parallel framework leads to an exponential expansion on the computational size (w.r.t. the sampling rate on symmetry), especially for learning representations where the introduced new learnable parameters make the training more challenging. 2) Regarding the representation capability, the discrete sampling on the symmetry of filters raises a tricky trade-off between invariance and discriminability – higher sampling rate implies better invariance, but the resulting computational cost restricts the overall size of representation networks (going deeper or wider), and hence the improvement of discriminability.

1.2 Contributions

As a potential step towards solving the above open problem, we conduct a systematic investigation of hierarchical invariance, exploring this topic from theoretical, practical, and application perspectives.

As summarized in Table 1, our approach stands nicely between the two extremes, i.e., traditional invariance [17] and CNN [21] w.r.t. discriminability, robustness, and interpretability. Compared with the recent scattering [24] and equivariant [32] networks, our approach is characterized by a more efficient design. Here, the equivariance is continuous/one-shot, holds across layers, eliminating the need for complex symmetry sampling, parallel framework, and cross-channel pooling. Therefore, it exhibits better efficiency, and also allowing more flexibility in enlarging the network size (going deeper or wider) to increase the representation capacity.

Our main contributions are as follows:

Theory. We propose a new framework for robust and interpretable image representation, named Hierarchical Invariant Representation (HIR), by extending the classical theory of moment invariants to cascade transformations. Starting from an ideal blueprint for hierarchical invariance, we formalize the over-complete moment invariants by an efficient hierarchical structure, with better trade-off between invariance and discriminability than traditional invariants and CNN. Note that the HIR exhibits continuous and one-shot equivariance w.r.t. translations, rotations, and flips at each intermediate layer, which is not available in current state-of-the-art algorithms. We also provide some fast and accurate numerical implementations of HIR, which are generic for arbitrary basis functions.

Practice. We explore the practical flexibility of this theoretical framework, covering a wide range of potential designs to better match a given vision task. The theory is specified into a class of networks, involving details about the topologies, layers, and parameters. In particular, we define a concept of frequency pooling to satisfy the common requirements from interpretability, invariance, and discriminability. For empowering the data adaptability of hand-crafted HIR, we also provide an architecture searching strategy based on the over-completeness.

Application. We validate the effectiveness of HIR in various simulation experiments and real-world applications. Pattern classification experiments are performed on typical sets of texture, digit, and parasite images, exhibiting state-of-the-art accuracy, invariance, and efficiency under diverse task scales and geometric variants. The direct applications to real-world forensics, i.e., detections of adversarial perturbations [44] and Artificial Intelligence Generated Content (AIGC) [45], also demonstrate the competitive discriminability even in the era of deep learning.

2 Foundations

As mentioned earlier, this work develops from the theory of moment invariants. Therefore, we begin with a brief review on the foundations of moment invariants, covering some concepts, notations, and definitions from our previous works.

2.1 Global and Local Representations

In general, classical moments and moment invariants are global representations of images, where the theory is built on the following definition [17]:

f,Vnm=DVnm*(x,y)f(x,y)𝑑x𝑑y,𝑓subscript𝑉𝑛𝑚subscriptdouble-integral𝐷superscriptsubscript𝑉𝑛𝑚𝑥𝑦𝑓𝑥𝑦differential-d𝑥differential-d𝑦\left<f,{V_{nm}}\right>=\iint_{D}{{V_{nm}^{*}(x,y)f(x,y)dxdy}},⟨ italic_f , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT ⟩ = ∬ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x , italic_y ) italic_f ( italic_x , italic_y ) italic_d italic_x italic_d italic_y , (1)

where f𝑓fitalic_f is the image function, Vnmsubscript𝑉𝑛𝑚{V_{nm}}italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT is the basis function with order parameter (n,m)2𝑛𝑚superscript2(n,m)\in\mathbb{Z}^{2}( italic_n , italic_m ) ∈ blackboard_Z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on domain D𝐷Ditalic_D, and *** is the complex conjugate. Note that the domains of f𝑓fitalic_f and Vnmsubscript𝑉𝑛𝑚{V_{nm}}italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT in (1) have the same/similar location and scale, implying the global nature of the representation information.

With the sparse prior and geometric prior for natural images, two typical constraints, i.e., orthogonality and rotation invariance, often imposed on the explicit definition of Vnmsubscript𝑉𝑛𝑚{V_{nm}}italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT, leading to the following polar form:

f,Vnm=DRn*(r)Am*(θ)f(r,θ)r𝑑r𝑑θ,𝑓subscript𝑉𝑛𝑚subscriptdouble-integral𝐷superscriptsubscript𝑅𝑛𝑟superscriptsubscript𝐴𝑚𝜃𝑓𝑟𝜃𝑟differential-d𝑟differential-d𝜃\left<f,{V_{nm}}\right>=\iint_{D}{{R_{n}^{*}(r)A_{m}^{*}(\theta)f(r,\theta)% rdrd\theta}},⟨ italic_f , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT ⟩ = ∬ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_r ) italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_θ ) italic_f ( italic_r , italic_θ ) italic_r italic_d italic_r italic_d italic_θ , (2)

where Vnm(rcosθx,rsinθy)Vnm(r,θ)subscript𝑉𝑛𝑚subscript𝑟𝜃𝑥subscript𝑟𝜃𝑦subscript𝑉𝑛𝑚𝑟𝜃{V_{nm}}(\underbrace{r\cos\theta}_{x},\underbrace{r\sin\theta}_{y})\equiv{V_{% nm}}(r,\theta)italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT ( under⏟ start_ARG italic_r roman_cos italic_θ end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , under⏟ start_ARG italic_r roman_sin italic_θ end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ≡ italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT ( italic_r , italic_θ ) is separated as the product of the angular basis function Am(θ)=exp(𝒋mθ)subscript𝐴𝑚𝜃𝒋𝑚𝜃{A_{m}}(\theta)=\exp(\bm{j}m\theta)italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ ) = roman_exp ( bold_italic_j italic_m italic_θ ) (𝒋=1𝒋1\bm{j}=\sqrt{-1}bold_italic_j = square-root start_ARG - 1 end_ARG) and the radial basis function Rnsubscript𝑅𝑛{R_{n}}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, subject to the weighted orthogonality condition 01Rn(r)Rn*(r)r𝑑r=12πδnnsuperscriptsubscript01subscript𝑅𝑛𝑟superscriptsubscript𝑅superscript𝑛𝑟𝑟differential-d𝑟12𝜋subscript𝛿𝑛superscript𝑛\int\limits_{0}^{1}{{R_{n}}(r)R_{n^{\prime}}^{*}(r)rdr}=\frac{1}{{2\pi}}{% \delta_{nn^{\prime}}}∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_r ) italic_R start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_r ) italic_r italic_d italic_r = divide start_ARG 1 end_ARG start_ARG 2 italic_π end_ARG italic_δ start_POSTSUBSCRIPT italic_n italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Note that the basis function Vnm=RnAmsubscript𝑉𝑛𝑚subscript𝑅𝑛subscript𝐴𝑚{V_{nm}}={R_{n}}{A_{m}}italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in (2) is orthogonal on D𝐷Ditalic_D, and the magnitude of f,Vnm𝑓subscript𝑉𝑛𝑚\left<f,{V_{nm}}\right>⟨ italic_f , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT ⟩ is invariant to the rotation on the image f𝑓fitalic_f (see [17] for a survey).

In our recent work, moments and moment invariants are extended to local representations of images, where the theory is built on the following definition [46]:

f,Vnmuvw=DRn*(r)Am*(θ)f(x,y)𝑑x𝑑y,𝑓superscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤subscriptdouble-integral𝐷superscriptsubscript𝑅𝑛superscript𝑟superscriptsubscript𝐴𝑚superscript𝜃𝑓𝑥𝑦differential-d𝑥differential-d𝑦\left<f,V_{nm}^{uvw}\right>=\iint_{D}{{R_{n}^{*}(r^{\prime})A_{m}^{*}(\theta^{% \prime})f(x,y)dxdy}},⟨ italic_f , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ⟩ = ∬ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_f ( italic_x , italic_y ) italic_d italic_x italic_d italic_y , (3)

where the new basis function Vnmuvwsuperscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤V_{nm}^{uvw}italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT introduces position parameter (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) and scale parameter w𝑤witalic_w. It can be interpreted as a translated and scaled version of the global Vnmsubscript𝑉𝑛𝑚{V_{nm}}italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT with the following coordinate relationship:

{r=1w(xu)2+(yv)2θ=arctan(yvxu),casessuperscript𝑟1𝑤superscript𝑥𝑢2superscript𝑦𝑣2missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsuperscript𝜃𝑦𝑣𝑥𝑢missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression\left\{{\begin{array}[]{*{20}{l}}{r^{\prime}=\frac{1}{w}\sqrt{{{(x-u)}^{2}}+{{% (y-v)}^{2}}}}\\ {\theta^{\prime}=\arctan(\frac{{y-v}}{{x-u}})}\end{array}}\right.,{ start_ARRAY start_ROW start_CELL italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_w end_ARG square-root start_ARG ( italic_x - italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y - italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_arctan ( divide start_ARG italic_y - italic_v end_ARG start_ARG italic_x - italic_u end_ARG ) end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY , (4)

where the domain is a disk centered at (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) and with radius w𝑤witalic_w: D={(x,y):(xu)2+(yv)2w2}𝐷conditional-set𝑥𝑦superscript𝑥𝑢2superscript𝑦𝑣2superscript𝑤2D=\{(x,y):{(x-u)^{2}}+{(y-v)^{2}}\leq{w^{2}}\}italic_D = { ( italic_x , italic_y ) : ( italic_x - italic_u ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y - italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }. Note that (3) allows the domain of Vnmuvwsuperscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤V_{nm}^{uvw}italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT to be built in different positions and scales w.r.t. the domain of f𝑓fitalic_f, implying the local nature of the representation information. Also, the classical definition (2) is in fact a special case of the new definition (3) with (u,v)=(0,0)𝑢𝑣00(u,v)=(0,0)( italic_u , italic_v ) = ( 0 , 0 ) and w=1𝑤1w=1italic_w = 1 (see [46] for details).

2.2 Invariance, Equivariance, and Covariance

The terms of invariance, equivariance, and covariance appear in the fields of computer vision, graphics, geometry, and physics, with similar but slightly different definitions. In this paper, we use the following identities to generally denote such terms [47, 48]:

  • invariance – (𝒟(f))(f)𝒟𝑓𝑓{\cal R}({\cal D}(f))\equiv{\cal R}(f)caligraphic_R ( caligraphic_D ( italic_f ) ) ≡ caligraphic_R ( italic_f ),

  • equivariance – (𝒟(f))𝒟((f))𝒟𝑓𝒟𝑓{\cal R}({\cal D}(f))\equiv{\cal D}({\cal R}(f))caligraphic_R ( caligraphic_D ( italic_f ) ) ≡ caligraphic_D ( caligraphic_R ( italic_f ) ),

  • covariance – (𝒟(f))𝒟((f))𝒟𝑓superscript𝒟𝑓{\cal R}({\cal D}(f))\equiv{\cal D}^{\prime}({\cal R}(f))caligraphic_R ( caligraphic_D ( italic_f ) ) ≡ caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( caligraphic_R ( italic_f ) ),

where {\cal R}caligraphic_R is a representation, 𝒟𝒟{\cal D}caligraphic_D is a degradation, and 𝒟superscript𝒟{\cal D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a composite function of 𝒟𝒟{\cal D}caligraphic_D. Note that invariance and equivariance are special cases of covariance with 𝒟=idsuperscript𝒟id{\cal D}^{\prime}={\rm{id}}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_id and 𝒟=𝒟superscript𝒟𝒟{\cal D}^{\prime}={\cal D}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_D.

Starting from the above terms and the local representation (3), one can verify that f,Vnmuvw𝑓superscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤\left<f,V_{nm}^{uvw}\right>⟨ italic_f , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ⟩ exhibits the following properties w.r.t. translation, rotation, flip**, and scaling on images (see [46] for details).

The image translation leads to

f(x+Δx,y+Δy),Vnmuvw(x,y)=f(x,y),Vnm(u+Δx)(v+Δy)w(x,y),𝑓𝑥Δ𝑥𝑦Δ𝑦superscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤𝑥𝑦𝑓𝑥𝑦superscriptsubscript𝑉𝑛𝑚𝑢Δ𝑥𝑣Δ𝑦𝑤𝑥𝑦\begin{split}&\left<f(x+\Delta x,y+\Delta y),V_{nm}^{uvw}(x,y)\right>\\ &=\left<f(x,y),V_{nm}^{(u+\Delta x)(v+\Delta y)w}(x,y)\right>,\end{split}start_ROW start_CELL end_CELL start_CELL ⟨ italic_f ( italic_x + roman_Δ italic_x , italic_y + roman_Δ italic_y ) , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_x , italic_y ) ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ⟨ italic_f ( italic_x , italic_y ) , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u + roman_Δ italic_x ) ( italic_v + roman_Δ italic_y ) italic_w end_POSTSUPERSCRIPT ( italic_x , italic_y ) ⟩ , end_CELL end_ROW (5)

where (Δx,Δy)Δ𝑥Δ𝑦(\Delta x,\Delta y)( roman_Δ italic_x , roman_Δ italic_y ) is the translation offset of the image f𝑓fitalic_f. Note that the same (Δx,Δy)Δ𝑥Δ𝑦(\Delta x,\Delta y)( roman_Δ italic_x , roman_Δ italic_y ) appears in position parameter (u,v)𝑢𝑣(u,v)( italic_u , italic_v ), implying the equivariance w.r.t. the image translation.

Since the translation equivariance holds, the following analysis (6) similar-to\sim (8) will consider only center-aligned geometric transformations, i.e., we can restrict (u,v)=(0,0)𝑢𝑣00(u,v)=(0,0)( italic_u , italic_v ) = ( 0 , 0 ) without loss of generality.

Refer to caption
Figure 1: The blueprint of hierarchical invariance, where the image information is able to pass through each intermediate layer in a geometrically controllable manner, and on the last layer, the invariant features are allowed by compact designs, with also sufficient information.

The image rotation leads to

f(r,θ+ϕ),Vnmuvw(r,θ)=f(r,θ),Vnmuvw(r,θ)Am*(ϕ),𝑓𝑟𝜃italic-ϕsuperscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤superscript𝑟superscript𝜃𝑓𝑟𝜃superscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤superscript𝑟superscript𝜃superscriptsubscript𝐴𝑚italic-ϕ\begin{split}&\left<f(r,\theta+\phi),V_{nm}^{uvw}(r^{\prime},\theta^{\prime})% \right>\\ &=\left<f(r,\theta),V_{nm}^{uvw}(r^{\prime},\theta^{\prime})\right>A_{m}^{*}(-% \phi),\end{split}start_ROW start_CELL end_CELL start_CELL ⟨ italic_f ( italic_r , italic_θ + italic_ϕ ) , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ⟨ italic_f ( italic_r , italic_θ ) , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩ italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( - italic_ϕ ) , end_CELL end_ROW (6)

with (u,v)=(0,0)𝑢𝑣00(u,v)=(0,0)( italic_u , italic_v ) = ( 0 , 0 ), where ϕitalic-ϕ\phiitalic_ϕ is the rotation angle w.r.t. the center of the image f𝑓fitalic_f. Note that the same ϕitalic-ϕ\phiitalic_ϕ appears in phase of the representation, implying the covariance w.r.t. the center-aligned rotation. It is straightforward that the covariance (6) will specialize to the invariance when taking the magnitude as |f(r,θ+ϕ),Vnmuvw(r,θ)|=|f(r,θ),Vnmuvw(r,θ)|𝑓𝑟𝜃italic-ϕsuperscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤superscript𝑟superscript𝜃𝑓𝑟𝜃superscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤superscript𝑟superscript𝜃|\left<f(r,\theta+\phi),V_{nm}^{uvw}(r^{\prime},\theta^{\prime})\right>|=|% \left<f(r,\theta),V_{nm}^{uvw}(r^{\prime},\theta^{\prime})\right>|| ⟨ italic_f ( italic_r , italic_θ + italic_ϕ ) , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩ | = | ⟨ italic_f ( italic_r , italic_θ ) , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩ |.

The image flip** leads to

f(r,θ),Vnmuvw(r,θ)=(f(r,θ),Vnmuvw(r,θ))*,𝑓𝑟𝜃superscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤superscript𝑟superscript𝜃superscript𝑓𝑟𝜃superscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤superscript𝑟superscript𝜃\begin{split}\left<f(r,-\theta),V_{nm}^{uvw}(r^{\prime},\theta^{\prime})\right% >=(\left<f(r,\theta),V_{nm}^{uvw}(r^{\prime},\theta^{\prime})\right>)^{*},\end% {split}start_ROW start_CELL ⟨ italic_f ( italic_r , - italic_θ ) , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩ = ( ⟨ italic_f ( italic_r , italic_θ ) , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩ ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , end_CELL end_ROW (7)

with (u,v)=(0,0)𝑢𝑣00(u,v)=(0,0)( italic_u , italic_v ) = ( 0 , 0 ), where f(r,θ)𝑓𝑟𝜃f(r,-\theta)italic_f ( italic_r , - italic_θ ) is a vertically flipped version of the image f𝑓fitalic_f w.r.t. the center. Note that center-aligned vertical flip** again only affects the phase of the representation, implying the covariance similar to (6). As for other flip** orientations, the same conclusion can be derived from the composite of rotation and vertical flip**. It is straightforward that the joint invariance of center-aligned rotation and flip** is hold when taking the magnitude of the representation.

The image scaling leads to

f(sx,sy),Vnmuvw(x,y)=f(x,y),Vnmuv(ws)(x,y),𝑓𝑠𝑥𝑠𝑦superscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤𝑥𝑦𝑓𝑥𝑦superscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤𝑠𝑥𝑦\begin{split}\left<f(sx,sy),V_{nm}^{uvw}(x,y)\right>=\left<f(x,y),V_{nm}^{uv(% ws)}(x,y)\right>,\end{split}start_ROW start_CELL ⟨ italic_f ( italic_s italic_x , italic_s italic_y ) , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_x , italic_y ) ⟩ = ⟨ italic_f ( italic_x , italic_y ) , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v ( italic_w italic_s ) end_POSTSUPERSCRIPT ( italic_x , italic_y ) ⟩ , end_CELL end_ROW (8)

with (u,v)=(0,0)𝑢𝑣00(u,v)=(0,0)( italic_u , italic_v ) = ( 0 , 0 ), where s𝑠sitalic_s is the scaling factor w.r.t. the center of the image f𝑓fitalic_f. Note that the same s𝑠sitalic_s appears in scale parameter w𝑤witalic_w, implying the covariance w.r.t. center-aligned scaling.

For the representation properties when (u,v)(0,0)𝑢𝑣00(u,v)\neq(0,0)( italic_u , italic_v ) ≠ ( 0 , 0 ), they can be derived from the composite of translation with center-aligned rotation, flip**, and scaling, respectively. Hence, the magnitude of the representation has joint equivariance for any translation, rotation, and flip** on (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) domain, as well as covariance for any scaling on w𝑤witalic_w domain.

3 Hierarchical Invariance: Theory

This section focuses on the theoretical aspects of the proposed HIR.

As a high-level intuition, we formalize the blueprint of hierarchical invariance, rethinking the typical modules of CNN representations. Starting from the invariant theory, we then define new modules with their compositions to fulfill such blueprint, along with representation property justifications and efficient numerical implementations. For a summary of this theoretical section, we discuss the criticisms and developments of the proposed idea versus typically concepts.

3.1 Blueprint of Hierarchical Invariance

Our goal is to achieve such a CNN-like hierarchical architecture – the image information is able to pass through each intermediate layer in a geometrically controllable manner, and on the last layer, the invariant features are allowed by compact designs, with also sufficient information. In this paper, this ideal representation structure is termed as hierarchical invariance.

Motivated by the goal, we rethink several typical modules of CNN representation and formalize a blueprint of such modules for hierarchical invariance, as shown in Fig. 1.

Formulation 1.

(Hierarchical invariance blueprint). The set of feature maps (including input images) is denoted as X(Ω,H){M(i,j;k):ΩH}normal-≜normal-Xnormal-Ωnormal-Hconditional-setnormal-Mnormal-inormal-jnormal-knormal-→normal-Ωnormal-HX(\Omega,{\rm H})\triangleq\{M(i,j;k):\Omega\to{\rm H}\}italic_X ( roman_Ω , roman_H ) ≜ { italic_M ( italic_i , italic_j ; italic_k ) : roman_Ω → roman_H } with the discrete domain (i,j)Ωnormal-inormal-jnormal-Ω(i,j)\in\Omega( italic_i , italic_j ) ∈ roman_Ω and complex-valued channels kHsuperscriptnormal-knormal-H{\mathfrak{C}^{k}}\in{\rm H}fraktur_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ roman_H, where Ωsuperscriptnormal-Ωnormal-′\Omega^{\prime}roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and Hsuperscriptnormal-Hnormal-′{\rm H}^{\prime}roman_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the variants of Ωnormal-Ω\Omegaroman_Ω and Hnormal-H{\rm H}roman_H respectively under certain operators, e.g., downsampling, and 𝔊𝔊\mathfrak{G}fraktur_G is a group modeling all the symmetries of interest over the Ωnormal-Ω\Omegaroman_Ω. We formalize the following modules for hierarchical invariance:

  • The convolutional layer :X(Ω,H)X(Ω,H):𝑋ΩH𝑋superscriptΩsuperscriptH\mathbb{C}:X(\Omega,{\rm H})\to X(\Omega^{\prime},{\rm H}^{\prime})blackboard_C : italic_X ( roman_Ω , roman_H ) → italic_X ( roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) captures local features by convolution operations. The geometric principle of \mathbb{C}blackboard_C is the covariance for the symmetry group 𝔊𝔊\mathfrak{G}fraktur_G, i.e., there exists a predictable 𝔤superscript𝔤\mathfrak{g}^{\prime}fraktur_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that (𝔤M)=𝔤(M)𝔤𝑀superscript𝔤𝑀\mathbb{C}(\mathfrak{g}M)=\mathfrak{g}^{\prime}\mathbb{C}(M)blackboard_C ( fraktur_g italic_M ) = fraktur_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_C ( italic_M ) for any 𝔤𝔊𝔤𝔊\mathfrak{g}\in\mathfrak{G}fraktur_g ∈ fraktur_G and MX𝑀𝑋M\in Xitalic_M ∈ italic_X, where the covariance specializes to the equivariance when 𝔤=𝔤superscript𝔤𝔤\mathfrak{g}^{\prime}=\mathfrak{g}fraktur_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = fraktur_g.

  • The nonlinear layer 𝕊:X(Ω,H)X(Ω,H):𝕊𝑋ΩH𝑋ΩsuperscriptH\mathbb{S}:X(\Omega,{\rm H})\to X(\Omega,{\rm H}^{\prime})blackboard_S : italic_X ( roman_Ω , roman_H ) → italic_X ( roman_Ω , roman_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) introduces the non-linearity in features for discriminative purposes, with an activation function σ𝜎\sigmaitalic_σ such that (𝕊M)(i,j)=σ(M(i,j))𝕊𝑀𝑖𝑗𝜎𝑀𝑖𝑗(\mathbb{S}M)(i,j)=\sigma(M(i,j))( blackboard_S italic_M ) ( italic_i , italic_j ) = italic_σ ( italic_M ( italic_i , italic_j ) ), where the composition of convolutional and nonlinear layers, i.e., 𝕊𝕊\mathbb{S}\circ\mathbb{C}blackboard_S ∘ blackboard_C, is also covariant for the group 𝔊𝔊\mathfrak{G}fraktur_G, due to the element-wise act of σ𝜎\sigmaitalic_σ.

  • The local pooling layer :X(Ω,H)X(Ω,H):𝑋ΩH𝑋superscriptΩH\mathbb{P}:X(\Omega,{\rm H})\to X(\Omega^{\prime},{\rm H})blackboard_P : italic_X ( roman_Ω , roman_H ) → italic_X ( roman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_H ) downsamples the plane dimensions of feature maps to reduce computational complexity, such that ΩΩsuperscriptΩΩ\Omega^{\prime}\subseteq\Omegaroman_Ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ roman_Ω. The geometric principle of \mathbb{P}blackboard_P is the (approximately) equivariance for any 𝔤superscript𝔤\mathfrak{g}^{\prime}fraktur_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT produced by \mathbb{C}blackboard_C, i.e., (𝔤M)𝔤(M)similar-to-or-equalssuperscript𝔤𝑀superscript𝔤𝑀\mathbb{P}(\mathfrak{g}^{\prime}M)\simeq\mathfrak{g}^{\prime}\mathbb{P}(M)blackboard_P ( fraktur_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_M ) ≃ fraktur_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_P ( italic_M ), allowing the subsequent convolutional layer \mathbb{C}blackboard_C to continue capturing such covariant features, where the composition 𝕊𝕊\mathbb{P}\circ\mathbb{S}\circ\mathbb{C}blackboard_P ∘ blackboard_S ∘ blackboard_C remains covariance for the group 𝔊𝔊\mathfrak{G}fraktur_G.

  • The invariant layer 𝕀:X(Ω,H)Y:𝕀𝑋ΩH𝑌\mathbb{I}:X(\Omega,{\rm H})\to Yblackboard_I : italic_X ( roman_Ω , roman_H ) → italic_Y obtains the final vector representation via a certain global pooling over plane dimensions. The geometric principle of 𝕀𝕀\mathbb{I}blackboard_I is the invariance for the symmetry group 𝔊𝔊\mathfrak{G}fraktur_G, i.e., 𝕀(𝔤M)=𝕀(M)𝕀𝔤𝑀𝕀𝑀\mathbb{I}(\mathfrak{g}M)=\mathbb{I}(M)blackboard_I ( fraktur_g italic_M ) = blackboard_I ( italic_M ) for any 𝔤𝔊𝔤𝔊\mathfrak{g}\in\mathfrak{G}fraktur_g ∈ fraktur_G and MX𝑀𝑋M\in Xitalic_M ∈ italic_X.

With this blueprint, HIR can be designed in a generic way, by the ordered cascading of such modules.

3.2 Definition of Layer and Path

From the invariant theory of Section 2, we will give a class of definitions for layers \mathbb{C}blackboard_C, 𝕊𝕊\mathbb{S}blackboard_S, \mathbb{P}blackboard_P, and 𝕀𝕀\mathbb{I}blackboard_I, satisfying the hierarchical invariance blueprint.

Definition 1.

(Convolutional layer). For the input feature map M(i,j;k)normal-Mnormal-inormal-jnormal-kM(i,j;k)italic_M ( italic_i , italic_j ; italic_k ) with Ω={1,2,,Ni}×{1,2,,Nj}normal-Ω12normal-…subscriptnormal-Nnormal-i12normal-…subscriptnormal-Nnormal-j\Omega=\{1,2,...,{N_{i}}\}\times\{1,2,...,{N_{j}}\}roman_Ω = { 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } × { 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } and H=Knormal-Hsuperscriptnormal-K{\rm H}={\mathfrak{C}^{K}}roman_H = fraktur_C start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, the convolutional layer \mathbb{C}blackboard_C is defined channel-wise as local covariant representations with (3) and (4) :

MM,Vnmuvw=M(i,j;k)(Hnmw(i,j))T,𝑀𝑀superscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤tensor-product𝑀𝑖𝑗𝑘superscriptsuperscriptsubscript𝐻𝑛𝑚𝑤𝑖𝑗𝑇\mathbb{C}M\triangleq\left<M,V_{nm}^{uvw}\right>=M(i,j;k)\otimes{(H_{nm}^{w}(i% ,j))^{T}},blackboard_C italic_M ≜ ⟨ italic_M , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ⟩ = italic_M ( italic_i , italic_j ; italic_k ) ⊗ ( italic_H start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_i , italic_j ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (9)

where tensor-product\otimes is the convolution over the Ωnormal-Ω\Omegaroman_Ω, ()Tsuperscriptnormal-⋅𝑇(\cdot)^{T}( ⋅ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the matrix transpose, and Hnmwsuperscriptsubscript𝐻𝑛𝑚𝑤H_{nm}^{w}italic_H start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT is a convolution kernel defined as:

Hnmw(i,j)={hnmuvw(i,j):u,v=w,(i,j)s.t.DijD},superscriptsubscript𝐻𝑛𝑚𝑤𝑖𝑗conditional-setsuperscriptsubscript𝑛𝑚𝑢𝑣𝑤𝑖𝑗formulae-sequence𝑢𝑣𝑤𝑖𝑗stsubscript𝐷𝑖𝑗𝐷H_{nm}^{w}(i,j)=\{h_{nm}^{uvw}(i,j):u,v=w,(i,j)\;\mathrm{s.t.}\;{D_{ij}}\cap D% \neq\emptyset\},italic_H start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_i , italic_j ) = { italic_h start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_i , italic_j ) : italic_u , italic_v = italic_w , ( italic_i , italic_j ) roman_s . roman_t . italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∩ italic_D ≠ ∅ } , (10)

where hnmuvwsuperscriptsubscript𝑛𝑚𝑢𝑣𝑤h_{nm}^{uvw}italic_h start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT is the integral value of basis function over a valid pixel region:

hnmuvw(i,j)=DijD(Vnmuvw(x,y))*𝑑x𝑑y,superscriptsubscript𝑛𝑚𝑢𝑣𝑤𝑖𝑗subscriptdouble-integralsubscript𝐷𝑖𝑗𝐷superscriptsuperscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤𝑥𝑦differential-d𝑥differential-d𝑦h_{nm}^{uvw}(i,j)=\iint_{{{D_{ij}}\cap D}}{{{(V_{nm}^{uvw}(x,y))}^{*}}dxdy},italic_h start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_i , italic_j ) = ∬ start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∩ italic_D end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_x , italic_y ) ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT italic_d italic_x italic_d italic_y , (11)

with (i,j)𝑖𝑗(i,j)( italic_i , italic_j )-centered pixel region Dij={(x,y)[iΔi2,i+Δi2]×[jΔj2,j+Δj2]}subscript𝐷𝑖𝑗𝑥𝑦𝑖normal-Δ𝑖2𝑖normal-Δ𝑖2𝑗normal-Δ𝑗2𝑗normal-Δ𝑗2{D_{ij}}=\{(x,y)\in[i-\frac{{\Delta i}}{2},i+\frac{{\Delta i}}{2}]\times[j-% \frac{{\Delta j}}{2},j+\frac{{\Delta j}}{2}]\}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { ( italic_x , italic_y ) ∈ [ italic_i - divide start_ARG roman_Δ italic_i end_ARG start_ARG 2 end_ARG , italic_i + divide start_ARG roman_Δ italic_i end_ARG start_ARG 2 end_ARG ] × [ italic_j - divide start_ARG roman_Δ italic_j end_ARG start_ARG 2 end_ARG , italic_j + divide start_ARG roman_Δ italic_j end_ARG start_ARG 2 end_ARG ] }.

Remark. In the Definition 1, the convolutional layer \mathbb{C}blackboard_C is specified by w𝑤witalic_w and (n,m)𝑛𝑚(n,m)( italic_n , italic_m ), which control the representation scale and representation frequency of M𝑀\mathbb{C}Mblackboard_C italic_M, respectively. Note that the \mathbb{C}blackboard_C defined by (9) will directly inherit the representation properties (5) similar-to\sim (8), it should be regarded as a linear covariant layer on the group 𝔊𝔊\mathfrak{G}fraktur_G modeling all translation, rotation, flip**, and scaling symmetries over the ΩΩ\Omegaroman_Ω. For convenience, we denote 𝔊=𝔊1×𝔊2𝔊subscript𝔊1subscript𝔊2\mathfrak{G}=\mathfrak{G}_{1}\times\mathfrak{G}_{2}fraktur_G = fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where 𝔊1subscript𝔊1\mathfrak{G}_{1}fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the translation/rotation/flip** symmetry group and 𝔊2subscript𝔊2\mathfrak{G}_{2}fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the scaling symmetry group.

Definition 2.

(Nonlinear layer). For the input feature map M(i,j;k)normal-Mnormal-inormal-jnormal-kM(i,j;k)italic_M ( italic_i , italic_j ; italic_k ) with Ω={1,2,,Ni}×{1,2,,Nj}normal-Ω12normal-…subscriptnormal-Nnormal-i12normal-…subscriptnormal-Nnormal-j\Omega=\{1,2,...,{N_{i}}\}\times\{1,2,...,{N_{j}}\}roman_Ω = { 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } × { 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } and H=Knormal-Hsuperscriptnormal-K{\rm H}={\mathfrak{C}^{K}}roman_H = fraktur_C start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, the nonlinear layer 𝕊𝕊\mathbb{S}blackboard_S is defined channel-wise as a magnitude operation:

𝕊M=σ(M(i,j))|M(i,j;k)|,𝕊𝑀𝜎𝑀𝑖𝑗𝑀𝑖𝑗𝑘\mathbb{S}M=\sigma(M(i,j))\triangleq|M(i,j;k)|,blackboard_S italic_M = italic_σ ( italic_M ( italic_i , italic_j ) ) ≜ | italic_M ( italic_i , italic_j ; italic_k ) | , (12)

where M(i,j;k)𝑀𝑖𝑗𝑘M(i,j;k)italic_M ( italic_i , italic_j ; italic_k ) is complex-valued, and (12) can be written explicitly as (ReM(i,j;k))2+(ImM(i,j;k))2superscriptnormal-Re𝑀𝑖𝑗𝑘2superscriptnormal-Im𝑀𝑖𝑗𝑘2\sqrt{{{({\mathop{\rm Re}\nolimits}M(i,j;k))}^{2}}+{{({\mathop{\rm Im}% \nolimits}M(i,j;k))}^{2}}}square-root start_ARG ( roman_Re italic_M ( italic_i , italic_j ; italic_k ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( roman_Im italic_M ( italic_i , italic_j ; italic_k ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG.

Remark. With the Definition 2 and Section 2.2, the composition of convolutional and nonlinear layers will exhibit the joint equivariance of translation, rotation, and flip**, i.e., 𝕊(𝔤1M)=𝔤1𝕊(M)𝕊subscript𝔤1𝑀subscript𝔤1𝕊𝑀\mathbb{S}\circ\mathbb{C}(\mathfrak{g}_{1}M)=\mathfrak{g}_{1}\mathbb{S}\circ% \mathbb{C}(M)blackboard_S ∘ blackboard_C ( fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_M ) = fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_S ∘ blackboard_C ( italic_M ) for any 𝔤1𝔊1subscript𝔤1subscript𝔊1\mathfrak{g}_{1}\in\mathfrak{G}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and MX𝑀𝑋M\in Xitalic_M ∈ italic_X. Note that the 𝕊𝕊\mathbb{S}blackboard_S defined by (12) not only introduces the non-linearity in feature maps, but also converts the covariance 𝔤1superscriptsubscript𝔤1\mathfrak{g}_{1}^{\prime}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (from \mathbb{C}blackboard_C w.r.t. rotation and flip**) into the more manageable equivariance 𝔤1subscript𝔤1\mathfrak{g}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. In addition, the composition 𝕊𝕊\mathbb{S}\circ\mathbb{C}blackboard_S ∘ blackboard_C preserves the scaling covariance of \mathbb{C}blackboard_C due to the element-wise act of 𝕊𝕊\mathbb{S}blackboard_S, i.e., 𝕊(𝔤2M)=𝔤2𝕊(M)𝕊subscript𝔤2𝑀superscriptsubscript𝔤2𝕊𝑀\mathbb{S}\circ\mathbb{C}(\mathfrak{g}_{2}M)=\mathfrak{g}_{2}^{\prime}\mathbb{% S}\circ\mathbb{C}(M)blackboard_S ∘ blackboard_C ( fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_M ) = fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_S ∘ blackboard_C ( italic_M ) for any 𝔤2𝔊2subscript𝔤2subscript𝔊2\mathfrak{g}_{2}\in\mathfrak{G}_{2}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and MX𝑀𝑋M\in Xitalic_M ∈ italic_X.

Definition 3.

(Local pooling layer). For the input feature map M(i,j;k)normal-Mnormal-inormal-jnormal-kM(i,j;k)italic_M ( italic_i , italic_j ; italic_k ) with Ω={1,2,,Ni}×{1,2,,Nj}normal-Ω12normal-…subscriptnormal-Nnormal-i12normal-…subscriptnormal-Nnormal-j\Omega=\{1,2,...,{N_{i}}\}\times\{1,2,...,{N_{j}}\}roman_Ω = { 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } × { 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } and H=Knormal-Hsuperscriptnormal-K{\rm H}={\mathfrak{C}^{K}}roman_H = fraktur_C start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, the local pooling layer \mathbb{P}blackboard_P is defined as identity function:

M=M.𝑀𝑀\mathbb{P}M=M.blackboard_P italic_M = italic_M . (13)

Remark. According to related researches, downsampling operations (e.g., local max pooling) of CNN will variously impair (translation) equivariance, i.e., an approximation (𝔤M)𝔤(M)similar-to-or-equals𝔤𝑀𝔤𝑀\mathbb{P}(\mathfrak{g}M)\simeq\mathfrak{g}\mathbb{P}(M)blackboard_P ( fraktur_g italic_M ) ≃ fraktur_g blackboard_P ( italic_M ), especially for larger pooling scales or deeper network architectures, implying a trade-off between computational complexity and representation equivariance. Since the proposed representation is one-shot without the large-scale training of typical CNN, we neglect downsampling operations and simply set =idid\mathbb{P}={\rm{id}}blackboard_P = roman_id when the computational cost is acceptable. Alternatively, more elegant pooling designs with better tradeoffs between complexity and equivariance can be employed to define \mathbb{P}blackboard_P, as detailed in the paper by Zhang [21]. It is straightforward that the composition 𝕊𝕊\mathbb{P}\circ\mathbb{S}\circ\mathbb{C}blackboard_P ∘ blackboard_S ∘ blackboard_C has the same representation properties of 𝕊𝕊\mathbb{S}\circ\mathbb{C}blackboard_S ∘ blackboard_C based on the Definition 3.

Definition 4.

(Invariant layer). For the input feature map M(i,j;k)normal-Mnormal-inormal-jnormal-kM(i,j;k)italic_M ( italic_i , italic_j ; italic_k ) with Ω={1,2,,Ni}×{1,2,,Nj}normal-Ω12normal-…subscriptnormal-Nnormal-i12normal-…subscriptnormal-Nnormal-j\Omega=\{1,2,...,{N_{i}}\}\times\{1,2,...,{N_{j}}\}roman_Ω = { 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } × { 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } and H=Knormal-Hsuperscriptnormal-K{\rm H}={\mathfrak{C}^{K}}roman_H = fraktur_C start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, the invariant layer 𝕀𝕀\mathbb{I}blackboard_I is defined channel-wise as global invariant representations with (1) and (2):

𝕀M=({M(i,j;k),Vnm(xi,yj)}),𝕀𝑀𝑀𝑖𝑗𝑘subscript𝑉𝑛𝑚subscript𝑥𝑖subscript𝑦𝑗\mathbb{I}M=\mathcal{I}(\{\left<M(i,j;k),{V_{nm}}({x_{i}},{y_{j}})\right>\}),blackboard_I italic_M = caligraphic_I ( { ⟨ italic_M ( italic_i , italic_j ; italic_k ) , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⟩ } ) , (14)

where \mathcal{I}caligraphic_I is a special transform map** image moments to global invariants, w.r.t. the symmetry group of interest 𝔊0𝔊1×𝔊2subscript𝔊0subscript𝔊1subscript𝔊2\mathfrak{G}_{0}\subseteq\mathfrak{G}_{1}\times\mathfrak{G}_{2}fraktur_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊆ fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and any MX𝑀𝑋M\in Xitalic_M ∈ italic_X.

Remark. In the Definition 4, we have not restricted \mathcal{I}caligraphic_I to a fixed formula, allowing the generality of the discussion in the following Section 3.3; its specific designs (w.r.t. considered applications of this paper) will be given in Section 4.1. Note that, with Definitions 1 similar-to\sim 3 and Section 2.2, the M𝑀Mitalic_M form 𝕊𝕊\mathbb{P}\circ\mathbb{S}\circ\mathbb{C}blackboard_P ∘ blackboard_S ∘ blackboard_C with its cascade will basically preserve the geometric information of f𝑓fitalic_f, specifically the translation, rotation and flip** symmetries. Therefore, the idea of \mathcal{I}caligraphic_I (w.r.t. the deep feature map M𝑀Mitalic_M) is very similar to the classical theory of moment invariants (w.r.t. the original image f𝑓fitalic_f), with a wide range of potential designs [17].

Definition 5.

(Path). From the Definitions 1 similar-to\sim 4, we define a path of HIR as p=(λ[1],λ[2],,λ[L])normal-psubscriptnormal-λdelimited-[]1subscriptnormal-λdelimited-[]2normal-⋯subscriptnormal-λdelimited-[]normal-Lp=({\lambda_{[1]}},{\lambda_{[2]}},\cdots,{\lambda_{[L]}})italic_p = ( italic_λ start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT , ⋯ , italic_λ start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ), where λ[z]=(n,m,w)[z]subscriptnormal-λdelimited-[]normal-zsubscriptnormal-nnormal-mnormal-wdelimited-[]normal-z{\lambda_{[z]}}={(n,m,w)_{[z]}}italic_λ start_POSTSUBSCRIPT [ italic_z ] end_POSTSUBSCRIPT = ( italic_n , italic_m , italic_w ) start_POSTSUBSCRIPT [ italic_z ] end_POSTSUBSCRIPT specifies the parameters of the convolutional layer sorted by znormal-zzitalic_z. The HIR along a path pnormal-ppitalic_p, psubscriptnormal-p{{\cal R}_{p}}caligraphic_R start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, is defined as the following ordered cascading with corresponding parameters p=(λ[1],λ[2],,λ[L])normal-psubscriptnormal-λdelimited-[]1subscriptnormal-λdelimited-[]2normal-…subscriptnormal-λdelimited-[]normal-Lp=({\lambda_{[1]}},{\lambda_{[2]}},...,{\lambda_{[L]}})italic_p = ( italic_λ start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ):

p𝕀[L]𝕊[L][L][1]𝕊[1][1].subscript𝑝𝕀subscriptdelimited-[]𝐿subscript𝕊delimited-[]𝐿subscriptdelimited-[]𝐿subscriptdelimited-[]1subscript𝕊delimited-[]1subscriptdelimited-[]1{\cal R}_{p}\triangleq\mathbb{I}\circ\mathbb{P}_{[L]}\circ\mathbb{S}_{[L]}% \circ\mathbb{C}_{[L]}\circ\cdots\circ\mathbb{P}_{[1]}\circ\mathbb{S}_{[1]}% \circ\mathbb{C}_{[1]}.caligraphic_R start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≜ blackboard_I ∘ blackboard_P start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ∘ blackboard_S start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ∘ blackboard_C start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ∘ ⋯ ∘ blackboard_P start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ∘ blackboard_S start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ∘ blackboard_C start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT . (15)

Remark. In the Definition 5, we further unify the global and local representation theories of moment invariants (Section 2) into a hierarchical representation framework. Note that the layers prior to the invariant layer 𝕀𝕀\mathbb{I}blackboard_I provide structure-preserving properties for the representation psubscript𝑝{\cal R}_{p}caligraphic_R start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Here, the global representation (15) is designed for image-level visual tasks, e.g., classification; as for pixel-level ones, e.g., segmentation, we can preserve the spatial dimensions by removing the last invariant layer. In the next section, the representation properties of Definition 5 will be analyzed explicitly.

3.3 Representation Property

In a typical CNN, the relationship between image information and learned representation is highly nonlinear and difficult to understand or predict. As for the HIR, we can explicitly give the following conclusions about the geometric symmetries between image and representation, implying good robustness and interpretability.

Property 1.

(Equivariance for translation, rotation, and flip**). For a representation unit 𝕌𝕊normal-≜𝕌𝕊\mathbb{U}\triangleq\mathbb{P}\circ\mathbb{S}\circ\mathbb{C}blackboard_U ≜ blackboard_P ∘ blackboard_S ∘ blackboard_C with arbitrary parameters λnormal-λ\lambdaitalic_λ (for the convolutional layer), any composition of 𝕌𝕌\mathbb{U}blackboard_U satisfy the joint equivariance for translation, rotation, and flip** (ignoring edge effects and resampling errors), i.e., the following identity holds:

𝕌[L]𝕌[2]𝕌[1](𝔤1M)𝔤1𝕌[L]𝕌[2]𝕌[1](M).subscript𝕌delimited-[]𝐿subscript𝕌delimited-[]2subscript𝕌delimited-[]1subscript𝔤1𝑀subscript𝔤1subscript𝕌delimited-[]𝐿subscript𝕌delimited-[]2subscript𝕌delimited-[]1𝑀\mathbb{U}_{[L]}\circ\cdots\circ\mathbb{U}_{[2]}\circ\mathbb{U}_{[1]}(% \mathfrak{g}_{1}M)\equiv\mathfrak{g}_{1}\mathbb{U}_{[L]}\circ\cdots\circ% \mathbb{U}_{[2]}\circ\mathbb{U}_{[1]}(M).blackboard_U start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ∘ ⋯ ∘ blackboard_U start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT ∘ blackboard_U start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ( fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_M ) ≡ fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_U start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ∘ ⋯ ∘ blackboard_U start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT ∘ blackboard_U start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ( italic_M ) . (16)

for any composition length L1𝐿1L\geq 1italic_L ≥ 1, any 𝔤1𝔊1subscript𝔤1subscript𝔊1\mathfrak{g}_{1}\in\mathfrak{G}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and MX𝑀𝑋M\in Xitalic_M ∈ italic_X, where 𝔊1subscript𝔊1\mathfrak{G}_{1}fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the translation/rotation/flip** symmetry group.

Proof.

First, let us examine the behavior of a representation unit 𝕌𝕌\mathbb{U}blackboard_U on 𝔊1subscript𝔊1\mathfrak{G}_{1}fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT:

𝕌(𝔤1M)=𝕊(𝔤1M)=𝕊𝔤1(M)=𝔤1𝕊(M)=𝔤1𝕊(M)=𝔤1𝕌(M),𝕌subscript𝔤1𝑀𝕊subscript𝔤1𝑀𝕊superscriptsubscript𝔤1𝑀subscript𝔤1𝕊𝑀subscript𝔤1𝕊𝑀subscript𝔤1𝕌𝑀\begin{split}\mathbb{U}(\mathfrak{g}_{1}M)&=\mathbb{P}\circ\mathbb{S}\circ% \mathbb{C}(\mathfrak{g}_{1}M)\\ &=\mathbb{P}\circ\mathbb{S}\circ\mathfrak{g}_{1}^{\prime}\mathbb{C}(M)\\ &=\mathbb{P}\circ\mathfrak{g}_{1}\mathbb{S}\circ\mathbb{C}(M)\\ &=\mathfrak{g}_{1}\mathbb{P}\circ\mathbb{S}\circ\mathbb{C}(M)\\ &=\mathfrak{g}_{1}\mathbb{U}(M),\end{split}start_ROW start_CELL blackboard_U ( fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_M ) end_CELL start_CELL = blackboard_P ∘ blackboard_S ∘ blackboard_C ( fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_P ∘ blackboard_S ∘ fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_C ( italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_P ∘ fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_S ∘ blackboard_C ( italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_P ∘ blackboard_S ∘ blackboard_C ( italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_U ( italic_M ) , end_CELL end_ROW (17)

where the first pass comes from the covariance of \mathbb{C}blackboard_C for rotation and flip**, i.e., (6) and (7), and 𝔤1superscriptsubscript𝔤1\mathfrak{g}_{1}^{\prime}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a predictable operation acting in the phase domain of (M)𝑀\mathbb{C}(M)blackboard_C ( italic_M ); the second pass comes from the specialization of 𝕊𝕊\mathbb{S}blackboard_S to the covariant 𝔤1superscriptsubscript𝔤1\mathfrak{g}_{1}^{\prime}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT – the magnitude operation removes the extra phase variations, leading to a pure equivariance 𝔤1subscript𝔤1\mathfrak{g}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; the third pass comes from the identity function of \mathbb{P}blackboard_P, which becomes approximately equal when the downsampled \mathbb{P}blackboard_P is used. Here, 𝕌(𝔤1M)=𝔤1𝕌(M)𝕌subscript𝔤1𝑀subscript𝔤1𝕌𝑀\mathbb{U}(\mathfrak{g}_{1}M)=\mathfrak{g}_{1}\mathbb{U}(M)blackboard_U ( fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_M ) = fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT blackboard_U ( italic_M ) means that the representation unit 𝕌𝕌\mathbb{U}blackboard_U can be considered as an equivariant layer for any 𝔤1𝔊1subscript𝔤1subscript𝔊1\mathfrak{g}_{1}\in\mathfrak{G}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and MX𝑀𝑋M\in Xitalic_M ∈ italic_X – in other words, the single 𝕌𝕌\mathbb{U}blackboard_U and 𝔤1subscript𝔤1\mathfrak{g}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT operations on MX𝑀𝑋M\in Xitalic_M ∈ italic_X are exchangeable. Furthermore, with a notation M[l]𝕌[l]𝕌[1](M)=𝕌[l]M[l1]subscript𝑀delimited-[]𝑙subscript𝕌delimited-[]𝑙subscript𝕌delimited-[]1𝑀subscript𝕌delimited-[]𝑙subscript𝑀delimited-[]𝑙1M_{[l]}\triangleq\mathbb{U}_{[l]}\circ\cdots\circ\mathbb{U}_{[1]}(M)=\mathbb{U% }_{[l]}M_{[l-1]}italic_M start_POSTSUBSCRIPT [ italic_l ] end_POSTSUBSCRIPT ≜ blackboard_U start_POSTSUBSCRIPT [ italic_l ] end_POSTSUBSCRIPT ∘ ⋯ ∘ blackboard_U start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ( italic_M ) = blackboard_U start_POSTSUBSCRIPT [ italic_l ] end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT [ italic_l - 1 ] end_POSTSUBSCRIPT, we have M[l]Xsubscript𝑀delimited-[]𝑙𝑋M_{[l]}\in Xitalic_M start_POSTSUBSCRIPT [ italic_l ] end_POSTSUBSCRIPT ∈ italic_X for any l{1,2,,L}𝑙12𝐿l\in\{1,2,\cdots,L\}italic_l ∈ { 1 , 2 , ⋯ , italic_L }. Therefore, 𝔤1subscript𝔤1\mathfrak{g}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and any composition of 𝕌𝕌\mathbb{U}blackboard_U are exchangeable, implying the correctness of Property 1. ∎

Property 2.

(Covariance for scaling). For a representation unit 𝕌𝕌\mathbb{U}blackboard_U, where the scale parameter of its convolutional layer is specified as wnormal-wwitalic_w with a notation 𝕌w𝕊wnormal-≜superscript𝕌normal-w𝕊superscriptnormal-w\mathbb{U}^{w}\triangleq\mathbb{P}\circ\mathbb{S}\circ\mathbb{C}^{w}blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ≜ blackboard_P ∘ blackboard_S ∘ blackboard_C start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT, any composition of 𝕌wsuperscript𝕌normal-w\mathbb{U}^{w}blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT satisfy the covariance for scaling (ignoring edge effects and resampling errors), i.e., the following identity holds:

𝕌[L]w𝕌[2]w𝕌[1]w(𝔤2M)𝔤2𝕌[L]w𝕌[2]w𝕌[1]w(M)=𝔤2𝕌[L]ws𝕌[2]ws𝕌[1]ws(M),subscriptsuperscript𝕌𝑤delimited-[]𝐿subscriptsuperscript𝕌𝑤delimited-[]2subscriptsuperscript𝕌𝑤delimited-[]1subscript𝔤2𝑀superscriptsubscript𝔤2subscriptsuperscript𝕌𝑤delimited-[]𝐿subscriptsuperscript𝕌𝑤delimited-[]2subscriptsuperscript𝕌𝑤delimited-[]1𝑀subscript𝔤2subscriptsuperscript𝕌𝑤𝑠delimited-[]𝐿subscriptsuperscript𝕌𝑤𝑠delimited-[]2subscriptsuperscript𝕌𝑤𝑠delimited-[]1𝑀\begin{split}&\mathbb{U}^{w}_{[L]}\circ\cdots\circ\mathbb{U}^{w}_{[2]}\circ% \mathbb{U}^{w}_{[1]}(\mathfrak{g}_{2}M)\\ &\equiv\mathfrak{g}_{2}^{\prime}\mathbb{U}^{w}_{[L]}\circ\cdots\circ\mathbb{U}% ^{w}_{[2]}\circ\mathbb{U}^{w}_{[1]}(M)\\ &=\mathfrak{g}_{2}\mathbb{U}^{ws}_{[L]}\circ\cdots\circ\mathbb{U}^{ws}_{[2]}% \circ\mathbb{U}^{ws}_{[1]}(M),\end{split}start_ROW start_CELL end_CELL start_CELL blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ∘ ⋯ ∘ blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT ∘ blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ( fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≡ fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ∘ ⋯ ∘ blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT ∘ blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ( italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT blackboard_U start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ∘ ⋯ ∘ blackboard_U start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT ∘ blackboard_U start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ( italic_M ) , end_CELL end_ROW (18)

for any composition length L1𝐿1L\geq 1italic_L ≥ 1, any 𝔤2𝔊2subscript𝔤2subscript𝔊2\mathfrak{g}_{2}\in\mathfrak{G}_{2}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and MX𝑀𝑋M\in Xitalic_M ∈ italic_X, where 𝔤2superscriptsubscript𝔤2normal-′\mathfrak{g}_{2}^{\prime}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a predictable operation corresponding to 𝔤2subscript𝔤2\mathfrak{g}_{2}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with explicit form 𝔤2𝕌w𝔤2𝕌wsnormal-≜superscriptsubscript𝔤2normal-′superscript𝕌𝑤subscript𝔤2superscript𝕌𝑤𝑠\mathfrak{g}_{2}^{\prime}\mathbb{U}^{w}\triangleq\mathfrak{g}_{2}\mathbb{U}^{ws}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ≜ fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT blackboard_U start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT, s𝑠sitalic_s is the scaling factor w.r.t. 𝔤2subscript𝔤2\mathfrak{g}_{2}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and 𝔊2subscript𝔊2\mathfrak{G}_{2}fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the scaling symmetry group.

Proof.

First, let us examine the behavior of a representation unit 𝕌wsuperscript𝕌𝑤\mathbb{U}^{w}blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT on 𝔊2subscript𝔊2\mathfrak{G}_{2}fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:

𝕌w(𝔤2M)=𝕊w(𝔤2M)=𝕊𝔤2w(M)=𝕊𝔤2ws(M)=𝔤2𝕊ws(M)=𝔤2𝕊ws(M)=𝔤2𝕌ws(M)=𝔤2𝕌w(M),superscript𝕌𝑤subscript𝔤2𝑀𝕊superscript𝑤subscript𝔤2𝑀𝕊superscriptsubscript𝔤2superscript𝑤𝑀𝕊subscript𝔤2superscript𝑤𝑠𝑀subscript𝔤2𝕊superscript𝑤𝑠𝑀subscript𝔤2𝕊superscript𝑤𝑠𝑀subscript𝔤2superscript𝕌𝑤𝑠𝑀superscriptsubscript𝔤2superscript𝕌𝑤𝑀\begin{split}\mathbb{U}^{w}(\mathfrak{g}_{2}M)&=\mathbb{P}\circ\mathbb{S}\circ% \mathbb{C}^{w}(\mathfrak{g}_{2}M)\\ &=\mathbb{P}\circ\mathbb{S}\circ\mathfrak{g}_{2}^{\prime}\mathbb{C}^{w}(M)\\ &=\mathbb{P}\circ\mathbb{S}\circ\mathfrak{g}_{2}\mathbb{C}^{ws}(M)\\ &=\mathbb{P}\circ\mathfrak{g}_{2}\mathbb{S}\circ\mathbb{C}^{ws}(M)\\ &=\mathfrak{g}_{2}\mathbb{P}\circ\mathbb{S}\circ\mathbb{C}^{ws}(M)\\ &=\mathfrak{g}_{2}\mathbb{U}^{ws}(M)\\ &=\mathfrak{g}_{2}^{\prime}\mathbb{U}^{w}(M),\end{split}start_ROW start_CELL blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_M ) end_CELL start_CELL = blackboard_P ∘ blackboard_S ∘ blackboard_C start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_P ∘ blackboard_S ∘ fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_C start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_P ∘ blackboard_S ∘ fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT blackboard_C start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT ( italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_P ∘ fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT blackboard_S ∘ blackboard_C start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT ( italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT blackboard_P ∘ blackboard_S ∘ blackboard_C start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT ( italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT blackboard_U start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT ( italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_M ) , end_CELL end_ROW (19)

where the first pass comes from the covariance of \mathbb{C}blackboard_C for scaling, i.e., (8), and 𝔤2superscriptsubscript𝔤2\mathfrak{g}_{2}^{\prime}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a predictable operation acting in both the ΩΩ\Omegaroman_Ω domain (i.e., the same scaling 𝔤2subscript𝔤2\mathfrak{g}_{2}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) and the w𝑤witalic_w domain (i.e., the factor s𝑠sitalic_s) of w(M)superscript𝑤𝑀\mathbb{C}^{w}(M)blackboard_C start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_M ); the second and third passes come from the element-wise act of 𝕊𝕊\mathbb{S}blackboard_S and the identity function of \mathbb{P}blackboard_P, respectively. Here, 𝕌w(𝔤2M)=𝔤2𝕌w(M)superscript𝕌𝑤subscript𝔤2𝑀superscriptsubscript𝔤2superscript𝕌𝑤𝑀\mathbb{U}^{w}(\mathfrak{g}_{2}M)=\mathfrak{g}_{2}^{\prime}\mathbb{U}^{w}(M)blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_M ) = fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_M ) means that the representation unit 𝕌wsuperscript𝕌𝑤\mathbb{U}^{w}blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT can be considered as an covariant layer for any 𝔤2𝔊2subscript𝔤2subscript𝔊2\mathfrak{g}_{2}\in\mathfrak{G}_{2}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and MX𝑀𝑋M\in Xitalic_M ∈ italic_X – in other words, the single 𝕌wsuperscript𝕌𝑤\mathbb{U}^{w}blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT and 𝔤2subscript𝔤2\mathfrak{g}_{2}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT operations on MX𝑀𝑋M\in Xitalic_M ∈ italic_X are exchaneable but with the parameter changing of ws𝑤𝑠wsitalic_w italic_s. Furthermore, we have M[l]Xsubscript𝑀delimited-[]𝑙𝑋{M_{[l]}}\in Xitalic_M start_POSTSUBSCRIPT [ italic_l ] end_POSTSUBSCRIPT ∈ italic_X for any l{1,2,,L}𝑙12𝐿l\in\{1,2,\cdots,L\}italic_l ∈ { 1 , 2 , ⋯ , italic_L }. Therefore, 𝔤2subscript𝔤2\mathfrak{g}_{2}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and any composition of 𝕌wsuperscript𝕌𝑤\mathbb{U}^{w}blackboard_U start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT are exchangeable while changing the scale parameter to ws𝑤𝑠wsitalic_w italic_s, implying the correctness of Property 2. ∎

Property 3.

(Hierarchical invariance). For any composition of representation unit 𝕌𝕌\mathbb{U}blackboard_U, it is practical to design a global invariant map \mathcal{I}caligraphic_I w.r.t. the symmetry group of interest 𝔊0𝔊1×𝔊2subscript𝔊0subscript𝔊1subscript𝔊2\mathfrak{G}_{0}\subseteq\mathfrak{G}_{1}\times\mathfrak{G}_{2}fraktur_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊆ fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, due to the predictable geometric symmetries between the input image and deep feature map (at each intermediate layer) guaranteed by Properties 1 and 2. More specifically, with the Definition 4, we assume that there exists a \mathcal{I}caligraphic_I such that 𝕀(𝔤0M)=𝕀(M)𝕀superscriptsubscript𝔤0normal-′normal-M𝕀normal-M\mathbb{I}(\mathfrak{g}_{0}^{\prime}M)=\mathbb{I}(M)blackboard_I ( fraktur_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_M ) = blackboard_I ( italic_M ) for any 𝔤0𝔊0subscript𝔤0subscript𝔊0\mathfrak{g}_{0}\in\mathfrak{G}_{0}fraktur_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ fraktur_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and MXnormal-Mnormal-XM\in Xitalic_M ∈ italic_X, i.e., invariance holds on one layer, where 𝔤superscript𝔤normal-′\mathfrak{g}^{\prime}fraktur_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a predictable operation corresponding to 𝔤𝔤\mathfrak{g}fraktur_g and 𝕌𝕌\mathbb{U}blackboard_U. Then we have following invariance:

𝕀(𝔤0M)[L]𝕀M[L],𝕀subscriptsuperscriptsubscript𝔤0𝑀delimited-[]𝐿𝕀subscript𝑀delimited-[]𝐿\begin{split}\mathbb{I}(\mathfrak{g}_{0}^{\prime}M)_{[L]}\equiv\mathbb{I}M_{[L% ]},\end{split}start_ROW start_CELL blackboard_I ( fraktur_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_M ) start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ≡ blackboard_I italic_M start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT , end_CELL end_ROW (20)

holds for any composition length L1𝐿1L\geq 1italic_L ≥ 1.

Proof.

We can rewrite (20) as:

𝕀(𝔤0M)[L]=𝕀𝕌[L]𝕌[2]𝕌[1](𝔤0M)=𝕀(𝔤0𝕌[L]𝕌[2]𝕌[1](M))=𝕀𝕌[L]𝕌[2]𝕌[1](M)=𝕀M[L],𝕀subscriptsubscript𝔤0𝑀delimited-[]𝐿𝕀subscript𝕌delimited-[]𝐿subscript𝕌delimited-[]2subscript𝕌delimited-[]1subscript𝔤0𝑀𝕀superscriptsubscript𝔤0subscript𝕌delimited-[]𝐿subscript𝕌delimited-[]2subscript𝕌delimited-[]1𝑀𝕀subscript𝕌delimited-[]𝐿subscript𝕌delimited-[]2subscript𝕌delimited-[]1𝑀𝕀subscript𝑀delimited-[]𝐿\begin{split}\mathbb{I}(\mathfrak{g}_{0}M)_{[L]}&=\mathbb{I}\circ\mathbb{U}_{[% L]}\circ\cdots\circ\mathbb{U}_{[2]}\circ\mathbb{U}_{[1]}(\mathfrak{g}_{0}M)\\ &=\mathbb{I}(\mathfrak{g}_{0}^{\prime}\mathbb{U}_{[L]}\circ\cdots\circ\mathbb{% U}_{[2]}\circ\mathbb{U}_{[1]}(M))\\ &=\mathbb{I}\circ\mathbb{U}_{[L]}\circ\cdots\circ\mathbb{U}_{[2]}\circ\mathbb{% U}_{[1]}(M)\\ &=\mathbb{I}M_{[L]},\end{split}start_ROW start_CELL blackboard_I ( fraktur_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_M ) start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT end_CELL start_CELL = blackboard_I ∘ blackboard_U start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ∘ ⋯ ∘ blackboard_U start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT ∘ blackboard_U start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ( fraktur_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_I ( fraktur_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_U start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ∘ ⋯ ∘ blackboard_U start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT ∘ blackboard_U start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ( italic_M ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_I ∘ blackboard_U start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT ∘ ⋯ ∘ blackboard_U start_POSTSUBSCRIPT [ 2 ] end_POSTSUBSCRIPT ∘ blackboard_U start_POSTSUBSCRIPT [ 1 ] end_POSTSUBSCRIPT ( italic_M ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_I italic_M start_POSTSUBSCRIPT [ italic_L ] end_POSTSUBSCRIPT , end_CELL end_ROW (21)

where the first pass comes from Properties 1 and 2, note that 𝔤0𝔊0𝔊1×𝔊2subscript𝔤0subscript𝔊0subscript𝔊1subscript𝔊2\mathfrak{g}_{0}\in\mathfrak{G}_{0}\subseteq\mathfrak{G}_{1}\times\mathfrak{G}% _{2}fraktur_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ fraktur_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊆ fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, 𝔤0superscriptsubscript𝔤0\mathfrak{g}_{0}^{\prime}fraktur_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is related to 𝔤1subscript𝔤1\mathfrak{g}_{1}fraktur_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝔤2superscriptsubscript𝔤2\mathfrak{g}_{2}^{\prime}fraktur_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT; the second pass comes from our assumption 𝕀(𝔤0M)=𝕀(M)𝕀superscriptsubscript𝔤0𝑀𝕀𝑀\mathbb{I}(\mathfrak{g}_{0}^{\prime}M)=\mathbb{I}(M)blackboard_I ( fraktur_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_M ) = blackboard_I ( italic_M ) for any 𝔤0𝔊0subscript𝔤0subscript𝔊0\mathfrak{g}_{0}\in\mathfrak{G}_{0}fraktur_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ fraktur_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and MX𝑀𝑋M\in Xitalic_M ∈ italic_X, with M[l]Xsubscript𝑀delimited-[]𝑙𝑋{M_{[l]}}\in Xitalic_M start_POSTSUBSCRIPT [ italic_l ] end_POSTSUBSCRIPT ∈ italic_X for any l{1,2,,L}𝑙12𝐿l\in\{1,2,\cdots,L\}italic_l ∈ { 1 , 2 , ⋯ , italic_L }. ∎

3.4 Fast and Accurate Implementation

Above, the core components of our representations, i.e., definitions and properties, have been formalized. In this section, we will complement the numerical implementation of HIR, especially the fast and accurate computations of Definition 1 from our previous work [46]. Note that the discussion here is very general, with no restrictions on the specific definitions of the basis functions.

Definition 6.

(Fast implementation). Let us introduce the convolution theorem as a fast implementation of Definition 1, such that the spatial domain convolution of (9) can be converted to the following frequency domain product form [46]:

M=1((M(i,j;k))((Hnmw(i,j))T)),𝑀superscript1direct-product𝑀𝑖𝑗𝑘superscriptsuperscriptsubscript𝐻𝑛𝑚𝑤𝑖𝑗𝑇\mathbb{C}M={\mathcal{F}^{-1}}(\mathcal{F}(M(i,j;k))\odot\mathcal{F}({(H_{nm}^% {w}(i,j))^{T}})),blackboard_C italic_M = caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_F ( italic_M ( italic_i , italic_j ; italic_k ) ) ⊙ caligraphic_F ( ( italic_H start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ( italic_i , italic_j ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ) , (22)

where \mathcal{F}caligraphic_F is the Fourier transform and direct-product\odot is the point-wise multiplication.

Property 4.

(Complexity analysis). In the Definition 1, the (9) dominates the computational complexity due to the dense convolution. For the input feature map M(i,j;k)normal-Mnormal-inormal-jnormal-kM(i,j;k)italic_M ( italic_i , italic_j ; italic_k ) with Ω={1,2,,Ni}×{1,2,,Nj}normal-Ω12normal-⋯subscriptnormal-Nnormal-i12normal-⋯subscriptnormal-Nnormal-j\Omega=\{1,2,\cdots,{N_{i}}\}\times\{1,2,\cdots,{N_{j}}\}roman_Ω = { 1 , 2 , ⋯ , italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } × { 1 , 2 , ⋯ , italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } and H=Knormal-Hsuperscriptnormal-K{\rm H}={\mathfrak{C}^{K}}roman_H = fraktur_C start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, we assume that a set of Mnormal-M\mathbb{C}Mblackboard_C italic_M needs to be computed, where scale parameter wSwnormal-wsubscriptnormal-Snormal-ww\in{S_{w}}italic_w ∈ italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT with a fixed order (n,m)normal-nnormal-m(n,m)( italic_n , italic_m ) and a fixed channel knormal-kkitalic_k, and denote the number of feature map samples as Nij=NiNjsubscriptnormal-Nnormal-inormal-jsubscriptnormal-Nnormal-isubscriptnormal-Nnormal-j{N_{ij}}={N_{i}}{N_{j}}italic_N start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the number of scale samples as Nw=|Sw|subscriptnormal-Nnormal-wsubscriptnormal-Snormal-w{N_{w}}=|{S_{w}}|italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = | italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT |. With the Definition 6 and the Fast Fourier Transform (FFT), we can compute the set of Mnormal-M\mathbb{C}Mblackboard_C italic_M in 𝒪(NwNijlogNij)𝒪subscriptnormal-Nnormal-wsubscriptnormal-Nnormal-inormal-jsubscriptnormal-Nnormal-inormal-j{\mathcal{O}}({N_{w}}{N_{ij}}\log{N_{ij}})caligraphic_O ( italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_log italic_N start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) multiplications, as opposed to the complexity of 𝒪(NwNijwmax2)𝒪subscriptnormal-Nnormal-wsubscriptnormal-Nnormal-inormal-jsuperscriptsubscriptnormal-w2{\mathcal{O}}({N_{w}}{N_{ij}}{w_{\max}}^{2})caligraphic_O ( italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) by the direct Definition 1, where wmaxsubscriptnormal-w{w_{\max}}italic_w start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is the maximum scale in Swsubscriptnormal-Snormal-w{S_{w}}italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. Note that the big difference between square and logarithmic growths in the complexity (removing the same terms), where the Definition 6 will exhibit batter efficiency when wmaxsubscriptnormal-w{w_{\max}}italic_w start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is sufficiently large such that wmax2>logNijsuperscriptsubscriptnormal-w2subscriptnormal-Nnormal-inormal-j{w_{\max}}^{2}>\log{N_{ij}}italic_w start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > roman_log italic_N start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

Definition 7.

(Accurate implementation). Let us introduce the higher-order numerical integration as an accurate implementation of Definition 1, such that the two-dimensional continuous integral of (11) can be converted to the following summation form [46]:

hnmuvw(i,j)(a,b)Sabcab(Vnmuvw(xa,yb))*ΔiΔjw2,similar-to-or-equalssuperscriptsubscript𝑛𝑚𝑢𝑣𝑤𝑖𝑗subscript𝑎𝑏subscript𝑆𝑎𝑏subscript𝑐𝑎𝑏superscriptsuperscriptsubscript𝑉𝑛𝑚𝑢𝑣𝑤subscript𝑥𝑎subscript𝑦𝑏Δ𝑖Δ𝑗superscript𝑤2h_{nm}^{uvw}(i,j)\simeq\sum\limits_{(a,b)\in{S_{ab}}}{{c_{ab}}{{(V_{nm}^{uvw}(% {x_{a}},{y_{b}}))}^{*}}\frac{{\Delta i\Delta j}}{{{w^{2}}}}},italic_h start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_i , italic_j ) ≃ ∑ start_POSTSUBSCRIPT ( italic_a , italic_b ) ∈ italic_S start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT divide start_ARG roman_Δ italic_i roman_Δ italic_j end_ARG start_ARG italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (23)

where the set of numerical integration samples Sabsubscript𝑆𝑎𝑏{S_{ab}}italic_S start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT encodes the points (xa,yb)Dijsubscript𝑥𝑎subscript𝑦𝑏subscript𝐷𝑖𝑗({x_{a}},{y_{b}})\in{D_{ij}}( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ∈ italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and the corresponding weights cabsubscript𝑐𝑎𝑏{c_{ab}}italic_c start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT, which are specified by a certain numerical integration strategy, such as Gaussian quadrature.

Property 5.

(Accurate analysis). In the Definition 1, the (11) dominates the computational accuracy due to the continuous integration of complicated functions. We assume that the hnmuvwsuperscriptsubscriptnormal-hnormal-nnormal-mnormal-unormal-vnormal-wh_{nm}^{uvw}italic_h start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_v italic_w end_POSTSUPERSCRIPT with a fixed order (n,m)normal-nnormal-m(n,m)( italic_n , italic_m ) and position (u,v)normal-unormal-v(u,v)( italic_u , italic_v ) needs to be computed, and denote the number of numerical integration samples as Nab=|Sab|subscriptnormal-Nnormal-anormal-bsubscriptnormal-Snormal-anormal-b{N_{ab}}=|{S_{ab}}|italic_N start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT = | italic_S start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT |. The implementation based on the Definition 7 exhibits an approximation error of 𝒪((ΔiΔjw2)Nab+1)𝒪superscriptnormal-Δnormal-inormal-Δnormal-jsuperscriptnormal-w2subscriptnormal-Nnormal-anormal-b1{\mathcal{O}}({(\frac{{\Delta i\Delta j}}{{{w^{2}}}})^{{N_{ab}}+1}})caligraphic_O ( ( divide start_ARG roman_Δ italic_i roman_Δ italic_j end_ARG start_ARG italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT ). Note that when there is more than one sample within each pixel region, i.e., Nab>1subscriptnormal-Nnormal-anormal-b1{N_{ab}}>1italic_N start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT > 1, the Definition 7 will exhibit batter accuracy than the error of 𝒪((ΔiΔjw2)2)𝒪superscriptnormal-Δnormal-inormal-Δnormal-jsuperscriptnormal-w22{\mathcal{O}}({(\frac{{\Delta i\Delta j}}{{{w^{2}}}})^{2}})caligraphic_O ( ( divide start_ARG roman_Δ italic_i roman_Δ italic_j end_ARG start_ARG italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) by the direct Definition 1 (zero-order approximation).

3.5 Comparison with Related Work

As a further explanation of Table 1, it is necessary to conclude this section and highlight the theoretical relationships with typical related works:

  • Traditional invariance. Our work generalizes this theory by unifying the global and local invariant representations into a new framework of HIR. More specifically, we formalize layers \mathbb{C}blackboard_C, 𝕊𝕊\mathbb{S}blackboard_S, and \mathbb{P}blackboard_P based on the theory of local invariants [46] (Definitions 1 similar-to\sim 3), arguing the equivariance/covariance can be preserved across layers under a certain cascade (Properties 1 similar-to\sim 2). We also formalize layer 𝕀𝕀\mathbb{I}blackboard_I based on the theory of global invariants [17] (Definition 4), arguing the successes of global invariance for image domains can be directly generalized to equivariant/covariant deep feature domains (Property 3). Under our hierarchical invariance, classical global [17] and local [46] invariants can be considered as special cases, i.e., 𝕀f𝕀𝑓\mathbb{I}fblackboard_I italic_f and 𝕀𝕊f𝕀𝕊𝑓\mathbb{I}\circ\mathbb{S}\circ\mathbb{C}fblackboard_I ∘ blackboard_S ∘ blackboard_C italic_f (Definition 5).

  • Traditional CNN. Our work has a similar hierarchical architecture but with better properties in geometric symmetry, allowing for robust and interpretable image representations. More specifically, we introduce the discriminative design of CNN in our invariants, i.e., over-complete representation with deep cascading [21]. On the other hand, we criticize typical CNN modules (Formulation 1), allowing fully transparent geometric symmetries across layers of our representation (Properties 1 similar-to\sim 3). As a result, the proposed representation serves as an effective alternative to the highly black-box CNN in trustworthy tasks.

  • Scattering networks. Our work is more compact in achieving rotation invariance. As a main competitor, scattering networks are also based on deep cascading of explicit transforms (wavelets) [24], with similar concepts to our work. However, constructing rotation invariants from scattering networks is complicated, which requires parallel convolution and cross-channel pooling of multiple oriented wavelets; increasing the orientation sampling will result in an exponential growth of the complexity. Whereas our approach benefits from classical invariant theory, rotation invariance is continuous and one-shot (Property 1), providing better efficiency while easily enlarging the network size to improve the representation capacity.

  • Equivariant networks. Our work is non-learning while being more compact in achieving continuous and joint invariance. As a secondary competitor, equivariant networks are also guaranteed by group theory [32], with similar concepts to our work. However, the convolutional layers in equivariant networks are learned, leading to varying degrees of data dependence. In particular, it has a similar parallel structure to scattering networks, leading to exponential complexity and optimization challenges. Although equivariant networks are a very generic design, our approach provides better efficiency for continuous and joint invariance (Properties 1 similar-to\sim 3), while easily enlarging the network size to improve the representation capacity.

4 Hierarchical Invariance: Practice

This section focuses on the practical aspects of the proposed HIR.

We first specify the above theory into a class of networks, involving more practical details about the topologies, layers, and parameters. We also discuss the domain adaptation strategies for practical scenarios, with feature/architecture selection or cascading learning module. Note that such efforts only serve as a feasible practice of hierarchical invariance theory, towards the experiments and applications in Section 5.

4.1 Specifying the Architecture

Starting from the Properties 1 similar-to\sim 3, we propose a practical architecture for the hierarchical invariance w.r.t. the symmetry group of interest 𝔊0=𝔊1×𝔊2subscript𝔊0subscript𝔊1subscript𝔊2\mathfrak{G}_{0}=\mathfrak{G}_{1}\times\mathfrak{G}_{2}fraktur_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, with a tree topology and working at multiple scales.

Single-scale Networks. Let us first present the topology on a single scale, i.e., all involved convolutional layers have a common scale parameter w𝑤witalic_w, which exhibits invariance for 𝔊1subscript𝔊1\mathfrak{G}_{1}fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (Properties 1 and 2). As shown in Fig. 2, we organize the set of paths as a tree-like network: 1) blue nodes denote the representation units 𝕌𝕌\mathbb{U}blackboard_U with different parameters (n,m)𝑛𝑚(n,m)( italic_n , italic_m ); 2) black nodes denote the identity function; 3) lines denote cascading relationships between nodes, where all nodes sorted by l𝑙litalic_l along their paths are plotted at the same level l𝑙litalic_l (sorted from top to bottom). Note that the feature map of each node will be fed into 𝕀𝕀\mathbb{I}blackboard_I for forming the invariants under this path, where the network representation is just the set of invariants under all paths. Here, the order parameter (n,m)𝑛𝑚(n,m)( italic_n , italic_m ) of the previous unit (blue) is always smaller than that of the subsequent ones (under a specific norm), so that the path exhibits an increasing trend in the order. With this design, the main information can be passed through the early nodes, and hence the subsequent nodes capture rich features. Also, the identity function (black) is introduced as a skip-connection trick, allowing the information to be passed to deeper nodes. In this paper, all units from the same level l𝑙litalic_l are specified separately from the set {(n,m):n+m=l,(n,m)2}conditional-set𝑛𝑚formulae-sequence𝑛𝑚𝑙𝑛𝑚superscript2\{(n,m):n+m=l,(n,m)\in{\mathbb{N}^{2}}\}{ ( italic_n , italic_m ) : italic_n + italic_m = italic_l , ( italic_n , italic_m ) ∈ blackboard_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }, i.e., their orders are equal under the 1subscript1{\ell_{1}}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm.

Multi-scale Networks. Next, let us consider a multi-scale version of the above network with scale separation prior, extending the invariance to 𝔊0subscript𝔊0\mathfrak{G}_{0}fraktur_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (Property 3). As shown in Fig. 3, a series of single-scale networks are introduced: 1) they have the same tree topology and same order parameter at corresponding nodes; 2) but each network has a different scale parameter, sampled from the set {w:w=2t,t}conditional-set𝑤formulae-sequence𝑤superscript2𝑡𝑡\{w:w={2^{t}},t\in\mathbb{Z}\}{ italic_w : italic_w = 2 start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_t ∈ blackboard_Z }, where the scaling covariance (w.r.t. w𝑤witalic_w) is transformed into a linear translation pattern (w.r.t. t𝑡titalic_t) between multi-scale networks. The above network can derive a series of multi-scale representations of the image (with invariance for 𝔊1subscript𝔊1\mathfrak{G}_{1}fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), which are directly suitable for visual tasks with multi-scale physical structure (e.g., object detection). Further, we can derive scale-invariant representations (with invariance for 𝔊0subscript𝔊0\mathfrak{G}_{0}fraktur_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) under the Property 3, by pooling feature maps from a series of corresponding nodes at multiple scales. Note that, in practice, we cannot sample the scale completely and densely, and thus the above scaling invariance is restricted.

Refer to caption
Figure 2: A single-scale practice of HIR with the invariance for 𝔊1subscript𝔊1\mathfrak{G}_{1}fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. This tree-like HIR network encodes a set of paths, where blue and black nodes denote representation units (with different parameters) and identity function, respectively; lines denote cascading relationships between nodes.
Refer to caption
Figure 3: A multi-scale practice of HIR with the invariance for 𝔊0subscript𝔊0\mathfrak{G}_{0}fraktur_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This multi-scale HIR network is based on scale separation prior, where the scaling covariance is transformed into a linear translation pattern between multi-scale networks. One can derive scale-invariant representations by pooling feature maps from a series of corresponding nodes at multiple scales

Radial Basis Functions. In our previous work [49], two generic classes of radial basis functions have been introduced, based on a family of harmonic functions:

Rn(α,r)=αrα22πexp(𝒋2nπrα),subscript𝑅𝑛𝛼𝑟𝛼superscript𝑟𝛼22𝜋𝒋2𝑛𝜋superscript𝑟𝛼{R_{n}}(\alpha,r)=\sqrt{\frac{{\alpha{r^{\alpha-2}}}}{{2\pi}}}\exp(\bm{j}2n\pi% {r^{\alpha}}),italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_α , italic_r ) = square-root start_ARG divide start_ARG italic_α italic_r start_POSTSUPERSCRIPT italic_α - 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_π end_ARG end_ARG roman_exp ( bold_italic_j 2 italic_n italic_π italic_r start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) , (24)

and a family of polynomial functions:

Rnα(p,q,r)=αrαq2(1rα)pq(p+2n)Γ(q+n)n!2πΓ(p+n)Γ(pq+n+1)×k=0n(1)kΓ(p+n+k)rαkk!(nk)!Γ(q+k),superscriptsubscript𝑅𝑛𝛼𝑝𝑞𝑟𝛼superscript𝑟𝛼𝑞2superscript1superscript𝑟𝛼𝑝𝑞𝑝2𝑛Γ𝑞𝑛𝑛2𝜋Γ𝑝𝑛Γ𝑝𝑞𝑛1superscriptsubscript𝑘0𝑛superscript1𝑘Γ𝑝𝑛𝑘superscript𝑟𝛼𝑘𝑘𝑛𝑘Γ𝑞𝑘\begin{split}R_{n}^{\alpha}(p,q,r)=\sqrt{\frac{{\alpha{r^{\alpha q-2}}{{(1-{r^% {\alpha}})}^{p-q}}(p+2n)\Gamma(q+n)n!}}{{2\pi\Gamma(p+n)\Gamma(p-q+n+1)}}}\\ \times\sum\limits_{k=0}^{n}{\frac{{{{(-1)}^{k}}\Gamma(p+n+k){r^{\alpha k}}}}{{% k!(n-k)!\Gamma(q+k)}}},\end{split}start_ROW start_CELL italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_p , italic_q , italic_r ) = square-root start_ARG divide start_ARG italic_α italic_r start_POSTSUPERSCRIPT italic_α italic_q - 2 end_POSTSUPERSCRIPT ( 1 - italic_r start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_p - italic_q end_POSTSUPERSCRIPT ( italic_p + 2 italic_n ) roman_Γ ( italic_q + italic_n ) italic_n ! end_ARG start_ARG 2 italic_π roman_Γ ( italic_p + italic_n ) roman_Γ ( italic_p - italic_q + italic_n + 1 ) end_ARG end_ARG end_CELL end_ROW start_ROW start_CELL × ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG ( - 1 ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_Γ ( italic_p + italic_n + italic_k ) italic_r start_POSTSUPERSCRIPT italic_α italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_k ! ( italic_n - italic_k ) ! roman_Γ ( italic_q + italic_k ) end_ARG , end_CELL end_ROW (25)

respectively, where the fractional parameter α+𝛼superscript\alpha\in{\mathbb{R}^{+}}italic_α ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, the polynomial parameters p,q𝑝𝑞p,q\in{\mathbb{R}}italic_p , italic_q ∈ blackboard_R must fulfill pq>1𝑝𝑞1p-q>-1italic_p - italic_q > - 1 and q>0𝑞0q>0italic_q > 0. Both classes of functions can be used to define in the (2), satisfying the orthogonality condition in Section 2.1.

For the sake of simplicity, a family of cosine functions are chosen in all experiments and applications, as a special case of the (24):

Rn(r)={1πn=02πcos(nπr2)n>0,subscript𝑅𝑛𝑟cases1𝜋𝑛0missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression2𝜋𝑛𝜋superscript𝑟2𝑛0missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression{R_{n}}(r)=\left\{{\begin{array}[]{*{20}{l}}{\frac{1}{{\sqrt{\pi}}}}&{n=0}\\ {\sqrt{\frac{2}{\pi}}\cos(n\pi{r^{2}})}&{n>0}\end{array}}\right.,italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_r ) = { start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_π end_ARG end_ARG end_CELL start_CELL italic_n = 0 end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_π end_ARG end_ARG roman_cos ( italic_n italic_π italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_n > 0 end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY , (26)

i.e., forming a hierarchical invariant version of the Polar Cosine Transform (PCT) [50]. Note that we try to show the superiority of the hierarchical invariant framework itself, even if relying on naive (26).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Texture image dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) Digit image dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(c) Parasite image dataset
Figure 4: Illustration for the datasets from the computer vision and pattern recognition experiments.

Invariant Layer. In the monograph [51] and our previous work [17], a number of strategies for directly constructing global invariants in image domains have been presented. They can be naturally used to define \mathcal{I}caligraphic_I in (14), with the equivariant or covariant behavior of deep feature maps (Properties 1 similar-to\sim 3). In all experiments and applications of this paper, a class of global invariants is concisely designed based on frequency pooling.

Regarding (14), we first let the Fourier basis be Vnm(xi,yj)subscript𝑉𝑛𝑚subscript𝑥𝑖subscript𝑦𝑗{V_{nm}}({x_{i}},{y_{j}})italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Note that the Fourier Transform (FT) is highly understood in the signal processing community and can be considered a good foundation for interpretability. Then, based on the order/frequency sampling of the FT (n,m)[K,K]2𝑛𝑚superscript𝐾𝐾2(n,m)\in{[-K,K]^{2}}( italic_n , italic_m ) ∈ [ - italic_K , italic_K ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we define \mathcal{I}caligraphic_I as a frequency-band integral in the polar system:

({M,Vnm}){Ii=(n,m)i{|M,Vnm|}:i=1,2,,#B},\begin{split}&{\cal I}(\{\left<M,{V_{nm}}\right>\})\\ &\triangleq\{{I_{i}}=\sum\limits_{(n,m)\in{{\cal B}_{i}}}{\{|\left<M,{V_{nm}}% \right>|\}:i=1,2,\cdots,{\#_{B}}}\},\end{split}start_ROW start_CELL end_CELL start_CELL caligraphic_I ( { ⟨ italic_M , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT ⟩ } ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≜ { italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT ( italic_n , italic_m ) ∈ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT { | ⟨ italic_M , italic_V start_POSTSUBSCRIPT italic_n italic_m end_POSTSUBSCRIPT ⟩ | } : italic_i = 1 , 2 , ⋯ , # start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } , end_CELL end_ROW (27)

where i={(n,m):2K(i1)/#B|||(n,m)||22Ki/#B}{{\cal B}_{i}}=\{(n,m):\sqrt{2}K(i-1)/{\#_{B}}|\leq||(n,m)|{|_{2}}\leq\sqrt{2}% Ki/{\#_{B}}\}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( italic_n , italic_m ) : square-root start_ARG 2 end_ARG italic_K ( italic_i - 1 ) / # start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | ≤ | | ( italic_n , italic_m ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ square-root start_ARG 2 end_ARG italic_K italic_i / # start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } is the i𝑖iitalic_i-th frequency band under the 2subscript2{\ell_{2}}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm, with the number of bands #Bsubscript#𝐵{\#_{B}}# start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT.

Here, we can state that the above feature vector {Ii:i=1,2,,#B}conditional-setsubscript𝐼𝑖𝑖12subscript#𝐵\{{I_{i}}:i=1,2,...,{\#_{B}}\}{ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i = 1 , 2 , … , # start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT } directly satisfies the invariance for 𝔊1subscript𝔊1\mathfrak{G}_{1}fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, in light of Property 1 and the translation, rotation and flip** properties of FT. As for scaling, {\cal I}caligraphic_I is compatible with both single-scale and multi-scale networks: 1) regarding the single-scale case, a certain degree of robustness is provided for 𝔊2subscript𝔊2\mathfrak{G}_{2}fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (at least up to the bandwidth), in light of Property 2 and the scaling property of FT; 2) regarding the multi-scale case, the scaling covariance has been eliminated before feeding into {\cal I}caligraphic_I, and thus will satisfy the joint invariance for 𝔊0=𝔊1×𝔊2subscript𝔊0subscript𝔊1subscript𝔊2\mathfrak{G}_{0}=\mathfrak{G}_{1}\times\mathfrak{G}_{2}fraktur_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = fraktur_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × fraktur_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Note that the well-known average pooling is in fact a special case of (27), with K=0𝐾0K=0italic_K = 0 and #B=1subscript#𝐵1{\#_{B}}=1# start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 1. Our frequency-band integral {\cal I}caligraphic_I can be regarded as a generic design of global pooling, with comprehensive consideration on interpretability, invariance, and discriminability.

4.2 Empowering the Data Adaptability

Due to the hand-crafted nature of HIR, a fixed set of their features is not adaptive to the data distribution. For larger-scale vision tasks, we propose following data adaptability strategies, allowing our invariants to reach a similar discriminability level of learning representations. Note that unlike typical (under)-complete invariants, our invariants exhibit a high level of over-completeness due to the local and hierarchical structure, which is the foundation for data adaptability strategies.

Feature/Architecture Selection. Discriminative features for a given task can be formed in a selection-based manner, inspired by Neural Architecture Search (NAS) [52]. First, we can construct a large-scale tree-like network (going deeper or wider), covering a wide set of paths and parameters, analogous to the notion of supernet in NAS [53]. Then, with the training set under a given task, we can perform correlation analysis of features and labels for ranking discriminative features (as well as the corresponding paths), analogous to the phase of architecture sampling and evaluation in NAS [53]. With the above analysis, we can greatly simplify the initial supernet such that the paths cover top-ranked features for applications, allowing our representations to be task-discriminative.

Cascading Learning Module. Discriminative features for a given task can also be formed in a learning-based manner, inspired by Hybrid Representation Learning (HRL) [27]. The main idea is to replace shallow layers of learning CNN with fixed HIR, such that discriminative features are formed in a space with geometric symmetries. According to related justifications [27], this strategy is able to achieve a discriminability level rivaling typical CNN on large-scale classification benchmarks, while exhibiting significantly better training compactness. In all experiments and applications of this paper, we still employ the feature/architecture selection strategy to show the superiority of the hierarchical invariant framework itself, taking also into account that the cascading CNN weakens invariance and interpretability to some extent.

TABLE II: Classification Scores (%) and Runtime (Second) for Different Representations on a Small-scale Texture Benchmark.
Method Time GPU† Original Orien. & Flip.
Pre. Rec. F1 Pre. Rec. F1
Classical:
Cosine 5 70.74 67.50 66.85 69.65 66.25 65.30
Wavelet 6 69.43 64.38 64.68 62.34 58.13 57.82
Kraw. 5 70.67 67.50 66.30 64.41 60.00 59.55
Learning:
SimpleNet 52† 70.33 67.50 67.09 54.63 43.13 41.31
SimpleNet+ 52† 46.93 49.38 46.06 47.18 48.13 44.93
AlexNet 42† 98.82 98.75 98.75 91.69 91.25 91.28
AlexNet+ 41† 87.61 84.38 84.05 88.37 85.63 85.76
VGGNet 266† 99.41 99.38 99.37 92.18 91.25 91.37
VGGNet+ 609† 91.34 90.00 89.81 92.15 91.25 91.08
Invariant:
ScatterNet 42 98.89 98.75 98.75 84.98 83.13 83.08
HIR 27 96.98 96.88 96.87 96.32 96.25 96.23

5 Experiments and Applications

In this section, we will comprehensively evaluate the discriminability, robustness, and efficiency of HIR, covering simulation experiments in Section 5.1 and real-world applications in Section 5.2. Here, the main aim is for examining the representation properties promised in previous sections, as well as positioning its discriminative power in the era of deep learning.

With Sections 4 and 5, we implement a code repository for HIR in https://github.com/ShurenQi/HIR. All experiments/applications are executed in Matlab R2023a under Microsoft Windows environment, based on 2.90-GHz CPU, RTX-3060 GPU, and 16-GB RAM.

5.1 Computer Vision and Pattern Recognition

We perform classification experiments with HIR on typical sets of texture, digit, and parasite images, benchmarking its representation capabilities. Note that this series of simulation experiments examines the properties promised by our theory under diverse task scales and geometric variants, also with comparisons to a range of hand-crafted and learning representations.

With the practice of Sections 4, our HIR is implemented here as a single-scale network scale parameter w=10𝑤10w=10italic_w = 10 and composition length L=6𝐿6L=6italic_L = 6; its invariant layer (27) is specialized to the average pooling, with K=0𝐾0K=0italic_K = 0 and #B=1subscript#𝐵1{\#_{B}}=1# start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 1, for a fair comparison with the deep representations by average pooling. Note that the adaptability strategies of Section 4.2 are not employed here, for a direct assessment of its discriminative power. All features are fed into a PCA classifier, trained on features of the training set. Unless otherwise stated, the training and testing sets are formed without any crossover by random sampling at 80% and 20% ratios on the original dataset, respectively.

The competing representations involved here can be summarized as follows:

  • Classical complete representation: discrete cosine transform as a global representation;

  • Classical over-complete representation: discrete wavelet transform [54] and Krawtchouk moments [55] as local representations, with different time-frequency resolutions;

  • Advanced over-complete representations: 1) typical CNNs, i.e., direct-learning CNN (denoted as SimpleNet), transfer-learning AlexNet [56] and VGGNet [57], with also data augmentation (denoted as ‘+’); 2) invariant CNNs, i.e., scattering networks [24] and our HIR. Broadly speaking, they can all be considered as a class of hierarchical invariant representations, but with different levels of invariance, where typical CNNs are invariant only to translations.

TABLE III: Classification Scores (%) and Runtime (Second) for Different Representations on a Medium-scale Digit Benchmark.
Method Time GPU† Original Trans. & Rota.
Pre. Rec. F1 Pre. Rec. F1
Classical:
Cosine 15 45.68 45.35 45.43 32.50 31.20 30.83
Wavelet 16 67.11 66.75 66.75 38.31 35.25 35.44
Kraw. 15 71.73 69.85 69.69 27.05 26.30 25.83
Learning:
SimpleNet 535† 98.60 98.60 98.60 35.42 33.50 33.72
SimpleNet+ 551† 52.70 48.90 48.82 54.26 50.75 50.78
AlexNet 393† 100 100 100 66.18 64.45 64.27
AlexNet+ 392† 93.07 92.00 91.80 94.23 93.10 92.93
VGGNet 3610† 100 100 100 70.74 70.25 69.93
VGGNet+ 7731† 95.98 95.70 95.68 95.53 95.20 95.13
Invariant:
ScatterNet 115 98.96 98.95 98.95 57.20 56.95 56.23
HIR 57 97.48 97.45 97.45 95.05 94.95 94.98

5.1.1 Texture

As shown in Fig. 4, the experiment is executed on dataset KTH-TIPS111https://www.csc.kth.se/cvap/databases/kth-tips/index.html, a typical benchmark for texture image classification. This dataset has 10 classes, each containing 81 instances, the total size is 10×81=810108181010\times 81=81010 × 81 = 810, and hence is considered as a small-scale vision problem.

As shown in Table 2, we list performance scores of the competing representations on this benchmark, as well as the elapsed time, i.e., CPU featuring time or GPU training time. Besides this direct protocol on the original dataset, we also consider testing image variants with random orientation (w.r.t. {0,90,180,270}090180270\{0,90,180,270\}{ 0 , 90 , 180 , 270 } degree) or flip** (w.r.t. x𝑥xitalic_x or y𝑦yitalic_y axis).

  • The classical (over-)complete representations fail to achieve a satisfactory level of discriminability, even in the direct protocol of such small-scale benchmark.

  • The learning CNN family achieves significantly higher scores due to its over-complete and data-adaptive properties, especially the AlexNet and VGGNet with large-scale pre-training and transfer learning. Whereas, the SimpleNet performs relatively poorly, indicating the sensitivity of learning to network size and training strategy. Under the variant protocol, they exhibit a significant performance degradation, suggesting the learned features lack invariance w.r.t. natural geometric variations of texture. After introducing the augmented training, the CNN scores become more stable, but at the cost of discriminability. A potential reason for this phenomenon is the small amount of training data. Moreover, the computational cost is considerable for this small-scale problem, and a certain training instability is observed.

  • The scattering networks provide a high level of discriminability and robustness without feature training and data augmentation, indicating the success of extending classical wavelets to deep representations.

  • Our work further extends such success: the HIR achieves a similar level of discriminability as the learning CNN family, while exhibiting superior robustness in the variant protocol than all competing representations. In particular, such representation success build on our compact and efficient framework, with lower runtimes than scattering networks and learning CNN family.

5.1.2 Digit

As shown in Fig. 4, the experiment is executed on a digit dataset222https://ww2.mathworks.cn/help/deeplearning/ug/data-sets-for-deep-learning.html for classification similar to the MNIST benchmark. This dataset has 10 classes from ‘0’ to ‘9’, each containing 1000 instances with rich font differences and geometric distortions, the total size is 10×1000=100001010001000010\times 1000=1000010 × 1000 = 10000, and hence is considered as a medium-scale vision problem.

As shown in Table 3, we list performance scores and elapsed times of the competing representations on this benchmark. Besides this direct protocol, we also consider testing image variants with random translation (w.r.t. 22similar-to22-2\sim 2- 2 ∼ 2 pixels in axial directions) and random rotation (w.r.t. 0360similar-to03600\sim 3600 ∼ 360 degrees around the center).

  • As the problem size increases, the complete Cosine exhibits a significant degradation, while the over-complete Wavelet and Krawtchouk are more stable, revealing the role of over-completeness in the discriminability. Regarding robustness, such representations all fail against translation and rotation variants, implying the challenging nature of this protocol.

  • In general, the learning CNN family continues the performance in Section 5.1.1, further confirming the sensitivity of learning to network size, training strategy, and geometric variants. Its robustness is significantly increased after augmented training. However, theoretically, the resulting robustness is not guaranteed for unseen data distributions (even for similar variants with unseen parameters). One can note the rapid expansion of computational cost: the transfer learning of VGGNet even takes similar-to\sim 2 GPU hours.

  • The handcrafted scattering networks still provide a good level of discriminability here, further validating its success. However, it exhibits unsatisfactory scores for translation and rotation variants, even lower than the CNN without augmentation, failing to achieve the expected robustness.

  • The HIR significantly outperforms the main competitor, i.e., scattering networks. Despite the increased problem size, it still achieves a similar level of discriminability as the learning CNN family, also under the constraints of invariance and compactness. Note that our HIR is the only method exhibiting confusion between classes ‘6’ and ‘9’ in the direct training, while achieving similar-to\sim 100% scores for the rest. This phenomenon is in line with the rotation invariance and discriminative power expected by our theory.

TABLE IV: Classification Scores (%) and Runtime (Second, for Train./Test. = 8/2) for Different Representations on a Large-scale Parasite Benchmark.
Method Time GPU† Train./Test. = 8/2 Train./Test. = 1/9
Pre. Rec. F1 Pre. Rec. F1
Classical:
Cosine 37 36.19 32.60 29.85 49.40 41.97 43.80
Wavelet 39 41.68 45.20 41.79 53.69 47.97 49.27
Kraw. 42 66.56 69.49 67.21 71.60 57.88 61.10
Learning:
SimpleNet 2244† 90.15 89.25 89.65 84.51 76.14 78.84
AlexNet 1796† 98.87 98.40 98.63 95.92 94.69 95.27
VGGNet 9184† 99.24 98.97 99.11 97.95 97.37 97.65
Invariant:
ScatterNet 1277 68.41 69.71 67.55 72.52 63.30 65.70
HIR 823 88.73 92.18 90.10 91.26 88.76 89.85

5.1.3 Parasite

As shown in Fig. 4, the experiment is executed on micrographic dataset333https://data.mendeley.com/datasets/38jtn4nzs6/3, a typical benchmark for parasite image classification. This dataset has 6 parasite classes and 2 host classes, with real-world diversity regarding imaging, background, morphology, and geometry, the total size is 34298, and hence is considered as a large-scale vision problem.

As shown in Table 4, we list performance scores and elapsed times of the competing representations on this benchmark. Note that we also consider a protocol with different training-testing ratios to analyze the data dependence and sample efficiency.

  • In this large-scale problem, the scores of the classical representations drop further, implying a limited level of discriminability. On the other hand, their performance is relatively stable when training samples are reduced, and even better in the 1/9191/91 / 9 case, indicating a good efficiency.

  • In the learning CNN family, the direct-learning SimpleNet exhibits a clear data dependence. Specifically, it achieves similar-to\sim90% scores in the 8/2828/28 / 2 case (similar to HIR), while the scores drop significantly in the 1/9191/91 / 9 case (below than HIR). In contrast, the AlexNet and VGGNet achieve good discriminability and stability in the 1/9191/91 / 9 case, revealing that the transfer strategy effectively inherits the pre-training prior on ImageNet. On the other hand, the cost of pre-training and transfer learning is still considerable, without guaranteed robustness or adaptability for a given data domain.

  • Despite outperforming the original wavelets, scattering networks fail to provide a competitive discriminability in the era of deep learning. Here, the common failure of such hand-crafted representations on larger-scale discriminability can be regarded as important evidence for our motivation.

  • The HIR achieves a SimpleNet-level discriminability, outperforming our competitor scattering networks significantly. Also, the HIR is not sensitive to the reduction of training samples, outperforming the learning CNN family in data dependence and sample efficiency. Note that the discriminability of the fixed features from HIR is still lower than the transfer learning with large-scale pre-training. Therefore, in the next applications, the HIR features will be empowered with data adaptability strategies in Section 4.2.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Adversarial perturbation image dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) AIGC image dataset
Figure 5: Illustration for the datasets from the digital forensic and forgery detection experiments.
TABLE V: Adversarial Perturbation Forensic Scores (F1, %) for Different Representations w.r.t. Various Types of Perturbations.
Method BIM CW DAmage FGSM PGD UP Average Worst
Classical:
Cosine NN 34.63 33.19 90.78 39.80 34.69 2.22 39.22 2.22
Cosine SVM 79.57 83.34 97.26 78.24 79.22 96.68 85.72 78.24
Wavelet NN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Wavelet SVM 72.83 82.09 97.77 78.21 71.80 95.87 83.10 71.80
Krawtchouk NN 66.43 66.49 90.86 66.43 66.44 0.00 59.44 0.00
Krawtchouk SVM 0.00 55.87 0.00 56.44 0.00 70.37 30.45 0.00
Learning:
SimpleNet 4.24 3.24 92.13 49.89 33.13 99.86 47.08 3.24
AlexNet 90.20 72.72 96.63 94.61 90.91 98.45 90.59 72.72
VGGNet 96.04 62.50 99.08 98.12 96.99 99.15 91.98 62.50
GoogLeNet 90.29 80.04 97.09 95.29 89.94 98.75 91.90 80.04
ResNet 90.22 75.59 97.35 94.66 90.17 98.40 91.07 75.59
DenseNet 98.93 90.19 99.34 99.23 98.85 99.76 97.72 90.19
InceptionNet 98.70 85.14 97.38 97.32 98.66 99.41 96.10 85.14
MobileNet 92.51 82.67 97.37 96.81 92.10 98.19 93.27 82.67
Invariant:
ScatterNet NN 81.30 70.23 95.27 91.17 82.65 94.64 85.88 70.23
ScatterNet SVM 84.40 69.49 96.77 90.57 83.86 95.12 86.70 69.49
HIR NN 89.66 84.92 98.89 93.26 90.08 97.78 92.43 84.92
HIR SVM 92.30 89.10 99.30 95.96 91.60 98.93 94.53 89.10
TABLE VI: Adversarial Perturbation Forensic Scores (%) for Different Representations on a Real-world (Hybrid) Benchmark.
Method Train./Test. = 5/5 Train./Test. = 1/9
Pre. Rec. F1 Pre. Rec. F1
Classical:
Cosine NN 0.00 0.00 0.00 0.00 0.00 0.00
Cosine SVM 79.08 73.33 76.10 81.13 68.79 74.45
Wavelet NN 0.00 0.00 0.00 0.00 0.00 0.00
Wavelet SVM 77.53 66.95 71.85 76.05 61.13 67.78
Kraw. NN 50.53 15.22 23.40 50.00 15.10 23.20
Kraw. SVM 50.03 65.34 56.67 49.75 48.77 49.26
Learning:
SimpleNet 47.31 48.11 47.71 50.59 63.63 56.36
AlexNet 81.46 87.35 84.30 72.24 61.36 66.35
VGGNet 81.41 90.04 85.51 78.83 75.35 77.05
GoogLeNet 82.74 85.46 84.08 63.35 57.74 60.42
ResNet 80.93 84.70 82.77 68.48 66.64 67.55
DenseNet 87.92 93.25 90.51 82.07 83.96 83.00
InceptionNet 84.60 90.92 87.65 69.58 70.77 70.17
MobileNet 83.07 88.07 85.50 68.73 69.50 69.11
Invariant:
Scatter. NN 69.85 68.94 69.39 74.93 77.31 76.10
Scatter. SVM 75.70 72.07 73.84 76.42 78.63 77.51
HIR NN 81.27 80.68 80.98 79.09 82.17 80.60
HIR SVM 86.20 86.06 86.13 83.42 83.29 83.35

5.2 Digital Forensic and Forgery Detection

For real-world applications, we employ the HIR for large-scale digital forensics, i.e., detections of adversarial perturbation and AIGC, for direct checking its usefulness in robust and interpretable tasks. Note that this plug-and-play strategy will not only be compared to similar representations, but will also a range of current forensic solutions, including well-designed deep forensics.

With the practice of Sections 4, our HIR is implemented here as a single-scale network scale parameter w=10𝑤10w=10italic_w = 10 and composition length L=7𝐿7L=7italic_L = 7; its invariant layer (27) is specialized with K=Nij/2𝐾subscript𝑁𝑖𝑗2K={N_{ij}}/2italic_K = italic_N start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT / 2 and #B=30subscript#𝐵30{\#_{B}}=30# start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 30, for improving the discriminability of digital artifacts. Note that the feature/architecture selection strategy of Section 4.2 is employed for data adaptability and discriminability, where the top-ranked 500- and 1000-dimensional features are selected for AIGC and adversarial perturbation, respectively. All features are fed into both NN and SVM classifiers, for evaluating the sensitivity w.r.t. the classifier. Unless otherwise stated, the training and testing sets are formed without any crossover by random sampling at 50% and 50% ratios on the original dataset, respectively.

Refer to caption
Figure 6: A comparison of adversarial perturbation forensic scores (F1, %) w.r.t. current forensic solutions on the UP benchmark.

The competing methods involved here can be summarized as follows:

  • All the representations in Section 5.1 as direct forensics;

  • More deep representation milestones for direct forensics, i.e., GoogLeNet [58], ResNet [59], DenseNet [60], InceptionNet [61], and MobileNet [62];

  • The forensic methods designed for adversarial perturbation, i.e., arXiv’17 [63], ICLR’18 [64], TDSC’18 [65], IJCV’19 [66], CVPR’19 [67], TDSC’20 [68], and arXiv’23 [69];

  • The forensic methods designed for AIGC, i.e., ECCV’20 [70], CVPR’20a [71], and CVPR’20b [72].

5.2.1 Adversarial Perturbations

As shown in Fig. 5, the dataset ImageNet444https://www.image-net.org/ is perturbed through 6 adversarial methods, i.e., BIM [73], CW [74], Damage [75], FGSM [76], PGD [77], and UP [78], respectively, resulting in 6 benchmarks, each containing 5000 clean images and 5000 perturbed versions. This task exhibits real-world discriminative challenges, in light of the rich variability of the perturbations themselves and the underlying ImageNet.

In Fig. 6, we first provide a comparison with the current solutions of perturbation forensics on the basic and realistic UP benchmark. Despite the fixed perturbation pattern, there are still competing methods failing to achieve good scores. Such methods are with under-complete representations, and thereby unable to comprehensively capture perturbation patterns. In contrast, over-complete arXiv’23 and our HIR all achieve >>> 90% scores, further revealing the fundamental role of representation in forensic tasks. Thus, we will next further compare relevant representation strategies.

In Table 5, we train and test all representations on the 6 benchmarks, presenting the corresponding F1 scores, as well as the average and worst score statistics. This protocol exhibits richer intra-class variability over the fixed perturbation.

  • The frequency difference between natural and perturbed data is a fruitful forensic clue. Therefore, the classical (time)-frequency representations achieve higher scores than generally expected on this large-scale problem. However, such features exhibit significant sensitivity to classifiers. A potential reason is the restricted separability, where one must resort to complex classification strategies in the feature space.

  • In the learning CNN family, all large-scale networks exhibit >>> 90% average scores, especially DenseNet and InceptionNet. The phenomenon suggests that the transfer learning is good at capturing discriminative features with sufficient training data and aligned testing protocol. As for the attacks, the CW is more challenging and dominates the worst scores, mainly due to its variable and weak patterns.

  • The scattering networks achieve similar scores and much better classifier stability than the original wavelets, suggesting an improvement in the separability. However, its average scores did not reach 90%, failing to provide a similar level of discriminability as learning CNN.

  • Our HIR is very robust to classifier changes, also achieving a MobileNet-level of discriminability, slightly lower than DenseNet and InceptionNet, and significantly better than the direct competitor scattering networks. Therefore, our strategy has a better combined performance in robustness, interpretability, and discriminability. Its efficiency benefit will be highlighted in the next experimental protocol.

In Table 6, we train and test all representations on a hybrid of the 6 perturbation benchmarks, presenting scores at two training-testing ratios. This protocol is more challenging due to very complex intra-class variability, while being more practical for real-world forensic scenarios.

Refer to caption
Figure 7: A comparison of AIGC forensic scores (F1, %) w.r.t. current forensic solutions on SD 1.5 benchmark.
TABLE VII: AIGC Forensic Scores (F1, %) for Different Representations w.r.t. Various Types of Generators.
Method ADM BGAN GLIDE Midjourney SD1.4 SD1.5 VQDM Wukong Average Worst
Classical:
Cosine NN 0.00 0.00 0.00 0.00 0.00 65.09 0.00 0.00 8.14 0.00
Cosine SVM 99.19 99.95 99.57 89.02 99.10 98.80 99.46 99.11 98.03 89.02
Wavelet NN 0.00 0.00 0.00 0.00 0.00 2.08 0.00 0.00 0.26 0.00
Wavelet SVM 99.98 99.70 99.87 85.55 98.63 99.04 99.97 99.38 97.76 85.55
Krawtchouk NN 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Krawtchouk SVM 99.75 99.60 98.49 62.44 74.11 77.98 93.63 76.00 85.25 62.44
Learning:
SimpleNet 98.25 97.87 92.98 68.00 73.52 74.37 74.88 76.32 82.02 68.00
AlexNet 94.45 98.99 98.26 81.52 87.96 88.66 84.24 88.62 90.34 81.52
VGGNet 99.40 99.38 98.57 86.44 89.97 91.86 93.60 90.09 93.66 86.44
GoogLeNet 80.30 99.18 98.16 75.00 82.77 82.44 86.75 82.32 85.87 75.00
ResNet 98.78 99.14 97.78 87.41 89.88 90.85 88.53 88.80 92.65 87.41
DenseNet 99.63 99.60 98.57 93.08 93.79 94.50 95.01 92.55 95.84 92.55
InceptionNet 97.69 99.41 98.32 90.07 89.40 92.55 92.72 88.35 93.56 88.35
MobileNet 90.08 99.28 97.95 87.49 88.51 90.75 87.74 88.29 91.26 87.49
Invariant:
ScatterNet NN 99.10 99.63 98.44 79.47 89.26 89.95 96.71 89.07 92.70 79.47
ScatterNet SVM 99.18 99.67 99.05 85.21 95.85 95.58 97.02 94.60 95.77 85.21
HIR NN 99.92 99.97 99.83 92.63 98.57 98.97 99.92 98.58 98.55 92.63
HIR SVM 99.90 99.92 99.78 92.18 99.07 99.26 99.87 99.42 98.68 92.18
  • In line with previous observations, the classical representations still exhibit score fluctuations on the two classifiers. We also note a performance degradation compared to the case of Table 5, due to the discriminative challenges by this hybrid protocol. On the other hand, their performance is stable w.r.t. the reduction of training samples, further validating the inherent advantages in sample efficiency.

  • Moving into this hybrid benchmark, the learning CNN family yields consistent and large performance degradation, especially for the 1/9191/91 / 9 case with fewer samples. This phenomenon is direct evidence for the data dependence in learning representations (even with transfer strategy). In fact, real-world forensics often face the situation where the perturbation types are diverse and some of them lack samples. Therefore, such data-dependent forensics typically exhibit time-consuming (re-)training, while failing to guarantee their validity for under-sampled perturbation patterns.

  • The scattering networks basically continue the discriminability level and classifier stability from Table 5. Note that its scores in the 1/9191/91 / 9 case are higher than most classical and learning representations, reflecting the superior performance in both discriminability and efficiency.

  • In this challenging protocol, the hand-crafted HIR still achieves a learning-level discriminability and consistently outperforms scattering networks. More importantly, our HIR is significantly less dependent on training samples than learning CNN, meaning it can better cope with under-sampled perturbation patterns in practice. For the next larger-scale forensic task, the comprehensive advantages of HIR over learning CNN will be further highlighted, in robustness, interpretability, discriminability, and efficiency.

TABLE VIII: AIGC Forensic Robustness Scores (F1, %) for Different Representations w.r.t. Various Types of Generators.
Method Testing With Random Orientation and Flip**
ADM BGAN GLIDE Midjourney SD1.4 SD1.5 VQDM Wukong Average Worst
Classical:
Cosine SVM 99.16 99.95 99.55 88.07 99.05 98.72 99.36 99.11 97.87 88.07
Wavelet SVM 99.95 99.80 99.85 82.83 99.08 98.71 99.95 99.16 97.42 82.83
Krawtchouk SVM 70.90 99.58 98.57 64.90 77.28 76.58 94.44 76.72 82.37 64.90
Learning:
SimpleNet 77.72 95.04 92.99 65.25 74.52 74.72 73.62 76.91 78.85 65.25
AlexNet 81.82 99.08 97.99 77.45 85.87 87.93 83.29 86.78 87.53 77.45
VGGNet 76.18 99.40 98.41 82.23 89.26 89.30 88.79 89.00 89.07 76.18
GoogLeNet 80.62 99.30 98.09 73.13 81.29 81.92 85.36 82.31 85.25 73.13
ResNet 85.62 99.28 97.60 81.82 85.60 86.84 87.94 85.21 88.74 81.82
DenseNet 84.57 99.56 98.66 88.57 91.16 91.60 94.26 89.47 92.23 84.57
InceptionNet 91.99 99.30 98.56 87.04 85.72 88.85 92.19 85.24 91.11 85.24
MobileNet 85.14 99.36 97.78 84.54 86.74 88.80 87.50 86.56 89.55 84.54
Invariant:
ScatterNet SVM 92.61 99.67 99.18 82.75 88.81 89.40 97.05 83.83 91.66 82.75
HIR SVM 99.88 99.87 99.77 91.99 99.02 99.25 99.83 99.45 98.63 91.99
TABLE IX: AIGC Forensic Scores (%) for Different Representations on a Real-world (Hybrid) Benchmark
Method Train./Test. = 5/5 Train./Test. = 1/9
Pre. Rec. F1 Pre. Rec. F1
Classical:
Cosine NN 0.00 0.00 0.00 0.00 0.00 0.00
Cosine SVM 94.95 94.57 94.76 94.36 91.06 92.68
Wavelet NN 48.70 94.17 64.20 48.69 94.13 64.18
Wavelet SVM 94.03 94.57 94.30 83.55 93.48 88.24
Kraw. NN 0.00 0.00 0.00 0.00 0.00 0.00
Kraw. SVM 75.24 74.77 75.00 71.56 68.57 70.03
Learning:
SimpleNet 61.79 40.70 49.08 56.40 60.48 58.37
AlexNet 80.76 77.63 79.16 71.83 72.50 72.17
VGGNet 84.75 86.67 85.70 72.45 72.37 72.41
GoogLeNet 74.15 80.40 77.15 67.84 68.83 68.33
ResNet 85.10 83.03 84.06 76.88 73.67 75.24
DenseNet 86.83 85.23 86.02 76.84 75.37 76.10
InceptionNet 82.69 86.63 84.62 68.62 68.56 68.59
MobileNet 81.54 82.47 82.00 68.52 68.57 68.55
Invariant:
Scatter. NN 83.68 83.73 83.71 79.37 79.70 79.53
Scatter. SVM 90.31 85.17 87.67 85.28 79.70 82.40
HIR NN 96.79 96.47 96.63 95.66 93.04 94.33
HIR SVM 96.92 96.37 96.64 95.21 94.26 94.73

5.2.2 Artificial Intelligence Generated Content

As shown in Fig. 5, fake images with similar content to ImageNet are synthesized through 8 AIGC methods, i.e., ADM [79], BGAN [80], GLIDE [81], Midjourney555https://www.midjourney.com/home, SD 1.4 [82], SD 1.5 [82], VQDM [83], and Wukong666https://xihe.mindspore.cn/modelzoo/wukong, respectively, resulting in 8 benchmarks, each containing 6000 natural images and 6000 synthesized images. This task exhibits higher level of discriminative challenges, in light of the very rich variability of both natural and synthesized content.

In Fig. 7, we first provide a comparison with the current solutions of AIGC forensics on the basic SD 1.5 benchmark. All these methods based on deep networks and feature enhancements achieve good forensic scores. This indicates that the forensic scenario with sufficient training and aligned testing is not challenging for typical learning representations. Here, the hand-crafted HIR also achieves similar-to\sim 100% scores, meaning a similar discriminability for this forensic scenario. Next, we will compare scores in a comprehensive manner, mainly at the representation level, and also some AIGC forensic solutions as references.

In Table 7, we train and test all representations on the 8 benchmarks, presenting the corresponding F1 scores, as well as the average and worst score statistics. Besides this direct protocol, we also consider testing image variants with random orientation or flip** in Table 8, reflecting the basic geometric robustness requirements.

  • One can observe that the frequency forensic clue of the AI-generated pipeline is still very effective. The classical representations based on SVM classifier achieve consistently good accuracy and robustness, suggesting that state-of-the-art generators (even diffusion ones) still exhibit inherent frequency artifacts. On the other hand, such hand-crafted features are sensitive to classifiers, in line with the observations of Section 5.2.1.

  • The learning representations other than SimpleNet and GoogLeNet achieve >>> 90% average scores, further confirming their good discriminability with sufficient training data and aligned testing protocol. However, for the robustness protocol in Table 8, they exhibit varying degrees of performance degradation, in both average and worst statistics. Clearly, even natural and slight shifts in the data distribution can strongly interfere with the learning forensics. In particular, such interference is highly black-boxed (i.e., unpredictable), where an example is the significantly higher fluctuations on ADM compared to others.

  • The scattering networks exhibit similar level of discriminability and robustness as the learning representations, while outperforming the original wavelets in classifier stability. Note that scattering networks fails to achieve the expected invariance and thus cannot provide higher robustness scores than learning CNN.

  • Regarding the discriminability, geometric invariance, and classifier stability, our HIR achieves better combined performance versus classical representations, scattering networks, and learning CNN. This is in line with our theory expectation that HIR combines the advantages of both hand-crafted and learning representations. While its efficiency will be further highlighted in the next experimental protocol.

In Table 9, we train and test all representations on a hybrid of the 8 AIGC benchmarks, presenting scores at two training-testing ratios. This protocol is more challenging due to very complex intra-class variability, while being more practical for real-world forensic scenarios.

  • The classical representations exhibit good discriminability for this hybrid benchmark, along with the benefit of sample efficiency. In line with previous observations, they still exhibit score fluctuations on the two classifiers.

  • The forensic scores of the learning CNN family drop significantly, with average scores of only similar-to\sim70% in the 1/9191/91 / 9 case. This further illustrates the weakness of learning forensic algorithms in dealing with real-world scenarios, i.e., the data dependence problem.

  • The scattering networks inherits the discriminability level and classifier stability from Table 7, also with better scores than all learning representations in the 1/9191/91 / 9 case. These all indicate the superiority in combined performance.

  • Here, the HIR achieves the highest scores over classical representations, scattering networks, and learning CNN. Its discriminability allows for a hybrid forensic of the 8 AIGC methods in the 1/9191/91 / 9 case with fewer samples, implying the usefulness in real-world forensic scenarios. Note that the ECCV’20, CVPR’20a, and CVPR’20b specifically designed for AIGC forensics exhibit 79.67%, 73.74%, and 76.89% F1 scores in the 1/9191/91 / 9 case, respectively. In summary, our HIR yields consistently good discriminability, robustness, and efficiency in a plug-and-play way, spanning 3 classical vision tasks and 2 forensic tasks. No representation and forensic method achieves such results in these experiments.

6 Conclusion

In this paper, we have systematically investigated the topic of hierarchical invariance, as an early attempt to harmonize the divergence between typical CNN and invariants w.r.t. discriminability and robustness. Compared to related research approaches, our hierarchical invariant representation can be characterized as 1) principled and interpretable design, 2) efficient invariant structure, and 3) competitive discriminability in the era of deep learning.

The theory ingredients of our work are as follows.

  • We have formalized a blueprint for hierarchical invariance, rethinking the typical modules of CNN representations.

  • We have defined new modules with their compositions to fulfill the blueprint, providing formal conclusions about the geometric symmetries between image and representation.

  • We have discussed the criticisms and developments of the above new idea versus typically concepts, highlighting our uniqueness in moving towards robust and interpretable representations.

The practice ingredients of our work are as follows.

  • We have provided a specific framework for the theory of hierarchical invariance, covering practical principles about the topologies, layers, and parameters.

  • We have explored the data adaptability potential of the above framework, resorting to feature/architecture selection or cascading learning module.

The application ingredients of our work are as follows.

  • We have conducted pattern classification experiments on typical sets of texture, digit, and parasite images, respectively, examining the properties promised by our theory under diverse task scales and geometric variants.

  • We have checked the realistic usefulness in large-scale digital forensics of adversarial perturbation and AIGC. Here, the HIR yields consistently good discriminability, robustness, efficiency, and interpretability in a plug-and-play way, exhibiting competitive overall performance than current representation and forensic methods.

References

  • [1] J. M. Wing, “Trustworthy AI,” Commun. ACM, vol. 64, no. 10, pp. 64–71, 2021.
  • [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017.
  • [4] H. Wang, T. Fu, Y. Du, W. Gao, K. Huang, Z. Liu, P. Chandak, S. Liu, P. Van Katwyk, A. Deac et al., “Scientific discovery in the age of artificial intelligence,” Nature, vol. 620, no. 7972, pp. 47–60, 2023.
  • [5] K. Sundararajan and D. L. Woodard, “Deep learning for biometrics: A survey,” ACM Comput. Surv., vol. 51, no. 3, pp. 1–34, 2018.
  • [6] S. Warnat-Herresthal, H. Schultze, K. L. Shastry, S. Manamohan, S. Mukherjee, V. Garg, R. Sarveswara, K. Händler, P. Pickkers, N. A. Aziz et al., “Swarm learning for decentralized and confidential clinical machine learning,” Nature, vol. 594, no. 7862, pp. 265–270, 2021.
  • [7] S. Feng, H. Sun, X. Yan, H. Zhu, Z. Zou, S. Shen, and H. X. Liu, “Dense reinforcement learning for safety validation of autonomous vehicles,” Nature, vol. 615, no. 7953, pp. 620–627, 2023.
  • [8] F. Juefei-Xu, R. Wang, Y. Huang, Q. Guo, L. Ma, and Y. Liu, “Countering malicious deepfakes: Survey, battleground, and horizon,” Int. J. Comput. Vis., vol. 130, no. 7, pp. 1678–1734, 2022.
  • [9] H. Liu, M. Chaudhary, and H. Wang, “Towards trustworthy and aligned machine learning: A data-centric survey with causality perspectives,” arXiv preprint arXiv:2307.16851, 2023.
  • [10] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković, “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges,” arXiv preprint arXiv:2104.13478, 2021.
  • [11] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013.
  • [12] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 9446–9454.
  • [13] F. Klein, “A comparative review of recent researches in geometry,” Bull. Am. Math. Soc., vol. 2, no. 10, pp. 215–249, 1893.
  • [14] R. Tobies, Felix Klein—Mathematician, Academic Organizer, Educational Reformer.   Cham: Springer International Publishing, 2019, pp. 5–21.
  • [15] K. Fukushima and S. Miyake, “Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position,” Pattern Recognit., vol. 15, no. 6, pp. 455–469, 1982.
  • [16] V. Balntas, K. Lenc, A. Vedaldi, T. Tuytelaars, J. Matas, and K. Mikolajczyk, “H-Patches: A benchmark and evaluation of handcrafted and learned local descriptors.” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 11, pp. 2825–2841, 2019.
  • [17] S. Qi, Y. Zhang, C. Wang, J. Zhou, and X. Cao, “A survey of orthogonal moments for image representation: theory, implementation, and evaluation,” ACM Comput. Surv., vol. 55, no. 1, pp. 1–35, 2021.
  • [18] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, pp. 91–110, 2004.
  • [19] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient dense descriptor applied to wide-baseline stereo,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 5, pp. 815–830, 2009.
  • [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, pp. 211–252, 2015.
  • [21] R. Zhang, “Making convolutional networks shift-invariant again,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 7324–7334.
  • [22] C. Buckner, “Understanding adversarial examples requires a theory of artefacts for deep learning,” Nature Mach. Intell., vol. 2, no. 12, pp. 731–736, 2020.
  • [23] M. Taddeo, T. McCutcheon, and L. Floridi, “Trusting artificial intelligence in cybersecurity is a double-edged sword,” Nature Mach. Intell., vol. 1, no. 12, pp. 557–560, 2019.
  • [24] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1872–1886, 2013.
  • [25] L. Sifre and S. Mallat, “Rotation, scaling and deformation invariant scattering for texture discrimination,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 1233–1240.
  • [26] T. Wiatowski and H. Bölcskei, “A mathematical theory of deep convolutional neural networks for feature extraction,” IEEE Trans. Inf. Theory, vol. 64, no. 3, pp. 1845–1866, 2017.
  • [27] E. Oyallon, S. Zagoruyko, G. Huang, N. Komodakis, S. Lacoste-Julien, M. Blaschko, and E. Belilovsky, “Scattering networks for hybrid representation learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 9, pp. 2208–2221, 2018.
  • [28] J. Andén and S. Mallat, “Deep scattering spectrum,” IEEE Trans. Signal Process., vol. 62, no. 16, pp. 4114–4128, 2014.
  • [29] X. Chen, X. Cheng, and S. Mallat, “Unsupervised deep Haar scattering on graphs,” Proc. Adv. Neural Inf. Process. Syst., vol. 27, 2014.
  • [30] S. Yu, “Evolving scattering networks for engineering disorder,” Nature Comput. Sci., vol. 3, no. 2, pp. 128–138, 2023.
  • [31] S. Cheng, Y.-S. Ting, B. Ménard, and J. Bruna, “A new approach to observational cosmology using the scattering transform,” Mon. Not. R. Astron. Soc., vol. 499, no. 4, pp. 5902–5914, 2020.
  • [32] T. Cohen and M. Welling, “Group equivariant convolutional networks,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 2990–2999.
  • [33] T. S. Cohen and M. Welling, “Steerable CNNs,” in Proc. Int. Conf. Learn. Representations, 2016.
  • [34] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow, “Harmonic networks: Deep translation and rotation equivariance,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5028–5037.
  • [35] M. Weiler, F. A. Hamprecht, and M. Storath, “Learning steerable filters for rotation equivariant CNNs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 849–858.
  • [36] I. Sosnovik, M. Szmaja, and A. Smeulders, “Scale-equivariant steerable networks,” in Proc. Int. Conf. Learn. Representations, 2019.
  • [37] D. Worrall and M. Welling, “Deep scale-spaces: Equivariance over scale,” Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019.
  • [38] Z. Sun and T. Blu, “Empowering networks with scale and rotation equivariance using a similarity convolution,” in Proc. Int. Conf. Learn. Representations, 2022.
  • [39] M. Finzi, S. Stanton, P. Izmailov, and A. G. Wilson, “Generalizing convolutional neural networks for equivariance to Lie groups on arbitrary continuous data,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 3165–3176.
  • [40] E. J. Bekkers, “B-spline CNNs on Lie groups,” in Proc. Int. Conf. Learn. Representations, 2019.
  • [41] Q. Xie, Q. Zhao, Z. Xu, and D. Meng, “Fourier series expansion based filter parametrization for equivariant convolutions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 4, pp. 4537–4551, 2022.
  • [42] K. Atz, F. Grisoni, and G. Schneider, “Geometric deep learning on molecular representations,” Nature Mach. Intell., vol. 3, no. 12, pp. 1023–1032, 2021.
  • [43] R. J. Townshend, S. Eismann, A. M. Watkins, R. Rangan, M. Karelina, R. Das, and R. O. Dror, “Geometric deep learning of RNA structure,” Science, vol. 373, no. 6558, pp. 1047–1051, 2021.
  • [44] I. Goodfellow, P. McDaniel, and N. Papernot, “Making machine learning robust against adversarial inputs,” Commun. ACM, vol. 61, no. 7, pp. 56–66, 2018.
  • [45] F. Zhan, Y. Yu, R. Wu, J. Zhang, S. Lu, L. Liu, A. Kortylewski, C. Theobalt, and E. Xing, “Multimodal image synthesis and editing: The generative AI era,” IEEE Trans. Pattern Anal. Mach. Intell., 2023.
  • [46] S. Qi, Y. Zhang, C. Wang, J. Zhou, and X. Cao, “A principled design of image representation: Towards forensic tasks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 5, pp. 5337–5354, 2022.
  • [47] K. Lenc and A. Vedaldi, “Understanding image representations by measuring their equivariance and equivalence,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 991–999.
  • [48] D. Marcos, M. Volpi, N. Komodakis, and D. Tuia, “Rotation equivariant vector field networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 5048–5057.
  • [49] S. Qi, Y. Zhang, C. Wang, T. Xiang, X. Cao, and Y. Xiang, “Representing noisy image without denoising,” arXiv preprint arXiv:2301.07409, 2023.
  • [50] P.-T. Yap, X. Jiang, and A. C. Kot, “Two-dimensional polar harmonic transforms for invariant image representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 7, pp. 1259–1270, 2009.
  • [51] J. Flusser, B. Zitova, and T. Suk, Moments and moment invariants in pattern recognition.   John Wiley & Sons, 2009.
  • [52] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le, “Understanding and simplifying one-shot architecture search,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 550–559.
  • [53] M. Guo, Y. Yang, R. Xu, Z. Liu, and D. Lin, “When NAS meets robustness: In search of robust architectures against adversarial attacks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 631–640.
  • [54] S. G. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 7, pp. 674–693, 1989.
  • [55] P.-T. Yap, R. Paramesran, and S.-H. Ong, “Image analysis by Krawtchouk moments,” IEEE Trans. Image Process., vol. 12, no. 11, pp. 1367–1377, 2003.
  • [56] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, pp. 84–90, 2017.
  • [57] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [58] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9.
  • [59] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
  • [60] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4700–4708.
  • [61] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826.
  • [62] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenet-v2: Inverted residuals and linear bottlenecks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520.
  • [63] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner, “Detecting adversarial samples from artifacts,” arXiv preprint arXiv:1703.00410, 2017.
  • [64] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” in Proc. Int. Conf. Learn. Representations, 2018.
  • [65] B. Liang, H. Li, M. Su, X. Li, W. Shi, and X. Wang, “Detecting adversarial image examples in deep neural networks with adaptive noise reduction,” IEEE Trans. Dependable Secure Comput., vol. 18, no. 1, pp. 72–85, 2018.
  • [66] G. Goswami, A. Agarwal, N. Ratha, R. Singh, and M. Vatsa, “Detecting and mitigating adversarial perturbations for robust face recognition,” Int. J. Comput. Vis., vol. 127, pp. 719–742, 2019.
  • [67] J. Liu, W. Zhang, Y. Zhang, D. Hou, Y. Liu, H. Zha, and N. Yu, “Detection based defense against adversarial examples from the steganalysis point of view,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4825–4834.
  • [68] A. Agarwal, R. Singh, M. Vatsa, and N. Ratha, “Image transformation-based defense against adversarial perturbation on deep learning models,” IEEE Trans. Dependable Secure Comput., vol. 18, no. 5, pp. 2106–2121, 2020.
  • [69] C. Wang, S. Qi, Z. Huang, Y. Zhang, R. Lan, and X. Cao, “Towards an accurate and secure detector against adversarial perturbations,” arXiv preprint arXiv:2305.10856, 2023.
  • [70] Y. Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao, “Thinking in frequency: Face forgery detection by mining frequency-aware clues,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 86–103.
  • [71] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “CNN-generated images are surprisingly easy to spot… for now,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8695–8704.
  • [72] Z. Liu, X. Qi, and P. H. Torr, “Global texture enhancement for fake face detection in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8060–8069.
  • [73] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” in Artificial Intelligence Safety and Security.   Chapman and Hall/CRC, 2018, pp. 99–112.
  • [74] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in Proc. IEEE Symp. Secur. Privacy, 2017, pp. 39–57.
  • [75] S. Chen, Z. He, C. Sun, J. Yang, and X. Huang, “Universal adversarial attack on attention and the resulting dataset DamageNet,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 4, pp. 2188–2197, 2020.
  • [76] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
  • [77] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in Proc. Int. Conf. Learn. Representations, 2018.
  • [78] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1765–1773.
  • [79] P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” Proc. Adv. Neural Inf. Process. Syst., vol. 34, pp. 8780–8794, 2021.
  • [80] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018.
  • [81] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
  • [82] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10 684–10 695.
  • [83] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for text-to-image synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10 696–10 706.
[Uncaptioned image] Shuren Qi received the B.A. and M.E. degrees from Liaoning Normal University, Dalian, China, in 2017 and 2020 respectively. He is currently pursuing the Ph.D. degree in computer science at Nan**g University of Aeronautics and Astronautics, Nan**g, China. He has published academic papers in top-tier venues including ACM Computing Surveys and IEEE Transactions on Pattern Analysis and Machine Intelligence. His research involves the general topics of invariance, robustness, and explainability in computer vision, with a focus on invariant representations, for closing today’s trustworthiness gap in artificial intelligence, e.g., forensic and security of visual data.
[Uncaptioned image] Yushu Zhang (Senior Member, IEEE) received the Ph.D. degree in computer science from Chongqing University, Chongqing, China, in 2014. He held various research positions with the City University of Hong Kong, Southwest University, University of Macau, and Deakin University. He is currently a Professor with the College of Computer Science and Technology, Nan**g University of Aeronautics and Astronautics, Nan**g, China. His research interests include multimedia processing and security, artificial intelligence, and blockchain. Dr. Zhang is an Associate Editor of Signal Processing and Information Sciences.
[Uncaptioned image] Chao Wang received the B.S. and M.S. degrees from Liaoning Normal University, Dalian, China, in 2017 and 2020 respectively. She is currently pursuing the Ph.D. degree in computer science at Nan**g University of Aeronautics and Astronautics, Nan**g, China. Her research interests include trustworthy artificial intelligence, adversarial learning, and media forensics.
[Uncaptioned image] Zhihua Xia (Member, IEEE) received the Ph.D. degree in computer science from Hunan University, China, in 2011. He held various research positions with the Nan**g University of Information Science and Technology, New Jersey Institute of Technology, and Sungkyunkwan University. He is currently a Professor with the College of Cyber Security, **an University, China. His research interests include AI security, secure computation, and media forensics.
[Uncaptioned image] Xiaochun Cao (Senior Member, IEEE) received the B.E. and M.E. degrees in computer science from Beihang University, Bei**g, China, in 1999 and 2002, respectively, and the Ph.D. degree in computer science from the University of Central Florida, Orlando, FL, USA, in 2006. After graduation, he spent about three years at ObjectVideo Inc., as a Research Scientist. From 2008 to 2012, he was a Professor at Tian** University, Tian**, China. Before joining Sun Yat-sen University, Shenzhen, China, he was a Professor at the Institute of Information Engineering, Chinese Academy of Sciences, Bei**g, China. He is a Professor and the Dean with the School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University. He has published more than 200 journal and conference papers. Dr. Cao’s dissertation was nominated for the University Level Outstanding Dissertation Award. He was a recipient of the Piero Zamperoni Best Student Paper Award at the International Conference on Pattern Recognition, in 2004 and 2010; the Excellent Young Scientists Fund and Distinguished Young Scholars Fund of National Natural Science Foundation of China, in 2014 and 2020. He was on the Editorial Boards of IEEE Transactions on Circuits and Systems for Video Technology and IEEE Transactions on Multimedia. He is on the Editorial Boards of IEEE Transactions on Pattern Analysis and Machine Intelligence and IEEE Transactions on Image Processing.
[Uncaptioned image] Jian Weng (Senior Member, IEEE) received the Ph.D. degree in computer science from Shanghai Jiao Tong University, Shanghai, China, in 2008. From 2008 to 2010, he was a Post-doctoral Researcher with the Singapore Management University, Singapore. He is currently a Professor and the Vice President of **an University, Guangzhou, China. He has published more than 100 journal and conference papers. Dr. Weng served as the PC co-chair or a PC member for more than 50 international conferences. He was a recipient of the Innovation Award from the Chinese Association for Cryptologic Research in 2015 and the Distinguished Young Scholars Fund of National Natural Science Foundation of China in 2018. He is on the Editorial Board of IEEE Transactions on Vehicular Technology.