Hierarchical Invariance for Robust and Interpretable Vision Tasks at Larger Scales
Abstract
Develo** robust and interpretable vision systems is a crucial step towards trustworthy artificial intelligence. In this regard, a promising paradigm considers embedding task-required invariant structures, e.g., geometric invariance, in the fundamental image representation. However, such invariant representations typically exhibit limited discriminability, limiting their applications in larger-scale trustworthy vision tasks. For this open problem, we conduct a systematic investigation of hierarchical invariance, exploring this topic from theoretical, practical, and application perspectives. At the theoretical level, we show how to construct over-complete invariants with a Convolutional Neural Networks (CNN)-like hierarchical architecture yet in a fully interpretable manner. The general blueprint, specific definitions, invariant properties, and numerical implementations are provided. At the practical level, we discuss how to customize this theoretical framework into a given task. With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner. We demonstrate the above arguments with accuracy, invariance, and efficiency results on texture, digit, and parasite classification experiments. Furthermore, at the application level, our representations are explored in real-world forensics tasks on adversarial perturbations and Artificial Intelligence Generated Content (AIGC). Such applications reveal that the proposed strategy not only realizes the theoretically promised invariance, but also exhibits competitive discriminability even in the era of deep learning. For robust and interpretable vision tasks at larger scales, hierarchical invariant representation can be considered as an effective alternative to traditional CNN and invariants.
Index Terms:
Image representation, hierarchical invariance, robustness, discriminability, interpretability.1 Introduction
The trustworthiness is a rising topic in modern Artificial Intelligence (AI) communities [1]. Over the past decade, deep learning techniques, especially Convolutional Neural Networks (CNN), have led to breakthrough results in numerous AI tasks, e.g., processing human perceptual information [2], playing board games [3], and solving hard science problems [4]. More recently, their applications are expanding into trust-related scenarios, e.g., biometrics [5], medical diagnostics [6], self-driving cars [7], and misinformation detection [8]. In such scenarios, the robustness and interpretability of AI systems are crucial [9]: 1) robustness means the performance of system is stable for intra-class variations on the input; 2) interpretability means the behavior of system can be understood or predicted by humans.
Integrating invariant structures into image representations is a principled design towards robust and interpretable vision systems [10]. Specifically, representations play a fundamental role in visual systems, where the system is generally built on meaningful representations of digital images (rather than the raw data) [11]. Note that the proper design/learning of such representations in fact relies on priors w.r.t. the task of interest [12]. Here, the concept of symmetry from the Erlangen Program [13, 14] may be the most fruitful prior – informally, a symmetry of a system is a transformation that leaves a certain property of system invariant. Symmetry priors are ubiquitous in vision tasks, e.g., translation as a symmetry of the object classification system where object category is invariant under translation [15].
Next, we begin by providing some historical perspectives on the invariance in the development of image representations. The quest for invariance is as old as the field of computer vision itself, spanning both hand-crafted and learning approaches [16]:
-
•
In the hand-crafted approach, symmetry priors (e.g., invariance and equivariance) w.r.t. geometric transformations (e.g., translation, rotation, and scaling) have been recognized as main ideas in designing representations. Such ideas cover almost all classical and state-of-the-art methods, from global features (e.g., moment invariants [17]), to local sparse features (e.g., SIFT [18]), and to local dense features (e.g., DAISY [19]). However, these hand-crafted representations are all fixed in design, relying on (under)-complete dictionaries, and therefore fail to provide sufficient discriminability at larger scales, e.g., ImageNet classification task [20].
-
•
In the learning approach, the CNN achieves over-complete representations of strong discriminative power for larger-scale vision tasks, through a cascade of learnable nonlinear transformations. As a textbook view of deep learning, representations should be learned not designed [2]. Therefore, classical CNN representations are equipped with very few symmetry priors, typically just translation equivariance [15], but has recently been proven to no longer hold in deeper layers of the CNN with downsampling structures [21]. In general, these learning representations lack robustness and interpretability guarantees, e.g., the presence and understanding of adversarial perturbations [22], and therefore cannot be well extended to trustworthy tasks [23].
Historically, to a certain extent, efforts at invariance and discriminability have developed independently in hand-crafted and learning approaches. The compatibility between invariance and discriminability has emerged as a tricky problem when moving towards trustworthy AI.
Approach | Traditional invariance | Traditional CNN | Scattering networks | Equivariant networks | Hierarchical invariance |
---|---|---|---|---|---|
\rowcolormygray2Discriminative | ✓ | ✓ | ✓ | ✓ | |
Robust | ✓ | ✓ | ✓ | ✓ | |
\rowcolormygray2Interpretable | ✓ | ✓ | ✓ | ✓ | |
Efficient | ✓ | ✓ |
1.1 State of the Art and Motivation
In trust-related scenarios at larger scales, recent advances are seeking more advanced invariant designs of image representations, fulfilling discriminability, robustness, and interpretability simultaneously.
-
•
On the discriminability of hand-crafted representations, researchers introduce successful experiences behind learning representations, especially cascading and over-complete designs. The most representative work is Invariant Scattering Convolution Networks [24], where the classical wavelet transform is expanded into an over-complete representation with deep cascading. Unlike typical CNN: 1) regarding the architecture, convolutional layers are defined by fixed wavelet filters, with modulus-based nonlinearity, but without subsequent pooling; 2) regarding the representation properties, the architecture yields the translation equivariance and certain robustness w.r.t. non-linear deformations. Following theoretical works further explored various geometric invariants [25], more general mathematical formulations [26], and the potential for improving the efficiency, interpretability, and robustness of state-of-the-art CNN techniques [27]. Regarding applications, they provided competitive results in a variety of tasks on audio [28], image [24, 25, 27] and graph [29] data, some of which are even interdisciplinary [30, 31]. With similar design goals and paths, we consider Scattering Networks as a main competitor for our work.
-
•
On the robustness and interpretability of learning representations, researchers introduce successful experiences behind hand-crafted representations, especially invariance and equivariance designs. The most representative work is Group Equivariant Convolutional Networks [32], where the classical convolution is generalized to a new definition on the symmetry group. Unlike typical CNN: 1) regarding the architecture, convolutional layers are learned but with new structure inspired by symmetry priors, e.g., re-parameterizing the filter to control symmetry; 2) regarding the representation properties, the architecture commonly provides the joint equivariance for translation and rotation. Following theoretical works further explored the equivariance for rotation [33, 34, 35], flip** [33], scaling [36, 37], and their combination [38] from various mathematical theories, including steerable filters [33], harmonic analysis [34], scale space [37], Lie groups [39], and B-spline interpolation [40]. Regarding applications, they played a key role in low-level vision tasks [41], especially scientific discoveries with symmetry priors [42, 43]. With similar design goals but on the learning path, we consider Equivariant Networks as a secondary competitor for our work.
Motivation. Despite starting from different theories, the above state-of-the-art methods exhibit common problems in the implementation efficiency and representation capability. An over-simplified interpretation on the technical level of such methods is as follows [38]: The input feature map is convolved with symmetry versions of the same filter to obtain multi-channel features, where the distortion on the input (e.g., rotation) corresponds to the cyclic shift between channels, and hence the invariance is achieved by pooling across channels. 1) Regarding the implementation efficiency, this parallel framework leads to an exponential expansion on the computational size (w.r.t. the sampling rate on symmetry), especially for learning representations where the introduced new learnable parameters make the training more challenging. 2) Regarding the representation capability, the discrete sampling on the symmetry of filters raises a tricky trade-off between invariance and discriminability – higher sampling rate implies better invariance, but the resulting computational cost restricts the overall size of representation networks (going deeper or wider), and hence the improvement of discriminability.
1.2 Contributions
As a potential step towards solving the above open problem, we conduct a systematic investigation of hierarchical invariance, exploring this topic from theoretical, practical, and application perspectives.
As summarized in Table 1, our approach stands nicely between the two extremes, i.e., traditional invariance [17] and CNN [21] w.r.t. discriminability, robustness, and interpretability. Compared with the recent scattering [24] and equivariant [32] networks, our approach is characterized by a more efficient design. Here, the equivariance is continuous/one-shot, holds across layers, eliminating the need for complex symmetry sampling, parallel framework, and cross-channel pooling. Therefore, it exhibits better efficiency, and also allowing more flexibility in enlarging the network size (going deeper or wider) to increase the representation capacity.
Our main contributions are as follows:
Theory. We propose a new framework for robust and interpretable image representation, named Hierarchical Invariant Representation (HIR), by extending the classical theory of moment invariants to cascade transformations. Starting from an ideal blueprint for hierarchical invariance, we formalize the over-complete moment invariants by an efficient hierarchical structure, with better trade-off between invariance and discriminability than traditional invariants and CNN. Note that the HIR exhibits continuous and one-shot equivariance w.r.t. translations, rotations, and flips at each intermediate layer, which is not available in current state-of-the-art algorithms. We also provide some fast and accurate numerical implementations of HIR, which are generic for arbitrary basis functions.
Practice. We explore the practical flexibility of this theoretical framework, covering a wide range of potential designs to better match a given vision task. The theory is specified into a class of networks, involving details about the topologies, layers, and parameters. In particular, we define a concept of frequency pooling to satisfy the common requirements from interpretability, invariance, and discriminability. For empowering the data adaptability of hand-crafted HIR, we also provide an architecture searching strategy based on the over-completeness.
Application. We validate the effectiveness of HIR in various simulation experiments and real-world applications. Pattern classification experiments are performed on typical sets of texture, digit, and parasite images, exhibiting state-of-the-art accuracy, invariance, and efficiency under diverse task scales and geometric variants. The direct applications to real-world forensics, i.e., detections of adversarial perturbations [44] and Artificial Intelligence Generated Content (AIGC) [45], also demonstrate the competitive discriminability even in the era of deep learning.
2 Foundations
As mentioned earlier, this work develops from the theory of moment invariants. Therefore, we begin with a brief review on the foundations of moment invariants, covering some concepts, notations, and definitions from our previous works.
2.1 Global and Local Representations
In general, classical moments and moment invariants are global representations of images, where the theory is built on the following definition [17]:
(1) |
where is the image function, is the basis function with order parameter on domain , and is the complex conjugate. Note that the domains of and in (1) have the same/similar location and scale, implying the global nature of the representation information.
With the sparse prior and geometric prior for natural images, two typical constraints, i.e., orthogonality and rotation invariance, often imposed on the explicit definition of , leading to the following polar form:
(2) |
where is separated as the product of the angular basis function () and the radial basis function , subject to the weighted orthogonality condition . Note that the basis function in (2) is orthogonal on , and the magnitude of is invariant to the rotation on the image (see [17] for a survey).
In our recent work, moments and moment invariants are extended to local representations of images, where the theory is built on the following definition [46]:
(3) |
where the new basis function introduces position parameter and scale parameter . It can be interpreted as a translated and scaled version of the global with the following coordinate relationship:
(4) |
where the domain is a disk centered at and with radius : . Note that (3) allows the domain of to be built in different positions and scales w.r.t. the domain of , implying the local nature of the representation information. Also, the classical definition (2) is in fact a special case of the new definition (3) with and (see [46] for details).
2.2 Invariance, Equivariance, and Covariance
The terms of invariance, equivariance, and covariance appear in the fields of computer vision, graphics, geometry, and physics, with similar but slightly different definitions. In this paper, we use the following identities to generally denote such terms [47, 48]:
-
•
invariance – ,
-
•
equivariance – ,
-
•
covariance – ,
where is a representation, is a degradation, and is a composite function of . Note that invariance and equivariance are special cases of covariance with and .
Starting from the above terms and the local representation (3), one can verify that exhibits the following properties w.r.t. translation, rotation, flip**, and scaling on images (see [46] for details).
The image translation leads to
(5) |
where is the translation offset of the image . Note that the same appears in position parameter , implying the equivariance w.r.t. the image translation.
Since the translation equivariance holds, the following analysis (6) (8) will consider only center-aligned geometric transformations, i.e., we can restrict without loss of generality.
![Refer to caption](x1.png)
The image rotation leads to
(6) |
with , where is the rotation angle w.r.t. the center of the image . Note that the same appears in phase of the representation, implying the covariance w.r.t. the center-aligned rotation. It is straightforward that the covariance (6) will specialize to the invariance when taking the magnitude as .
The image flip** leads to
(7) |
with , where is a vertically flipped version of the image w.r.t. the center. Note that center-aligned vertical flip** again only affects the phase of the representation, implying the covariance similar to (6). As for other flip** orientations, the same conclusion can be derived from the composite of rotation and vertical flip**. It is straightforward that the joint invariance of center-aligned rotation and flip** is hold when taking the magnitude of the representation.
The image scaling leads to
(8) |
with , where is the scaling factor w.r.t. the center of the image . Note that the same appears in scale parameter , implying the covariance w.r.t. center-aligned scaling.
For the representation properties when , they can be derived from the composite of translation with center-aligned rotation, flip**, and scaling, respectively. Hence, the magnitude of the representation has joint equivariance for any translation, rotation, and flip** on domain, as well as covariance for any scaling on domain.
3 Hierarchical Invariance: Theory
This section focuses on the theoretical aspects of the proposed HIR.
As a high-level intuition, we formalize the blueprint of hierarchical invariance, rethinking the typical modules of CNN representations. Starting from the invariant theory, we then define new modules with their compositions to fulfill such blueprint, along with representation property justifications and efficient numerical implementations. For a summary of this theoretical section, we discuss the criticisms and developments of the proposed idea versus typically concepts.
3.1 Blueprint of Hierarchical Invariance
Our goal is to achieve such a CNN-like hierarchical architecture – the image information is able to pass through each intermediate layer in a geometrically controllable manner, and on the last layer, the invariant features are allowed by compact designs, with also sufficient information. In this paper, this ideal representation structure is termed as hierarchical invariance.
Motivated by the goal, we rethink several typical modules of CNN representation and formalize a blueprint of such modules for hierarchical invariance, as shown in Fig. 1.
Formulation 1.
(Hierarchical invariance blueprint). The set of feature maps (including input images) is denoted as with the discrete domain and complex-valued channels , where and are the variants of and respectively under certain operators, e.g., downsampling, and is a group modeling all the symmetries of interest over the . We formalize the following modules for hierarchical invariance:
-
•
The convolutional layer captures local features by convolution operations. The geometric principle of is the covariance for the symmetry group , i.e., there exists a predictable such that for any and , where the covariance specializes to the equivariance when .
-
•
The nonlinear layer introduces the non-linearity in features for discriminative purposes, with an activation function such that , where the composition of convolutional and nonlinear layers, i.e., , is also covariant for the group , due to the element-wise act of .
-
•
The local pooling layer downsamples the plane dimensions of feature maps to reduce computational complexity, such that . The geometric principle of is the (approximately) equivariance for any produced by , i.e., , allowing the subsequent convolutional layer to continue capturing such covariant features, where the composition remains covariance for the group .
-
•
The invariant layer obtains the final vector representation via a certain global pooling over plane dimensions. The geometric principle of is the invariance for the symmetry group , i.e., for any and .
With this blueprint, HIR can be designed in a generic way, by the ordered cascading of such modules.
3.2 Definition of Layer and Path
From the invariant theory of Section 2, we will give a class of definitions for layers , , , and , satisfying the hierarchical invariance blueprint.
Definition 1.
(Convolutional layer). For the input feature map with and , the convolutional layer is defined channel-wise as local covariant representations with (3) and (4) :
(9) |
where is the convolution over the , denotes the matrix transpose, and is a convolution kernel defined as:
(10) |
where is the integral value of basis function over a valid pixel region:
(11) |
with -centered pixel region .
Remark. In the Definition 1, the convolutional layer is specified by and , which control the representation scale and representation frequency of , respectively. Note that the defined by (9) will directly inherit the representation properties (5) (8), it should be regarded as a linear covariant layer on the group modeling all translation, rotation, flip**, and scaling symmetries over the . For convenience, we denote , where is the translation/rotation/flip** symmetry group and is the scaling symmetry group.
Definition 2.
(Nonlinear layer). For the input feature map with and , the nonlinear layer is defined channel-wise as a magnitude operation:
(12) |
where is complex-valued, and (12) can be written explicitly as .
Remark. With the Definition 2 and Section 2.2, the composition of convolutional and nonlinear layers will exhibit the joint equivariance of translation, rotation, and flip**, i.e., for any and . Note that the defined by (12) not only introduces the non-linearity in feature maps, but also converts the covariance (from w.r.t. rotation and flip**) into the more manageable equivariance . In addition, the composition preserves the scaling covariance of due to the element-wise act of , i.e., for any and .
Definition 3.
(Local pooling layer). For the input feature map with and , the local pooling layer is defined as identity function:
(13) |
Remark. According to related researches, downsampling operations (e.g., local max pooling) of CNN will variously impair (translation) equivariance, i.e., an approximation , especially for larger pooling scales or deeper network architectures, implying a trade-off between computational complexity and representation equivariance. Since the proposed representation is one-shot without the large-scale training of typical CNN, we neglect downsampling operations and simply set when the computational cost is acceptable. Alternatively, more elegant pooling designs with better tradeoffs between complexity and equivariance can be employed to define , as detailed in the paper by Zhang [21]. It is straightforward that the composition has the same representation properties of based on the Definition 3.
Definition 4.
(Invariant layer). For the input feature map with and , the invariant layer is defined channel-wise as global invariant representations with (1) and (2):
(14) |
where is a special transform map** image moments to global invariants, w.r.t. the symmetry group of interest and any .
Remark. In the Definition 4, we have not restricted to a fixed formula, allowing the generality of the discussion in the following Section 3.3; its specific designs (w.r.t. considered applications of this paper) will be given in Section 4.1. Note that, with Definitions 1 3 and Section 2.2, the form with its cascade will basically preserve the geometric information of , specifically the translation, rotation and flip** symmetries. Therefore, the idea of (w.r.t. the deep feature map ) is very similar to the classical theory of moment invariants (w.r.t. the original image ), with a wide range of potential designs [17].
Definition 5.
(Path). From the Definitions 1 4, we define a path of HIR as , where specifies the parameters of the convolutional layer sorted by . The HIR along a path , , is defined as the following ordered cascading with corresponding parameters :
(15) |
Remark. In the Definition 5, we further unify the global and local representation theories of moment invariants (Section 2) into a hierarchical representation framework. Note that the layers prior to the invariant layer provide structure-preserving properties for the representation . Here, the global representation (15) is designed for image-level visual tasks, e.g., classification; as for pixel-level ones, e.g., segmentation, we can preserve the spatial dimensions by removing the last invariant layer. In the next section, the representation properties of Definition 5 will be analyzed explicitly.
3.3 Representation Property
In a typical CNN, the relationship between image information and learned representation is highly nonlinear and difficult to understand or predict. As for the HIR, we can explicitly give the following conclusions about the geometric symmetries between image and representation, implying good robustness and interpretability.
Property 1.
(Equivariance for translation, rotation, and flip**). For a representation unit with arbitrary parameters (for the convolutional layer), any composition of satisfy the joint equivariance for translation, rotation, and flip** (ignoring edge effects and resampling errors), i.e., the following identity holds:
(16) |
for any composition length , any and , where is the translation/rotation/flip** symmetry group.
Proof.
First, let us examine the behavior of a representation unit on :
(17) |
where the first pass comes from the covariance of for rotation and flip**, i.e., (6) and (7), and is a predictable operation acting in the phase domain of ; the second pass comes from the specialization of to the covariant – the magnitude operation removes the extra phase variations, leading to a pure equivariance ; the third pass comes from the identity function of , which becomes approximately equal when the downsampled is used. Here, means that the representation unit can be considered as an equivariant layer for any and – in other words, the single and operations on are exchangeable. Furthermore, with a notation , we have for any . Therefore, and any composition of are exchangeable, implying the correctness of Property 1. ∎
Property 2.
(Covariance for scaling). For a representation unit , where the scale parameter of its convolutional layer is specified as with a notation , any composition of satisfy the covariance for scaling (ignoring edge effects and resampling errors), i.e., the following identity holds:
(18) |
for any composition length , any and , where is a predictable operation corresponding to with explicit form , is the scaling factor w.r.t. , and is the scaling symmetry group.
Proof.
First, let us examine the behavior of a representation unit on :
(19) |
where the first pass comes from the covariance of for scaling, i.e., (8), and is a predictable operation acting in both the domain (i.e., the same scaling ) and the domain (i.e., the factor ) of ; the second and third passes come from the element-wise act of and the identity function of , respectively. Here, means that the representation unit can be considered as an covariant layer for any and – in other words, the single and operations on are exchaneable but with the parameter changing of . Furthermore, we have for any . Therefore, and any composition of are exchangeable while changing the scale parameter to , implying the correctness of Property 2. ∎
Property 3.
(Hierarchical invariance). For any composition of representation unit , it is practical to design a global invariant map w.r.t. the symmetry group of interest , due to the predictable geometric symmetries between the input image and deep feature map (at each intermediate layer) guaranteed by Properties 1 and 2. More specifically, with the Definition 4, we assume that there exists a such that for any and , i.e., invariance holds on one layer, where is a predictable operation corresponding to and . Then we have following invariance:
(20) |
holds for any composition length .
Proof.
We can rewrite (20) as:
(21) |
where the first pass comes from Properties 1 and 2, note that , is related to and ; the second pass comes from our assumption for any and , with for any . ∎
3.4 Fast and Accurate Implementation
Above, the core components of our representations, i.e., definitions and properties, have been formalized. In this section, we will complement the numerical implementation of HIR, especially the fast and accurate computations of Definition 1 from our previous work [46]. Note that the discussion here is very general, with no restrictions on the specific definitions of the basis functions.
Definition 6.
(Fast implementation). Let us introduce the convolution theorem as a fast implementation of Definition 1, such that the spatial domain convolution of (9) can be converted to the following frequency domain product form [46]:
(22) |
where is the Fourier transform and is the point-wise multiplication.
Property 4.
(Complexity analysis). In the Definition 1, the (9) dominates the computational complexity due to the dense convolution. For the input feature map with and , we assume that a set of needs to be computed, where scale parameter with a fixed order and a fixed channel , and denote the number of feature map samples as and the number of scale samples as . With the Definition 6 and the Fast Fourier Transform (FFT), we can compute the set of in multiplications, as opposed to the complexity of by the direct Definition 1, where is the maximum scale in . Note that the big difference between square and logarithmic growths in the complexity (removing the same terms), where the Definition 6 will exhibit batter efficiency when is sufficiently large such that .
Definition 7.
(Accurate implementation). Let us introduce the higher-order numerical integration as an accurate implementation of Definition 1, such that the two-dimensional continuous integral of (11) can be converted to the following summation form [46]:
(23) |
where the set of numerical integration samples encodes the points and the corresponding weights , which are specified by a certain numerical integration strategy, such as Gaussian quadrature.
Property 5.
(Accurate analysis). In the Definition 1, the (11) dominates the computational accuracy due to the continuous integration of complicated functions. We assume that the with a fixed order and position needs to be computed, and denote the number of numerical integration samples as . The implementation based on the Definition 7 exhibits an approximation error of . Note that when there is more than one sample within each pixel region, i.e., , the Definition 7 will exhibit batter accuracy than the error of by the direct Definition 1 (zero-order approximation).
3.5 Comparison with Related Work
As a further explanation of Table 1, it is necessary to conclude this section and highlight the theoretical relationships with typical related works:
-
•
Traditional invariance. Our work generalizes this theory by unifying the global and local invariant representations into a new framework of HIR. More specifically, we formalize layers , , and based on the theory of local invariants [46] (Definitions 1 3), arguing the equivariance/covariance can be preserved across layers under a certain cascade (Properties 1 2). We also formalize layer based on the theory of global invariants [17] (Definition 4), arguing the successes of global invariance for image domains can be directly generalized to equivariant/covariant deep feature domains (Property 3). Under our hierarchical invariance, classical global [17] and local [46] invariants can be considered as special cases, i.e., and (Definition 5).
-
•
Traditional CNN. Our work has a similar hierarchical architecture but with better properties in geometric symmetry, allowing for robust and interpretable image representations. More specifically, we introduce the discriminative design of CNN in our invariants, i.e., over-complete representation with deep cascading [21]. On the other hand, we criticize typical CNN modules (Formulation 1), allowing fully transparent geometric symmetries across layers of our representation (Properties 1 3). As a result, the proposed representation serves as an effective alternative to the highly black-box CNN in trustworthy tasks.
-
•
Scattering networks. Our work is more compact in achieving rotation invariance. As a main competitor, scattering networks are also based on deep cascading of explicit transforms (wavelets) [24], with similar concepts to our work. However, constructing rotation invariants from scattering networks is complicated, which requires parallel convolution and cross-channel pooling of multiple oriented wavelets; increasing the orientation sampling will result in an exponential growth of the complexity. Whereas our approach benefits from classical invariant theory, rotation invariance is continuous and one-shot (Property 1), providing better efficiency while easily enlarging the network size to improve the representation capacity.
-
•
Equivariant networks. Our work is non-learning while being more compact in achieving continuous and joint invariance. As a secondary competitor, equivariant networks are also guaranteed by group theory [32], with similar concepts to our work. However, the convolutional layers in equivariant networks are learned, leading to varying degrees of data dependence. In particular, it has a similar parallel structure to scattering networks, leading to exponential complexity and optimization challenges. Although equivariant networks are a very generic design, our approach provides better efficiency for continuous and joint invariance (Properties 1 3), while easily enlarging the network size to improve the representation capacity.
4 Hierarchical Invariance: Practice
This section focuses on the practical aspects of the proposed HIR.
We first specify the above theory into a class of networks, involving more practical details about the topologies, layers, and parameters. We also discuss the domain adaptation strategies for practical scenarios, with feature/architecture selection or cascading learning module. Note that such efforts only serve as a feasible practice of hierarchical invariance theory, towards the experiments and applications in Section 5.
4.1 Specifying the Architecture
Starting from the Properties 1 3, we propose a practical architecture for the hierarchical invariance w.r.t. the symmetry group of interest , with a tree topology and working at multiple scales.
Single-scale Networks. Let us first present the topology on a single scale, i.e., all involved convolutional layers have a common scale parameter , which exhibits invariance for (Properties 1 and 2). As shown in Fig. 2, we organize the set of paths as a tree-like network: 1) blue nodes denote the representation units with different parameters ; 2) black nodes denote the identity function; 3) lines denote cascading relationships between nodes, where all nodes sorted by along their paths are plotted at the same level (sorted from top to bottom). Note that the feature map of each node will be fed into for forming the invariants under this path, where the network representation is just the set of invariants under all paths. Here, the order parameter of the previous unit (blue) is always smaller than that of the subsequent ones (under a specific norm), so that the path exhibits an increasing trend in the order. With this design, the main information can be passed through the early nodes, and hence the subsequent nodes capture rich features. Also, the identity function (black) is introduced as a skip-connection trick, allowing the information to be passed to deeper nodes. In this paper, all units from the same level are specified separately from the set , i.e., their orders are equal under the norm.
Multi-scale Networks. Next, let us consider a multi-scale version of the above network with scale separation prior, extending the invariance to (Property 3). As shown in Fig. 3, a series of single-scale networks are introduced: 1) they have the same tree topology and same order parameter at corresponding nodes; 2) but each network has a different scale parameter, sampled from the set , where the scaling covariance (w.r.t. ) is transformed into a linear translation pattern (w.r.t. ) between multi-scale networks. The above network can derive a series of multi-scale representations of the image (with invariance for ), which are directly suitable for visual tasks with multi-scale physical structure (e.g., object detection). Further, we can derive scale-invariant representations (with invariance for ) under the Property 3, by pooling feature maps from a series of corresponding nodes at multiple scales. Note that, in practice, we cannot sample the scale completely and densely, and thus the above scaling invariance is restricted.
![Refer to caption](x2.png)
![Refer to caption](x3.png)
Radial Basis Functions. In our previous work [49], two generic classes of radial basis functions have been introduced, based on a family of harmonic functions:
(24) |
and a family of polynomial functions:
(25) |
respectively, where the fractional parameter , the polynomial parameters must fulfill and . Both classes of functions can be used to define in the (2), satisfying the orthogonality condition in Section 2.1.
For the sake of simplicity, a family of cosine functions are chosen in all experiments and applications, as a special case of the (24):
(26) |
i.e., forming a hierarchical invariant version of the Polar Cosine Transform (PCT) [50]. Note that we try to show the superiority of the hierarchical invariant framework itself, even if relying on naive (26).
![Refer to caption](extracted/5529955/Figure/fig4a1.png)
![Refer to caption](extracted/5529955/Figure/fig4a2.png)
![Refer to caption](extracted/5529955/Figure/fig4a3.png)
![Refer to caption](extracted/5529955/Figure/fig4a4.png)
![Refer to caption](extracted/5529955/Figure/fig4a5.png)
![Refer to caption](extracted/5529955/Figure/fig4a6.png)
![Refer to caption](extracted/5529955/Figure/fig4a7.png)
![Refer to caption](extracted/5529955/Figure/fig4a8.png)
![Refer to caption](extracted/5529955/Figure/fig4a9.png)
![Refer to caption](extracted/5529955/Figure/fig4a10.png)
![Refer to caption](extracted/5529955/Figure/fig4b1.png)
![Refer to caption](extracted/5529955/Figure/fig4b2.png)
![Refer to caption](extracted/5529955/Figure/fig4b3.png)
![Refer to caption](extracted/5529955/Figure/fig4b4.png)
![Refer to caption](extracted/5529955/Figure/fig4b5.png)
![Refer to caption](extracted/5529955/Figure/fig4b6.png)
![Refer to caption](extracted/5529955/Figure/fig4b7.png)
![Refer to caption](extracted/5529955/Figure/fig4b8.png)
![Refer to caption](extracted/5529955/Figure/fig4b9.png)
![Refer to caption](extracted/5529955/Figure/fig4b10.png)
![Refer to caption](extracted/5529955/Figure/fig4c1.jpg)
![Refer to caption](extracted/5529955/Figure/fig4c2.jpg)
![Refer to caption](extracted/5529955/Figure/fig4c3.png)
![Refer to caption](extracted/5529955/Figure/fig4c4.jpg)
![Refer to caption](extracted/5529955/Figure/fig4c5.jpg)
![Refer to caption](extracted/5529955/Figure/fig4c6.jpg)
![Refer to caption](extracted/5529955/Figure/fig4c7.jpg)
![Refer to caption](extracted/5529955/Figure/fig4c8.jpg)
Invariant Layer. In the monograph [51] and our previous work [17], a number of strategies for directly constructing global invariants in image domains have been presented. They can be naturally used to define in (14), with the equivariant or covariant behavior of deep feature maps (Properties 1 3). In all experiments and applications of this paper, a class of global invariants is concisely designed based on frequency pooling.
Regarding (14), we first let the Fourier basis be . Note that the Fourier Transform (FT) is highly understood in the signal processing community and can be considered a good foundation for interpretability. Then, based on the order/frequency sampling of the FT , we define as a frequency-band integral in the polar system:
(27) |
where is the -th frequency band under the norm, with the number of bands .
Here, we can state that the above feature vector directly satisfies the invariance for , in light of Property 1 and the translation, rotation and flip** properties of FT. As for scaling, is compatible with both single-scale and multi-scale networks: 1) regarding the single-scale case, a certain degree of robustness is provided for (at least up to the bandwidth), in light of Property 2 and the scaling property of FT; 2) regarding the multi-scale case, the scaling covariance has been eliminated before feeding into , and thus will satisfy the joint invariance for .
Note that the well-known average pooling is in fact a special case of (27), with and . Our frequency-band integral can be regarded as a generic design of global pooling, with comprehensive consideration on interpretability, invariance, and discriminability.
4.2 Empowering the Data Adaptability
Due to the hand-crafted nature of HIR, a fixed set of their features is not adaptive to the data distribution. For larger-scale vision tasks, we propose following data adaptability strategies, allowing our invariants to reach a similar discriminability level of learning representations. Note that unlike typical (under)-complete invariants, our invariants exhibit a high level of over-completeness due to the local and hierarchical structure, which is the foundation for data adaptability strategies.
Feature/Architecture Selection. Discriminative features for a given task can be formed in a selection-based manner, inspired by Neural Architecture Search (NAS) [52]. First, we can construct a large-scale tree-like network (going deeper or wider), covering a wide set of paths and parameters, analogous to the notion of supernet in NAS [53]. Then, with the training set under a given task, we can perform correlation analysis of features and labels for ranking discriminative features (as well as the corresponding paths), analogous to the phase of architecture sampling and evaluation in NAS [53]. With the above analysis, we can greatly simplify the initial supernet such that the paths cover top-ranked features for applications, allowing our representations to be task-discriminative.
Cascading Learning Module. Discriminative features for a given task can also be formed in a learning-based manner, inspired by Hybrid Representation Learning (HRL) [27]. The main idea is to replace shallow layers of learning CNN with fixed HIR, such that discriminative features are formed in a space with geometric symmetries. According to related justifications [27], this strategy is able to achieve a discriminability level rivaling typical CNN on large-scale classification benchmarks, while exhibiting significantly better training compactness. In all experiments and applications of this paper, we still employ the feature/architecture selection strategy to show the superiority of the hierarchical invariant framework itself, taking also into account that the cascading CNN weakens invariance and interpretability to some extent.
Method | Time GPU† | Original | Orien. & Flip. | ||||
Pre. | Rec. | F1 | Pre. | Rec. | F1 | ||
Classical: | |||||||
Cosine | 5 | 70.74 | 67.50 | 66.85 | 69.65 | 66.25 | 65.30 |
Wavelet | 6 | 69.43 | 64.38 | 64.68 | 62.34 | 58.13 | 57.82 |
Kraw. | 5 | 70.67 | 67.50 | 66.30 | 64.41 | 60.00 | 59.55 |
Learning: | |||||||
SimpleNet | 52† | 70.33 | 67.50 | 67.09 | 54.63 | 43.13 | 41.31 |
SimpleNet+ | 52† | 46.93 | 49.38 | 46.06 | 47.18 | 48.13 | 44.93 |
AlexNet | 42† | 98.82 | 98.75 | 98.75 | 91.69 | 91.25 | 91.28 |
AlexNet+ | 41† | 87.61 | 84.38 | 84.05 | 88.37 | 85.63 | 85.76 |
VGGNet | 266† | 99.41 | 99.38 | 99.37 | 92.18 | 91.25 | 91.37 |
VGGNet+ | 609† | 91.34 | 90.00 | 89.81 | 92.15 | 91.25 | 91.08 |
Invariant: | |||||||
ScatterNet | 42 | 98.89 | 98.75 | 98.75 | 84.98 | 83.13 | 83.08 |
HIR | 27 | 96.98 | 96.88 | 96.87 | 96.32 | 96.25 | 96.23 |
5 Experiments and Applications
In this section, we will comprehensively evaluate the discriminability, robustness, and efficiency of HIR, covering simulation experiments in Section 5.1 and real-world applications in Section 5.2. Here, the main aim is for examining the representation properties promised in previous sections, as well as positioning its discriminative power in the era of deep learning.
With Sections 4 and 5, we implement a code repository for HIR in https://github.com/ShurenQi/HIR. All experiments/applications are executed in Matlab R2023a under Microsoft Windows environment, based on 2.90-GHz CPU, RTX-3060 GPU, and 16-GB RAM.
5.1 Computer Vision and Pattern Recognition
We perform classification experiments with HIR on typical sets of texture, digit, and parasite images, benchmarking its representation capabilities. Note that this series of simulation experiments examines the properties promised by our theory under diverse task scales and geometric variants, also with comparisons to a range of hand-crafted and learning representations.
With the practice of Sections 4, our HIR is implemented here as a single-scale network scale parameter and composition length ; its invariant layer (27) is specialized to the average pooling, with and , for a fair comparison with the deep representations by average pooling. Note that the adaptability strategies of Section 4.2 are not employed here, for a direct assessment of its discriminative power. All features are fed into a PCA classifier, trained on features of the training set. Unless otherwise stated, the training and testing sets are formed without any crossover by random sampling at 80% and 20% ratios on the original dataset, respectively.
The competing representations involved here can be summarized as follows:
-
•
Classical complete representation: discrete cosine transform as a global representation;
- •
-
•
Advanced over-complete representations: 1) typical CNNs, i.e., direct-learning CNN (denoted as SimpleNet), transfer-learning AlexNet [56] and VGGNet [57], with also data augmentation (denoted as ‘+’); 2) invariant CNNs, i.e., scattering networks [24] and our HIR. Broadly speaking, they can all be considered as a class of hierarchical invariant representations, but with different levels of invariance, where typical CNNs are invariant only to translations.
Method | Time GPU† | Original | Trans. & Rota. | ||||
Pre. | Rec. | F1 | Pre. | Rec. | F1 | ||
Classical: | |||||||
Cosine | 15 | 45.68 | 45.35 | 45.43 | 32.50 | 31.20 | 30.83 |
Wavelet | 16 | 67.11 | 66.75 | 66.75 | 38.31 | 35.25 | 35.44 |
Kraw. | 15 | 71.73 | 69.85 | 69.69 | 27.05 | 26.30 | 25.83 |
Learning: | |||||||
SimpleNet | 535† | 98.60 | 98.60 | 98.60 | 35.42 | 33.50 | 33.72 |
SimpleNet+ | 551† | 52.70 | 48.90 | 48.82 | 54.26 | 50.75 | 50.78 |
AlexNet | 393† | 100 | 100 | 100 | 66.18 | 64.45 | 64.27 |
AlexNet+ | 392† | 93.07 | 92.00 | 91.80 | 94.23 | 93.10 | 92.93 |
VGGNet | 3610† | 100 | 100 | 100 | 70.74 | 70.25 | 69.93 |
VGGNet+ | 7731† | 95.98 | 95.70 | 95.68 | 95.53 | 95.20 | 95.13 |
Invariant: | |||||||
ScatterNet | 115 | 98.96 | 98.95 | 98.95 | 57.20 | 56.95 | 56.23 |
HIR | 57 | 97.48 | 97.45 | 97.45 | 95.05 | 94.95 | 94.98 |
5.1.1 Texture
As shown in Fig. 4, the experiment is executed on dataset KTH-TIPS111https://www.csc.kth.se/cvap/databases/kth-tips/index.html, a typical benchmark for texture image classification. This dataset has 10 classes, each containing 81 instances, the total size is , and hence is considered as a small-scale vision problem.
As shown in Table 2, we list performance scores of the competing representations on this benchmark, as well as the elapsed time, i.e., CPU featuring time or GPU training time. Besides this direct protocol on the original dataset, we also consider testing image variants with random orientation (w.r.t. degree) or flip** (w.r.t. or axis).
-
•
The classical (over-)complete representations fail to achieve a satisfactory level of discriminability, even in the direct protocol of such small-scale benchmark.
-
•
The learning CNN family achieves significantly higher scores due to its over-complete and data-adaptive properties, especially the AlexNet and VGGNet with large-scale pre-training and transfer learning. Whereas, the SimpleNet performs relatively poorly, indicating the sensitivity of learning to network size and training strategy. Under the variant protocol, they exhibit a significant performance degradation, suggesting the learned features lack invariance w.r.t. natural geometric variations of texture. After introducing the augmented training, the CNN scores become more stable, but at the cost of discriminability. A potential reason for this phenomenon is the small amount of training data. Moreover, the computational cost is considerable for this small-scale problem, and a certain training instability is observed.
-
•
The scattering networks provide a high level of discriminability and robustness without feature training and data augmentation, indicating the success of extending classical wavelets to deep representations.
-
•
Our work further extends such success: the HIR achieves a similar level of discriminability as the learning CNN family, while exhibiting superior robustness in the variant protocol than all competing representations. In particular, such representation success build on our compact and efficient framework, with lower runtimes than scattering networks and learning CNN family.
5.1.2 Digit
As shown in Fig. 4, the experiment is executed on a digit dataset222https://ww2.mathworks.cn/help/deeplearning/ug/data-sets-for-deep-learning.html for classification similar to the MNIST benchmark. This dataset has 10 classes from ‘0’ to ‘9’, each containing 1000 instances with rich font differences and geometric distortions, the total size is , and hence is considered as a medium-scale vision problem.
As shown in Table 3, we list performance scores and elapsed times of the competing representations on this benchmark. Besides this direct protocol, we also consider testing image variants with random translation (w.r.t. pixels in axial directions) and random rotation (w.r.t. degrees around the center).
-
•
As the problem size increases, the complete Cosine exhibits a significant degradation, while the over-complete Wavelet and Krawtchouk are more stable, revealing the role of over-completeness in the discriminability. Regarding robustness, such representations all fail against translation and rotation variants, implying the challenging nature of this protocol.
-
•
In general, the learning CNN family continues the performance in Section 5.1.1, further confirming the sensitivity of learning to network size, training strategy, and geometric variants. Its robustness is significantly increased after augmented training. However, theoretically, the resulting robustness is not guaranteed for unseen data distributions (even for similar variants with unseen parameters). One can note the rapid expansion of computational cost: the transfer learning of VGGNet even takes 2 GPU hours.
-
•
The handcrafted scattering networks still provide a good level of discriminability here, further validating its success. However, it exhibits unsatisfactory scores for translation and rotation variants, even lower than the CNN without augmentation, failing to achieve the expected robustness.
-
•
The HIR significantly outperforms the main competitor, i.e., scattering networks. Despite the increased problem size, it still achieves a similar level of discriminability as the learning CNN family, also under the constraints of invariance and compactness. Note that our HIR is the only method exhibiting confusion between classes ‘6’ and ‘9’ in the direct training, while achieving 100% scores for the rest. This phenomenon is in line with the rotation invariance and discriminative power expected by our theory.
Method | Time GPU† | Train./Test. = 8/2 | Train./Test. = 1/9 | ||||
Pre. | Rec. | F1 | Pre. | Rec. | F1 | ||
Classical: | |||||||
Cosine | 37 | 36.19 | 32.60 | 29.85 | 49.40 | 41.97 | 43.80 |
Wavelet | 39 | 41.68 | 45.20 | 41.79 | 53.69 | 47.97 | 49.27 |
Kraw. | 42 | 66.56 | 69.49 | 67.21 | 71.60 | 57.88 | 61.10 |
Learning: | |||||||
SimpleNet | 2244† | 90.15 | 89.25 | 89.65 | 84.51 | 76.14 | 78.84 |
AlexNet | 1796† | 98.87 | 98.40 | 98.63 | 95.92 | 94.69 | 95.27 |
VGGNet | 9184† | 99.24 | 98.97 | 99.11 | 97.95 | 97.37 | 97.65 |
Invariant: | |||||||
ScatterNet | 1277 | 68.41 | 69.71 | 67.55 | 72.52 | 63.30 | 65.70 |
HIR | 823 | 88.73 | 92.18 | 90.10 | 91.26 | 88.76 | 89.85 |
5.1.3 Parasite
As shown in Fig. 4, the experiment is executed on micrographic dataset333https://data.mendeley.com/datasets/38jtn4nzs6/3, a typical benchmark for parasite image classification. This dataset has 6 parasite classes and 2 host classes, with real-world diversity regarding imaging, background, morphology, and geometry, the total size is 34298, and hence is considered as a large-scale vision problem.
As shown in Table 4, we list performance scores and elapsed times of the competing representations on this benchmark. Note that we also consider a protocol with different training-testing ratios to analyze the data dependence and sample efficiency.
-
•
In this large-scale problem, the scores of the classical representations drop further, implying a limited level of discriminability. On the other hand, their performance is relatively stable when training samples are reduced, and even better in the case, indicating a good efficiency.
-
•
In the learning CNN family, the direct-learning SimpleNet exhibits a clear data dependence. Specifically, it achieves 90% scores in the case (similar to HIR), while the scores drop significantly in the case (below than HIR). In contrast, the AlexNet and VGGNet achieve good discriminability and stability in the case, revealing that the transfer strategy effectively inherits the pre-training prior on ImageNet. On the other hand, the cost of pre-training and transfer learning is still considerable, without guaranteed robustness or adaptability for a given data domain.
-
•
Despite outperforming the original wavelets, scattering networks fail to provide a competitive discriminability in the era of deep learning. Here, the common failure of such hand-crafted representations on larger-scale discriminability can be regarded as important evidence for our motivation.
-
•
The HIR achieves a SimpleNet-level discriminability, outperforming our competitor scattering networks significantly. Also, the HIR is not sensitive to the reduction of training samples, outperforming the learning CNN family in data dependence and sample efficiency. Note that the discriminability of the fixed features from HIR is still lower than the transfer learning with large-scale pre-training. Therefore, in the next applications, the HIR features will be empowered with data adaptability strategies in Section 4.2.
![Refer to caption](x4.jpeg)
![Refer to caption](x5.jpeg)
![Refer to caption](x6.jpeg)
![Refer to caption](x7.jpeg)
![Refer to caption](x8.jpeg)
![Refer to caption](extracted/5529955/Figure/fig5a6.png)
![Refer to caption](x9.png)
![Refer to caption](extracted/5529955/Figure/fig5b2.png)
![Refer to caption](extracted/5529955/Figure/fig5b3.png)
![Refer to caption](extracted/5529955/Figure/fig5b4.png)
![Refer to caption](extracted/5529955/Figure/fig5b5.png)
![Refer to caption](extracted/5529955/Figure/fig5b6.png)
![Refer to caption](extracted/5529955/Figure/fig5b7.png)
![Refer to caption](extracted/5529955/Figure/fig5b8.png)
Method | BIM | CW | DAmage | FGSM | PGD | UP | Average | Worst |
Classical: | ||||||||
Cosine NN | 34.63 | 33.19 | 90.78 | 39.80 | 34.69 | 2.22 | 39.22 | 2.22 |
Cosine SVM | 79.57 | 83.34 | 97.26 | 78.24 | 79.22 | 96.68 | 85.72 | 78.24 |
Wavelet NN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Wavelet SVM | 72.83 | 82.09 | 97.77 | 78.21 | 71.80 | 95.87 | 83.10 | 71.80 |
Krawtchouk NN | 66.43 | 66.49 | 90.86 | 66.43 | 66.44 | 0.00 | 59.44 | 0.00 |
Krawtchouk SVM | 0.00 | 55.87 | 0.00 | 56.44 | 0.00 | 70.37 | 30.45 | 0.00 |
Learning: | ||||||||
SimpleNet | 4.24 | 3.24 | 92.13 | 49.89 | 33.13 | 99.86 | 47.08 | 3.24 |
AlexNet | 90.20 | 72.72 | 96.63 | 94.61 | 90.91 | 98.45 | 90.59 | 72.72 |
VGGNet | 96.04 | 62.50 | 99.08 | 98.12 | 96.99 | 99.15 | 91.98 | 62.50 |
GoogLeNet | 90.29 | 80.04 | 97.09 | 95.29 | 89.94 | 98.75 | 91.90 | 80.04 |
ResNet | 90.22 | 75.59 | 97.35 | 94.66 | 90.17 | 98.40 | 91.07 | 75.59 |
DenseNet | 98.93 | 90.19 | 99.34 | 99.23 | 98.85 | 99.76 | 97.72 | 90.19 |
InceptionNet | 98.70 | 85.14 | 97.38 | 97.32 | 98.66 | 99.41 | 96.10 | 85.14 |
MobileNet | 92.51 | 82.67 | 97.37 | 96.81 | 92.10 | 98.19 | 93.27 | 82.67 |
Invariant: | ||||||||
ScatterNet NN | 81.30 | 70.23 | 95.27 | 91.17 | 82.65 | 94.64 | 85.88 | 70.23 |
ScatterNet SVM | 84.40 | 69.49 | 96.77 | 90.57 | 83.86 | 95.12 | 86.70 | 69.49 |
HIR NN | 89.66 | 84.92 | 98.89 | 93.26 | 90.08 | 97.78 | 92.43 | 84.92 |
HIR SVM | 92.30 | 89.10 | 99.30 | 95.96 | 91.60 | 98.93 | 94.53 | 89.10 |
Method | Train./Test. = 5/5 | Train./Test. = 1/9 | ||||
Pre. | Rec. | F1 | Pre. | Rec. | F1 | |
Classical: | ||||||
Cosine NN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Cosine SVM | 79.08 | 73.33 | 76.10 | 81.13 | 68.79 | 74.45 |
Wavelet NN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Wavelet SVM | 77.53 | 66.95 | 71.85 | 76.05 | 61.13 | 67.78 |
Kraw. NN | 50.53 | 15.22 | 23.40 | 50.00 | 15.10 | 23.20 |
Kraw. SVM | 50.03 | 65.34 | 56.67 | 49.75 | 48.77 | 49.26 |
Learning: | ||||||
SimpleNet | 47.31 | 48.11 | 47.71 | 50.59 | 63.63 | 56.36 |
AlexNet | 81.46 | 87.35 | 84.30 | 72.24 | 61.36 | 66.35 |
VGGNet | 81.41 | 90.04 | 85.51 | 78.83 | 75.35 | 77.05 |
GoogLeNet | 82.74 | 85.46 | 84.08 | 63.35 | 57.74 | 60.42 |
ResNet | 80.93 | 84.70 | 82.77 | 68.48 | 66.64 | 67.55 |
DenseNet | 87.92 | 93.25 | 90.51 | 82.07 | 83.96 | 83.00 |
InceptionNet | 84.60 | 90.92 | 87.65 | 69.58 | 70.77 | 70.17 |
MobileNet | 83.07 | 88.07 | 85.50 | 68.73 | 69.50 | 69.11 |
Invariant: | ||||||
Scatter. NN | 69.85 | 68.94 | 69.39 | 74.93 | 77.31 | 76.10 |
Scatter. SVM | 75.70 | 72.07 | 73.84 | 76.42 | 78.63 | 77.51 |
HIR NN | 81.27 | 80.68 | 80.98 | 79.09 | 82.17 | 80.60 |
HIR SVM | 86.20 | 86.06 | 86.13 | 83.42 | 83.29 | 83.35 |
5.2 Digital Forensic and Forgery Detection
For real-world applications, we employ the HIR for large-scale digital forensics, i.e., detections of adversarial perturbation and AIGC, for direct checking its usefulness in robust and interpretable tasks. Note that this plug-and-play strategy will not only be compared to similar representations, but will also a range of current forensic solutions, including well-designed deep forensics.
With the practice of Sections 4, our HIR is implemented here as a single-scale network scale parameter and composition length ; its invariant layer (27) is specialized with and , for improving the discriminability of digital artifacts. Note that the feature/architecture selection strategy of Section 4.2 is employed for data adaptability and discriminability, where the top-ranked 500- and 1000-dimensional features are selected for AIGC and adversarial perturbation, respectively. All features are fed into both NN and SVM classifiers, for evaluating the sensitivity w.r.t. the classifier. Unless otherwise stated, the training and testing sets are formed without any crossover by random sampling at 50% and 50% ratios on the original dataset, respectively.
![Refer to caption](x10.png)
The competing methods involved here can be summarized as follows:
-
•
All the representations in Section 5.1 as direct forensics;
- •
- •
- •
5.2.1 Adversarial Perturbations
As shown in Fig. 5, the dataset ImageNet444https://www.image-net.org/ is perturbed through 6 adversarial methods, i.e., BIM [73], CW [74], Damage [75], FGSM [76], PGD [77], and UP [78], respectively, resulting in 6 benchmarks, each containing 5000 clean images and 5000 perturbed versions. This task exhibits real-world discriminative challenges, in light of the rich variability of the perturbations themselves and the underlying ImageNet.
In Fig. 6, we first provide a comparison with the current solutions of perturbation forensics on the basic and realistic UP benchmark. Despite the fixed perturbation pattern, there are still competing methods failing to achieve good scores. Such methods are with under-complete representations, and thereby unable to comprehensively capture perturbation patterns. In contrast, over-complete arXiv’23 and our HIR all achieve 90% scores, further revealing the fundamental role of representation in forensic tasks. Thus, we will next further compare relevant representation strategies.
In Table 5, we train and test all representations on the 6 benchmarks, presenting the corresponding F1 scores, as well as the average and worst score statistics. This protocol exhibits richer intra-class variability over the fixed perturbation.
-
•
The frequency difference between natural and perturbed data is a fruitful forensic clue. Therefore, the classical (time)-frequency representations achieve higher scores than generally expected on this large-scale problem. However, such features exhibit significant sensitivity to classifiers. A potential reason is the restricted separability, where one must resort to complex classification strategies in the feature space.
-
•
In the learning CNN family, all large-scale networks exhibit 90% average scores, especially DenseNet and InceptionNet. The phenomenon suggests that the transfer learning is good at capturing discriminative features with sufficient training data and aligned testing protocol. As for the attacks, the CW is more challenging and dominates the worst scores, mainly due to its variable and weak patterns.
-
•
The scattering networks achieve similar scores and much better classifier stability than the original wavelets, suggesting an improvement in the separability. However, its average scores did not reach 90%, failing to provide a similar level of discriminability as learning CNN.
-
•
Our HIR is very robust to classifier changes, also achieving a MobileNet-level of discriminability, slightly lower than DenseNet and InceptionNet, and significantly better than the direct competitor scattering networks. Therefore, our strategy has a better combined performance in robustness, interpretability, and discriminability. Its efficiency benefit will be highlighted in the next experimental protocol.
In Table 6, we train and test all representations on a hybrid of the 6 perturbation benchmarks, presenting scores at two training-testing ratios. This protocol is more challenging due to very complex intra-class variability, while being more practical for real-world forensic scenarios.
![Refer to caption](x11.png)
Method | ADM | BGAN | GLIDE | Midjourney | SD1.4 | SD1.5 | VQDM | Wukong | Average | Worst |
Classical: | ||||||||||
Cosine NN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 65.09 | 0.00 | 0.00 | 8.14 | 0.00 |
Cosine SVM | 99.19 | 99.95 | 99.57 | 89.02 | 99.10 | 98.80 | 99.46 | 99.11 | 98.03 | 89.02 |
Wavelet NN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 2.08 | 0.00 | 0.00 | 0.26 | 0.00 |
Wavelet SVM | 99.98 | 99.70 | 99.87 | 85.55 | 98.63 | 99.04 | 99.97 | 99.38 | 97.76 | 85.55 |
Krawtchouk NN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Krawtchouk SVM | 99.75 | 99.60 | 98.49 | 62.44 | 74.11 | 77.98 | 93.63 | 76.00 | 85.25 | 62.44 |
Learning: | ||||||||||
SimpleNet | 98.25 | 97.87 | 92.98 | 68.00 | 73.52 | 74.37 | 74.88 | 76.32 | 82.02 | 68.00 |
AlexNet | 94.45 | 98.99 | 98.26 | 81.52 | 87.96 | 88.66 | 84.24 | 88.62 | 90.34 | 81.52 |
VGGNet | 99.40 | 99.38 | 98.57 | 86.44 | 89.97 | 91.86 | 93.60 | 90.09 | 93.66 | 86.44 |
GoogLeNet | 80.30 | 99.18 | 98.16 | 75.00 | 82.77 | 82.44 | 86.75 | 82.32 | 85.87 | 75.00 |
ResNet | 98.78 | 99.14 | 97.78 | 87.41 | 89.88 | 90.85 | 88.53 | 88.80 | 92.65 | 87.41 |
DenseNet | 99.63 | 99.60 | 98.57 | 93.08 | 93.79 | 94.50 | 95.01 | 92.55 | 95.84 | 92.55 |
InceptionNet | 97.69 | 99.41 | 98.32 | 90.07 | 89.40 | 92.55 | 92.72 | 88.35 | 93.56 | 88.35 |
MobileNet | 90.08 | 99.28 | 97.95 | 87.49 | 88.51 | 90.75 | 87.74 | 88.29 | 91.26 | 87.49 |
Invariant: | ||||||||||
ScatterNet NN | 99.10 | 99.63 | 98.44 | 79.47 | 89.26 | 89.95 | 96.71 | 89.07 | 92.70 | 79.47 |
ScatterNet SVM | 99.18 | 99.67 | 99.05 | 85.21 | 95.85 | 95.58 | 97.02 | 94.60 | 95.77 | 85.21 |
HIR NN | 99.92 | 99.97 | 99.83 | 92.63 | 98.57 | 98.97 | 99.92 | 98.58 | 98.55 | 92.63 |
HIR SVM | 99.90 | 99.92 | 99.78 | 92.18 | 99.07 | 99.26 | 99.87 | 99.42 | 98.68 | 92.18 |
-
•
In line with previous observations, the classical representations still exhibit score fluctuations on the two classifiers. We also note a performance degradation compared to the case of Table 5, due to the discriminative challenges by this hybrid protocol. On the other hand, their performance is stable w.r.t. the reduction of training samples, further validating the inherent advantages in sample efficiency.
-
•
Moving into this hybrid benchmark, the learning CNN family yields consistent and large performance degradation, especially for the case with fewer samples. This phenomenon is direct evidence for the data dependence in learning representations (even with transfer strategy). In fact, real-world forensics often face the situation where the perturbation types are diverse and some of them lack samples. Therefore, such data-dependent forensics typically exhibit time-consuming (re-)training, while failing to guarantee their validity for under-sampled perturbation patterns.
-
•
The scattering networks basically continue the discriminability level and classifier stability from Table 5. Note that its scores in the case are higher than most classical and learning representations, reflecting the superior performance in both discriminability and efficiency.
-
•
In this challenging protocol, the hand-crafted HIR still achieves a learning-level discriminability and consistently outperforms scattering networks. More importantly, our HIR is significantly less dependent on training samples than learning CNN, meaning it can better cope with under-sampled perturbation patterns in practice. For the next larger-scale forensic task, the comprehensive advantages of HIR over learning CNN will be further highlighted, in robustness, interpretability, discriminability, and efficiency.
Method | Testing With Random Orientation and Flip** | |||||||||
ADM | BGAN | GLIDE | Midjourney | SD1.4 | SD1.5 | VQDM | Wukong | Average | Worst | |
Classical: | ||||||||||
Cosine SVM | 99.16 | 99.95 | 99.55 | 88.07 | 99.05 | 98.72 | 99.36 | 99.11 | 97.87 | 88.07 |
Wavelet SVM | 99.95 | 99.80 | 99.85 | 82.83 | 99.08 | 98.71 | 99.95 | 99.16 | 97.42 | 82.83 |
Krawtchouk SVM | 70.90 | 99.58 | 98.57 | 64.90 | 77.28 | 76.58 | 94.44 | 76.72 | 82.37 | 64.90 |
Learning: | ||||||||||
SimpleNet | 77.72 | 95.04 | 92.99 | 65.25 | 74.52 | 74.72 | 73.62 | 76.91 | 78.85 | 65.25 |
AlexNet | 81.82 | 99.08 | 97.99 | 77.45 | 85.87 | 87.93 | 83.29 | 86.78 | 87.53 | 77.45 |
VGGNet | 76.18 | 99.40 | 98.41 | 82.23 | 89.26 | 89.30 | 88.79 | 89.00 | 89.07 | 76.18 |
GoogLeNet | 80.62 | 99.30 | 98.09 | 73.13 | 81.29 | 81.92 | 85.36 | 82.31 | 85.25 | 73.13 |
ResNet | 85.62 | 99.28 | 97.60 | 81.82 | 85.60 | 86.84 | 87.94 | 85.21 | 88.74 | 81.82 |
DenseNet | 84.57 | 99.56 | 98.66 | 88.57 | 91.16 | 91.60 | 94.26 | 89.47 | 92.23 | 84.57 |
InceptionNet | 91.99 | 99.30 | 98.56 | 87.04 | 85.72 | 88.85 | 92.19 | 85.24 | 91.11 | 85.24 |
MobileNet | 85.14 | 99.36 | 97.78 | 84.54 | 86.74 | 88.80 | 87.50 | 86.56 | 89.55 | 84.54 |
Invariant: | ||||||||||
ScatterNet SVM | 92.61 | 99.67 | 99.18 | 82.75 | 88.81 | 89.40 | 97.05 | 83.83 | 91.66 | 82.75 |
HIR SVM | 99.88 | 99.87 | 99.77 | 91.99 | 99.02 | 99.25 | 99.83 | 99.45 | 98.63 | 91.99 |
Method | Train./Test. = 5/5 | Train./Test. = 1/9 | ||||
Pre. | Rec. | F1 | Pre. | Rec. | F1 | |
Classical: | ||||||
Cosine NN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Cosine SVM | 94.95 | 94.57 | 94.76 | 94.36 | 91.06 | 92.68 |
Wavelet NN | 48.70 | 94.17 | 64.20 | 48.69 | 94.13 | 64.18 |
Wavelet SVM | 94.03 | 94.57 | 94.30 | 83.55 | 93.48 | 88.24 |
Kraw. NN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Kraw. SVM | 75.24 | 74.77 | 75.00 | 71.56 | 68.57 | 70.03 |
Learning: | ||||||
SimpleNet | 61.79 | 40.70 | 49.08 | 56.40 | 60.48 | 58.37 |
AlexNet | 80.76 | 77.63 | 79.16 | 71.83 | 72.50 | 72.17 |
VGGNet | 84.75 | 86.67 | 85.70 | 72.45 | 72.37 | 72.41 |
GoogLeNet | 74.15 | 80.40 | 77.15 | 67.84 | 68.83 | 68.33 |
ResNet | 85.10 | 83.03 | 84.06 | 76.88 | 73.67 | 75.24 |
DenseNet | 86.83 | 85.23 | 86.02 | 76.84 | 75.37 | 76.10 |
InceptionNet | 82.69 | 86.63 | 84.62 | 68.62 | 68.56 | 68.59 |
MobileNet | 81.54 | 82.47 | 82.00 | 68.52 | 68.57 | 68.55 |
Invariant: | ||||||
Scatter. NN | 83.68 | 83.73 | 83.71 | 79.37 | 79.70 | 79.53 |
Scatter. SVM | 90.31 | 85.17 | 87.67 | 85.28 | 79.70 | 82.40 |
HIR NN | 96.79 | 96.47 | 96.63 | 95.66 | 93.04 | 94.33 |
HIR SVM | 96.92 | 96.37 | 96.64 | 95.21 | 94.26 | 94.73 |
5.2.2 Artificial Intelligence Generated Content
As shown in Fig. 5, fake images with similar content to ImageNet are synthesized through 8 AIGC methods, i.e., ADM [79], BGAN [80], GLIDE [81], Midjourney555https://www.midjourney.com/home, SD 1.4 [82], SD 1.5 [82], VQDM [83], and Wukong666https://xihe.mindspore.cn/modelzoo/wukong, respectively, resulting in 8 benchmarks, each containing 6000 natural images and 6000 synthesized images. This task exhibits higher level of discriminative challenges, in light of the very rich variability of both natural and synthesized content.
In Fig. 7, we first provide a comparison with the current solutions of AIGC forensics on the basic SD 1.5 benchmark. All these methods based on deep networks and feature enhancements achieve good forensic scores. This indicates that the forensic scenario with sufficient training and aligned testing is not challenging for typical learning representations. Here, the hand-crafted HIR also achieves 100% scores, meaning a similar discriminability for this forensic scenario. Next, we will compare scores in a comprehensive manner, mainly at the representation level, and also some AIGC forensic solutions as references.
In Table 7, we train and test all representations on the 8 benchmarks, presenting the corresponding F1 scores, as well as the average and worst score statistics. Besides this direct protocol, we also consider testing image variants with random orientation or flip** in Table 8, reflecting the basic geometric robustness requirements.
-
•
One can observe that the frequency forensic clue of the AI-generated pipeline is still very effective. The classical representations based on SVM classifier achieve consistently good accuracy and robustness, suggesting that state-of-the-art generators (even diffusion ones) still exhibit inherent frequency artifacts. On the other hand, such hand-crafted features are sensitive to classifiers, in line with the observations of Section 5.2.1.
-
•
The learning representations other than SimpleNet and GoogLeNet achieve 90% average scores, further confirming their good discriminability with sufficient training data and aligned testing protocol. However, for the robustness protocol in Table 8, they exhibit varying degrees of performance degradation, in both average and worst statistics. Clearly, even natural and slight shifts in the data distribution can strongly interfere with the learning forensics. In particular, such interference is highly black-boxed (i.e., unpredictable), where an example is the significantly higher fluctuations on ADM compared to others.
-
•
The scattering networks exhibit similar level of discriminability and robustness as the learning representations, while outperforming the original wavelets in classifier stability. Note that scattering networks fails to achieve the expected invariance and thus cannot provide higher robustness scores than learning CNN.
-
•
Regarding the discriminability, geometric invariance, and classifier stability, our HIR achieves better combined performance versus classical representations, scattering networks, and learning CNN. This is in line with our theory expectation that HIR combines the advantages of both hand-crafted and learning representations. While its efficiency will be further highlighted in the next experimental protocol.
In Table 9, we train and test all representations on a hybrid of the 8 AIGC benchmarks, presenting scores at two training-testing ratios. This protocol is more challenging due to very complex intra-class variability, while being more practical for real-world forensic scenarios.
-
•
The classical representations exhibit good discriminability for this hybrid benchmark, along with the benefit of sample efficiency. In line with previous observations, they still exhibit score fluctuations on the two classifiers.
-
•
The forensic scores of the learning CNN family drop significantly, with average scores of only 70% in the case. This further illustrates the weakness of learning forensic algorithms in dealing with real-world scenarios, i.e., the data dependence problem.
-
•
The scattering networks inherits the discriminability level and classifier stability from Table 7, also with better scores than all learning representations in the case. These all indicate the superiority in combined performance.
-
•
Here, the HIR achieves the highest scores over classical representations, scattering networks, and learning CNN. Its discriminability allows for a hybrid forensic of the 8 AIGC methods in the case with fewer samples, implying the usefulness in real-world forensic scenarios. Note that the ECCV’20, CVPR’20a, and CVPR’20b specifically designed for AIGC forensics exhibit 79.67%, 73.74%, and 76.89% F1 scores in the case, respectively. In summary, our HIR yields consistently good discriminability, robustness, and efficiency in a plug-and-play way, spanning 3 classical vision tasks and 2 forensic tasks. No representation and forensic method achieves such results in these experiments.
6 Conclusion
In this paper, we have systematically investigated the topic of hierarchical invariance, as an early attempt to harmonize the divergence between typical CNN and invariants w.r.t. discriminability and robustness. Compared to related research approaches, our hierarchical invariant representation can be characterized as 1) principled and interpretable design, 2) efficient invariant structure, and 3) competitive discriminability in the era of deep learning.
The theory ingredients of our work are as follows.
-
•
We have formalized a blueprint for hierarchical invariance, rethinking the typical modules of CNN representations.
-
•
We have defined new modules with their compositions to fulfill the blueprint, providing formal conclusions about the geometric symmetries between image and representation.
-
•
We have discussed the criticisms and developments of the above new idea versus typically concepts, highlighting our uniqueness in moving towards robust and interpretable representations.
The practice ingredients of our work are as follows.
-
•
We have provided a specific framework for the theory of hierarchical invariance, covering practical principles about the topologies, layers, and parameters.
-
•
We have explored the data adaptability potential of the above framework, resorting to feature/architecture selection or cascading learning module.
The application ingredients of our work are as follows.
-
•
We have conducted pattern classification experiments on typical sets of texture, digit, and parasite images, respectively, examining the properties promised by our theory under diverse task scales and geometric variants.
-
•
We have checked the realistic usefulness in large-scale digital forensics of adversarial perturbation and AIGC. Here, the HIR yields consistently good discriminability, robustness, efficiency, and interpretability in a plug-and-play way, exhibiting competitive overall performance than current representation and forensic methods.
References
- [1] J. M. Wing, “Trustworthy AI,” Commun. ACM, vol. 64, no. 10, pp. 64–71, 2021.
- [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
- [3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017.
- [4] H. Wang, T. Fu, Y. Du, W. Gao, K. Huang, Z. Liu, P. Chandak, S. Liu, P. Van Katwyk, A. Deac et al., “Scientific discovery in the age of artificial intelligence,” Nature, vol. 620, no. 7972, pp. 47–60, 2023.
- [5] K. Sundararajan and D. L. Woodard, “Deep learning for biometrics: A survey,” ACM Comput. Surv., vol. 51, no. 3, pp. 1–34, 2018.
- [6] S. Warnat-Herresthal, H. Schultze, K. L. Shastry, S. Manamohan, S. Mukherjee, V. Garg, R. Sarveswara, K. Händler, P. Pickkers, N. A. Aziz et al., “Swarm learning for decentralized and confidential clinical machine learning,” Nature, vol. 594, no. 7862, pp. 265–270, 2021.
- [7] S. Feng, H. Sun, X. Yan, H. Zhu, Z. Zou, S. Shen, and H. X. Liu, “Dense reinforcement learning for safety validation of autonomous vehicles,” Nature, vol. 615, no. 7953, pp. 620–627, 2023.
- [8] F. Juefei-Xu, R. Wang, Y. Huang, Q. Guo, L. Ma, and Y. Liu, “Countering malicious deepfakes: Survey, battleground, and horizon,” Int. J. Comput. Vis., vol. 130, no. 7, pp. 1678–1734, 2022.
- [9] H. Liu, M. Chaudhary, and H. Wang, “Towards trustworthy and aligned machine learning: A data-centric survey with causality perspectives,” arXiv preprint arXiv:2307.16851, 2023.
- [10] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković, “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges,” arXiv preprint arXiv:2104.13478, 2021.
- [11] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013.
- [12] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 9446–9454.
- [13] F. Klein, “A comparative review of recent researches in geometry,” Bull. Am. Math. Soc., vol. 2, no. 10, pp. 215–249, 1893.
- [14] R. Tobies, Felix Klein—Mathematician, Academic Organizer, Educational Reformer. Cham: Springer International Publishing, 2019, pp. 5–21.
- [15] K. Fukushima and S. Miyake, “Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position,” Pattern Recognit., vol. 15, no. 6, pp. 455–469, 1982.
- [16] V. Balntas, K. Lenc, A. Vedaldi, T. Tuytelaars, J. Matas, and K. Mikolajczyk, “H-Patches: A benchmark and evaluation of handcrafted and learned local descriptors.” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 11, pp. 2825–2841, 2019.
- [17] S. Qi, Y. Zhang, C. Wang, J. Zhou, and X. Cao, “A survey of orthogonal moments for image representation: theory, implementation, and evaluation,” ACM Comput. Surv., vol. 55, no. 1, pp. 1–35, 2021.
- [18] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, pp. 91–110, 2004.
- [19] E. Tola, V. Lepetit, and P. Fua, “Daisy: An efficient dense descriptor applied to wide-baseline stereo,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 5, pp. 815–830, 2009.
- [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, pp. 211–252, 2015.
- [21] R. Zhang, “Making convolutional networks shift-invariant again,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 7324–7334.
- [22] C. Buckner, “Understanding adversarial examples requires a theory of artefacts for deep learning,” Nature Mach. Intell., vol. 2, no. 12, pp. 731–736, 2020.
- [23] M. Taddeo, T. McCutcheon, and L. Floridi, “Trusting artificial intelligence in cybersecurity is a double-edged sword,” Nature Mach. Intell., vol. 1, no. 12, pp. 557–560, 2019.
- [24] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1872–1886, 2013.
- [25] L. Sifre and S. Mallat, “Rotation, scaling and deformation invariant scattering for texture discrimination,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 1233–1240.
- [26] T. Wiatowski and H. Bölcskei, “A mathematical theory of deep convolutional neural networks for feature extraction,” IEEE Trans. Inf. Theory, vol. 64, no. 3, pp. 1845–1866, 2017.
- [27] E. Oyallon, S. Zagoruyko, G. Huang, N. Komodakis, S. Lacoste-Julien, M. Blaschko, and E. Belilovsky, “Scattering networks for hybrid representation learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 9, pp. 2208–2221, 2018.
- [28] J. Andén and S. Mallat, “Deep scattering spectrum,” IEEE Trans. Signal Process., vol. 62, no. 16, pp. 4114–4128, 2014.
- [29] X. Chen, X. Cheng, and S. Mallat, “Unsupervised deep Haar scattering on graphs,” Proc. Adv. Neural Inf. Process. Syst., vol. 27, 2014.
- [30] S. Yu, “Evolving scattering networks for engineering disorder,” Nature Comput. Sci., vol. 3, no. 2, pp. 128–138, 2023.
- [31] S. Cheng, Y.-S. Ting, B. Ménard, and J. Bruna, “A new approach to observational cosmology using the scattering transform,” Mon. Not. R. Astron. Soc., vol. 499, no. 4, pp. 5902–5914, 2020.
- [32] T. Cohen and M. Welling, “Group equivariant convolutional networks,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 2990–2999.
- [33] T. S. Cohen and M. Welling, “Steerable CNNs,” in Proc. Int. Conf. Learn. Representations, 2016.
- [34] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow, “Harmonic networks: Deep translation and rotation equivariance,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5028–5037.
- [35] M. Weiler, F. A. Hamprecht, and M. Storath, “Learning steerable filters for rotation equivariant CNNs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 849–858.
- [36] I. Sosnovik, M. Szmaja, and A. Smeulders, “Scale-equivariant steerable networks,” in Proc. Int. Conf. Learn. Representations, 2019.
- [37] D. Worrall and M. Welling, “Deep scale-spaces: Equivariance over scale,” Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019.
- [38] Z. Sun and T. Blu, “Empowering networks with scale and rotation equivariance using a similarity convolution,” in Proc. Int. Conf. Learn. Representations, 2022.
- [39] M. Finzi, S. Stanton, P. Izmailov, and A. G. Wilson, “Generalizing convolutional neural networks for equivariance to Lie groups on arbitrary continuous data,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 3165–3176.
- [40] E. J. Bekkers, “B-spline CNNs on Lie groups,” in Proc. Int. Conf. Learn. Representations, 2019.
- [41] Q. Xie, Q. Zhao, Z. Xu, and D. Meng, “Fourier series expansion based filter parametrization for equivariant convolutions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 4, pp. 4537–4551, 2022.
- [42] K. Atz, F. Grisoni, and G. Schneider, “Geometric deep learning on molecular representations,” Nature Mach. Intell., vol. 3, no. 12, pp. 1023–1032, 2021.
- [43] R. J. Townshend, S. Eismann, A. M. Watkins, R. Rangan, M. Karelina, R. Das, and R. O. Dror, “Geometric deep learning of RNA structure,” Science, vol. 373, no. 6558, pp. 1047–1051, 2021.
- [44] I. Goodfellow, P. McDaniel, and N. Papernot, “Making machine learning robust against adversarial inputs,” Commun. ACM, vol. 61, no. 7, pp. 56–66, 2018.
- [45] F. Zhan, Y. Yu, R. Wu, J. Zhang, S. Lu, L. Liu, A. Kortylewski, C. Theobalt, and E. Xing, “Multimodal image synthesis and editing: The generative AI era,” IEEE Trans. Pattern Anal. Mach. Intell., 2023.
- [46] S. Qi, Y. Zhang, C. Wang, J. Zhou, and X. Cao, “A principled design of image representation: Towards forensic tasks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 5, pp. 5337–5354, 2022.
- [47] K. Lenc and A. Vedaldi, “Understanding image representations by measuring their equivariance and equivalence,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 991–999.
- [48] D. Marcos, M. Volpi, N. Komodakis, and D. Tuia, “Rotation equivariant vector field networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 5048–5057.
- [49] S. Qi, Y. Zhang, C. Wang, T. Xiang, X. Cao, and Y. Xiang, “Representing noisy image without denoising,” arXiv preprint arXiv:2301.07409, 2023.
- [50] P.-T. Yap, X. Jiang, and A. C. Kot, “Two-dimensional polar harmonic transforms for invariant image representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 7, pp. 1259–1270, 2009.
- [51] J. Flusser, B. Zitova, and T. Suk, Moments and moment invariants in pattern recognition. John Wiley & Sons, 2009.
- [52] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le, “Understanding and simplifying one-shot architecture search,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 550–559.
- [53] M. Guo, Y. Yang, R. Xu, Z. Liu, and D. Lin, “When NAS meets robustness: In search of robust architectures against adversarial attacks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 631–640.
- [54] S. G. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 7, pp. 674–693, 1989.
- [55] P.-T. Yap, R. Paramesran, and S.-H. Ong, “Image analysis by Krawtchouk moments,” IEEE Trans. Image Process., vol. 12, no. 11, pp. 1367–1377, 2003.
- [56] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, pp. 84–90, 2017.
- [57] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [58] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1–9.
- [59] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
- [60] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4700–4708.
- [61] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826.
- [62] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenet-v2: Inverted residuals and linear bottlenecks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520.
- [63] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner, “Detecting adversarial samples from artifacts,” arXiv preprint arXiv:1703.00410, 2017.
- [64] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” in Proc. Int. Conf. Learn. Representations, 2018.
- [65] B. Liang, H. Li, M. Su, X. Li, W. Shi, and X. Wang, “Detecting adversarial image examples in deep neural networks with adaptive noise reduction,” IEEE Trans. Dependable Secure Comput., vol. 18, no. 1, pp. 72–85, 2018.
- [66] G. Goswami, A. Agarwal, N. Ratha, R. Singh, and M. Vatsa, “Detecting and mitigating adversarial perturbations for robust face recognition,” Int. J. Comput. Vis., vol. 127, pp. 719–742, 2019.
- [67] J. Liu, W. Zhang, Y. Zhang, D. Hou, Y. Liu, H. Zha, and N. Yu, “Detection based defense against adversarial examples from the steganalysis point of view,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4825–4834.
- [68] A. Agarwal, R. Singh, M. Vatsa, and N. Ratha, “Image transformation-based defense against adversarial perturbation on deep learning models,” IEEE Trans. Dependable Secure Comput., vol. 18, no. 5, pp. 2106–2121, 2020.
- [69] C. Wang, S. Qi, Z. Huang, Y. Zhang, R. Lan, and X. Cao, “Towards an accurate and secure detector against adversarial perturbations,” arXiv preprint arXiv:2305.10856, 2023.
- [70] Y. Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao, “Thinking in frequency: Face forgery detection by mining frequency-aware clues,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 86–103.
- [71] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “CNN-generated images are surprisingly easy to spot… for now,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8695–8704.
- [72] Z. Liu, X. Qi, and P. H. Torr, “Global texture enhancement for fake face detection in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8060–8069.
- [73] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” in Artificial Intelligence Safety and Security. Chapman and Hall/CRC, 2018, pp. 99–112.
- [74] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in Proc. IEEE Symp. Secur. Privacy, 2017, pp. 39–57.
- [75] S. Chen, Z. He, C. Sun, J. Yang, and X. Huang, “Universal adversarial attack on attention and the resulting dataset DamageNet,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 4, pp. 2188–2197, 2020.
- [76] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
- [77] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in Proc. Int. Conf. Learn. Representations, 2018.
- [78] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1765–1773.
- [79] P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” Proc. Adv. Neural Inf. Process. Syst., vol. 34, pp. 8780–8794, 2021.
- [80] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018.
- [81] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
- [82] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10 684–10 695.
- [83] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for text-to-image synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10 696–10 706.
![]() |
Shuren Qi received the B.A. and M.E. degrees from Liaoning Normal University, Dalian, China, in 2017 and 2020 respectively. He is currently pursuing the Ph.D. degree in computer science at Nan**g University of Aeronautics and Astronautics, Nan**g, China. He has published academic papers in top-tier venues including ACM Computing Surveys and IEEE Transactions on Pattern Analysis and Machine Intelligence. His research involves the general topics of invariance, robustness, and explainability in computer vision, with a focus on invariant representations, for closing today’s trustworthiness gap in artificial intelligence, e.g., forensic and security of visual data. |
![]() |
Yushu Zhang (Senior Member, IEEE) received the Ph.D. degree in computer science from Chongqing University, Chongqing, China, in 2014. He held various research positions with the City University of Hong Kong, Southwest University, University of Macau, and Deakin University. He is currently a Professor with the College of Computer Science and Technology, Nan**g University of Aeronautics and Astronautics, Nan**g, China. His research interests include multimedia processing and security, artificial intelligence, and blockchain. Dr. Zhang is an Associate Editor of Signal Processing and Information Sciences. |
![]() |
Chao Wang received the B.S. and M.S. degrees from Liaoning Normal University, Dalian, China, in 2017 and 2020 respectively. She is currently pursuing the Ph.D. degree in computer science at Nan**g University of Aeronautics and Astronautics, Nan**g, China. Her research interests include trustworthy artificial intelligence, adversarial learning, and media forensics. |
![]() |
Zhihua Xia (Member, IEEE) received the Ph.D. degree in computer science from Hunan University, China, in 2011. He held various research positions with the Nan**g University of Information Science and Technology, New Jersey Institute of Technology, and Sungkyunkwan University. He is currently a Professor with the College of Cyber Security, **an University, China. His research interests include AI security, secure computation, and media forensics. |
![]() |
Xiaochun Cao (Senior Member, IEEE) received the B.E. and M.E. degrees in computer science from Beihang University, Bei**g, China, in 1999 and 2002, respectively, and the Ph.D. degree in computer science from the University of Central Florida, Orlando, FL, USA, in 2006. After graduation, he spent about three years at ObjectVideo Inc., as a Research Scientist. From 2008 to 2012, he was a Professor at Tian** University, Tian**, China. Before joining Sun Yat-sen University, Shenzhen, China, he was a Professor at the Institute of Information Engineering, Chinese Academy of Sciences, Bei**g, China. He is a Professor and the Dean with the School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University. He has published more than 200 journal and conference papers. Dr. Cao’s dissertation was nominated for the University Level Outstanding Dissertation Award. He was a recipient of the Piero Zamperoni Best Student Paper Award at the International Conference on Pattern Recognition, in 2004 and 2010; the Excellent Young Scientists Fund and Distinguished Young Scholars Fund of National Natural Science Foundation of China, in 2014 and 2020. He was on the Editorial Boards of IEEE Transactions on Circuits and Systems for Video Technology and IEEE Transactions on Multimedia. He is on the Editorial Boards of IEEE Transactions on Pattern Analysis and Machine Intelligence and IEEE Transactions on Image Processing. |
![]() |
Jian Weng (Senior Member, IEEE) received the Ph.D. degree in computer science from Shanghai Jiao Tong University, Shanghai, China, in 2008. From 2008 to 2010, he was a Post-doctoral Researcher with the Singapore Management University, Singapore. He is currently a Professor and the Vice President of **an University, Guangzhou, China. He has published more than 100 journal and conference papers. Dr. Weng served as the PC co-chair or a PC member for more than 50 international conferences. He was a recipient of the Innovation Award from the Chinese Association for Cryptologic Research in 2015 and the Distinguished Young Scholars Fund of National Natural Science Foundation of China in 2018. He is on the Editorial Board of IEEE Transactions on Vehicular Technology. |