CILF-CIAE: CLIP-driven Image–Language Fusion for Correcting Inverse Age Estimation

Yuntao Shou [email protected] Tao Meng [email protected] Wei Ai [email protected] Nan Yin [email protected] Fuchen Zhang [email protected] Keqin Li [email protected] College of Computer and Information Engineering, Central South University of Forestry and Technology, Hunan, China Department of Computer Science, State University of New York, New Paltz, New York 12561, USA Mohamed bin Zayed University of Artificial Intelligence, UAE
Abstract

The age estimation task aims to predict the age of an individual by analyzing facial features in an image. The development of age estimation can improve the efficiency and accuracy of various applications (e.g., age verification and secure access control, etc.). In recent years, contrastive language-image pre-training (CLIP) has been widely used in various multimodal tasks and has made some progress in the field of age estimation. However, existing CLIP-based age estimation methods require high memory usage (quadratic complexity) when globally modeling images, and lack an error feedback mechanism to prompt the model about the quality of age prediction results. To tackle the above issues, we propose a novel CLIP-driven Image–Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE). Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture (i.e., FourierFormer) to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Compared with the quadratic complexity of the attention mechanism, the proposed Fourierformer is of linear log complexity. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions. Through extensive experiments on multiple data sets, CILF-CIAE has achieved better age prediction results.

keywords:
Age estimation, CLIP, Error correction, Transformer, Fourier Transform.
journal: Information Fusionvolume: 00
\journalname

Information Fusion \runauth \jidprocs \jnltitlelogoInformation Fusion \CopyrightLine2021Published by Elsevier Ltd.

1 Introduction

1.1 Motivation

The task of age estimation aims to determine the age based on the facial features in the image. In recent years, due to the massive increase in image data sets and the widespread application of deep learning (DL), age estimation methods have also achieved important achievements and attracted widespread research attention [1], [2], [3]. Futhermore, age estimation is also widely used in many scenarios. For example, age estimation in finance and insurance can help detect fraud where age is falsely stated to obtain improper benefits [4], [5], [6].

Refer to caption
Figure 1: We compare the differences between existing image processing paradigms and the paradigm proposed in this paper. As shown in Fig. 1(a), most image processing methods perform supervised learning by inputting images and then using manually annotated labels as supervision signals. As shown in Fig. 1(b), since manual annotation requires a large amount of resources, existing methods begin to build self-supervised learning models by contrasting input images. As shown in Fig. 1(c), we perform text-image contrastive learning by using the CLIP pre-trained model and transfer the learned knowledge to the age estimation prediction task. As shown in Figs 1 (d) and (e), existing methods are mainly based on CNN architecture and Transformer architecture based on attention mechanism to extract feature information of images. As shown in Fig. 1(f), we replace the attention module in the Transformer architecture with a Fourier prior module.

The current mainstream age estimation methods are divided into three categories: CNN [9], [10], attention network [11], [12], and GCN [13]. To extract global information and multi-scale information in images, a CNN-based age estimation algorithm is applied. For example, Rothe et al. [4] estimated an individual’s true age and apparent age from a single face image based on a CNN method. Unlike many traditional machine learning methods [14], this method does not require the use of facial feature point markers and only requires the input of face images for age estimation. However, CNN-based methods cannot capture the semantic features in images that are most relevant to age features. To give higher weight to the semantic features in the image that are most relevant to the age feature, attention networks began to be applied. For instance, Shen et al. [15] introduced an attention mechanism so that the model can automatically focus on regions in the image that are relevant for age estimation, which helps improve the model’s perception of important features related to age. However, attention network-based methods cannot flexibly model irregular objects. To overcome the above problems, Shou et al. [13] proposed a contrastive multi-view GCN for age estimation (CMGCN). CMGCN improves the feature representation capabilities of images by extending image representation into topological semantic space. However, the methods mentioned above are all supervised learning methods and ignore the CLIP-based multimodal learning paradigm. Taking Fig 1(a) and (b) as an example, existing age estimation algorithms mainly focus on supervised, or self-supervised algorithm design [7], [8], ignoring the contrastive image-language pre-training (CLIP) paradigm. CLIP can learn the prior information of faces from a large number of text-image pairs and provide better generalization for downstream tasks. Specifically, CLIP learns the correlation between images and text from a large number of image-text pairs through contrastive learning. Furthermore, existing algorithms directly predict age and lack an error information feedback mechanism, which may lead to a large error between the model’s predicted age and the true label. Therefore, it is necessary to take CLIP multimodal learning paradigm and error-controllable generation as the starting point for model design.

To tackle the above problem, we propose a novel CLIP-driven Image–Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE) to perform age estimation. CILF-CIAE mainly includes four modules: CLIP-based visual and language feature encoder, Fourierformer-based feature fusion, age prediction and error-controllable generation module. Firstly, we use Image Encoder and Text Encoder in CLIP to encode image and text features respectively and obtain corresponding feature representations. After obtaining the image and text feature representations, we jointly input them into the N𝑁Nitalic_N-dimensional feature space for contrastive learning to obtain aligned text and image semantic vectors, and utilize obtained image semantic vectors to perform age estimation. Secondly, as shown in Fig. 1(d) and (e), unlike previous CNN-based and attention-based Transformer architectures, CNN-based methods can only extract local information of the image and it is difficult to use contextual prompts modules to enhance age estimation, while attention-based methods require large memory usage (quadratic complexity). We introduce the Transformer architecture based on Fourier transform to realize the spatial interaction and channel evolution of image features, so as to fuse text and image feature information to improve the age estimation performance. Specifically, we replace the attention module in Transformer with Fourier transform and input image features into Fuorierformer to achieve spatial interaction and channel evolution. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Thirdly, we construct age estimation prediction loss and text and image matching loss to complete the parameter optimization of the model. Finally, we build an error-correcting reversible age estimation module to ensure that the predicted age is within a high-confidence interval in an end-to-end learning manner.

1.2 Our Contributions

Therefore, CLIP multimodal learning, spatial interaction of images, and channel evolution should be the core of age estimation algorithm design. Inspired by the above analysis, we propose a novel CLIP-driven Image–Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE) to perform age estimation. The main contributions of this paper are summarized as follows:

  1. 1.

    A novel CLIP-driven Image–Language Fusion for Correcting Inverse Age Estimation architecture is present and named CILF-CIAE. CILF-CIAE is able to learn information about age from input images.

  2. 2.

    A new Transformer structure is designed, i.e., Fourierformer. FourierFormer replaces the attention mechanism with Fourier transform to realize the channel evolution and spatial interaction of image features.

  3. 3.

    An efficient contrastive multimodal learning module is utilized to supervise the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities.

  4. 4.

    An end-to-end error feedback mechanism is proposed to ensure that the confidence of age estimation is within a credible range.

  5. 5.

    Extensive experiments are conducted on four real data sets to verify the effectiveness of the method CILF-CIAE proposed in this paper. Experimental results show that CILF-CIAE can achieve optimal age prediction.

2 Related work

2.1 Age Estimation

Traditional age estimation methods usually rely on hand-designed feature extraction and machine learning algorithms, which are limited by feature selection and age estimation performance [14], [16], [17]. With the popularity of the Internet and social media (e.g., meta, twitter, and Youtube, etc.), large-scale face image datasets have also been widely grown. The rapid growth of data sets provides rich training data for deep learning (DL), making DL’s learning capabilities more powerful. Age estimation has potential applications in social media analysis, ad targeting, security monitoring, medical image analysis, etc. For example, in security and legal applications, image age estimation can assist police in identifying possible underage criminal suspects.

Existing age estimation algorithms are mainly divided into two categories, i.e., age estimation algorithms based on machine learning and algorithms based on deep learning. Machine learning-based age estimation algorithms mainly rely on hand-designed rules to extract age-related features of images. Age estimation algorithms based on deep learning mainly use some deep learning models (e.g., CNN, Transformer, and GCN, etc) with powerful adaptive learning capabilities and massive data sets to estimate age in an end-to-end manner.

Machine learning methods: In the age estimation algorithms based on traditional machine learning algorithms, Shin et al. [18] proposed an ordinal regression algorithm (MWR) based on moving window regression, which first ranks the input and reference labels and designs global and local regressors to achieve prediction of global ranking and local ranking. MWR achieves fine-grained age estimation by continuously iteratively optimizing the ranking order. However, the computational complexity of MWR is relatively high. Cao et al. [19] proposed a consistent ranking logic algorithm to solve the inconsistency problem of multiple binary ordinal regression algorithms. CORAL ensures ranking consistency by introducing confidence scores. Cao et al. [14] proposed the Ranking SVM algorithm to achieve age estimation of images. This algorithm estimates age by first grou** ages and then sorting ages. RSVM can reduce the hypothesis space of model learning. Zhang et al. [20] achieved age estimation by learning the probability distribution of label information. This algorithm achieves age prediction by calculating the posterior probability of the image. There are some other typical traditional machine learning algorithms [21], [22].

Deep learning methods: In the age estimation algorithms based on deep learning algorithms, CNN [23], attention network [11], and hybrid neural network systems [24] are currently common age estimation algorithms. For example, Levi et al. [23] proposed an age estimation algorithm based on deep CNN to solve the problem of insufficient performance of traditional machine learning algorithms. DeepCNN can achieve better prediction results even on a small amount of data sets. Duan et al. [10] proposed the CNN2ELM algorithm to combine the advantages of CNN and regression algorithms. CNN2ELM constructed three feature extraction networks of age, gender, and race, and used a fusion mechanism to fuse the complementary information of the three networks, and used ELM for regression prediction of age. Wang et al. [11] proposed the Attention-based Dynamic Patch Fusion algorithm to solve the problem that CNN cannot extract the most beneficial semantic information in the image for the age estimation task. ADPF introduces attention network and fusion network to dynamically extract image patches with rich semantic features and adaptively fuse the extracted feature information. Zhang et al. [12] proposed a fine-grained attention LSTM algorithm to solve the problem that existing methods only focus on the global information of the image and ignore the fine-grained features of the image. This method first uses the residual network to extract the global information of the image, and then uses the attention LSTM to capture the sensitive area information of the image to obtain local important semantic features in the image. Xie et al. [24] integrated CNN’s feature extraction capabilities, domain generalization capabilities, and local information discrimination capabilities based on dictionary algorithms. This method first uses a pre-trained CNN to extract the feature representation of the image, and then builds a dictionary representation to extract the local feature information and Fisher vector representation of the image.

Refer to caption
Figure 2: The overall framework for age prediction using CILF-CIAE. Specifically, we first use CLIP to extract image features and C-type text features, and then calculate the pixel-text similarity score. The similarity scores of the pixel-text pairs are fed into the age estimation module, and the age label is used as a supervision signal. To better utilize the prior knowledge of images, we introduce Fourierformer to extract contextual information in images to prompt the language model. Finally, we perform error optimization on the predicted age.

2.2 Contrastive Image-Language Pre-training

With the powerful representation ability of the pre-trained visual-language model CLIP [25] in feature extraction learning, it has been widely used in CV tasks. CLIP uses a contrastive learning method to train by maximizing the similarity between the embedding vectors of related texts and images, which enables the model to find the most relevant text-image pairs in the embedding space and achieve natural language and image multimodal understanding. In the field of age estimation, we need a CLIP-based backbone network to directly perform inductive reasoning.

3 METHODOLOGY

3.1 The Design of the CILF-CIAE Structure

The CILF-CIAE architecture proposed in this paper is shown in Fig. 2, which contains age prediction stages and age error optimization. Specifically, we first use age estimation models based CLIP with a Fourier prior module to predict the age of images. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Furthermore, if the predicted and actual values exceed a given threshold, the optimization branch is activated. The age errors are then used in the training of an ensemble error correction model to update the predicted age xsuperscript𝑥x^{\ast}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. This training process continues until e(x)ϵ𝑒superscript𝑥italic-ϵe(x^{\ast})\leq\epsilonitalic_e ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_ϵ terminates. The details of the CILF-CIAE architecture proposed in this paper will be described.

3.1.1 Language-guided Visual Age Prediction

As shown in Fig. 2, we briefly introduce the CLIP-based visual language pre-training model for age estimation. CLIP consists of an image encoder and a text encoder [26]. Image encoders aim to extract the underlying features of an image and map them into a low-dimensional embedding space. The architecture of image encoders usually uses ViT [27] with superior performance. The text encoder often use Transformers [28] to generate text representations with rich semantic information. Given a text prompt, such as “A photo of a 12 year old person,” the text encoder first converts each character into a lowercase byte-pair encoded representation, which uniquely identifies each character. The beginning and end of each text sequence are marked by [SOS] and [EOS]. Afterwards, the text representation is mapped into a 512-dimensional feature space, and then text Transformer is used for sequence modeling. Then, given an image feature obtained by the image encoder, the cosine similarity function is used to calculate the similarity between the image and the text prompt. The similarity formula is defined as follows:

𝐒=exp(sim(Ti,Ii)/τ)j=1Nexp(sim(Tj,Ii)/τ)𝐒𝑠𝑖𝑚subscript𝑇𝑖subscript𝐼𝑖𝜏superscriptsubscript𝑗1𝑁𝑠𝑖𝑚subscript𝑇𝑗subscript𝐼𝑖𝜏\displaystyle\mathbf{S}=\frac{\exp(sim(T_{i},I_{i})/\tau)}{\sum_{j=1}^{N}\exp(% sim(T_{j},I_{i})/\tau)}bold_S = divide start_ARG roman_exp ( italic_s italic_i italic_m ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_s italic_i italic_m ( italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG (1)

where 𝐒𝐒\mathbf{S}bold_S is the similarity matrix, Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the feature vector of the i𝑖iitalic_i-th text sequence obtained by the text encoder, Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the feature vector of the i𝑖iitalic_i-th image obtained by the image encoder, N𝑁Nitalic_N represents the total number of training samples, sim()𝑠𝑖𝑚sim(\cdot)italic_s italic_i italic_m ( ⋅ ) represents cosine similarity, and τ𝜏\tauitalic_τ represents temperature attenuation coefficient.

To further narrow the semantic gap between image and text features, we design an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching loss, thereby improving the interaction effect between different modalities. The optimization goal for image-text matching loss is defined as follows:

text-image=1Ni=1Nlogexp(sim(Ti,Ii)/τ)j=1Nexp(sim(Ti,Ij)/τ)+1N(N1)i=1NjiNlogexp(S(Ti,Ij)/τ)k=1Nexp(S(Ti,Ik)/τ)subscripttext-imageabsent1𝑁superscriptsubscript𝑖1𝑁𝑠𝑖𝑚subscript𝑇𝑖subscript𝐼𝑖𝜏superscriptsubscript𝑗1𝑁𝑠𝑖𝑚subscript𝑇𝑖subscript𝐼𝑗𝜏missing-subexpression1𝑁𝑁1superscriptsubscript𝑖1𝑁superscriptsubscript𝑗𝑖𝑁𝑆subscript𝑇𝑖subscript𝐼𝑗𝜏superscriptsubscript𝑘1𝑁𝑆subscript𝑇𝑖subscript𝐼𝑘𝜏\displaystyle\begin{aligned} \mathcal{L}_{\text{text-image}}&=-\frac{1}{N}\sum% _{i=1}^{N}\log\frac{\exp(sim(T_{i},I_{i})/\tau)}{\sum_{j=1}^{N}\exp(sim(T_{i},% I_{j})/\tau)}\\ &+\frac{1}{N(N-1)}\sum_{i=1}^{N}\sum_{j\neq i}^{N}\log\frac{\exp(S(T_{i},I_{j}% )/\tau)}{\sum_{k=1}^{N}\exp(S(T_{i},I_{k})/\tau)}\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT text-image end_POSTSUBSCRIPT end_CELL start_CELL = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log divide start_ARG roman_exp ( italic_s italic_i italic_m ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_s italic_i italic_m ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG 1 end_ARG start_ARG italic_N ( italic_N - 1 ) end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log divide start_ARG roman_exp ( italic_S ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_S ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_τ ) end_ARG end_CELL end_ROW (2)

where N𝑁Nitalic_N is the number of the training samples.

3.1.2 Context-Aware Prompting

Previous work has demonstrated that feature alignment of visual and language modalities can significantly improve the performance of CLIP models on downstream tasks [29], [30]. Therefore, we consider whether we can design a customized context-aware prompting method to improve text features.

Vision-to-language prompting. The textual features that fuse visual global context information can make age estimation predictions more accurate. For example, “a photo of a 68-year-old man with gray hair” is a more accurate prediction than “a photo of a 68-year-old man.” Therefore, we design a customized Fourier prior module to utilize visual global context information to improve text features in fine granularity. Specifically, we use the Fourierformer decoder to realize image spatial information interaction and channel evolution, and model the interaction between vision and language.

Refer to caption
Figure 3: The overall framework of the proposed Fourierformer. FourierFormer includes a spatial interaction module, a channel evolution module, a discrete Fourier transform (DFT) and an inverse discrete Fourier (IDFT) module, which can effectively extract information from the global context of an image.

3.1.3 Fourier Prior Embedded Block

Fourier transform is used for frequency domain filtering, compression and feature extraction in image processing [31]. By converting the image to the frequency domain, patterns and structures in the image can be more easily identified. For a given image xH×W×C𝑥superscript𝐻𝑊𝐶x\in\mathbb{R}^{H\times W\times C}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, the Fourier transform is applied to each image channel separately and transforms them into frequency domain space as complex components (x)𝑥\mathcal{F}(x)caligraphic_F ( italic_x ). The formula is defined as:

(𝒙)(𝒖,𝒗)=1H×W𝒉=0H1𝒘=0W1𝒙(𝒉,𝒘)ej2π(hHu+𝒘W𝒗)𝒙𝒖𝒗1HWsuperscriptsubscript𝒉0H1superscriptsubscript𝒘0W1𝒙𝒉𝒘superscript𝑒𝑗2𝜋H𝑢𝒘W𝒗\displaystyle\mathcal{F}(\boldsymbol{x})(\boldsymbol{u},\boldsymbol{v})=\frac{% 1}{\sqrt{\mathrm{H\times W}}}\sum_{\boldsymbol{h}=0}^{\mathrm{H}-1}\sum_{% \boldsymbol{w}=0}^{\mathrm{W}-1}\boldsymbol{x}(\boldsymbol{h},\boldsymbol{w})e% ^{-j2\pi(\frac{h}{\mathrm{H}}u+\frac{\boldsymbol{w}}{\mathrm{W}}\boldsymbol{v})}caligraphic_F ( bold_italic_x ) ( bold_italic_u , bold_italic_v ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG roman_H × roman_W end_ARG end_ARG ∑ start_POSTSUBSCRIPT bold_italic_h = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_H - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT bold_italic_w = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_W - 1 end_POSTSUPERSCRIPT bold_italic_x ( bold_italic_h , bold_italic_w ) italic_e start_POSTSUPERSCRIPT - italic_j 2 italic_π ( divide start_ARG italic_h end_ARG start_ARG roman_H end_ARG italic_u + divide start_ARG bold_italic_w end_ARG start_ARG roman_W end_ARG bold_italic_v ) end_POSTSUPERSCRIPT (3)

where u𝑢uitalic_u and v𝑣vitalic_v represent the horizontal and vertical coordinates of the Fourier domain. The phase component P(x)(u,v)𝑃𝑥𝑢𝑣P(x)(u,v)italic_P ( italic_x ) ( italic_u , italic_v ) and the amplitude component A(x)(u,v)𝐴𝑥𝑢𝑣A(x)(u,v)italic_A ( italic_x ) ( italic_u , italic_v ) are obtained as follows:

𝒜(𝒙)(𝒖,𝒗))=2(𝒙)(𝒖,𝒗))+2(𝒙)(𝒖,𝒗)),𝒫(𝒙)(𝒖,𝒗))=arctan[(𝒙)(𝒖,𝒗))(𝒙)(𝒖,𝒗))],\displaystyle\begin{aligned} &\mathcal{A}(\boldsymbol{x})(\boldsymbol{u},% \boldsymbol{v}))=\sqrt{\mathcal{R}^{2}(\boldsymbol{x})(\boldsymbol{u},% \boldsymbol{v}))+\mathcal{I}^{2}(\boldsymbol{x})(\boldsymbol{u},\boldsymbol{v}% ))},\\ &\mathcal{P}(\boldsymbol{x})(\boldsymbol{u},\boldsymbol{v}))=\arctan[\frac{% \mathcal{I}(\boldsymbol{x})(\boldsymbol{u},\boldsymbol{v}))}{\mathcal{R}(% \boldsymbol{x})(\boldsymbol{u},\boldsymbol{v}))}],\end{aligned}start_ROW start_CELL end_CELL start_CELL caligraphic_A ( bold_italic_x ) ( bold_italic_u , bold_italic_v ) ) = square-root start_ARG caligraphic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x ) ( bold_italic_u , bold_italic_v ) ) + caligraphic_I start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x ) ( bold_italic_u , bold_italic_v ) ) end_ARG , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL caligraphic_P ( bold_italic_x ) ( bold_italic_u , bold_italic_v ) ) = roman_arctan [ divide start_ARG caligraphic_I ( bold_italic_x ) ( bold_italic_u , bold_italic_v ) ) end_ARG start_ARG caligraphic_R ( bold_italic_x ) ( bold_italic_u , bold_italic_v ) ) end_ARG ] , end_CELL end_ROW (4)

where I(x)(u,v)𝐼𝑥𝑢𝑣I(x)(u,v)italic_I ( italic_x ) ( italic_u , italic_v ) and R(x)(u,v)𝑅𝑥𝑢𝑣R(x)(u,v)italic_R ( italic_x ) ( italic_u , italic_v ) represent imaginary numbers and real numbers, respectively.

Structure Flow. The main goal of designing the Fourier prior module in this paper is to achieve an effective and efficient global context image information modeling paradigm and improve the representation ability of text features, as shown in Fig. 3. For a given image xH×W×Cin𝑥superscript𝐻𝑊subscript𝐶𝑖𝑛x\in\mathbb{R}^{H\times W\times C_{in}}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we first use a text encoder based CLIP to extract the shallow features of the image X0H×W×Csubscript𝑋0superscript𝐻𝑊𝐶X_{0}\in\mathbb{R}^{H\times W\times C}italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT. Shallow features are encoded by using N𝑁Nitalic_N stacked image encoders. The Fuoriformer module designed in this paper consists of a stack of spatial interaction module, channel evolution module, residual and layer normalization module and Fourier prior module. Similarly, for the image decoder, we use a stack of the proposed core modules for image feature decoding.

As shown in Fig. 4, the core module of Fourierformer consists of two parts: spatial interaction and channel evolution, which are implemented by depth convolution and 1×1111\times 11 × 1 convolution with DFT and IDFT respectively.

Refer to caption
Figure 4: Details of the Fourier Prior Embedding module (FPE). FPE follows the global context information modeling idea of spatial interaction and channel evolution.

Fourier Spatial Interaction. Fourier spatial interaction first takes the image feature maps obtained by the image encoder as the input of Fourierformer, and then applies DFT to convert them into a spatial feature representation. Assuming that the features are expressed as XRH×W×C𝑋superscript𝑅𝐻𝑊𝐶X\in R^{H\times W\times C}italic_X ∈ italic_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, the corresponding DFT formula is defined as:

𝐗(𝒄),𝐗(𝒄)=(𝐗(𝒄))superscriptsubscript𝐗𝒄superscriptsubscript𝐗𝒄superscript𝐗𝒄\displaystyle\mathbf{X}_{\mathcal{I}}^{(\boldsymbol{c})},\mathbf{X}_{\mathcal{% R}}^{(\boldsymbol{c})}=\mathcal{F}(\mathbf{X}^{(\boldsymbol{c})})bold_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( bold_italic_c ) end_POSTSUPERSCRIPT , bold_X start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( bold_italic_c ) end_POSTSUPERSCRIPT = caligraphic_F ( bold_X start_POSTSUPERSCRIPT ( bold_italic_c ) end_POSTSUPERSCRIPT ) (5)

where c=1,,C𝑐1𝐶c=1,...,Citalic_c = 1 , … , italic_C, 𝐗subscript𝐗\mathbf{X}_{\mathcal{I}}bold_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT and 𝐗subscript𝐗\mathbf{X}_{\mathcal{R}}bold_X start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT represent the real and imaginary parts in the Fourier space. We then perform Fourier spatial interaction to filter and compress the frequency domain signal of the image through a deep-wise convolution (DWconv) operation with LeakyReLU activation function. The spatial interaction process of images can be defined as:

𝐒(𝒃)=LeakyReLU(DWconv(𝒃)(𝐗(𝒃)))𝐒(𝒃)=LeakyReLU(DWconv(𝒃)(𝐗(𝒃)))superscriptsubscript𝐒𝒃absent𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈𝐷𝑊𝑐𝑜𝑛superscript𝑣𝒃superscriptsubscript𝐗𝒃superscriptsubscript𝐒𝒃absent𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈𝐷𝑊𝑐𝑜𝑛superscript𝑣𝒃superscriptsubscript𝐗𝒃\displaystyle\begin{aligned} \mathbf{S}_{\mathcal{I}}^{(\boldsymbol{b})}&=% LeakyReLU\left(DWconv^{(\boldsymbol{b})}(\mathbf{X}_{\mathcal{I}}^{(% \boldsymbol{b})})\right)\\ \mathbf{S}_{\mathcal{R}}^{(\boldsymbol{b})}&=LeakyReLU\left(DWconv^{(% \boldsymbol{b})}(\mathbf{X}_{\mathcal{R}}^{(\boldsymbol{b})})\right)\end{aligned}start_ROW start_CELL bold_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( bold_italic_b ) end_POSTSUPERSCRIPT end_CELL start_CELL = italic_L italic_e italic_a italic_k italic_y italic_R italic_e italic_L italic_U ( italic_D italic_W italic_c italic_o italic_n italic_v start_POSTSUPERSCRIPT ( bold_italic_b ) end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( bold_italic_b ) end_POSTSUPERSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL bold_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( bold_italic_b ) end_POSTSUPERSCRIPT end_CELL start_CELL = italic_L italic_e italic_a italic_k italic_y italic_R italic_e italic_L italic_U ( italic_D italic_W italic_c italic_o italic_n italic_v start_POSTSUPERSCRIPT ( bold_italic_b ) end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( bold_italic_b ) end_POSTSUPERSCRIPT ) ) end_CELL end_ROW (6)

Then we apply inverse DFT to the learned 𝐒subscript𝐒\mathbf{S}_{\mathcal{I}}bold_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT and 𝐒subscript𝐒\mathbf{S}_{\mathcal{R}}bold_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT with low-frequency signals to transform them back into the spatial domain. The formula for 𝐒subscript𝐒\mathbf{S}_{\mathcal{I}}bold_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT and 𝐒subscript𝐒\mathbf{S}_{\mathcal{R}}bold_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT to achieve time-frequency conversion is defined as follows:

𝐗𝐒𝐛=1(𝐒(𝒃),𝐒(𝒃))superscriptsubscript𝐗𝐒𝐛superscript1superscriptsubscript𝐒𝒃superscriptsubscript𝐒𝒃\displaystyle\mathbf{X_{S}^{b}}=\mathcal{F}^{-1}(\mathbf{S}_{\mathcal{I}}^{(% \boldsymbol{b})},\mathbf{S}_{\mathcal{R}}^{(\boldsymbol{b})})bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( bold_italic_b ) end_POSTSUPERSCRIPT , bold_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( bold_italic_b ) end_POSTSUPERSCRIPT ) (7)

The spectral convolution theorem in Fourier theory states that the convolution operation of signals in the frequency domain is equivalent to their product operation in the time domain, which reveals the overall frequency composition. The spectral convolution theorem provides an efficient way to process signals in the frequency domain because convolution operations in the frequency domain are generally easier to process than multiplication operations in the time domain. Therefore, we concatenate the 𝐗𝐒𝐛superscriptsubscript𝐗𝐒𝐛\mathbf{X_{S}^{b}}bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT obtained by Fourier transform and normalize it to obtain the output SXsubscript𝑆𝑋S_{X}italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT of the Fourier spatial interaction.

Fourier Channel Evolution. Fourier channel evolution performs channel-by-channel evolution by applying a 1×1111\times 11 × 1 convolution operator to decompose the output SXsubscript𝑆𝑋S_{X}italic_S start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT of the Fourier space interaction into real and imaginary parts 𝐂subscript𝐂\mathbf{C}_{\mathcal{I}}bold_C start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT and 𝐂subscript𝐂\mathbf{C}_{\mathcal{R}}bold_C start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT. The Fourier channel evolution formula can be defined as:

𝐂𝐗=LeakyReLU(𝐜𝐨𝐧𝐯(cat[𝐂1,,𝐂c]))𝐂𝐗=LeakyReLU(𝐜𝐨𝐧𝐯(cat[𝐂1,,𝐂c]))missing-subexpressionsubscript𝐂𝐗𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈𝐜𝐨𝐧𝐯𝑐𝑎𝑡superscriptsubscript𝐂1superscriptsubscript𝐂𝑐missing-subexpressionsubscript𝐂𝐗𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈𝐜𝐨𝐧𝐯𝑐𝑎𝑡superscriptsubscript𝐂1superscriptsubscript𝐂𝑐\displaystyle\begin{aligned} &\mathbf{C}\mathbf{X}_{\mathcal{I}}=LeakyReLU% \left(\mathbf{con}\mathbf{v}\left(cat[\mathbf{C}_{\mathcal{I}}^{1},\ldots,% \mathbf{C}_{\mathcal{I}}^{c}]\right)\right)\\ &\mathbf{C}\mathbf{X}_{\mathcal{R}}=LeakyReLU\left(\mathbf{con}\mathbf{v}\left% (cat[\mathbf{C}_{\mathcal{R}}^{1},\ldots,\mathbf{C}_{\mathcal{R}}^{c}]\right)% \right)\end{aligned}start_ROW start_CELL end_CELL start_CELL bold_CX start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT = italic_L italic_e italic_a italic_k italic_y italic_R italic_e italic_L italic_U ( bold_conv ( italic_c italic_a italic_t [ bold_C start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_C start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_CX start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT = italic_L italic_e italic_a italic_k italic_y italic_R italic_e italic_L italic_U ( bold_conv ( italic_c italic_a italic_t [ bold_C start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_C start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ) ) end_CELL end_ROW (8)

where cat()𝑐𝑎𝑡cat(\cdot)italic_c italic_a italic_t ( ⋅ ) is the concatenation operation. Then we perform IDFT to convert 𝐂𝐗subscript𝐂𝐗\mathbf{C}\mathbf{X}_{\mathcal{R}}bold_CX start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT and 𝐂𝐗subscript𝐂𝐗\mathbf{C}\mathbf{X}_{\mathcal{I}}bold_CX start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT to time domain space as follows:

𝐂𝐒𝐛=1(𝐂𝐗(𝒃),𝐂𝐗(𝒃))superscriptsubscript𝐂𝐒𝐛superscript1superscriptsubscript𝐂𝐗𝒃superscriptsubscript𝐂𝐗𝒃\displaystyle\mathbf{C_{S}^{b}}=\mathcal{F}^{-1}(\mathbf{C}\mathbf{X}_{% \mathcal{I}}^{(\boldsymbol{b})},\mathbf{C}\mathbf{X}_{\mathcal{R}}^{(% \boldsymbol{b})})bold_C start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT = caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_CX start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( bold_italic_b ) end_POSTSUPERSCRIPT , bold_CX start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( bold_italic_b ) end_POSTSUPERSCRIPT ) (9)
Refer to caption
Figure 5: The flowchart of the correcting inverse age estimation. Existing age estimation models give a first age estimate, which is assessed by evaluations EPsubscript𝐸𝑃E_{P}italic_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT. If failed, the optimization branch will be activated. The age estimation error estimated by the ensemble error model is used for training to update the predicted age xsuperscript𝑥x^{\ast}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The process terminates until e(x)ϵ𝑒superscript𝑥italic-ϵe(x^{\ast})\leq\epsilonitalic_e ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_ϵ.

3.1.4 Two-stage Error Selection

As shown in Fig. 5, we first use a CLIP-based learning model to predict age. If the error exceeds the threshold, an optimization branch is used to optimize the error and give a predicted age with high confidence.

For a given observation y𝑦yitalic_y, we use multiple models and metrics to evaluate the predicted age, resulting in an hhitalic_h-dimensional error vectors, expressed as:

𝐞(𝐱,𝐲)=[E1(𝐱,𝐲),E2(𝐱,𝐲),,Eh(𝐱,𝐲)]𝐞𝐱𝐲subscript𝐸1𝐱𝐲subscript𝐸2𝐱𝐲subscript𝐸𝐱𝐲\displaystyle\mathbf{e}({\mathbf{x}},\mathbf{y})=\left[E_{1}({\mathbf{x}},% \mathbf{y}),E_{2}({\mathbf{x}},\mathbf{y}),\ldots,E_{h}({\mathbf{x}},\mathbf{y% })\right]bold_e ( bold_x , bold_y ) = [ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_x , bold_y ) , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x , bold_y ) , … , italic_E start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( bold_x , bold_y ) ] (10)

where Ei(,)E_{i}(,)italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( , ) represents the error estimate calculated by the i𝑖iitalic_i-th model, x𝑥{x}italic_x is the input image.

Each associated age estimate obtained from an observation y𝑦yitalic_y follows the i.i.d. criterion, so y𝑦yitalic_y is treated as a constant. Therefore, we can simplify Eq. 10 and obtain optimal model parameters by minimizing the error e(x)𝑒𝑥e(x)italic_e ( italic_x ):

min𝐱𝒳e(𝐱)=i=1hwiEi(𝐱)subscript𝐱𝒳𝑒𝐱superscriptsubscript𝑖1subscript𝑤𝑖subscript𝐸𝑖𝐱\displaystyle\min_{{\mathbf{x}}\in\mathcal{X}}e({\mathbf{x}})=\sum_{i=1}^{h}w_% {i}E_{i}({\mathbf{x}})roman_min start_POSTSUBSCRIPT bold_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_e ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) (11)

where wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is determined using a voting mechanism, which is learnable.

Leveraging ensemble learning [32] enables a more robust representation of the hypothesis space, we integrate multiple neural networks to estimate implicit errors. Each neural network uses a map** function ϕ(x,w)italic-ϕ𝑥𝑤\phi(x,w)italic_ϕ ( italic_x , italic_w ), D×|𝐰|ksuperscript𝐷superscript𝐰superscript𝑘\mathcal{R}^{D}\times{\mathcal{R}}^{|\mathbf{w}|}\to\mathcal{R}^{k}caligraphic_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × caligraphic_R start_POSTSUPERSCRIPT | bold_w | end_POSTSUPERSCRIPT → caligraphic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for error. We train L𝐿Litalic_L regressors with the same network architecture and use a voting algorithm to obtain the final prediction. Therefore, for a given input state x𝑥xitalic_x, the implicit error 𝐞^^𝐞\hat{\mathbf{e}}over^ start_ARG bold_e end_ARG is estimated by the ensemble network as follows:

𝐞^(𝐱,{𝐰i}i=1L)=1Li=1Lϕ(𝐱,𝐰i)^𝐞𝐱superscriptsubscriptsubscript𝐰𝑖𝑖1𝐿1𝐿superscriptsubscript𝑖1𝐿bold-italic-ϕ𝐱subscript𝐰𝑖\displaystyle\hat{\mathbf{e}}\left({\mathbf{x}},\{\mathbf{w}_{i}\}_{i=1}^{L}% \right)=\frac{1}{L}\sum_{i=1}^{L}\boldsymbol{\phi}\left({\mathbf{x}},\mathbf{w% }_{i}\right)over^ start_ARG bold_e end_ARG ( bold_x , { bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT bold_italic_ϕ ( bold_x , bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (12)

where 𝐰isubscript𝐰𝑖\mathbf{w}_{i}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the learnable network parameters.

According to Eq. 12, we can obtain the cumulative age estimation error as follows:

e^(𝐱,{𝐰i}i=1L)=j=1kwj(1Li=1Lϕj(𝐱,𝐰i))approximated implicit error+j=k+1hwjEj(𝐱)true explicit error^𝑒𝐱superscriptsubscriptsubscript𝐰𝑖𝑖1𝐿absentsubscriptsuperscriptsubscript𝑗1𝑘subscript𝑤𝑗1𝐿superscriptsubscript𝑖1𝐿subscriptbold-italic-ϕ𝑗𝐱subscript𝐰𝑖approximated implicit errormissing-subexpressionsubscriptsuperscriptsubscript𝑗𝑘1subscript𝑤𝑗subscript𝐸𝑗𝐱true explicit error\displaystyle\begin{aligned} \hat{e}\left({\mathbf{x}},\{\mathbf{w}_{i}\}_{i=1% }^{L}\right)&=\underbrace{\sum_{j=1}^{k}w_{j}\left(\frac{1}{L}\sum_{i=1}^{L}% \boldsymbol{\phi}_{j}\left({\mathbf{x}},\mathbf{w}_{i}\right)\right)}_{\text{% approximated implicit error}}\\ &+\underbrace{\sum_{j=k+1}^{h}w_{j}E_{j}({\mathbf{x}})}_{\text{true explicit % error}}\end{aligned}start_ROW start_CELL over^ start_ARG italic_e end_ARG ( bold_x , { bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) end_CELL start_CELL = under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT bold_italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x , bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT approximated implicit error end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_j = italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) end_ARG start_POSTSUBSCRIPT true explicit error end_POSTSUBSCRIPT end_CELL end_ROW (13)

We divide the error of Eq. 13 into two parts, one is the estimated implicit error, and the other is the true explicit error. The estimated implicit error is obtained by learning the feature representation of the image encoder by the ensemble regressor we built, and the real explicit error is obtained by the age estimation model based on CLIP we built. At the same time, we optimize the network parameters of the ensemble regressor by minimizing the distance between the estimated implicit error and the true explicit error. The optimization goal is defined as follows:

min𝐰i𝔼(𝐱,𝐞)D[dist(ϕ(𝐱,𝐰i),𝐞1:k)]subscriptsubscript𝐰𝑖subscript𝔼similar-to𝐱𝐞𝐷delimited-[]distbold-italic-ϕ𝐱subscript𝐰𝑖subscript𝐞:1𝑘\displaystyle\min_{\mathbf{w}_{i}}\mathbb{E}_{({\mathbf{x}},\mathbf{e})\sim D}% \left[\mathrm{dist}\left(\boldsymbol{\phi}\left({\mathbf{x}},\mathbf{w}_{i}% \right),\mathbf{e}_{1:k}\right)\right]roman_min start_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_x , bold_e ) ∼ italic_D end_POSTSUBSCRIPT [ roman_dist ( bold_italic_ϕ ( bold_x , bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_e start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT ) ] (14)

where dist(ϕ(𝐱,𝐰i),𝐞1:k)=||ϕ(𝐱,𝐰i),𝐞1:k||22\mathrm{dist}\left(\boldsymbol{\phi}\left({\mathbf{x}},\mathbf{w}_{i}\right),% \mathbf{e}_{1:k}\right)=||\boldsymbol{\phi}\left({\mathbf{x}},\mathbf{w}_{i}% \right),\mathbf{e}_{1:k}||^{2}_{2}roman_dist ( bold_italic_ϕ ( bold_x , bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_e start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT ) = | | bold_italic_ϕ ( bold_x , bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_e start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

To achieve controllable generation of predicted states, we use the feature representation decoded by the image encoder as the input of the ensemble regressor to learn and sample candidate predicted ages. Therefore, the update target of network parameters is defined as follows:

𝜽(t)=argmin𝜽d𝔼𝐳[e^((𝐳,𝜽),{𝐰i(t1)}i=1L)]superscript𝜽𝑡subscript𝜽superscript𝑑subscript𝔼𝐳delimited-[]^𝑒𝐳𝜽superscriptsubscriptsuperscriptsubscript𝐰𝑖𝑡1𝑖1𝐿\displaystyle\boldsymbol{\theta}^{(t)}=\arg\min_{\boldsymbol{\theta}\in% \mathcal{R}^{d}}\mathbb{E}_{\mathbf{z}}\left[\hat{e}\left((\mathbf{z},% \boldsymbol{\theta}),\left\{\mathbf{w}_{i}^{(t-1)}\right\}_{i=1}^{L}\right)\right]bold_italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ caligraphic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_z end_POSTSUBSCRIPT [ over^ start_ARG italic_e end_ARG ( ( bold_z , bold_italic_θ ) , { bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) ] (15)

where z𝑧zitalic_z is the the latent vectors. Finally, among the candidate age estimation states generated by the ensemble regressor with the trained network parameters θtsuperscript𝜃𝑡\theta^{t}italic_θ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, we select the final prediction result as follows:

𝐱Π(t)=argmin𝐱p(𝐱|𝜽(t))e^(𝐱,{𝐰i(t1)}i=1L)superscriptsubscript𝐱Π𝑡subscriptsimilar-to𝐱𝑝conditional𝐱superscript𝜽𝑡^𝑒𝐱superscriptsubscriptsuperscriptsubscript𝐰𝑖𝑡1𝑖1𝐿\displaystyle{\mathbf{x}}_{\mathrm{\Pi}}^{(t)}=\arg\min_{{\mathbf{x}}\sim p% \left({\mathbf{x}}|\boldsymbol{\theta}^{(t)}\right)}\hat{e}\left({\mathbf{x}},% \left\{\mathbf{w}_{i}^{(t-1)}\right\}_{i=1}^{L}\right)bold_x start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_x ∼ italic_p ( bold_x | bold_italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT over^ start_ARG italic_e end_ARG ( bold_x , { bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) (16)

The age estimation error is calculated via Eq. 10. If the calculated error is less than the feasibility threshold, i.e. e^(t)ϵ^𝑒𝑡italic-ϵ\hat{e}(t)\leq\epsilonover^ start_ARG italic_e end_ARG ( italic_t ) ≤ italic_ϵ, the selected age estimation state is considered acceptable and the predicted value is returned. Otherwise, the error is used to optimize the ensemble regressor model in the next iteration of parameter updates.

3.2 Model Training

Mean Absolute Error (MAE) is a commonly used performance evaluation metric in regression problems, which measures the mean absolute difference between model predictions and actual observations. The Loss is defined as follows:

Lk(θ)=|yky^k|superscript𝐿𝑘𝜃superscript𝑦𝑘superscript^𝑦𝑘\displaystyle L^{k}(\theta)=|y^{k}-\hat{y}^{k}|italic_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_θ ) = | italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | (17)

where θ𝜃\thetaitalic_θ is the parameter of network learning, and k𝑘kitalic_k represents the k𝑘kitalic_k-th training sample.

The optimization goals of the model are as follows:

minθk=1NLk(θ)subscript𝜃superscriptsubscript𝑘1𝑁superscript𝐿𝑘𝜃\displaystyle\min_{\theta}\sum_{k=1}^{N}L^{k}(\theta)roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_θ ) (18)

Where N𝑁Nitalic_N represents the total number of samples.

4 EXPERIMENTS

4.1 Benchmark Dataset Used

In this paper, we use six benchmark datasets, MORPH-II111http://www.faceaginggroup.com/morph/, FG-Net222http://yanweifu.github.io/FG_NET_data/FGNET.zip CACD333http://bcsiriuschen.github.io/CARC/, Adience444http://www.openu.ac.il/home/hassner/Adience/data.html, FACES555http://faces.mpib-berlin.mpg.de, and SC-FACE666https://www.scface.org/, to conduct our age estimation experiments and verify the effectiveness of our CILF-CIAE method.

MORPH-II. The MORPH-II dataset is widely used in facial image research (e.g., age estimation and facial recognition). The MORPH-II dataset contains 55,000 facial photos of 13,000 volunteers over a period of time. The MORPH-II dataset covers facial images of volunteers from different ethnicities, different genders, and different geographical regions from 1 to 80 years old.

Existing methods employ three different experimental settings on the MORPH-II dataset. The first setting (S1) selects 5,492 white images from the original dataset (80% images for training, 20% images for testing) and performs 5-fold cross-validation to reduce cross-race effects [4], [33]. The second setting (S2) randomly splits all images into training/test sets (80/20%) and performs 5-fold cross-validation [34]. The third setting (S3) randomly selects 21,000 images from MORPH and restricts the black-white race ratio to 1:1 and the female to male ratio to 1:3 [5].

FGNET. The FGNET dataset is composed of facial photos provided by volunteers from the age range of 0 to 69 years old. The FGNET dataset contains facial images of volunteers from different genders, different races, and different geographical areas. The FGNET dataset is mainly used to evaluate and improve the performance of facial age estimation algorithms.

CACD. CACD is also a dataset for facial age estimation, which mainly contains publicly available facial images of famous celebrities from social media (e.g., movies, TV, music). The CACD dataset contains more than 163,000 facial images of people from teenagers to older adults. The CACD dataset includes images of celebrities from different countries and different professions.

Adience. The Adience benchmark is an unconstrained dataset, i.e., there are no restrictions on gestures and photo poses. The face images in the Adience dataset are captured by mobile phone devices. Because these images are not subject to artificial data preprocessing and noisy image filtering, they can greatly reflect real-world challenges. The Adience dataset consists of 19,487 images, in which the numbers of males and females are 8,192 and 11,295 respectively.

FACES. The FACES face image dataset is a dataset used in psychology and neuroscience research, especially in studying age. This dataset was created by Ebner et al. in 2010 to provide a high-quality, diverse set of face images. The FACES dataset contains face photos of men and women ranging in age from 20 to 80 years old. The images show different emotional expressions such as happy, sad, angry and neutral expressions.

SC-FACE. SC-FACE (Surveillance Cameras Face Database) is a face image data set specially used for facial recognition research, especially facial recognition in surveillance environments. The dataset includes hundreds of images of subjects with facial expressions under different lighting conditions and backgrounds.

4.2 Evaluation Metrics

1) Mean Absolute Error (MAE): The MAE value reflects the absolute error between the true value of the sample and the predicted value of the model. In age estimation, MAE is more suitable as a model evaluation metric than MSE. The formula of MAE is defined as follows:

MAE=1Ni=1N|y^iyi|𝑀𝐴𝐸1𝑁superscriptsubscript𝑖1𝑁subscript^𝑦𝑖subscript𝑦𝑖\displaystyle MAE=\frac{1}{N}\sum_{i=1}^{N}|\hat{y}_{i}-y_{i}|italic_M italic_A italic_E = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | (19)

where y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG represents the predicted value of the model, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the true value, and N𝑁Nitalic_N represents the number of the samples.

2) Cumulative Score (CS): CS is used to measure the accuracy of the model’s prediction error for face images not exceeding L𝐿Litalic_L years. The formula for CS is defined as follows:

CS(L)=(eL/N)×100%𝐶𝑆𝐿subscript𝑒𝐿𝑁percent100\displaystyle CS(L)=(e_{\ell\leq L}/N)\times 100\%italic_C italic_S ( italic_L ) = ( italic_e start_POSTSUBSCRIPT roman_ℓ ≤ italic_L end_POSTSUBSCRIPT / italic_N ) × 100 % (20)

where eLsubscript𝑒𝐿e_{\ell\leq L}italic_e start_POSTSUBSCRIPT roman_ℓ ≤ italic_L end_POSTSUBSCRIPT represents the number of samples where the absolute error \ellroman_ℓ of the model does not exceed L𝐿Litalic_L.

4.3 Baseline Models

PML [8]: Deng et al. proposed a progressive margin loss (PML) method to adaptively learn the distribution pattern of age labels. The PML method fully considers the inter-class and intra-class age distribution differences, and can effectively alleviate the long-tail distribution problem of data.

Ranking-CNN [35]: Chen et al. designed a novel Ranking-CNN architecture for age estimation. Ranking-CNN uses CNN to rank age labels and then perform high-level feature extraction. Ranking-CNN theoretically proves that the error comes from the maximum error in the ranked labels.

DLDL [34]: The deep label distribution learning (DLDL) method proposed by Gao et al. can adaptively learn the characteristics of label ambiguity. DLDL discretizes the age labels and uses CNN to minimize the KL divergence between the predicted distribution and the true distribution to optimize the model parameters.

Refer to caption
Figure 6: We tested the performance of our proposed method CILF-CIAE and some comparative methods on two evaluation metrics (i.e., MAE and CS) on six data sets and obtained corresponding experimental results.
Refer to caption
Figure 7: To explore the sensitivity of different models to parameters, we tested the impact of different feature embedding dimensions on CS on six data sets.

CSOHR [36]: Chang et al. proposed a method combining hyperplane ranking algorithm and cost-sensitive loss for age estimation. CHOSR performs feature extraction on images with relative order information and introduces cost-sensitive losses to improve prediction accuracy.

DEX [4]: The DEX proposed by Rothe et al. uses the VGG-16 architecture pre-trained on ImageNet for age estimation. DEX uses a deep CNN to align faces and age expectations to optimize model parameters.

CNN+ELM [10]: Duan et al. proposed a CNN and extreme learning machine (ELM) algorithm CNN2ELM for age estimation. CNN2ELM built three CNN networks to extract features and perform information fusion for Age, Gender and Race respectively, and then used ELM for the final age regression prediction.

DRF [1]: Shen et al. designed deep regression forest (DRF) for age estimation, which is continuously differentiable. DRF adaptively learns non-uniform age distribution data through the joint learning method of CNNC’s random forest.

VDAL [2]: Liu et al. proposed a similarity-aware deep adversarial learning (SADAL) method for age estimation. SADAL enhances the model’s ability to learn facial age features through adversarial learning of positive and negative samples. In addition, SADAL designed a similarity-aware function to measure the distance between positive and negative samples to guide the optimization direction of the model.

DHR [37]: Tan et al. proposed a deep hybrid alignment architecture for age estimation, which captures image age features with complementary semantic information through joint learning of global and local branches. Furthermore, in each branch network, a fusion mechanism is used to explore the correlation between sub-networks.

DCT [7]: Bao et al. designed a divergence-driven consistency training mechanism to improve the quasi-efficiency of age estimation. DCT introduces an efficient sample selection strategy to select valid samples from unlabeled samples. Furthermore, DCT also introduces an identity consistency criterion to optimize the dependence between image features and age.

Refer to caption
Figure 8: To explore the sensitivity of different models to parameters, we tested the impact of different feature embedding dimensions on CS on six data sets.

4.4 Implementation Details

We adopt CLIP’s pretrained image encoder as the backbone and directly integrate our designed Fourierformer as the decoder. In terms of language domain prompts, we choose a context length of 9. The Transformer decoder used to extract visual context consists of 6 layers. To reduce computational cost, we project image embeddings and text embeddings to 512 dimensions before the Transformer module. In terms of model fine-tuning, we observe that fine-tuning directly using the CLIP model does not produce satisfactory results. Therefore, we made a key modification: using AdamW as the optimizer for model training instead of the default SGD, which helps to improve the effectiveness of the training process and improve the final prediction performance.

5 RESULTS AND DISCUSSION

In this section, we discuss the experimental results of our method CILF-CIAE and other comparative methods on six data sets.

5.1 Comparison with Baseline Methods

To verify the superior performance of our proposed method CILF-CIAE, we conducted performance tests on six real data sets and compared it with other comparison methods. The experimental results are shown in Figs. 6. The method CILF-CIAE proposed in this paper has better MAE values and CS values on six data sets than other comparative methods. Specifically, the MAE values of CILF-CIAE under the three data set evaluation criteria of MORPH-S1, MORPH-S2 and MORPH-S3 are 1.74, 1.68 and 1.81 respectively, and the CS are 95.1%, 95.7% and 94.3% respectively. Other comparison algorithms are worse than the CILF-CIAE algorithm in MAE value and CS value. Experimental results demonstrate that our method CILF-CIAE significantly outperforms other baseline algorithms. Similarly, on other data sets, our method CILF-CIAE method is also significantly better than other comparison algorithms. Experimental results show the robustness of the CILF-CIAE algorithm.

Overall, the feature learning ability of our method CILF-CIAE is better than other comparison algorithms in any case. Specifically, the performance improvement can be attributed to the high-quality text and image alignment capabilities based on the CLIP large model. Image representation based on language prompt guidance can greatly improve the ability to represent image features. At the same time, we introduce a context awareness module (i.e., Fourierformer) to react on language prompts to improve the expression of text semantic information. Unlike the traditional Vision Transformer architecture, Fourierformer models the global information of the image by introducing Fourier transform operations to achieve spatial interaction and channel evolution of image features. In addition, we also introduce an error correction mechanism. When the age predicted by the CLIP-based age estimation model differs greatly from the actual age, the model will start the optimization branch to optimize the error until e(x)ϵ𝑒𝑥italic-ϵe(x)\leq\epsilonitalic_e ( italic_x ) ≤ italic_ϵ is reached.

Table 1: We perform ablation experiments to explore the impact of the three modules of spatial interaction, channel evolution, and error correction on age estimation performance respectively. We use six datasets to compare experimental results, and the MAE value is chosen as our evaluation metric.
Spatial interaction Channel evolution Error correction MORPH-S1 MORPH-S2 MORPH-S3 FGNET CACD Adience FACES SC-FACES
2.71 2.46 2.84 2.69 3.31 0.52 3.01 3.43
2.63 2.31 2.69 2.62 3.24 0.47 2.86 3.35
2.65 2.34 2.69 2.67 3.26 0.48 2.83 3.36
2.44 2.17 2.48 2.41 3.13 0.44 2.67 3.14
2.05 1.93 2.26 2.19 3.05 0.41 2.43 2.53
1.91 1.85 2.14 2.06 2.94 0.39 2.38 2.40
2.37 2.08 2.34 2.29 3.08 0.47 2.55 2.82
1.74 1.68 1.81 1.78 2.83 0.39 2.13 2.27

5.2 Effectiveness of Low-Dimensional Representation

To explore the impact of the number of parameters of the model and the latent feature representation of the image on the model performance, we use different image feature dimensions (i.e., [512, 256, 128, 64, 32, 16]) to explore the effectiveness of low-dimensional representation. As shown in Figs. 7, we tested the experimental effects of CILF-CIAE and other comparative methods on 6 data sets in different dimensions. We report the MAE values of the model. Specifically, the MAE value of CILF-CIAE increases slightly as the feature embedding dimension decreases on the six datasets, while the performance of other comparison methods drops sharply. Experimental results demonstrate the robustness of our method. The stable performance of CILF-CIAE may be attributed to the fact that the estimation algorithm based on CLIP contains rich image prior knowledge, which can improve the induction ability of the model. In addition, the Transformer architecture designed based on the Fourier change module to implement contextual prompts is a parameter-free estimation function and is insensitive to parameter changes.

As shown in Figs. 8, we tested the experimental effects of CILF-CIAE and other comparative methods on six data sets in different dimensions. We report the CS values of the model. In tests on the MORPH-S1 and MORPH-S2 data sets, the CS value of CILF-CIAE decreased slightly as the image feature embedding dimension decreased. On other datasets, the CS value decreases rapidly with the decrease of image feature embedding dimension. However, the performance of CILF-CIAE is always higher than other comparison algorithms. The superior performance may be attributed to the optimization branch’s ability to ensure that the prediction results are at a relatively high confidence level.

Refer to caption
Figure 9: An example of age estimation results of our CILF-CIAE on the MORPH-II face dataset. The true labels are on the left and the estimated results are on the right. Poor estimation results are shown as red numbers.

5.3 Ablation Study

As shown in Tables 1, we perform ablation experiments on all test data respectively. We separately explored the effectiveness of the three modules proposed in this paper, i.e., spatial interaction module, channel evolution module and error correction module. If none of the three modules proposed in this paper are used, it means that the CLIP model is used directly to estimate the age of the image. The model has the worst experimental results on the six data sets if any of the modules proposed in this paper are not applied for age estimation. If one module is used for age estimation, the age estimation effect with the error estimation module is the best, the age estimation effect with the spatial interaction module is second, and the age estimation effect with the channel evolution module is the worst. When using two modules, the age estimation effect with the spatial interaction module and the error estimation module is the best, and the age estimation effect with the spatial interaction module and the channel evolution module is the worst. When three modules are used, the age estimation results are best in all cases. Ablation experiments demonstrate the effectiveness of each module proposed in this paper.

Refer to caption
Figure 10: We visualize the learned features using t-SNE on the Morph II (S1) training and test sets. We visualize the distribution of the six age categories in the two-dimensional feature space.

5.4 Qualitative Results Analysis

To more intuitively demonstrate the effectiveness of CILF-CIAE, we conducted extensive experiments on the Morph-II benchmark dataset. Fig. 9 shows the prediction results and true labels of CILF-CIAE. We observe that CILF-CIAE performs well in age prediction for most images and can accurately predict the age of faces. The inaccurate prediction on a few images may be attributed to the fact that the images are synthetic and the pose variations are large.

We further visualize the distribution of features learned in the training and testing phases on the Morph-II (S1) dataset using t-SNE. As can be seen from Fig. 10, the feature class boundaries learned in the training and testing phases are relatively clear, and different age categories have more compact feature distributions.

6 CONCLUSION AND FUTURE WORK

The paper proposes a novel CLIP-driven Image–Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE) to perform age estimation. Firstly, we use Image Encoder and Text Encoder in CLIP to obtain corresponding feature representations and achieve age estimation. Secondly, we introduce a Transformer architecture based on Fourier transform to achieve spatial interaction and channel evolution of image features. Specifically, we replace the attention module in Transformer with Fourier transform and input image features into Fuorierformer to achieve spatial interaction and channel evolution. Finally, we build an error-correcting reversible age estimation module to ensure that the predicted age is within a high-confidence interval in an end-to-end learning manner. The method CILF-CIAE proposed in this paper achieves optimal age estimation on multiple age estimation datasets. In future research work, we will consider investigating estimation across data sets, which can improve the generalization ability of the model.

Acknowledgments

This work is supported by National Natural Science Foundation of China (Grant No. 69189338), Excellent Young Scholars of Hunan Province of China (Grant No. 20B625, No. 18B196), Changsha Natural Science Foundation (Grant No. kq2202294), and program of Research on Local Community Structure Detection Algorithms in Complex Networks (Grant No. 2020YJ009).

References

  • [1] W. Shen, Y. Guo, Y. Wang, K. Zhao, B. Wang, A. Yuille, Deep differentiable random forests for age estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2) (2019) 404–419.
  • [2] H. Liu, P. Sun, J. Zhang, S. Wu, Z. Yu, X. Sun, Similarity-aware and variational deep adversarial learning for robust facial age estimation, IEEE Transactions on Multimedia 22 (7) (2020) 1808–1822.
  • [3] N. Yin, L. Shen, M. Wang, X. Luo, Z. Luo, D. Tao, Omg: Towards effective graph classification against label noise, IEEE Transactions on Knowledge and Data Engineering 35 (12) (2023) 12873–12886. doi:10.1109/TKDE.2023.3271677.
  • [4] R. Rothe, R. Timofte, L. Van Gool, Deep expectation of real and apparent age from a single image without facial landmarks, International Journal of Computer Vision 126 (2-4) (2018) 144–157.
  • [5] Z. Bao, Y. Luo, Z. Tan, J. Wan, X. Ma, Z. Lei, Deep domain-invariant learning for facial age estimation, Neurocomputing 534 (2023) 86–93.
  • [6] N. Yin, L. Shen, H. Xiong, B. Gu, C. Chen, X. Hua, S. Liu, X. Luo, Messages are never propagated alone: Collaborative hypergraph neural network for time-series forecasting, IEEE Transactions on Pattern Analysis and Machine Intelligence (01) (5555) 1–15. doi:10.1109/TPAMI.2023.3331389.
  • [7] Z. Bao, Z. Tan, J. Wan, X. Ma, G. Guo, Z. Lei, Divergence-driven consistency training for semi-supervised facial age estimation, IEEE Transactions on Information Forensics and Security 18 (2022) 221–232.
  • [8] Z. Deng, H. Liu, Y. Wang, C. Wang, Z. Yu, X. Sun, Pml: Progressive margin loss for long-tailed age classification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10503–10512.
  • [9] Z. Niu, M. Zhou, L. Wang, X. Gao, G. Hua, Ordinal regression with multiple output cnn for age estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4920–4928.
  • [10] M. Duan, K. Li, K. Li, An ensemble cnn2elm for age estimation, IEEE Transactions on Information Forensics and Security 13 (3) (2017) 758–772.
  • [11] H. Wang, V. Sanchez, C.-T. Li, Improving face-based age estimation with attention-based dynamic patch fusion, IEEE Transactions on Image Processing 31 (2022) 1084–1096.
  • [12] K. Zhang, N. Liu, X. Yuan, X. Guo, C. Gao, Z. Zhao, Z. Ma, Fine-grained age estimation in the wild with attention lstm networks, IEEE Transactions on Circuits and Systems for Video Technology 30 (9) (2019) 3140–3152.
  • [13] Y. Shou, X. Cao, D. Meng, Masked contrastive graph representation learning for age estimation, arXiv preprint arXiv:2306.17798 (2023).
  • [14] D. Cao, Z. Lei, Z. Zhang, J. Feng, S. Z. Li, Human age estimation using ranking svm, in: Biometric Recognition: 7th Chinese Conference, CCBR 2012, Guangzhou, China, December 4-5, 2012. Proceedings 7, Springer, 2012, pp. 324–331.
  • [15] L. Shen, J. Zheng, E. H. Lee, K. Shpanskaya, E. S. McKenna, M. G. Atluri, D. Plasto, C. Mitchell, L. M. Lai, C. V. Guimaraes, et al., Attention-guided deep learning for gestational age prediction using fetal brain mri, Scientific reports 12 (1) (2022) 1408.
  • [16] N. Yin, L. Shen, M. Wang, L. Lan, Z. Ma, C. Chen, X.-S. Hua, X. Luo, Coco: A coupled contrastive framework for unsupervised domain adaptive graph classification, arXiv preprint arXiv:2306.04979 (2023).
  • [17] N. Yin, L. Shen, B. Li, M. Wang, X. Luo, C. Chen, Z. Luo, X.-S. Hua, Deal: An unsupervised domain adaptive framework for graph-level classification, in: Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 3470–3479. doi:10.1145/3503161.3548012.
  • [18] N.-H. Shin, S.-H. Lee, C.-S. Kim, Moving window regression: A novel approach to ordinal regression, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18760–18769.
  • [19] W. Cao, V. Mirjalili, S. Raschka, Rank consistent ordinal regression for neural networks with application to age estimation, Pattern Recognition Letters 140 (2020) 325–331.
  • [20] Y. Zhang, L. Liu, C. Li, et al., Quantifying facial age by posterior of age comparisons, arXiv preprint arXiv:1708.09687 (2017).
  • [21] W. Li, J. Lu, J. Feng, C. Xu, J. Zhou, Q. Tian, Bridgenet: A continuity-aware probabilistic network for age estimation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1145–1154.
  • [22] W. Shen, Y. Guo, Y. Wang, K. Zhao, B. Wang, A. L. Yuille, Deep regression forests for age estimation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2304–2313.
  • [23] G. Levi, T. Hassner, Age and gender classification using convolutional neural networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2015, pp. 34–42.
  • [24] G.-S. Xie, X.-Y. Zhang, S. Yan, C.-L. Liu, Hybrid cnn and dictionary-based models for scene recognition and domain adaptation, IEEE Transactions on Circuits and Systems for Video Technology 27 (6) (2015) 1263–1274.
  • [25] J. Lee, J. Kim, H. Shon, B. Kim, S. H. Kim, H. Lee, J. Kim, Uniclip: Unified framework for contrastive language-image pre-training, Advances in Neural Information Processing Systems 35 (2022) 1008–1019.
  • [26] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, H. Li, Pointclip: Point cloud understanding by clip, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8552–8562.
  • [27] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al., A survey on vision transformer, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (1) (2022) 87–110.
  • [28] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, M. Shah, Transformers in vision: A survey, ACM computing surveys (CSUR) 54 (10s) (2022) 1–41.
  • [29] R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y. Qiao, H. Li, Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training, Advances in neural information processing systems 35 (2022) 27061–27074.
  • [30] K. Zhou, J. Yang, C. C. Loy, Z. Liu, Conditional prompt learning for vision-language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16816–16825.
  • [31] M. Zhou, J. Huang, C.-L. Guo, C. Li, Fourmer: an efficient global modeling paradigm for image restoration, in: International Conference on Machine Learning, PMLR, 2023, pp. 42589–42601.
  • [32] R. Kang, T. Mu, P. Liatsis, D. C. Kyritsis, Physics-driven ml-based modelling for correcting inverse estimation, in: 37th Conference on Neural Information Processing Systems (NeurIPS), 2023.
  • [33] E. Agustsson, R. Timofte, L. Van Gool, Anchored regression networks applied to age estimation and super resolution, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1643–1652.
  • [34] B.-B. Gao, C. Xing, C.-W. Xie, J. Wu, X. Geng, Deep label distribution learning with label ambiguity, IEEE Transactions on Image Processing 26 (6) (2017) 2825–2838.
  • [35] S. Chen, C. Zhang, M. Dong, Deep age estimation: From classification to ranking, IEEE Transactions on Multimedia 20 (8) (2017) 2209–2222.
  • [36] K.-Y. Chang, C.-S. Chen, A learning framework for age rank estimation based on face images with scattering transform, IEEE Transactions on Image Processing 24 (3) (2015) 785–798.
  • [37] Z. Tan, Y. Yang, J. Wan, G. Guo, S. Z. Li, Deeply-learned hybrid representations for facial age estimation., in: IJCAI, 2019, pp. 3548–3554.