CILF-CIAE: CLIP-driven Image–Language Fusion for Correcting Inverse Age Estimation
Abstract
The age estimation task aims to predict the age of an individual by analyzing facial features in an image. The development of age estimation can improve the efficiency and accuracy of various applications (e.g., age verification and secure access control, etc.). In recent years, contrastive language-image pre-training (CLIP) has been widely used in various multimodal tasks and has made some progress in the field of age estimation. However, existing CLIP-based age estimation methods require high memory usage (quadratic complexity) when globally modeling images, and lack an error feedback mechanism to prompt the model about the quality of age prediction results. To tackle the above issues, we propose a novel CLIP-driven Image–Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE). Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture (i.e., FourierFormer) to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Compared with the quadratic complexity of the attention mechanism, the proposed Fourierformer is of linear log complexity. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions. Through extensive experiments on multiple data sets, CILF-CIAE has achieved better age prediction results.
keywords:
Age estimation, CLIP, Error correction, Transformer, Fourier Transform.Information Fusion \runauth \jidprocs \jnltitlelogoInformation Fusion \CopyrightLine2021Published by Elsevier Ltd.
1 Introduction
1.1 Motivation
The task of age estimation aims to determine the age based on the facial features in the image. In recent years, due to the massive increase in image data sets and the widespread application of deep learning (DL), age estimation methods have also achieved important achievements and attracted widespread research attention [1], [2], [3]. Futhermore, age estimation is also widely used in many scenarios. For example, age estimation in finance and insurance can help detect fraud where age is falsely stated to obtain improper benefits [4], [5], [6].
![Refer to caption](x1.png)
The current mainstream age estimation methods are divided into three categories: CNN [9], [10], attention network [11], [12], and GCN [13]. To extract global information and multi-scale information in images, a CNN-based age estimation algorithm is applied. For example, Rothe et al. [4] estimated an individual’s true age and apparent age from a single face image based on a CNN method. Unlike many traditional machine learning methods [14], this method does not require the use of facial feature point markers and only requires the input of face images for age estimation. However, CNN-based methods cannot capture the semantic features in images that are most relevant to age features. To give higher weight to the semantic features in the image that are most relevant to the age feature, attention networks began to be applied. For instance, Shen et al. [15] introduced an attention mechanism so that the model can automatically focus on regions in the image that are relevant for age estimation, which helps improve the model’s perception of important features related to age. However, attention network-based methods cannot flexibly model irregular objects. To overcome the above problems, Shou et al. [13] proposed a contrastive multi-view GCN for age estimation (CMGCN). CMGCN improves the feature representation capabilities of images by extending image representation into topological semantic space. However, the methods mentioned above are all supervised learning methods and ignore the CLIP-based multimodal learning paradigm. Taking Fig 1(a) and (b) as an example, existing age estimation algorithms mainly focus on supervised, or self-supervised algorithm design [7], [8], ignoring the contrastive image-language pre-training (CLIP) paradigm. CLIP can learn the prior information of faces from a large number of text-image pairs and provide better generalization for downstream tasks. Specifically, CLIP learns the correlation between images and text from a large number of image-text pairs through contrastive learning. Furthermore, existing algorithms directly predict age and lack an error information feedback mechanism, which may lead to a large error between the model’s predicted age and the true label. Therefore, it is necessary to take CLIP multimodal learning paradigm and error-controllable generation as the starting point for model design.
To tackle the above problem, we propose a novel CLIP-driven Image–Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE) to perform age estimation. CILF-CIAE mainly includes four modules: CLIP-based visual and language feature encoder, Fourierformer-based feature fusion, age prediction and error-controllable generation module. Firstly, we use Image Encoder and Text Encoder in CLIP to encode image and text features respectively and obtain corresponding feature representations. After obtaining the image and text feature representations, we jointly input them into the -dimensional feature space for contrastive learning to obtain aligned text and image semantic vectors, and utilize obtained image semantic vectors to perform age estimation. Secondly, as shown in Fig. 1(d) and (e), unlike previous CNN-based and attention-based Transformer architectures, CNN-based methods can only extract local information of the image and it is difficult to use contextual prompts modules to enhance age estimation, while attention-based methods require large memory usage (quadratic complexity). We introduce the Transformer architecture based on Fourier transform to realize the spatial interaction and channel evolution of image features, so as to fuse text and image feature information to improve the age estimation performance. Specifically, we replace the attention module in Transformer with Fourier transform and input image features into Fuorierformer to achieve spatial interaction and channel evolution. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Thirdly, we construct age estimation prediction loss and text and image matching loss to complete the parameter optimization of the model. Finally, we build an error-correcting reversible age estimation module to ensure that the predicted age is within a high-confidence interval in an end-to-end learning manner.
1.2 Our Contributions
Therefore, CLIP multimodal learning, spatial interaction of images, and channel evolution should be the core of age estimation algorithm design. Inspired by the above analysis, we propose a novel CLIP-driven Image–Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE) to perform age estimation. The main contributions of this paper are summarized as follows:
-
1.
A novel CLIP-driven Image–Language Fusion for Correcting Inverse Age Estimation architecture is present and named CILF-CIAE. CILF-CIAE is able to learn information about age from input images.
-
2.
A new Transformer structure is designed, i.e., Fourierformer. FourierFormer replaces the attention mechanism with Fourier transform to realize the channel evolution and spatial interaction of image features.
-
3.
An efficient contrastive multimodal learning module is utilized to supervise the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities.
-
4.
An end-to-end error feedback mechanism is proposed to ensure that the confidence of age estimation is within a credible range.
-
5.
Extensive experiments are conducted on four real data sets to verify the effectiveness of the method CILF-CIAE proposed in this paper. Experimental results show that CILF-CIAE can achieve optimal age prediction.
2 Related work
2.1 Age Estimation
Traditional age estimation methods usually rely on hand-designed feature extraction and machine learning algorithms, which are limited by feature selection and age estimation performance [14], [16], [17]. With the popularity of the Internet and social media (e.g., meta, twitter, and Youtube, etc.), large-scale face image datasets have also been widely grown. The rapid growth of data sets provides rich training data for deep learning (DL), making DL’s learning capabilities more powerful. Age estimation has potential applications in social media analysis, ad targeting, security monitoring, medical image analysis, etc. For example, in security and legal applications, image age estimation can assist police in identifying possible underage criminal suspects.
Existing age estimation algorithms are mainly divided into two categories, i.e., age estimation algorithms based on machine learning and algorithms based on deep learning. Machine learning-based age estimation algorithms mainly rely on hand-designed rules to extract age-related features of images. Age estimation algorithms based on deep learning mainly use some deep learning models (e.g., CNN, Transformer, and GCN, etc) with powerful adaptive learning capabilities and massive data sets to estimate age in an end-to-end manner.
Machine learning methods: In the age estimation algorithms based on traditional machine learning algorithms, Shin et al. [18] proposed an ordinal regression algorithm (MWR) based on moving window regression, which first ranks the input and reference labels and designs global and local regressors to achieve prediction of global ranking and local ranking. MWR achieves fine-grained age estimation by continuously iteratively optimizing the ranking order. However, the computational complexity of MWR is relatively high. Cao et al. [19] proposed a consistent ranking logic algorithm to solve the inconsistency problem of multiple binary ordinal regression algorithms. CORAL ensures ranking consistency by introducing confidence scores. Cao et al. [14] proposed the Ranking SVM algorithm to achieve age estimation of images. This algorithm estimates age by first grou** ages and then sorting ages. RSVM can reduce the hypothesis space of model learning. Zhang et al. [20] achieved age estimation by learning the probability distribution of label information. This algorithm achieves age prediction by calculating the posterior probability of the image. There are some other typical traditional machine learning algorithms [21], [22].
Deep learning methods: In the age estimation algorithms based on deep learning algorithms, CNN [23], attention network [11], and hybrid neural network systems [24] are currently common age estimation algorithms. For example, Levi et al. [23] proposed an age estimation algorithm based on deep CNN to solve the problem of insufficient performance of traditional machine learning algorithms. DeepCNN can achieve better prediction results even on a small amount of data sets. Duan et al. [10] proposed the CNN2ELM algorithm to combine the advantages of CNN and regression algorithms. CNN2ELM constructed three feature extraction networks of age, gender, and race, and used a fusion mechanism to fuse the complementary information of the three networks, and used ELM for regression prediction of age. Wang et al. [11] proposed the Attention-based Dynamic Patch Fusion algorithm to solve the problem that CNN cannot extract the most beneficial semantic information in the image for the age estimation task. ADPF introduces attention network and fusion network to dynamically extract image patches with rich semantic features and adaptively fuse the extracted feature information. Zhang et al. [12] proposed a fine-grained attention LSTM algorithm to solve the problem that existing methods only focus on the global information of the image and ignore the fine-grained features of the image. This method first uses the residual network to extract the global information of the image, and then uses the attention LSTM to capture the sensitive area information of the image to obtain local important semantic features in the image. Xie et al. [24] integrated CNN’s feature extraction capabilities, domain generalization capabilities, and local information discrimination capabilities based on dictionary algorithms. This method first uses a pre-trained CNN to extract the feature representation of the image, and then builds a dictionary representation to extract the local feature information and Fisher vector representation of the image.
![Refer to caption](x2.png)
2.2 Contrastive Image-Language Pre-training
With the powerful representation ability of the pre-trained visual-language model CLIP [25] in feature extraction learning, it has been widely used in CV tasks. CLIP uses a contrastive learning method to train by maximizing the similarity between the embedding vectors of related texts and images, which enables the model to find the most relevant text-image pairs in the embedding space and achieve natural language and image multimodal understanding. In the field of age estimation, we need a CLIP-based backbone network to directly perform inductive reasoning.
3 METHODOLOGY
3.1 The Design of the CILF-CIAE Structure
The CILF-CIAE architecture proposed in this paper is shown in Fig. 2, which contains age prediction stages and age error optimization. Specifically, we first use age estimation models based CLIP with a Fourier prior module to predict the age of images. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Furthermore, if the predicted and actual values exceed a given threshold, the optimization branch is activated. The age errors are then used in the training of an ensemble error correction model to update the predicted age . This training process continues until terminates. The details of the CILF-CIAE architecture proposed in this paper will be described.
3.1.1 Language-guided Visual Age Prediction
As shown in Fig. 2, we briefly introduce the CLIP-based visual language pre-training model for age estimation. CLIP consists of an image encoder and a text encoder [26]. Image encoders aim to extract the underlying features of an image and map them into a low-dimensional embedding space. The architecture of image encoders usually uses ViT [27] with superior performance. The text encoder often use Transformers [28] to generate text representations with rich semantic information. Given a text prompt, such as “A photo of a 12 year old person,” the text encoder first converts each character into a lowercase byte-pair encoded representation, which uniquely identifies each character. The beginning and end of each text sequence are marked by [SOS] and [EOS]. Afterwards, the text representation is mapped into a 512-dimensional feature space, and then text Transformer is used for sequence modeling. Then, given an image feature obtained by the image encoder, the cosine similarity function is used to calculate the similarity between the image and the text prompt. The similarity formula is defined as follows:
(1) |
where is the similarity matrix, is the feature vector of the -th text sequence obtained by the text encoder, is the feature vector of the -th image obtained by the image encoder, represents the total number of training samples, represents cosine similarity, and represents temperature attenuation coefficient.
To further narrow the semantic gap between image and text features, we design an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching loss, thereby improving the interaction effect between different modalities. The optimization goal for image-text matching loss is defined as follows:
(2) |
where is the number of the training samples.
3.1.2 Context-Aware Prompting
Previous work has demonstrated that feature alignment of visual and language modalities can significantly improve the performance of CLIP models on downstream tasks [29], [30]. Therefore, we consider whether we can design a customized context-aware prompting method to improve text features.
Vision-to-language prompting. The textual features that fuse visual global context information can make age estimation predictions more accurate. For example, “a photo of a 68-year-old man with gray hair” is a more accurate prediction than “a photo of a 68-year-old man.” Therefore, we design a customized Fourier prior module to utilize visual global context information to improve text features in fine granularity. Specifically, we use the Fourierformer decoder to realize image spatial information interaction and channel evolution, and model the interaction between vision and language.
![Refer to caption](x3.png)
3.1.3 Fourier Prior Embedded Block
Fourier transform is used for frequency domain filtering, compression and feature extraction in image processing [31]. By converting the image to the frequency domain, patterns and structures in the image can be more easily identified. For a given image , the Fourier transform is applied to each image channel separately and transforms them into frequency domain space as complex components . The formula is defined as:
(3) |
where and represent the horizontal and vertical coordinates of the Fourier domain. The phase component and the amplitude component are obtained as follows:
(4) |
where and represent imaginary numbers and real numbers, respectively.
Structure Flow. The main goal of designing the Fourier prior module in this paper is to achieve an effective and efficient global context image information modeling paradigm and improve the representation ability of text features, as shown in Fig. 3. For a given image , we first use a text encoder based CLIP to extract the shallow features of the image . Shallow features are encoded by using stacked image encoders. The Fuoriformer module designed in this paper consists of a stack of spatial interaction module, channel evolution module, residual and layer normalization module and Fourier prior module. Similarly, for the image decoder, we use a stack of the proposed core modules for image feature decoding.
As shown in Fig. 4, the core module of Fourierformer consists of two parts: spatial interaction and channel evolution, which are implemented by depth convolution and convolution with DFT and IDFT respectively.
![Refer to caption](x4.png)
Fourier Spatial Interaction. Fourier spatial interaction first takes the image feature maps obtained by the image encoder as the input of Fourierformer, and then applies DFT to convert them into a spatial feature representation. Assuming that the features are expressed as , the corresponding DFT formula is defined as:
(5) |
where , and represent the real and imaginary parts in the Fourier space. We then perform Fourier spatial interaction to filter and compress the frequency domain signal of the image through a deep-wise convolution (DWconv) operation with LeakyReLU activation function. The spatial interaction process of images can be defined as:
(6) |
Then we apply inverse DFT to the learned and with low-frequency signals to transform them back into the spatial domain. The formula for and to achieve time-frequency conversion is defined as follows:
(7) |
The spectral convolution theorem in Fourier theory states that the convolution operation of signals in the frequency domain is equivalent to their product operation in the time domain, which reveals the overall frequency composition. The spectral convolution theorem provides an efficient way to process signals in the frequency domain because convolution operations in the frequency domain are generally easier to process than multiplication operations in the time domain. Therefore, we concatenate the obtained by Fourier transform and normalize it to obtain the output of the Fourier spatial interaction.
Fourier Channel Evolution. Fourier channel evolution performs channel-by-channel evolution by applying a convolution operator to decompose the output of the Fourier space interaction into real and imaginary parts and . The Fourier channel evolution formula can be defined as:
(8) |
where is the concatenation operation. Then we perform IDFT to convert and to time domain space as follows:
(9) |
![Refer to caption](x5.png)
3.1.4 Two-stage Error Selection
As shown in Fig. 5, we first use a CLIP-based learning model to predict age. If the error exceeds the threshold, an optimization branch is used to optimize the error and give a predicted age with high confidence.
For a given observation , we use multiple models and metrics to evaluate the predicted age, resulting in an -dimensional error vectors, expressed as:
(10) |
where represents the error estimate calculated by the -th model, is the input image.
Each associated age estimate obtained from an observation follows the i.i.d. criterion, so is treated as a constant. Therefore, we can simplify Eq. 10 and obtain optimal model parameters by minimizing the error :
(11) |
where is determined using a voting mechanism, which is learnable.
Leveraging ensemble learning [32] enables a more robust representation of the hypothesis space, we integrate multiple neural networks to estimate implicit errors. Each neural network uses a map** function , for error. We train regressors with the same network architecture and use a voting algorithm to obtain the final prediction. Therefore, for a given input state , the implicit error is estimated by the ensemble network as follows:
(12) |
where is the learnable network parameters.
According to Eq. 12, we can obtain the cumulative age estimation error as follows:
(13) |
We divide the error of Eq. 13 into two parts, one is the estimated implicit error, and the other is the true explicit error. The estimated implicit error is obtained by learning the feature representation of the image encoder by the ensemble regressor we built, and the real explicit error is obtained by the age estimation model based on CLIP we built. At the same time, we optimize the network parameters of the ensemble regressor by minimizing the distance between the estimated implicit error and the true explicit error. The optimization goal is defined as follows:
(14) |
where
To achieve controllable generation of predicted states, we use the feature representation decoded by the image encoder as the input of the ensemble regressor to learn and sample candidate predicted ages. Therefore, the update target of network parameters is defined as follows:
(15) |
where is the the latent vectors. Finally, among the candidate age estimation states generated by the ensemble regressor with the trained network parameters , we select the final prediction result as follows:
(16) |
The age estimation error is calculated via Eq. 10. If the calculated error is less than the feasibility threshold, i.e. , the selected age estimation state is considered acceptable and the predicted value is returned. Otherwise, the error is used to optimize the ensemble regressor model in the next iteration of parameter updates.
3.2 Model Training
Mean Absolute Error (MAE) is a commonly used performance evaluation metric in regression problems, which measures the mean absolute difference between model predictions and actual observations. The Loss is defined as follows:
(17) |
where is the parameter of network learning, and represents the -th training sample.
The optimization goals of the model are as follows:
(18) |
Where represents the total number of samples.
4 EXPERIMENTS
4.1 Benchmark Dataset Used
In this paper, we use six benchmark datasets, MORPH-II111http://www.faceaginggroup.com/morph/, FG-Net222http://yanweifu.github.io/FG_NET_data/FGNET.zip CACD333http://bcsiriuschen.github.io/CARC/, Adience444http://www.openu.ac.il/home/hassner/Adience/data.html, FACES555http://faces.mpib-berlin.mpg.de, and SC-FACE666https://www.scface.org/, to conduct our age estimation experiments and verify the effectiveness of our CILF-CIAE method.
MORPH-II. The MORPH-II dataset is widely used in facial image research (e.g., age estimation and facial recognition). The MORPH-II dataset contains 55,000 facial photos of 13,000 volunteers over a period of time. The MORPH-II dataset covers facial images of volunteers from different ethnicities, different genders, and different geographical regions from 1 to 80 years old.
Existing methods employ three different experimental settings on the MORPH-II dataset. The first setting (S1) selects 5,492 white images from the original dataset (80% images for training, 20% images for testing) and performs 5-fold cross-validation to reduce cross-race effects [4], [33]. The second setting (S2) randomly splits all images into training/test sets (80/20%) and performs 5-fold cross-validation [34]. The third setting (S3) randomly selects 21,000 images from MORPH and restricts the black-white race ratio to 1:1 and the female to male ratio to 1:3 [5].
FGNET. The FGNET dataset is composed of facial photos provided by volunteers from the age range of 0 to 69 years old. The FGNET dataset contains facial images of volunteers from different genders, different races, and different geographical areas. The FGNET dataset is mainly used to evaluate and improve the performance of facial age estimation algorithms.
CACD. CACD is also a dataset for facial age estimation, which mainly contains publicly available facial images of famous celebrities from social media (e.g., movies, TV, music). The CACD dataset contains more than 163,000 facial images of people from teenagers to older adults. The CACD dataset includes images of celebrities from different countries and different professions.
Adience. The Adience benchmark is an unconstrained dataset, i.e., there are no restrictions on gestures and photo poses. The face images in the Adience dataset are captured by mobile phone devices. Because these images are not subject to artificial data preprocessing and noisy image filtering, they can greatly reflect real-world challenges. The Adience dataset consists of 19,487 images, in which the numbers of males and females are 8,192 and 11,295 respectively.
FACES. The FACES face image dataset is a dataset used in psychology and neuroscience research, especially in studying age. This dataset was created by Ebner et al. in 2010 to provide a high-quality, diverse set of face images. The FACES dataset contains face photos of men and women ranging in age from 20 to 80 years old. The images show different emotional expressions such as happy, sad, angry and neutral expressions.
SC-FACE. SC-FACE (Surveillance Cameras Face Database) is a face image data set specially used for facial recognition research, especially facial recognition in surveillance environments. The dataset includes hundreds of images of subjects with facial expressions under different lighting conditions and backgrounds.
4.2 Evaluation Metrics
1) Mean Absolute Error (MAE): The MAE value reflects the absolute error between the true value of the sample and the predicted value of the model. In age estimation, MAE is more suitable as a model evaluation metric than MSE. The formula of MAE is defined as follows:
(19) |
where represents the predicted value of the model, represents the true value, and represents the number of the samples.
2) Cumulative Score (CS): CS is used to measure the accuracy of the model’s prediction error for face images not exceeding years. The formula for CS is defined as follows:
(20) |
where represents the number of samples where the absolute error of the model does not exceed .
4.3 Baseline Models
PML [8]: Deng et al. proposed a progressive margin loss (PML) method to adaptively learn the distribution pattern of age labels. The PML method fully considers the inter-class and intra-class age distribution differences, and can effectively alleviate the long-tail distribution problem of data.
Ranking-CNN [35]: Chen et al. designed a novel Ranking-CNN architecture for age estimation. Ranking-CNN uses CNN to rank age labels and then perform high-level feature extraction. Ranking-CNN theoretically proves that the error comes from the maximum error in the ranked labels.
DLDL [34]: The deep label distribution learning (DLDL) method proposed by Gao et al. can adaptively learn the characteristics of label ambiguity. DLDL discretizes the age labels and uses CNN to minimize the KL divergence between the predicted distribution and the true distribution to optimize the model parameters.
![Refer to caption](x6.png)
![Refer to caption](x7.png)
CSOHR [36]: Chang et al. proposed a method combining hyperplane ranking algorithm and cost-sensitive loss for age estimation. CHOSR performs feature extraction on images with relative order information and introduces cost-sensitive losses to improve prediction accuracy.
DEX [4]: The DEX proposed by Rothe et al. uses the VGG-16 architecture pre-trained on ImageNet for age estimation. DEX uses a deep CNN to align faces and age expectations to optimize model parameters.
CNN+ELM [10]: Duan et al. proposed a CNN and extreme learning machine (ELM) algorithm CNN2ELM for age estimation. CNN2ELM built three CNN networks to extract features and perform information fusion for Age, Gender and Race respectively, and then used ELM for the final age regression prediction.
DRF [1]: Shen et al. designed deep regression forest (DRF) for age estimation, which is continuously differentiable. DRF adaptively learns non-uniform age distribution data through the joint learning method of CNNC’s random forest.
VDAL [2]: Liu et al. proposed a similarity-aware deep adversarial learning (SADAL) method for age estimation. SADAL enhances the model’s ability to learn facial age features through adversarial learning of positive and negative samples. In addition, SADAL designed a similarity-aware function to measure the distance between positive and negative samples to guide the optimization direction of the model.
DHR [37]: Tan et al. proposed a deep hybrid alignment architecture for age estimation, which captures image age features with complementary semantic information through joint learning of global and local branches. Furthermore, in each branch network, a fusion mechanism is used to explore the correlation between sub-networks.
DCT [7]: Bao et al. designed a divergence-driven consistency training mechanism to improve the quasi-efficiency of age estimation. DCT introduces an efficient sample selection strategy to select valid samples from unlabeled samples. Furthermore, DCT also introduces an identity consistency criterion to optimize the dependence between image features and age.
![Refer to caption](x8.png)
4.4 Implementation Details
We adopt CLIP’s pretrained image encoder as the backbone and directly integrate our designed Fourierformer as the decoder. In terms of language domain prompts, we choose a context length of 9. The Transformer decoder used to extract visual context consists of 6 layers. To reduce computational cost, we project image embeddings and text embeddings to 512 dimensions before the Transformer module. In terms of model fine-tuning, we observe that fine-tuning directly using the CLIP model does not produce satisfactory results. Therefore, we made a key modification: using AdamW as the optimizer for model training instead of the default SGD, which helps to improve the effectiveness of the training process and improve the final prediction performance.
5 RESULTS AND DISCUSSION
In this section, we discuss the experimental results of our method CILF-CIAE and other comparative methods on six data sets.
5.1 Comparison with Baseline Methods
To verify the superior performance of our proposed method CILF-CIAE, we conducted performance tests on six real data sets and compared it with other comparison methods. The experimental results are shown in Figs. 6. The method CILF-CIAE proposed in this paper has better MAE values and CS values on six data sets than other comparative methods. Specifically, the MAE values of CILF-CIAE under the three data set evaluation criteria of MORPH-S1, MORPH-S2 and MORPH-S3 are 1.74, 1.68 and 1.81 respectively, and the CS are 95.1%, 95.7% and 94.3% respectively. Other comparison algorithms are worse than the CILF-CIAE algorithm in MAE value and CS value. Experimental results demonstrate that our method CILF-CIAE significantly outperforms other baseline algorithms. Similarly, on other data sets, our method CILF-CIAE method is also significantly better than other comparison algorithms. Experimental results show the robustness of the CILF-CIAE algorithm.
Overall, the feature learning ability of our method CILF-CIAE is better than other comparison algorithms in any case. Specifically, the performance improvement can be attributed to the high-quality text and image alignment capabilities based on the CLIP large model. Image representation based on language prompt guidance can greatly improve the ability to represent image features. At the same time, we introduce a context awareness module (i.e., Fourierformer) to react on language prompts to improve the expression of text semantic information. Unlike the traditional Vision Transformer architecture, Fourierformer models the global information of the image by introducing Fourier transform operations to achieve spatial interaction and channel evolution of image features. In addition, we also introduce an error correction mechanism. When the age predicted by the CLIP-based age estimation model differs greatly from the actual age, the model will start the optimization branch to optimize the error until is reached.
Spatial interaction | Channel evolution | Error correction | MORPH-S1 | MORPH-S2 | MORPH-S3 | FGNET | CACD | Adience | FACES | SC-FACES |
---|---|---|---|---|---|---|---|---|---|---|
✗ | ✗ | ✗ | 2.71 | 2.46 | 2.84 | 2.69 | 3.31 | 0.52 | 3.01 | 3.43 |
✔ | ✗ | ✗ | 2.63 | 2.31 | 2.69 | 2.62 | 3.24 | 0.47 | 2.86 | 3.35 |
✗ | ✔ | ✗ | 2.65 | 2.34 | 2.69 | 2.67 | 3.26 | 0.48 | 2.83 | 3.36 |
✗ | ✗ | ✔ | 2.44 | 2.17 | 2.48 | 2.41 | 3.13 | 0.44 | 2.67 | 3.14 |
✗ | ✔ | ✔ | 2.05 | 1.93 | 2.26 | 2.19 | 3.05 | 0.41 | 2.43 | 2.53 |
✔ | ✗ | ✔ | 1.91 | 1.85 | 2.14 | 2.06 | 2.94 | 0.39 | 2.38 | 2.40 |
✔ | ✔ | ✗ | 2.37 | 2.08 | 2.34 | 2.29 | 3.08 | 0.47 | 2.55 | 2.82 |
✔ | ✔ | ✔ | 1.74 | 1.68 | 1.81 | 1.78 | 2.83 | 0.39 | 2.13 | 2.27 |
5.2 Effectiveness of Low-Dimensional Representation
To explore the impact of the number of parameters of the model and the latent feature representation of the image on the model performance, we use different image feature dimensions (i.e., [512, 256, 128, 64, 32, 16]) to explore the effectiveness of low-dimensional representation. As shown in Figs. 7, we tested the experimental effects of CILF-CIAE and other comparative methods on 6 data sets in different dimensions. We report the MAE values of the model. Specifically, the MAE value of CILF-CIAE increases slightly as the feature embedding dimension decreases on the six datasets, while the performance of other comparison methods drops sharply. Experimental results demonstrate the robustness of our method. The stable performance of CILF-CIAE may be attributed to the fact that the estimation algorithm based on CLIP contains rich image prior knowledge, which can improve the induction ability of the model. In addition, the Transformer architecture designed based on the Fourier change module to implement contextual prompts is a parameter-free estimation function and is insensitive to parameter changes.
As shown in Figs. 8, we tested the experimental effects of CILF-CIAE and other comparative methods on six data sets in different dimensions. We report the CS values of the model. In tests on the MORPH-S1 and MORPH-S2 data sets, the CS value of CILF-CIAE decreased slightly as the image feature embedding dimension decreased. On other datasets, the CS value decreases rapidly with the decrease of image feature embedding dimension. However, the performance of CILF-CIAE is always higher than other comparison algorithms. The superior performance may be attributed to the optimization branch’s ability to ensure that the prediction results are at a relatively high confidence level.
![Refer to caption](x9.png)
5.3 Ablation Study
As shown in Tables 1, we perform ablation experiments on all test data respectively. We separately explored the effectiveness of the three modules proposed in this paper, i.e., spatial interaction module, channel evolution module and error correction module. If none of the three modules proposed in this paper are used, it means that the CLIP model is used directly to estimate the age of the image. The model has the worst experimental results on the six data sets if any of the modules proposed in this paper are not applied for age estimation. If one module is used for age estimation, the age estimation effect with the error estimation module is the best, the age estimation effect with the spatial interaction module is second, and the age estimation effect with the channel evolution module is the worst. When using two modules, the age estimation effect with the spatial interaction module and the error estimation module is the best, and the age estimation effect with the spatial interaction module and the channel evolution module is the worst. When three modules are used, the age estimation results are best in all cases. Ablation experiments demonstrate the effectiveness of each module proposed in this paper.
![Refer to caption](x10.png)
5.4 Qualitative Results Analysis
To more intuitively demonstrate the effectiveness of CILF-CIAE, we conducted extensive experiments on the Morph-II benchmark dataset. Fig. 9 shows the prediction results and true labels of CILF-CIAE. We observe that CILF-CIAE performs well in age prediction for most images and can accurately predict the age of faces. The inaccurate prediction on a few images may be attributed to the fact that the images are synthetic and the pose variations are large.
We further visualize the distribution of features learned in the training and testing phases on the Morph-II (S1) dataset using t-SNE. As can be seen from Fig. 10, the feature class boundaries learned in the training and testing phases are relatively clear, and different age categories have more compact feature distributions.
6 CONCLUSION AND FUTURE WORK
The paper proposes a novel CLIP-driven Image–Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE) to perform age estimation. Firstly, we use Image Encoder and Text Encoder in CLIP to obtain corresponding feature representations and achieve age estimation. Secondly, we introduce a Transformer architecture based on Fourier transform to achieve spatial interaction and channel evolution of image features. Specifically, we replace the attention module in Transformer with Fourier transform and input image features into Fuorierformer to achieve spatial interaction and channel evolution. Finally, we build an error-correcting reversible age estimation module to ensure that the predicted age is within a high-confidence interval in an end-to-end learning manner. The method CILF-CIAE proposed in this paper achieves optimal age estimation on multiple age estimation datasets. In future research work, we will consider investigating estimation across data sets, which can improve the generalization ability of the model.
Acknowledgments
This work is supported by National Natural Science Foundation of China (Grant No. 69189338), Excellent Young Scholars of Hunan Province of China (Grant No. 20B625, No. 18B196), Changsha Natural Science Foundation (Grant No. kq2202294), and program of Research on Local Community Structure Detection Algorithms in Complex Networks (Grant No. 2020YJ009).
References
- [1] W. Shen, Y. Guo, Y. Wang, K. Zhao, B. Wang, A. Yuille, Deep differentiable random forests for age estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2) (2019) 404–419.
- [2] H. Liu, P. Sun, J. Zhang, S. Wu, Z. Yu, X. Sun, Similarity-aware and variational deep adversarial learning for robust facial age estimation, IEEE Transactions on Multimedia 22 (7) (2020) 1808–1822.
- [3] N. Yin, L. Shen, M. Wang, X. Luo, Z. Luo, D. Tao, Omg: Towards effective graph classification against label noise, IEEE Transactions on Knowledge and Data Engineering 35 (12) (2023) 12873–12886. doi:10.1109/TKDE.2023.3271677.
- [4] R. Rothe, R. Timofte, L. Van Gool, Deep expectation of real and apparent age from a single image without facial landmarks, International Journal of Computer Vision 126 (2-4) (2018) 144–157.
- [5] Z. Bao, Y. Luo, Z. Tan, J. Wan, X. Ma, Z. Lei, Deep domain-invariant learning for facial age estimation, Neurocomputing 534 (2023) 86–93.
- [6] N. Yin, L. Shen, H. Xiong, B. Gu, C. Chen, X. Hua, S. Liu, X. Luo, Messages are never propagated alone: Collaborative hypergraph neural network for time-series forecasting, IEEE Transactions on Pattern Analysis and Machine Intelligence (01) (5555) 1–15. doi:10.1109/TPAMI.2023.3331389.
- [7] Z. Bao, Z. Tan, J. Wan, X. Ma, G. Guo, Z. Lei, Divergence-driven consistency training for semi-supervised facial age estimation, IEEE Transactions on Information Forensics and Security 18 (2022) 221–232.
- [8] Z. Deng, H. Liu, Y. Wang, C. Wang, Z. Yu, X. Sun, Pml: Progressive margin loss for long-tailed age classification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10503–10512.
- [9] Z. Niu, M. Zhou, L. Wang, X. Gao, G. Hua, Ordinal regression with multiple output cnn for age estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4920–4928.
- [10] M. Duan, K. Li, K. Li, An ensemble cnn2elm for age estimation, IEEE Transactions on Information Forensics and Security 13 (3) (2017) 758–772.
- [11] H. Wang, V. Sanchez, C.-T. Li, Improving face-based age estimation with attention-based dynamic patch fusion, IEEE Transactions on Image Processing 31 (2022) 1084–1096.
- [12] K. Zhang, N. Liu, X. Yuan, X. Guo, C. Gao, Z. Zhao, Z. Ma, Fine-grained age estimation in the wild with attention lstm networks, IEEE Transactions on Circuits and Systems for Video Technology 30 (9) (2019) 3140–3152.
- [13] Y. Shou, X. Cao, D. Meng, Masked contrastive graph representation learning for age estimation, arXiv preprint arXiv:2306.17798 (2023).
- [14] D. Cao, Z. Lei, Z. Zhang, J. Feng, S. Z. Li, Human age estimation using ranking svm, in: Biometric Recognition: 7th Chinese Conference, CCBR 2012, Guangzhou, China, December 4-5, 2012. Proceedings 7, Springer, 2012, pp. 324–331.
- [15] L. Shen, J. Zheng, E. H. Lee, K. Shpanskaya, E. S. McKenna, M. G. Atluri, D. Plasto, C. Mitchell, L. M. Lai, C. V. Guimaraes, et al., Attention-guided deep learning for gestational age prediction using fetal brain mri, Scientific reports 12 (1) (2022) 1408.
- [16] N. Yin, L. Shen, M. Wang, L. Lan, Z. Ma, C. Chen, X.-S. Hua, X. Luo, Coco: A coupled contrastive framework for unsupervised domain adaptive graph classification, arXiv preprint arXiv:2306.04979 (2023).
- [17] N. Yin, L. Shen, B. Li, M. Wang, X. Luo, C. Chen, Z. Luo, X.-S. Hua, Deal: An unsupervised domain adaptive framework for graph-level classification, in: Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 3470–3479. doi:10.1145/3503161.3548012.
- [18] N.-H. Shin, S.-H. Lee, C.-S. Kim, Moving window regression: A novel approach to ordinal regression, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18760–18769.
- [19] W. Cao, V. Mirjalili, S. Raschka, Rank consistent ordinal regression for neural networks with application to age estimation, Pattern Recognition Letters 140 (2020) 325–331.
- [20] Y. Zhang, L. Liu, C. Li, et al., Quantifying facial age by posterior of age comparisons, arXiv preprint arXiv:1708.09687 (2017).
- [21] W. Li, J. Lu, J. Feng, C. Xu, J. Zhou, Q. Tian, Bridgenet: A continuity-aware probabilistic network for age estimation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1145–1154.
- [22] W. Shen, Y. Guo, Y. Wang, K. Zhao, B. Wang, A. L. Yuille, Deep regression forests for age estimation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2304–2313.
- [23] G. Levi, T. Hassner, Age and gender classification using convolutional neural networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2015, pp. 34–42.
- [24] G.-S. Xie, X.-Y. Zhang, S. Yan, C.-L. Liu, Hybrid cnn and dictionary-based models for scene recognition and domain adaptation, IEEE Transactions on Circuits and Systems for Video Technology 27 (6) (2015) 1263–1274.
- [25] J. Lee, J. Kim, H. Shon, B. Kim, S. H. Kim, H. Lee, J. Kim, Uniclip: Unified framework for contrastive language-image pre-training, Advances in Neural Information Processing Systems 35 (2022) 1008–1019.
- [26] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, H. Li, Pointclip: Point cloud understanding by clip, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8552–8562.
- [27] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al., A survey on vision transformer, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (1) (2022) 87–110.
- [28] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, M. Shah, Transformers in vision: A survey, ACM computing surveys (CSUR) 54 (10s) (2022) 1–41.
- [29] R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y. Qiao, H. Li, Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training, Advances in neural information processing systems 35 (2022) 27061–27074.
- [30] K. Zhou, J. Yang, C. C. Loy, Z. Liu, Conditional prompt learning for vision-language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16816–16825.
- [31] M. Zhou, J. Huang, C.-L. Guo, C. Li, Fourmer: an efficient global modeling paradigm for image restoration, in: International Conference on Machine Learning, PMLR, 2023, pp. 42589–42601.
- [32] R. Kang, T. Mu, P. Liatsis, D. C. Kyritsis, Physics-driven ml-based modelling for correcting inverse estimation, in: 37th Conference on Neural Information Processing Systems (NeurIPS), 2023.
- [33] E. Agustsson, R. Timofte, L. Van Gool, Anchored regression networks applied to age estimation and super resolution, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1643–1652.
- [34] B.-B. Gao, C. Xing, C.-W. Xie, J. Wu, X. Geng, Deep label distribution learning with label ambiguity, IEEE Transactions on Image Processing 26 (6) (2017) 2825–2838.
- [35] S. Chen, C. Zhang, M. Dong, Deep age estimation: From classification to ranking, IEEE Transactions on Multimedia 20 (8) (2017) 2209–2222.
- [36] K.-Y. Chang, C.-S. Chen, A learning framework for age rank estimation based on face images with scattering transform, IEEE Transactions on Image Processing 24 (3) (2015) 785–798.
- [37] Z. Tan, Y. Yang, J. Wan, G. Guo, S. Z. Li, Deeply-learned hybrid representations for facial age estimation., in: IJCAI, 2019, pp. 3548–3554.