Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment

Zijia Song Zelin Zang Yelin Wang ShanghaiTech University Guozheng Yang National University of Defense Technology
Jiangbin Zheng Westlake University Kaicheng Yu Westlake University Wanyu Chen National University of Defense Technology Stan Z. Li

Abstract

Multimodal fusion breaks through the barriers between diverse modalities and has already yielded numerous impressive performances. However, in various specialized fields, it is struggling to obtain sufficient alignment data for the training process, which seriously limits the use of previously elegant models. Thus, semi-supervised learning attempts to achieve multimodal alignment with fewer matched pairs but traditional methods like pseudo-labeling are difficult to apply in domains with no label information. To address these problems, we transform semi-supervised multimodal alignment into a manifold matching problem and propose a new method based on CLIP, named Gentle-CLIP. Specifically, we design a novel semantic density distribution loss to explore implicit semantic alignment information from unpaired multimodal data by constraining the latent representation distribution with fine granularity, thus eliminating the need for numerous strictly matched pairs. Meanwhile, we introduce multi-kernel maximum mean discrepancy as well as self-supervised contrastive loss to pull separate modality distributions closer and enhance the stability of the representation distribution. In addition, the contrastive loss used in CLIP is employed on the supervised matched data to prevent negative optimization. Extensive experiments conducted on a range of tasks in various fields, including protein, remote sensing, and the general vision-language field, demonstrate the effectiveness of our proposed Gentle-CLIP.

1 Introduction

As a pivotal foundation for numerous tasks[11, 10], multimodal learning has become the focus of many researches[8, 9]. By integrating information from diverse modalities such as texts, images and so forth, multimodal models can acquire more comprehensive information to enhance the generalization of the learned representations[36]. Meanwhile, multimodal fusion enables networks to emulate human-like multiple perceptual capabilities and address the inherent challenges like data scarcity, noise and ambiguity in various domains, from computer vision to healthcare[14, 3].

In order to better harness the latent alignment information, most previous researches have primarily focused on the frameworks and the pretraining objectives, which can guide models to effectively understand the relationships between different modalities. Result from the sufficient development of image and text field, many studies have made great progress in the vision-language field[6, 51, 31, 33].Thereinto, CLIP employs a contrastive pretraining task [4] on large-scale datasets and gets robust multimodal representation. Due to simple framework and general pretraining task, it has good portability and scalability as well as competitive performance compared with supervised methods, hence it becomes the baseline for various vision-language works[53, 34]. Besides the aforesaid traditional field, in other intersecting domains, great breakthroughs have been also made by applying CLIP. EchoCLIP[13] improves the performance of cardiac imaging models by learning the association between cardiac ultrasound images and expert texts while ProtST[1] can capture more protein function information by aligning protein sequences and textual property descriptions. Moreover, in the field of zero-shot video recognition, Open-VCLIP[22] also shows excellent performance by leveraging the similar paradigm, which proves the powerful effects of CLIP.

Nonetheless, there are still many specialized fields where it is usually difficult to obtain sufficient alignment data[15] while traditional multimodal models like CLIP can only learn from matched pairs, which greatly limites the performance of previously elaborate models. In order to break the above dilemma, several studies paid attention to these specialized fields and attempted to reduce the number of matched pairs for pretraining[16, 17, 18, 19]. The main idea is to modify the loss function of CLIP and apply semi-supervised learning method[37, 38] to explore the latent alignment information in the unlabeled data. Recently, a new research improves the original CLIP and proposes S-CLIP[28] which introduces two novel pseudo-labeling methods for unlabeled images and achieve state-of-the-art in various specialized semi-supervised vision-language fields.

However, pseudo-labeling learning methods may only be limited to the fields with class information and have difficulties scaling to other specialized multimodal domains where pseudo-labels are struggling to obtain. Meanwhile, the knowledge of generating pseudo-labels only relies on the insufficient labeled data, which leads to narrow ken and may loss much potential alignment information. In addition, the quality of pseudo-label has a great impact on the final performance so the learning process is unstable and even negative[39]. In order to solve these problems, it is necessary to design new semi-supervised methods for multimodalities, which can capture latent alignment information in unpaired data and be well extended to various multimodal domains.

Therefore, we propose a novel semi-supervised learning method for multimodal alignment based on CLIP, named as Gentle-CLIP. We believe that ultimate representation is composed of modality, structure as well as semantic and the key to multimodal alignment is to capture the same semantic representation while ignoring the other two representations. On the premise of two modal data with the same semantic distribution, we design a new pretraining task based on manifold matching and design a novel loss called semantic density distribution(SDD) to better concentrate on the implicit alignment information among vast unpaired multimodal data. Moreover, we introduce multi-kernel maximum mean discrepancy(MK-MMD) to eliminate the difference between modality representations while self-supervised contrastive loss is used to prevent mode collapse and enhance the robustness of semantic distribution. At the same time, we apply contrastive loss from CLIP on the matched multimodal pairs to keep the correct learning direction. Gentle-CLIP tries to explore alignment relationship in latent space and it can be extended to various multimodal domains due to task irrelevance. Through end-to-end learning, the mutual constraints between losses prevent negative optimization and implicitly expand the knowledge range. Our approach attempts to achieve multimodal alignment from essence and it can be transferred to different multimodal frameworks[49, 50, 52]. The comparison between Gentle-CLIP and other learning strategies is shown in Figure 1.

In short, our contributions are summarized as follows: (1) We contribute a groundbreaking perspective for the semi-supervised multimodal alignment problem by transforming it into a manifold matching problem, which brings a new pathway to exploit the implicit alignment information in the rich, yet largely unmatched multimodal data. (2) We design a novel semantic density distribution loss with less computing cost and it can be applied in various specialized fields as well as different multimodal frameworks. We introduce other objectives based on theoretical analysis about the components of representation and propose Gentle-CLIP to realize multimodal alignment with less supervised pairs. Moreover, our method can be applied to other domains with two-stream networks [35], such as knowledge distillation[47, 46], self-supervised learning[48] and domain adaptation. (3) We conduct extensive experiments in various fields and prove the effectiveness of our method. And Gentle-CLIP outperforms the existing semi-supervised methods in specialized fields because of larger knowledge range. We also explain the effects of key modules and provide a feasible usage paradigm for the specialized fields without enough supervised pairs.

2 Related Works

Multimodal alignment. Multimodality enhances understanding and decision-making by integrating information from multiple sensorymodalities[81, 82]. Thereinto, ALBEF [77] aligns visual and language representations, using momentum distillation to improve multimodal embeddings. FLAVA [79] enhances multitask and cross-modal learning by jointly pretraining text and images. ALIGN [80] jointly trains language and image encoders, significantly enhancing performance across various vision and text benchmarks. In recent years, researches around CLIP has further optimized computational efficiency and model representation capabilities. For instance, FLIP[53] brings lower computation and faster training times by randomly removing a large number of image patches during training process while SoftCLIP[7] applies fine-grained interior self-similarity as a softening target for cross-modal learning to alleviate the strict mutual exclusion problem. Moreover, latent diffusion models[11] generates reliable text embeddings as condition by using pretrained text encoder of CLIP and CLIPSelf [12] enhances region-level representation through self-distillation from CLIP’s image encoder, which proves the powerful effects of CLIP.

Semi-supervised learning. Semi-supervised learning [61, 69, 70] uses both labeled and unlabeled data to improve model training, encompassing strategies like pseudo-labeling [62, 73, 74], where models self-label their training data, and self-supervised learning [63, 64, 76, 75], which generates its own labels, thereby broadening its application in various fields. vONTSS [65] utilizes the von Mises-Fisher distribution and optimal transport for semi-supervised neural topic modeling to improve topic extraction in text datasets. SSGD [66] proposes a new semi-supervised domain generalization method that enhances model robustness under domain shifts through stochastic modeling and style augmentation. SS-ORL [68] employs a semi-supervised offline reinforcement learning approach, improving learning outcomes by utilizing unlabeled trajectories and limited complete action data. Semi-supervised learning ensures performance with fewer samples[71, 72], and the CLIP algorithm focuses on the complementarity and integration of different modalities. However, within the specialized domains, how to achieve these complementarity and integration across different modalities remains an issue that needs attention and resolution.

3 Method

3.1 Problem Description and Distribution Assumption

Different from the general vision-lanuage field, there could be only limited available matched pairs between specific associated modalities while it is relatively simple to get a large amount of data with similar semantic distribution in each modality. Therefore, we propose Gentle-CLIP, which uses massive unmatched data as well as limited matched pairs to realize more generalized alignment through semi-supervised learning. Formally, for any two modalities $\bf A$ and $\bf B$ , we employ a small number of matched pairs $\left\{a_{i},b_{i}\right\}_{i=1}^{N}$ and a large number of unmatched data $\left\{a_{j}\in\bf A\right\}_{j=1}^{M_{1}}$ as well as $\left\{b_{j}\in\bf B\right\}_{j=1}^{M_{2}}$ to train our model. Through sampling respectively from unpaired two sets, we acquire $\left\{a_{j}\in\bf A\right\}_{j=1}^{M}$ and $\left\{b_{j}\in\bf B\right\}_{j=1}^{M}$ as unsupervised training data which are only considered to have similar semantic distribution rather than strict one-to-one matching. Based on a natural assumption, models can be fully trained on adequate unmatched multimodal data for robust representation.

Assumption 1 (Semantic Distribution Similarity Assumption, SDSA). We suppose that the latent embedding is a combination representation of modality, structure as well as semantic and more detailed analysis will be displayed in Appendix A. The goal of multimodal alignment is to find the same semantic representation and get rid of the interference from the other two representations. If the overall semantic distributions of $\left\{a_{j}\in\bf A\right\}_{j=1}^{M}$ and $\left\{b_{j}\in\bf B\right\}_{j=1}^{M}$ are similar, we can find a embedding space $\bf{S}\subseteq\bf{R}^{K}$ where $u_{j}$ and $v_{j}$ are the embedding representations respectively from $\bf{A}$ and $\bf{B}$ . When the density distributions of $\left\{u_{j}\in\bf S\right\}_{j=1}^{M}$ as $\bf U$ and $\left\{v_{j}\in\bf S\right\}_{j=1}^{M}$ as $\bf V$ are similar, this space $\bf S$ is the semantic embedding space of $\bf A$ and $\bf B$ . Consequently, when datasets from two modalities have the similar semantic distribution and their volumes are large enough, we can find the aligned semantic space by narrowing the gap between the density distribution from two modalities rather than strict matching relationship or pseudo-labeling method[45]. Through the above assumption, the semi-supervised multimodal alignment can be transformed into a manifold matching problem.

3.2 The Framework of Gentle-CLIP

Figure 2 introduces the conceptual overview of Gentle-CLIP. Due to the convenience and efficiency of CLIP, our proposed method follows to design the two-stream network[35]. Each stream includes an encoder network $F_{i}(\cdot),i\in\{\bf A,\bf B\}$ and a projection head network $H_{i}(\cdot),i\in\{\bf A,\bf B\}$ , which are applied to map the data from original space into embedding space. The network from different streams adopt different backbone with unshared weight. We introduce multi-kernel maximum mean discrepancy loss(MK-MMD) as well as self-supervised contrastive loss(SSL) and design a novel semantic density distribution loss(SDD) to learn potential alignment relationship in large amounts of unpaired data. Through contrastive metrix, we apply contrastive loss(CL) on limited supervised pairs to guarantee proper optimization of the model. Moreover, this framework could be extended to various modalities and the objectives in Gentle-CLIP can also migrate to other multimodal alignment works[31, 32]. A detailed description of loss functions will be shown below.

3.3 Objective loss

Supervised Alignment Guidance: Based on problem description, there are limited matched pairs $\left\{a_{i},b_{i}\right\}_{i=1}^{N}$ and a large number of unmatched data $\left\{a_{j},b_{j}\right\}_{j=1}^{M}$ . Due to the lack of sufficient data, we are supposed to learn generalized representations through unsupervised data and take advantage of explicit alignment relationship as ground-truth to achieve precise alignment. We apply contrastive loss $\mathcal{L}_{\mathrm{CL}}$ in CLIP, which is to maximize the representation similarity between matched pairs while minimize the similarity between negative pairs. The format of this loss function is shown as follows:

\mathcal{L}_{\mathrm{CL}}=-\frac{1}{2n}\sum_{i=1}^{n}\left(\log\frac{\exp\left% (\mathcal{S}(u_{i},v_{i})/\tau\right)}{\sum_{j=1}^{n}\exp\left(\mathcal{S}(u_{% i},v_{j})/\tau\right)}+\log\frac{\exp\left(\mathcal{S}(v_{i},u_{i})/\tau\right% )}{\sum_{j=1}^{n}\exp\left(\mathcal{S}(v_{i},u_{j})/\tau\right)}\right)

(1)

where $n=\lfloor\frac{N}{M+N}B\rfloor$ and $B$ is the batch size. $u_{i}$ and $v_{i}$ are representations in latent space respectively from two modalities. $\tau$ is a learnable temperature parameter and $\mathcal{S}(\cdot,\cdot)$ denotes cosine similarity. We expect to apply supervised as well as unsupervised data in every batch to jointly train the model due to the reason that a mass of unsupervised data can bring richer alignment information while matched pairs could lead to more accurate learning.

Coarse-Grained Modality Adaptation: MK-MMD[44] is used to measure the gap between two probability distributions $P$ as well as $Q$ and the core idea of this method is that samples $\{p_{1},p_{2},\ldots,p_{m}\}$ as well as $\{q_{1},q_{2},\ldots,q_{n}\}$ drawn from $P$ and $Q$ should keep similar statistical properties if the two distributions are the same. Specifically, MK-MMD maps the data from original space to Reproducing Kernel Hilbert Spac(RKHS) by kernel functions[43] and we can compare the difference between distributions in this space. Through linear combination of multiple kernel functions, we could get a more robust map** function to RKHS where we can easily distinguish two distributions even though they are similar in original space. The formula is shown as follows:

\mathcal{L}_{\mathrm{MK-MMD}}=\left\|\frac{1}{B}\sum_{i=1}^{B}\phi\left(u_{i}% \right)-\frac{1}{B}\sum_{j=1}^{B}\phi\left(v_{j}\right)\right\|_{\mathcal{H}_{% k}}^{2}

(2)

where $B$ is batch size while $u_{i}$ and $v_{j}$ are latent representations from two modalities. $\mathcal{H}_{k}$ is RKHS induced by kernel function $k$ and $\phi(\cdot)$ is implicit function used to map the original space data to $\mathcal{H}_{k}$ . For multi-kernel cases, kernel function $k$ is a linear combination of $d$ basic kernel functions $\{k_{1},k_{2},\ldots,k_{d}\}$ and the format is $k=\sum_{i=1}^{d}\beta_{i}k_{i}$ . learnable kernel weight $\beta_{i}$ is obtained through optimization to effectively represent differences between distributions. In our method, $d$ equals to $2$ while we choose Gaussian Kernel and Polynomial Kernel as basic kernel function.

Fine-grained Semantic Distribution Alignment: Since MK-MMD focuses on the whole distribution rather than sample level so it is imprecise and can only achieve macro alignment which is not enough for representation alignment. Consequently, we propose a new objective named as semantic density distribution loss(SDD) to explore more fine-grained information from unpaired data and realize more refined alignment. SDD is inspired from the perspective of probability density distribution estimation[29], hence it could keep an eye on specific sample representation while take the whole semantic distribution alignment into consideration at the same time. The formula is shown as follows:

\mathcal{L}_{\mathrm{SDD}}=\frac{1}{2}[\Gamma(U,V)+\Gamma(V,U)]

(3)

where $\mathcal{L}_{\mathrm{SDD}}$ works on the embedding space to measure the difference between two representation distributions more accurately in a symmetrical way and the models are trained to minimize the loss value to realize latent semantic alignment. $U$ and $V$ denotes embedding distributions and the format of $\Gamma(\cdot,\cdot)$ is shown as follows:

\Gamma(T,R)=\sum_{i=1}^{B}\{\frac{\kappa(t_{i},T)}{\sum_{j=1}^{B}\kappa(t_{j},% T)}\log\frac{\kappa(t_{i},T)/\sum_{j=1}^{B}\kappa(t_{j},T)}{\kappa(t_{i},R)/% \sum_{j=1}^{B}\kappa(t_{j},R)}\}

(4)

here for generality and convenience, we define three intermediate variables, $T=\{t_{i}\}_{i=1}^{B}$ while $R=\{r_{j}\}_{j=1}^{B}$ are sets composed of latent represenations and $x$ denotes the latent representation of a sample. $B$ is the size of batch which is a combination of matched pairs and unmatched data and Kullback-Leibler divergence is introduced to measure the dis-similarity between the density values of a specific sample from two distributions. The format of $\kappa(\cdot,\cdot)$ is displayed in the following formula.

\kappa(x,T)=\frac{\sum_{i=1}^{B}\exp\left(-\frac{\|x-t_{i}\|^{2}}{b^{2}\sigma(% T)}\right)}{2Bb^{2}\pi}

(5)

here we apply exponential function as probability density function and $b$ denotes bandwidth used to control the smoothness. $\sigma(\cdot)$ denotes the variance of distribution and the format is shown as follows.

\sigma(T)=\frac{1}{B-1}\sum_{i=1}^{B}\left\|t_{i}-\frac{\sum_{j=1}^{B}t_{i}}{B% }\right\|^{2}

(6)

where $t_{i}$ is the sample from set $T$ and we apply sample variance with Bessel’s Correction. $\sigma(\cdot)$ can lead model to pay more attention to narrowing the gap between semantic distributions from different modalities while avoid close cluster. Through training, $\mathcal{L}_{\mathrm{SDD}}$ can make the semantic aligned data from different modalities keep the similar probability density distribution in latent space. Meanwhile, the time complexity of $\mathcal{L}_{\mathrm{SDD}}$ is $\mathcal{O}(B^{2})$ which is the same as $\mathcal{L}_{\mathrm{CL}}$ . In addition, Sinkhorn algorithm can also achieve fine-grained alignment[84] but it brings huge computing costs and may not fit for low-quality multimodal data due to slashing alignment. More details will show in Appendix B.

Self-supervised Distribution Stability: Rely on self-supervised contrastive loss(SSL)[41, 40], we can adequately find out implicit information from single modality and get robust feature representation. In the field of multimodal alignment with limit matched pairs, we find that it is essential to apply this objective because it can pull away the representations of different samples in the latent space with incomplete alignment guidance. In other words, if SSL is not employed, the data without alignment constraint may gather into a tight cluster. To be specific, we apply augmentation to generate positive pairs and the format of $\mathcal{L}_{\mathrm{SSL}}$ is displayed as follows:

\mathcal{L}_{\mathrm{SSL}}=-\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp(\mathcal{S% }(z_{i},z_{i}^{+})/\tau)}{\sum_{j=1}^{B}\exp(\mathcal{S}(z_{i},z_{j})/\tau)}

(7)

where $z$ is latent embedding and $z_{i}^{+}$ denotes the representation of corresponding positive sample. In our method, each modality is supposed to apply this loss while $\mathcal{S}(\cdot,\cdot)$ is calculated by cosine similarity. We denote $u_{i}$ and $v_{i}$ as practical latent representation respectively from different two modalities and corresponding objectives are named as $\mathcal{L}_{\mathrm{SSL-U}}$ as well as $\mathcal{L}_{\mathrm{SSL-V}}$ . According to the above constraint, we propose new loss named as $\mathcal{L}_{GC}$ and the formula is shown as follows:

\mathcal{L}_{GC}=\mathcal{L}_{CL}+\mu\mathcal{L}_{SSL-U}+\\ \mu\mathcal{L}_{SSL-V}

(8)

where $\mu$ is a hyperparameter and $\mathcal{L}_{GC}$ is used to guide the training process with accurate supervised alignment information rather than semantic distribution similarity, which is necessary for avoiding negative optimization. Meanwhile, if the data from different modalities can achieve augmentation according to the common semantics rather than the pattern in the single modality, the performance of related method may realize further growth[83].

The Overall Pretraining Objective: Our method aims to adopt matched pairs as well as unsupervised data in a batch at the same time. In this way, during the pretraining process, we can utilize comprehensive unsupervised data as well as the alignment constraint from matched pairs to realize robust and stable optimization process. Moreover, through samantic distribution alignment, the knowledge learned from unsupervised data and matched pairs can potentially interact with each other which could guide to enlarge the range of knowledge. For overall pretraining objective, we seek to minimize the loss functions of all pretraining tasks simultaneously:

\mathop{\min}_{\theta}~{}\alpha\mathcal{L}_{\mathrm{GC}}+\delta\mathcal{L}_{% \mathrm{MK-MMD}}+\eta\mathcal{L}_{\mathrm{SDD}}

(9)

where $\theta$ denotes all learnable parameters in encoder networks and projection head networks. $\alpha$ , $\delta$ and $\eta$ are hyperparameters which is used to control the influence of different pretraining tasks.

4 Experiments

In order to evaluate the effectiveness of our proposed method, we conduct extensive experiments in various specialized fields, including protein representation, remote sensing as well as general vision-language field. In addition, we analyze the roles of key modules by ablation experiments.

4.1 Quantitative Analysis About Sampling Size

As mentioned above, there exists implicit alignment information between different modalities with similar semantic distribution even if there is no definite matched pairs. Therefore, if we can acquire unimodal batches which reflect the real distribution of original data by stochastic sampling in each modality, it is derivable that each batch from different modalities also keeps similar semantic distribution and can be used for subsequent training process. Obviously, sampling size significantly influence the ability whether batches are on behalf of original distributions. Hence We attempt to quantitatively analyze the ability of different scales of sampling size for representing the original distribution, which can guide to choose the proper size. By applying soft Parzen-window method which is used for non-parametric density estimation, we can calculate the representing confidence of specific sample batch with given size. Through experimental verification, it could be concluded that sample batches will be able to represent original complicated distribution effectively when sampling size is over $64$ . Furthermore, if different modal data is from the same semantic distribution, the batches with sampling size over $64$ will also keep the similar semantic distribution. More Detailed process of method and relevant analysis will be displayed in Appendix E.

Table 1: Benchmark results on protein representation field. Bold denotes the best results while underline represents the second best value. Two-stream networks outperform in most tasks and Gentle-CLIP can effectively explore latent alignment from unmatched data.

Input	Method	Gene Ontology			Enzyme Commission	Average
Input	Method	BP	MF	CC	Enzyme Commission	Average
1D	ProtBert	0.279	0.456	0.408	0.838	0.495
	ESM-1b	0.452	0.659	0.477	0.869	0.614
	ESM-2	0.472	0.662	0.472	0.874	0.620
(3+1)D	GVP	0.326	0.426	0.420	0.489	0.415
	GearNet	0.356	0.503	0.414	0.730	0.501
	CDConv	0.453	0.654	0.479	0.820	0.602
	CLIP(1/2 CATH)	0.456	0.661	0.485	0.881	0.621
	Gentle-CLIP(Ours)	0.459	0.667	0.491	0.884	0.625
	CLIP(CATH)	0.463	0.665	0.493	0.885	0.627

4.2 Evaluation On Single Protein Function Prediction

Overview of tasks and training setup: To examine the efficacy of Gentle-CLIP in non-vision-language multimodal domains with insufficient alignment data, we conduct experiments in the protein representation field. Proteins can be defined using a multi-level structure and most previous works take aligned sequence and structure as input for single-stream network to capture the invariance features[26]. Due to limited aligned data, these intricately designed models struggle into trouble. Following [30], we consider sequence and structure as two modalities and apply Gentle-CLIP to realize multimodal fushion by pulling semantic distributions closer from extensive unsupervised data. Structure encoder is designed based on CDConv[23] while ESM-2[91] is selected as sequence encoder. We adopt CATH $4.2$ dataset for pretraining and this process lasts $100$ epochs. According to the same settings in [23], we evaluate the proposed method on the following four tasks: protein fold classification[26], enzyme reaction classification[26], gene ontology (GO) term prediction[27] and enzyme commission (EC) number prediction[27]. More details is shown in Appendix F and D.

Results: The performance of downstream tasks are shown in Table 1 while results of previous baselines are from [23] and [1]. Thereinto, CLIP(1/2 CATH) is a two-stream network with 50% CATH data for pretraining while CLIP(CATH) is pretrained on the whole CATH datasets. Gentle-CLIP(Ours) adopts 50% CATH data as supervised pairs while the rest are considered as unlabeled data. Moreover, we add an Average item to evaluate the overall performance. We first verify the effect of two-stream network compared to single-stream model and the results are displayed at the last line in Table 1. We can find that CLIP achieve better results at most downstream tasks and show superiority especially in EC number prediction. Further, it is obvious that Gentle-CLIP dramatically narrow the overall gap between CLIP pretrained on the whole CATH and the performance is even better at a few downstream tasks. This may be due to the fact that Gentle-CLIP can explore implicit alignment through the fine-grained semantic distribution constraint of SDD while CLIP only focuses on local representation match which may loss some global distribution information.

4.3 Evaluation On Remote Sensing Datasets

Overview of tasks and training setup: The models in remote sensing field can acquire comprehensive knowledge by jointly learning satellite images and corresponding captions. However, the training datasets are usually composed of web-crawled data without integrality and annotating captions may also need various expert knowledge, which can be expensive and time-consuming. So it is essential to evaluate the performance of Gentle-CLIP on limited matched pairs which is hard to tackle by traditional methods. Following [28], Gentle-CLIP is pretrained on the union of RSICD[55], UCM[56] and Sydney[57] with zero-shot classification and image-text retrieval as downstream tasks. ResNet[2] and transformer[5] are chosen as single modal encoder and Gentle-CLIP is pretrained for $25$ epochs. We subsample $10\%$ of image-text pairs for supervised learning while the remaining data is served unlabeled but conform to the same semantic distribution. Similarly, Top-1 classification accuracy is used to evaluate the performance on zero-shot classification while recall is applied for image-text retrieval tasks. Detailed descriptions will be presented in the Appendix C and D.

Table 2: Benchmark results on remote sensing field. Bold is the best average results and Gentle-CLIP improves the performance at most datasets through learning from unsupervised data.

Method	RSICD-CLS	UCM-CLS	WHU-RS19	RSSCN7	AID
CLIP(original)	45.3	50.5	65.5	58.9	47.8
CLIP(fine-tune)	58.3 $\pm$ 0.3	63.5 $\pm$ 3.4	76.5 $\pm$ 3.2	61.9 $\pm$ 1.2	63.1 $\pm$ 1.3
Hard-PL	56.6 $\pm$ 3.5	61.6 $\pm$ 2.2	78.1 $\pm$ 2.5	63.9 $\pm$ 2.1	63.2 $\pm$ 2.6
Soft-PL	62.5 $\pm$ 0.8	65.7 $\pm$ 2.7	83.7 $\pm$ 2.7	65.7 $\pm$ 0.6	68.0 $\pm$ 0.7
S-CLIP	66.9 $\pm$ 1.7	66.7 $\pm$ 1.6	86.9 $\pm$ 2.0	66.2 $\pm$ 1.1	73.0 $\pm$ 0.3
Gentle-CLIP(ours)	69.2 $\pm$ 0.8	67.5 $\pm$ 1.1	89.0 $\pm$ 1.6	66.2 $\pm$ 0.9	76.2 $\pm$ 0.9

Results: Table 2 displays the results of zero-shot classification and the first five items are from [28]. In this experiment, Gentle-CLIP is designed based on S-CLIP and trained to narrow the embedding distribution gap between paired texts and unlabeled images under the guidance of SDD. The whole distributions of batches from different modalities may keep similar even though the explicit matching relationship is unknown between specific samples. For zero-shot classification, our method shows outstanding performance except RSSCN7. We can also find that existing methods are all difficult to bring much gain compared with other datasets, so it is believed that RSSCN7 may have significant gap with training set resulting in greater difficulty for inference. As shown in Appendix G, Gentle-CLIP consistently improves the results in image-text retrieval, which proves that less distribution gap between texts and unlabeled images brings robust pseudo-labels and stable distribution structure.

4.4 Evaluation On General Vision-Language Retrieval

Overview of tasks and training setup: Besides above specialized domains, we also evaluate the performance of Gentle-CLIP in general vision-language field. Experiments are carried on Flickr-8k[20] and Mini COCO while image-text retrieval is adopted as downstream task. Moreover, the vision encoder employs ResNet-50 or ViT-32[24] and BERT[25] acts as the text encoder. We choose the first description of each image as corresponding caption while dropout ratio is set to $0.3$ for augmentation following [40]. Pretraining process lasts $50$ epochs and batch size is $64$ with learning rate equalling to $0.001$ . Furthermore, $1/3$ multimodal data is considered to be matched while the rest data is unlabeled. Datailed statements of datasets will be displayed in Appendix C.

Table 3: Benchmark results on general vision-language field. CLIP(1/3) is the baseline while values highlighted in green indicate the improvements. Gentle-CLIP enlarges the knowledge range and especially brings advantages in Mini COCO dataset with VIT as image encoder.

Model	Method	Flickr-8k				Mini COCO
		Image $\rightarrow$ Text		Text $\rightarrow$ Image		Image $\rightarrow$ Text		Text $\rightarrow$ Image
		top1	top5	top1	top5	top1	top5	top1	top5
ResNet	CLIP(1/3)	11.6	35.5	10.9	34.2	7.9	30.6	8.2	30.3
	CLIP(1)	+9.5	+17.4	+8.9	+18.1	+6.5	+16.3	+6.0	+16.2
	Gentle-CLIP	+1.7	+5.4	+1.2	+5.2	+3.8	+8.5	+3.1	+8.5
VIT	CLIP(1/3)	14.7	42.7	14.5	41.7	10.5	37.1	9.7	36.5
	CLIP(1)	+10.9	+14.7	+10.0	+14.5	+6.9	+14.4	+6.6	+12.1
	Gentle-CLIP	+2.1	+2.9	+1.9	+3.6	+4.3	+12.3	+5.1	+12.6

Results: We evaluate the performance with different models as well as datasets and the results of image-text retrieval is shown in Table 3. CLIP(1/3) is trained by 1/3 dataset while CLIP(1) learns alignment on the whole dataset. Gentle-CLIP employs 1/3 matched data for supervised learning and enlarge knowledge range from rest unpaired data. For better observing effects of different strategies, we adopt the improvement value as yardstick and baseline is CLIP(1/3). We can find that Gentle-CLIP continuously brings gains regardless of settings and the overall performance of VIT is better than ResNet for given datasets. From the experimental results, we may conclude that multimodal fushion could be simple when different modality models have similar distance measurement.

4.5 Ablation

Gentle-CLIP applies different kinds of objectives to explore implicit semantic alignment from low-quality multimodal data and shows excellent performances in various experiments. Then, we will further analyze the internal mechanism by replying to the following noteworthy questions.

Q1: How will SDD and SSL influence the final effectiveness? To answer this question, we conduct experiments in protein representation field and evaluate the performance at protein fold classification as well as enzyme commission number prediction. We focus on the effects of these two objectives and results are shown in Table 4.5. We can find that Gentle-CLIP trained with both objectives shows better performance except superfamily classification while the model only trained by SDD outperforms at this item but fall into negative optimization in EC number prediction. Single SSL brings weak improvement compared with the baseline so it also acquires additional alignment information during pretraining process. Through above results, we believe that the combination of these two objectives brings more advantages rather than simple stack of respective effect, in other words, these two losses interact and depend on each other. To be specific, SDD can exploit alignment information in unsupervised data but may cause mode collapse and negative learning as shown at the third line. While adding SSL can further constrain the optimization direction and increase the stability of the overall distribution. However, SSL will also affect the latent distribution of similar samples, hence it is necessary to balance the relationship to achieve better performance[83].

Q2: How is the performance if we change some modules of SDD? With fine-grained distribution similarity measurement, SDD plays a critical role in semi-supervised fushion. So it is essential to deconstruct SDD and analyse which settings may lead to better effects. We make retrieval experiment on Flickr-8k with Top-3 recall and Table 4.5 display the results. We focus on how to calculate the distance between samples as well as the difference between distributions. Relative distance(RD) $\|x-t_{i}\|^{2}/\sigma(T)$ in eq. 5 can eliminate the indistinguishability in tight latent cluster compared to absolute distance while Kullback-Leibler Divergence(KL) in eq. 4 may be more suitable for distribution contrast with MSE. It is clear that models achieve the greatest improvement when adding RD and KL simultaneously. Moreover, if improved Softmax function in [47] can be employed to modify the distance $\|x-t_{i}\|^{2}$ with different attention, we will control distribution alignment granularity through different temperatures. Other ablation results will be shown in Appendix I.

\captionof

tableAblation results evaluated on protein representation field to analyze the roles of SSL and SDD. Thereinto, Super denotes Superfamily task and EC is EC number prediction. The values in bold is the best result. SSL SDD Fold Classification EC Fold Super Family \usym 2613 \usym 2613 57.7 78.6 99.6 0.881 ✓ \usym 2613 57.9 78.7 99.6 0.881 \usym 2613 ✓ 58.5 80.1 99.6 0.878 ✓ ✓ 59.1 79.7 99.6 0.884

\captionof

tableAblation results on Flickr-8k with ResNet as image encoder to analyze the effects of key modules in SDD. Bold indicates the best result. RD denotes relative distance while $50$ and $100$ represent the retrieval sizes. RD KL I $\rightarrow$ T R@3 T $\rightarrow$ I R@3 50 100 50 100 \usym 2613 \usym 2613 28.4 18.4 29.6 18.5 \usym 2613 ✓ 29.1 18.7 30.6 18.6 ✓ \usym 2613 28.7 18.8 29.9 18.9 ✓ ✓ 29.8 20.2 31.1 20.3

5 Conclusion

We propose a semi-supervised multimodal alignment method named as Gentle-CLIP and design a novel objective SDD which can measure the fine-grained difference between latent distributions. Besides, we introduce other loss functions to realize rubost generalization through the analysis of latent embedding. We demonstrate the superiority of our method across various fields and provide a possible way for multimodal fushion in specialized domains with insufficient aligned data by exploring latent alignment information from unimodality.

References

[1] Minghao Xu, Xinyu Yuan, Santiago Miret, and Jian Tang. Protst: Multi-modality learning of protein sequences and biomedical texts. arXiv preprint arXiv:2301.12040, 2023.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[3] Zelin Zang, Shenghui Cheng, Hanchen Xia, Liangyu Li, Yaoting Sun, Yongjie Xu, Lei Shang, Baigui Sun, and Stan Z Li. Dmt-ev: An explainable deep network for dimension reduction. IEEE Transactions on Visualization and Computer Graphics, 30(3):1710–1727, 2024.
[4] Zelin Zang, Siyuan Li, Di Wu, Ge Wang, Kai Wang, Lei Shang, Baigui Sun, Hao Li, and Stan Z Li. Dlme: Deep local-flatness manifold embedding. In European Conference on Computer Vision, pages 576–592. Springer, 2022.
[5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[6] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[7] Yuting Gao, **feng Liu, Zihan Xu, Tong Wu, Enwei Zhang, Ke Li, Jie Yang, Wei Liu, and Xing Sun. Softclip: Softer cross-modal alignment makes clip stronger. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 1860–1868, 2024.
[8] Ali Rasekh, Sepehr Kazemi Ranjbar, Milad Heidari, and Wolfgang Nejdl. Ecor: Explainable clip for object recognition. arXiv preprint arXiv:2404.12839, 2024.
[9] Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio P Calmon, and Himabindu Lakkaraju. Interpreting clip with sparse linear concept embeddings (splice). arXiv preprint arXiv:2402.10376, 2024.
[10] Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jiming Chen. Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. arXiv preprint arXiv:2310.18961, 2023.
[11] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[12] Size Wu, Wenwei Zhang, Lumin Xu, Sheng **, Xiangtai Li, Wentao Liu, and Chen Change Loy. Clipself: Vision transformer distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403, 2023.
[13] Matthew Christensen, Milos Vukadinovic, Neal Yuan, and David Ouyang. Vision–language foundation model for echocardiogram interpretation. Nature Medicine, pages 1–8, 2024.
[14] Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. arXiv preprint arXiv:2210.10163, 2022.
[15] Jiaxin Li, Danfeng Hong, Lianru Gao, **g Yao, Ke Zheng, Bing Zhang, and Jocelyn Chanussot. Deep learning in multimodal remote sensing data fusion: A comprehensive review. International Journal of Applied Earth Observation and Geoinformation, 112:102926, 2022.
[16] Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, and Kai-Wei Chang. Unsupervised vision-and-language pre-training without parallel images and captions. arXiv preprint arXiv:2010.12831, 2020.
[17] Mingyang Zhou, Licheng Yu, Amanpreet Singh, Mengjiao Wang, Zhou Yu, and Ning Zhang. Unsupervised vision-and-language pre-training via retrieval-based multi-granular alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16485–16494, 2022.
[18] Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, **g Shao, Fengwei Yu, and Junjie Yan. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
[19] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. In European conference on computer vision, pages 529–544. Springer, 2022.
[20] Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899, 2013.
[21] Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S Davis. Automatic spatially-aware fashion concept discovery. In Proceedings of the IEEE international conference on computer vision, pages 1463–1471, 2017.
[22] Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, and Yu-Gang Jiang. Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization. In International Conference on Machine Learning, pages 36978–36989. PMLR, 2023.
[23] Hehe Fan, Zhangyang Wang, Yi Yang, and Mohan Kankanhalli. Continuous-discrete convolution for geometry-sequence modeling in proteins. In The Eleventh International Conference on Learning Representations, 2022.
[24] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[25] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[26] Pedro Hermosilla, Marco Schäfer, Matěj Lang, Gloria Fackelmann, Pere Pau Vázquez, Barbora Kozlíková, Michael Krone, Tobias Ritschel, and Timo Ropinski. Intrinsic-extrinsic convolution and pooling for learning on 3d protein structures. arXiv preprint arXiv:2007.06252, 2020.
[27] Vladimir Gligorijević, P Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, Daniel Berenberg, Tommi Vatanen, Chris Chandler, Bryn C Taylor, Ian M Fisk, Hera Vlamakis, et al. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168, 2021.
[28] Sangwoo Mo, Minkyu Kim, Kyungmin Lee, and **woo Shin. S-clip: Semi-supervised vision-language pre-training using few specialist captions. arXiv preprint arXiv:2305.14095, 2023.
[29] Zhun-ga Liu, Yi-min Fu, Quan Pan, and Zuo-wei Zhang. Orientational distribution learning with hierarchical spatial attention for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[30] Jiangbin Zheng, Ge Wang, Yufei Huang, Bozhen Hu, Siyuan Li, Cheng Tan, Xinwen Fan, and Stan Z Li. Lightweight contrastive protein structure-sequence transformation. arXiv preprint arXiv:2303.11783, 2023.
[31] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
[32] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
[33] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
[34] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
[35] Congqi Cao, Yue Lu, and Yanning Zhang. Context recovery and knowledge retrieval: A novel two-stream framework for video anomaly detection. IEEE Transactions on Image Processing, 2024.
[36] **g Gao, Peng Li, Zhikui Chen, and Jianing Zhang. A survey on deep learning for multimodal data fusion. Neural Computation, 32(5):829–864, 2020.
[37] Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine learning, 109(2):373–440, 2020.
[38] Xiangli Yang, Zixing Song, Irwin King, and Zenglin Xu. A survey on deep semi-supervised learning. IEEE Transactions on Knowledge and Data Engineering, 35(9):8934–8954, 2022.
[39] Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2020.
[40] Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821, 2021.
[41] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
[42] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
[43] Volker Roth and Volker Steinhage. Nonlinear discriminant analysis using kernel functions. Advances in neural information processing systems, 12, 1999.
[44] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International conference on machine learning, pages 97–105. PMLR, 2015.
[45] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013.
[46] Minh Pham, Minsu Cho, Ameya Joshi, and Chinmay Hegde. Revisiting self-distillation. arXiv preprint arXiv:2206.08491, 2022.
[47] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[48] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
[49] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrap** language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
[50] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
[51] Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
[52] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[53] Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23390–23400, 2023.
[54] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
[55] Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017.
[56] Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, pages 270–279, 2010.
[57] Fan Zhang, Bo Du, and Liangpei Zhang. Saliency-guided unsupervised feature learning for scene classification. IEEE transactions on Geoscience and Remote Sensing, 53(4):2175–2184, 2014.
[58] Gui-Song Xia, Wen Yang, Julie Delon, Yann Gousseau, Hong Sun, and Henri Maître. Structural high-resolution satellite image indexing. In ISPRS TC VII Symposium-100 Years ISPRS, volume 38, pages 298–303, 2010.
[59] Qin Zou, Lihao Ni, Tong Zhang, and Qian Wang. Deep learning based feature selection for remote sensing scene classification. IEEE Geoscience and remote sensing letters, 12(11):2321–2325, 2015.
[60] Gui-Song Xia, **gwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017.
[61] Semi-Supervised Learning. Semi-supervised learning. CSZ2006. html, 5, 2006.
[62] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013.
[63] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
[64] Rayan Krishnan, Pranav Rajpurkar, and Eric J Topol. Self-supervised learning in medicine and healthcare. Nature Biomedical Engineering, 6(12):1346–1352, 2022.
[65] Weijie Xu, Xiaoyu Jiang, Srinivasan H Sengamedu, Francis Iannacci, and **** Zhao. vontss: vmf based semi-supervised neural topic modeling with optimal transport. arXiv preprint arXiv:2307.01226, 2023.
[66] Kaiyang Zhou, Chen Change Loy, and Ziwei Liu. Semi-supervised domain generalization with stochastic stylematch. International Journal of Computer Vision, 131(9):2377–2387, 2023.
[67] Xianchao Wu. Duplex diffusion models improve speech-to-speech translation. arXiv preprint arXiv:2305.12628, 2023.
[68] Qinqing Zheng, Mikael Henaff, Brandon Amos, and Aditya Grover. Semi-supervised offline reinforcement learning with action-free trajectories. In International conference on machine learning, pages 42339–42362. PMLR, 2023.
[69] Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine learning, 109(2):373–440, 2020.
[70] Yassine Ouali, Céline Hudelot, and Myriam Tami. An overview of deep semi-supervised learning. arXiv preprint arXiv:2006.05278, 2020.
[71] Xiangli Yang, Zixing Song, Irwin King, and Zenglin Xu. A survey on deep semi-supervised learning. IEEE Transactions on Knowledge and Data Engineering, 35(9):8934–8954, 2022.
[72] Nitin Namdeo Pise and Parag Kulkarni. A survey of semi-supervised learning methods. In 2008 International conference on computational intelligence and security, volume 2, pages 30–34. IEEE, 2008.
[73] Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, and Vicente Ordonez. Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 6912–6920, 2021.
[74] Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2020.
[75] Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. arXiv preprint arXiv:2101.06329, 2021.
[76] Yixin Liu, Ming **, Shirui Pan, Chuan Zhou, Yu Zheng, Feng Xia, and S Yu Philip. Graph self-supervised learning: A survey. IEEE transactions on knowledge and data engineering, 35(6):5879–5900, 2022.
[77] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
[78] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[79] Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
[80] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
[81] Daniel Martin, Sandra Malpica, Diego Gutierrez, Belen Masia, and Ana Serrano. Multimodality in vr: A survey. ACM Computing Surveys (CSUR), 54(10s):1–36, 2022.
[82] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.
[83] Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. arXiv preprint arXiv:2405.07987, 2024.
[84] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
[85] John Ingraham, Vikas Garg, Regina Barzilay, and Tommi Jaakkola. Generative models for graph-based protein design. Advances in neural information processing systems, 32, 2019.
[86] Alexey G Murzin, Steven E Brenner, Tim Hubbard, and Cyrus Chothia. Scop: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology, 247(4):536–540, 1995.
[87] Pedro Hermosilla, Marco Schäfer, Matěj Lang, Gloria Fackelmann, Pere Pau Vázquez, Barbora Kozlíková, Michael Krone, Tobias Ritschel, and Timo Ropinski. Intrinsic-extrinsic convolution and pooling for learning on 3d protein structures. arXiv preprint arXiv:2007.06252, 2020.
[88] Vladimir Gligorijević, P Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, Daniel Berenberg, Tommi Vatanen, Chris Chandler, Bryn C Taylor, Ian M Fisk, Hera Vlamakis, et al. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):3168, 2021.
[89] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[90] Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899, 2013.
[91] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022:500902, 2022.
[92] Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[93] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
[94] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the association for computational linguistics, 8:64–77, 2020.
[95] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
[96] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.

Appendix A The analysis of latent embedding

We hold the opinion that the ultimate embedding is a combination representation of modality, structure as well as semantic. If data is from different modalities, the modality representation may have difference and lead to the gap between ultimate embeddings. Meanwhile, if the data is from single modality, the representation may be separated due to the gap of semantic rather than modality. In addition, even though the data keeps similar semantic and are from the same modality, the embedding may also have difference because of the internal structural gap. The visualization of relationship between these three representations are shown in Figure 3.

Appendix B Difference between SDD and Sinkhorn Algorithm

For Sinkhorn Algorithm, we will analyse its role on the basis of SuperGlue[84]. In the attention-based graph neural network processing, each pair of keypoints (one from image A and one from image B) gets a combined feature representation. SuperGlue constructs a score matrix $S$ by calculating the inner product of these feature representations, where each element $S_{i,j}$ indicates the matching score between the $i$ -th keypoint in image A and the $j$ -th keypoint in image B. In order to handle the unmatched keypoints elegantly, it introduces an additional "dustbin" to the score matrix, allowing keypoints which find no match from the other image to be assigned to this dustbin.

The score matrix $S$ is transformed using the exponential function to obtain $K=e^{S}$ . Sinkhorn algorithm iteratively adjusts $K$ through alternating row and column normalizations:

•

Row normalization: $K\leftarrow K\oslash(K\mathbf{1}_{N})$ , where $\mathbf{1}_{N}$ is a length- $N$ vector of ones, and $\oslash$ denotes element-wise division.
•

Column normalization: $K\leftarrow K\oslash(\mathbf{1}_{M}K)$ , where $\mathbf{1}_{M}$ is a length- $M$ vector of ones.

This procedure ensures that the matrix $K$ approaches a doubly stochastic matrix, where each row and each column sum approximately to $1$ . As for the time complexity, since each row or column normalization operation involves processing $M\times N$ elements (where $M$ and $N$ are the numbers of keypoints in images A and B respectively), each normalization requires traversing the entire matrix, hence the complexity of a single normalization is $\mathcal{O}(MN)$ . Sinkhorn algorithm requires multiple iterations to converge. If the number of iterations is T, then the total time complexity is $\mathcal{O}(T\times MN)$ . In addition, it requires strict alignment of data points from different modalities, which may not be applicable for low-quality aligned multi-modal data and may bring wrong information to the optimization process, hence it requires strong monitoring with supervised data. Contrastively, we implicitly align data points by measuring the gap between these two semantic distributions with a gentle and comprehensive way. SDD takes the features of overall distribution into account, which reflects the essence of the problem, and the time complexity is $\mathcal{O}(MN)$ .

Appendix C Datasets

C.1 The datasets in protein representation fields

CATH 4.2. These dataset[85], part of the CATH (Class, Architecture, Topology, Homologous superfamily) protein structure classification system, is divided into training (18,204 proteins), validation (608 proteins), and testing (1,120 proteins) sets. In downstream protein design tasks, the training set fine-tunes pretrained models, and the test set evaluates them. It includes updates and classifications detailing protein classes, architectures, topologies, and homologous superfamilies. Although smaller in number, the CASP14 dataset better reflects real-world blind testing and competition environments.

Protein Fold Classification. We utilized the SCOPe 1.75 dataset[86] for protein folding classification across training, validation, and testing phases. Comprising 16,712 proteins, the dataset is classified into 1,195 folding categories, with 3D protein coordinates derived from the SCOPe 1.75 database. It employs three distinct evaluation schemes: the fold scheme, which excludes proteins from the same superfamily during training; the superfamily scheme, which withholds proteins from the same family in training; and the family scheme, where proteins from the same family are accessible during training. Average accuracy serves as the evaluation metric.

Enzyme Reaction Classification. The dataset described by Hermosilla et al.[87]includes 37,428 proteins categorized into 384 enzyme-catalyzed reaction classes based on Enzyme Commission (EC) numbers. It is structured into training, validation, and testing sets with proteins clustered by 50% sequence similarity to ensure that similar sequences are contained within the same data partition. This dataset is used primarily for develo** and evaluating machine learning models that predict enzyme functions from protein sequences.

Gene Ontology Term Prediction. This dataset, based on the research by Gligorijević et al. [88], is used for predicting Gene Ontology (GO) terms with protein functions. It categorizes proteins into three GO categories: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). It includes 1,943 biological process categories, 489 molecular function categories, and 320 cellular component categories. Additionally, the dataset comprises 29,898 proteins for training, 3,322 for validation, and 3,415 for testing, with the maximum F score( $F_{\text{max}}$ ) as evaluation metric.

Enzyme Commission Number Prediction. Unlike enzyme reaction classification, it aims to predict the three-level and four-level 538 EC numbers. The dataset is divided into training, validation, and test sets as described by Gligorijević et al. [88], comprising 15,550 proteins for training, 1,729 proteins for validation, and 1,919 proteins for testing. EC number prediction is also a multi-label classification task, with the maximum F score ( $F_{\text{max}}$ ) used as the evaluation metric. For GO term and EC number prediction, we follow the multi-cutoff splits in Gligorijević et al. to ensure that the test set only contains PDB chains with sequence identity no more than 95% to the training set.

C.2 The datasets in image-text retrieval task

Flickr-8K. This dataset is a well-known resource in the field of computer vision, particularly used for tasks like image captioning and cross-modal retrieval. It consists of 8,000 images collected from Flickr, each accompanied by five human-annotated captions that describe the images in detail.

MiniCOCO. This dataset is a smaller subset derived from the large COCO2014 dataset, designed for more efficient training and test. It comprises 25,000 images, which includes approximately 184,000 annotations across 80 object categories. In addition, this dataset is carefully sampled to maintain a balanced representation of object instances, size distributions, and class-specific object ratios.

RSICD. This dataset is a specialized resource in remote sensing, primarily used for tasks like image captioning and cross-modal retrieval. It contains approximately 10,000 images sourced from various global locations, each paired with five expert-generated captions detailing the geographical and environmental features visible in the images. This dataset is designed to advance the automatic description of complex geographical scenes from aerial or satellite perspectives.

UCM. This dataset is a collection from the University of California, Merced, consisting of 2,100 high-resolution satellite images across 21 land use categories including airports, golf courses, and residential areas. Each category contains 100 images, all meticulously labeled for precise land use classification tasks. This dataset provides a detailed visualization of diverse land use scenarios, facilitating advanced studies in geographic information systems and machine learning.

Sydney. This dataset comprises 643 high-resolution aerial images categorized into seven distinct urban elements such as parks, residential areas, and so forth. This dataset is used for urban planning and management, offering a rich visual repository to support research in urban feature analysis and geographic information systems applications, ensuring a comprehensive study of urban landscapes.

C.3 The datasets in zero-shot classification

RSCID-CLS. Derived from the broader RSICD, this dataset is tailored for image classification with a focus on geographical features. While the exact number of images is not specified, it is understood to include several thousand images, typically a significant subset of RSICD. This classification-focused dataset encompasses varied landscapes such as urban environments, water bodies, and agricultural regions, all categorized for targeted machine learning applications. The dataset aims to improve algorithms’ accuracy in discerning complex geographical patterns from satellite imagery.

UCM-CLS. This dataset adapts the original 2,100 high-resolution satellite images from the UCM dataset into a format specifically intended for classification tasks. Each image, measuring $256\times 256$ pixels, represents one of 21 distinct land use types including airports, orchards, and industrial zones. This granular categorization supports precise model training in land use recognition, providing a critical tool for geographic analysis and urban planning research.

WHU-RS19. Consisting of approximately 1,000 high-resolution images, this dataset captures 19 urban classifications such as transportation networks, and green spaces. This diversity offers a robust platform for testing urban classification algorithms against a backdrop of global cityscapes, thereby aiding in the development of more adaptive and effective remote sensing technologies.

RSSCN7. With 2,800 images distributed across 7 categories, each containing around 400 images, this dataset challenges scene classification systems with its variety of land covers. These categories include industrial sites, commercial regions, and parks, aiming to mirror the complexity of real-world environments. The dataset tests the ability of models to identify and classify scene characteristics accurately, enhancing their usability in practical remote sensing applications.

AID. This dataset extends over 10,000 images across 30 diverse scene types such as beaches, railways, and residential areas. The extensive collection of images is particularly designed to challenge and refine the scene recognition capabilities of aerial image analysis algorithms. Each category in AID is chosen to represent a wide array of aerial perspectives, promoting comprehensive training and evaluation of models on a scale that mimics real-world aerial surveillance tasks.

Appendix D Model Architecture

ESM2. ESM2 is an advanced protein language model designed to extract meaningful representations from protein sequences. It is based on a 24-layer transformer encoder architecture, employing rotary position embeddings to better handle dependencies in long sequences. ESM2 was pre-trained on 65 million unique UniRef50 sequences using a masked language model, enabling it to learn sequence patterns and their structural implications from a large amount of data. As the model scales from 8 million to 15 billion parameters, ESM2 shows significant improvements in structural knowledge, especially in low-resolution and atomic-level structure predictions. Its variant, ESMFold, can directly predict high-precision three-dimensional structures from single sequences without the need for multiple sequence alignment steps, greatly increasing prediction speed. These features make ESM2 not only excel in structural prediction but also applicable to functional prediction and mutation effect analysis in various biological research areas.

CDConv. CDConv is specifically designed to handle the complexity of protein sequences and geometries. The architecture includes two main components: discrete processing for sequences and continuous encoding for geometric coordinates. Through this approach, CDConv independently adjusts processing strategies for different spaces (sequence and geometry) and optimizes their respective weights. Specifically, the architecture employs discrete weights to capture the regularity in sequences while directly encoding continuous irregular displacements in geometries, thereby effectively reducing the impact of geometric irregularity on sequence modeling. Additionally, this architecture implements a hierarchical deep convolutional neural network, which progressively refines and abstracts protein features through multiple layers to support various biological function prediction tasks. This unique modeling approach has demonstrated excellent performance in multiple protein modeling tasks, proving its effectiveness in capturing the complex relationships between protein structure and function.

ResNet. ResNet (Residual Network), proposed by Microsoft Research in 2015, addresses the vanishing gradient problem in deep neural networks by introducing residual blocks with skip connections. These connections allow gradients to flow through multiple layers, hel** the network learn identity functions and ensuring that deeper networks perform at least as well as shallower ones. ResNet architectures, such as ResNet-34, ResNet-50, ResNet-101, and ResNet-152, range from 34 to 152 layers, effectively balancing performance and computational efficiency for various image recognition tasks. ResNet-34 handles medium-scale tasks, while ResNet-50, known for its balance of performance and efficiency, is widely used in image processing. ResNet-101 captures finer details for advanced visual recognition, and ResNet-152 excels in complex image processing tasks. Variants like Pre-activated ResNet and Wide ResNet further optimize performance and efficiency, making ResNet widely applicable in computer vision.

Vit. ViT (Vision Transformer), proposed by Google Research in 2020, adapts the Transformer model from NLP to computer vision by dividing images into patches and processing them with self-attention mechanisms. Unlike traditional CNNs that focus on local features, ViT captures global relationships between patches, enhancing its understanding of complex spatial hierarchies in images. This approach allows ViT to perform comparably to or even surpass state-of-the-art methods in various image processing tasks. ViT’s scalability allows it to adapt to different computational and performance needs by adjusting Transformer layers, model size, or input patch resolution. Variants like ViT-32 and ViT-64 demonstrate this flexibility, with ViT-32 using 32x32 patches for larger images and ViT-64 using 64x64 patches for coarser segmentation. This versatility and efficiency make ViT a powerful tool in computer vision.

Bert. BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model developed by Google AI in 2018 for natural language processing (NLP) tasks. Its bidirectional Transformer architecture allows it to consider both forward and backward context, enhancing performance in tasks like question answering, language inference, and sentiment analysis. BERT’s pre-training involves "Masked Language Model" (MLM) and "Next Sentence Prediction" (NSP), followed by fine-tuning on specific tasks. Improved models like RoBERTa[92], ALBERT[93], SpanBERT[94], DeBERTa[95], and ELECTRA[96] have built upon BERT’s foundation. RoBERTa optimizes training by removing NSP and using larger datasets. ALBERT reduces memory consumption with cross-layer parameter sharing. SpanBERT enhances span predictions for tasks like question answering. DeBERTa introduces disentangled attention for better context-aware representations, and ELECTRA uses a discriminator model for more efficient pre-training. These advancements have significantly advanced NLP capabilities.

Appendix E Detailed quantitative analysis for sampling size.

The sampling size may have a great influence on the experiment. On the one hand, if the sampling size is not large enough, the sampling batch may not represents the original distribution which leads to different semantic distribution between batches from different modalities. On the other hand, larger sampling size may bring greater cost even though it can better measure the true distribution. Hence, it is significant to select appropriate sampling size and we attempt to design a new approach to realize quantitative analysis. To be specific, for the original distribution, we apply the same scales of sampling size and acquire various sampling batches while we compute the difference between every two batch distributions according to the following formula:

D=\frac{1}{B}\sum_{i=1}^{B}(\kappa(t_{i},T)-\kappa(t_{i},R))^{2}+\frac{1}{B}% \sum_{i=1}^{B}(\kappa(r_{i},R)-\kappa(r_{i},T))^{2}

(10)

where $T=\{t_{i}\}_{i=1}^{B}$ and $R=\{r_{i}\}_{i=1}^{B}$ are two sampling batches from the same distribution and $B$ is sampling size. $\kappa(x,T)$ denotes the density estimate of sample $x$ in $T$ and the formula is as follows:

\kappa(x,T)=\frac{\sum_{i=1}^{B}\exp\left(-\frac{\|x-t_{i}\|^{2}}{\sigma(T)}% \right)}{B\pi}

(11)

where the format of $\sigma(\cdot)$ is shown in Eq. 6 and we adopt relative distance to avoid indistinguishable density estimates of different samples due to the tiny absolute distance. The smaller of the average gap between every two sampling batches is, the more similar the sampling distributions are and we can further deduce that the batch can better represent the original distribution. We choose random distribution as real distribution because it can be seen as the most cluttered distribution. If the batch well represents the real distribution with the specific sampling size, in theory, we can adopt the same scale of batches to realize representation for any other distributions. We select different scales of size and analyse the ability of corresponding sampling batches for representing the original distribution. The results are shown in Figure 4.

In Fig. 4a, abscissa denotes different sampling sizes and ordinate represents the average gap between sampling batches from the same distribution. Different colors denotes the different scales of sample feature dimensions. As we can see, with sampling sizes increasing, the average gap decreases sharply. If we suppose that the dissimilarity between batches is $100\%$ when sampling size is $2$ , the dissimilarity drops to $0.9\%$ when the size reaches $64$ and we can apply the sampling batches to represent the original distribution with $99\%$ confidence coefficient. In Fig. 4b, we acquire two sampling batches from random distribution and visualize each of them with different colors after dimensionality reduction. With constantly increasing sample sizes, visual distribution areas from different batches are getting coincident.

Appendix F More details in protein representation field

In the field of protein representation, there is no widely accepted two-stream network and most of the previous works input the sequence and structure as two properties of the protein into a single-stream network. Here, following [30], we consider sequence and structure as two modalities for different encoders and refer to Gentle-CLIP to realize multimodal fusion. Due to the good performance of Continuous-Discrete Convolution (CDConv)[23], structure encoder is designed based on CDConv and we adopt ESM-2 as sequence encoder. The definition of neighborhood is not changed because it is critical for the node representation. Result from CDConv as structure encoder, we modify its input module to represent structure information as follows:

\mathrm{S}[i]=\begin{cases}Concat([0,0,0],\mathcal{P}[i+1]-\mathcal{P}[i])&i=0% \\ Concat([\mathcal{P}[i-1]-\mathcal{P}[i],\mathcal{P}[i+1]-\mathcal{P}[i])&1\leq i% <N\\ Concat([\mathcal{P}[i-1]-\mathcal{P}[i],[0,0,0])&i=N\end{cases}

(12)

Where $\mathcal{P}[i]$ denotes the coordinate matrix of node $i$ and $N$ represents the total number of nodes. Through the above formulas, we can obtain the structural information between the current node and its neighbors, encompassing the positional relationship and angles. Then we apply an embedding layer for adapting to model inputs and we set original sequence information to random noise together with structure information as input. Although CDConv does not directly acquire sequence information, it learns sequence knowledge transmitted by ESM-2 through the fusion process, making the model gradually learn the comprehensive representation. In the pretraining process, we respectively train Gentle-CLIP and CLIP on CATH $4.2$ for comparison and the batch size is set to $64$ . The pretraining process last $100$ epochs and learning rate is $0.001$ . According to the same settings in [23], we evaluate the effect of pretraining process in the downstream tasks and the results are shown in Table 4.

Table 4: Using supervised models, as well as our Gentle-CLIP model with half supervised and half unsupervised learning, we compared the performance on Fold Classification and Enzyme Reaction tasks. The best performer in each column is highlighted in bold.

Input	Method	Fold Classification			Enzyme Reaction	Average
Input	Method	Fold	Superfamily	Family	Enzyme Reaction	Average
1D	CNN	11.3	13.4	53.4	51.7	32.5
	ResNet	10.1	7.21	23.5	24.1	16.2
	LSTM	6.41	4.33	18.1	11.0	9.9
	Transformer	9.22	8.81	40.4	26.6	21.3
3D	GCN	16.8	21.3	82.8	67.3	47.1
	GAT	12.4	16.5	72.7	55.6	39.3
	3D CNN	31.6	45.4	92.5	72.2	60.4
3D+Topo	IEConv(atom level)	45.0	69.7	98.9	87.2	75.2
(3+1)D	GraphQA	23.7	32.5	84.4	60.8	50.4
	GVP	16.0	22.5	83.8	65.5	46.9
	IEConv(residue level)	47.6	70.2	99.2	87.2	76.1
	GearNet	28.4	42.6	95.3	79.4	61.4
	GearNet-IEConv	42.3	64.1	99.1	83.7	72.3
	GearNet-Edge	44.0	66.7	99.1	86.6	74.1
	GearNet-Edge-IEConv	48.3	70.3	99.5	85.3	75.9
	CDConv	56.7	77.7	99.6	88.5	80.6
	CLIP(1/2 CATH)	57.7	78.6	99.6	88.6	81.1
	Gentle-CLIP(Ours)	59.1	79.7	99.6	88.8	81.8
	CLIP(CATH)	58.5	81.3	99.7	88.8	82.1

Appendix G More detailed experiment and analysis in remote sensing

As shown in Table 5, the performance of several CLIP-based methods on cross-modal retrieval tasks for image-to-text and text-to-image scenarios is compared. The datasets used include RSICD, UCM, and Sydney. The methods listed include the original CLIP model, fine-tuned CLIP model, Hard-PL, Soft-PL, S-CLIP methods, and our Gentle-CLIP. The performance metric is R@5 (Recall at top 5), detailing each method’s performance across different datasets. Notably, the Gentle-CLIP algorithm excels in all test scenarios, demonstrating its effectiveness and superiority in cross-modal retrieval tasks. This table provides quantitative evidence of the potential advantages of our Gentle-CLIP method in practical applications.

Table 5: In the field of remote sensing retrieval, the results produced by the best performer in each column are boldfaced.

Method	Image $\rightarrow$ text R@5			Text $\rightarrow$ image R@5
Method	RSICD	UCM	Sydney	RSICD	UCM	Sydney
CLIP(original)	9.4	34.3	36.2	10.1	24.8	51.7
CLIP(fine-tune)	15.4 $\pm$ 1.7	41.3 $\pm$ 1.8	47.1 $\pm$ 6.5	15.1 $\pm$ 1.0	40.9 $\pm$ 1.6	56.1 $\pm$ 2.4
Hard-PL	16.1 $\pm$ 0.2	40.8 $\pm$ 2.9	43.1 $\pm$ 3.0	15.7 $\pm$ 0.7	40.5 $\pm$ 3.0	47.7 $\pm$ 5.3
Soft-PL	17.0 $\pm$ 0.9	43.2 $\pm$ 3.9	42.0 $\pm$ 4.3	16.5 $\pm$ 0.1	42.9 $\pm$ 3.3	50.2 $\pm$ 4.9
S-CLIP	18.4 $\pm$ 0.6	45.7 $\pm$ 1.4	50.0 $\pm$ 3.0	16.8 $\pm$ 1.2	43.5 $\pm$ 1.5	55.1 $\pm$ 2.0
Gentle-CLIP(Ours)	19.6 $\pm$ 0.7	46.3 $\pm$ 1.1	51.1 $\pm$ 1.7	17.4 $\pm$ 0.9	44.1 $\pm$ 1.2	55.2 $\pm$ 2.3

Appendix H Visualization experiment in general vision-language field

As shown in Figure 5, benchmark results on image-text matching task. In these figures, circle denotes the representation of image in latent space while triangle denotes the latent embedding of text. Visualization points with the same color means the similar semantic. So the goal of this task is to make the circles and triangles with the same color closer while pull away the other points. We can find that Gentle-CLIP can better get the distinguishing representations due to applying the self-supervised contrastive loss.

Appendix I More ablation experiment

Q3: Can we simplify the format of SDD while guarantee competitive performance?
The question focus on whether it is necessary to design such a delicate objective for semantic distributions in a fine-grained manner. In other words, it is significant to analyse whether SDD can acquire more alignment information than other coarse-grained methods. Centroid method is chosen as comparison and it adopts the centroid distance to represent the difference between distributions. Experiments is carried out on the remote sensing field and the results are shown in Table 6.

S-CLIP uses a pseudo-labeling algorithm for alignment, while the Centroid method calculates the centroid of the distribution and aligns the centroids. Since our model performs matching at a finer granularity, it achieves the best results.

Table 6: In the field of remote sensing retrieval, we conducted ablation experiments on the alignment method, the results produced by the best performer in each column are boldfaced.

Method	Image $\rightarrow$ text R@5			Text $\rightarrow$ image R@5
Method	RSICD	UCM	Sydney	RSICD	UCM	Sydney
S-CLIP	18.4 $\pm$ 0.6	45.7 $\pm$ 1.4	50.0 $\pm$ 3.0	16.8 $\pm$ 1.2	43.5 $\pm$ 1.5	55.1 $\pm$ 2.0
Centroid method	18.6 $\pm$ 0.8	45.9 $\pm$ 1.1	50.7 $\pm$ 2.1	17.0 $\pm$ 1.1	43.9 $\pm$ 1.3	55.1 $\pm$ 1.7
Gentle-CLIP	19.6 $\pm$ 0.7	46.3 $\pm$ 1.1	51.1 $\pm$ 1.7	17.4 $\pm$ 0.9	44.1 $\pm$ 1.2	55.2 $\pm$ 2.3

Appendix J Experimental Resource Allocation

Experimental Resource Allocation. In the protein domain, the related experiments were conducted using 8 A100 GPUs, each with 80GB of memory, and the training lasted for 3 days. In the image-text domain, the experiments were performed using 4 A100 GPUs, each with 80GB of memory, and the training lasted for 1 day. For the remote sensing domain, the experiments utilized 4 A100 GPUs, each with 80GB of memory, and the training also lasted for 1 day.