Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image Understanding

Run Shao [email protected] Zhaoyang Zhang Chao Tao Yunsheng Zhang Chengli Peng Haifeng Li [email protected]
Abstract

The paradigm shift introduced by multimodal large language models, which is based on the transformer architecture and the pretext task of ”next-token prediction,” has revolutionized the field of remote sensing image understanding. However, the tokenizer, as one of the fundamental components of large models, has long been overlooked or even misunderstood in visual tasks. One key factor of the great comprehension power of the large language model is that natural language tokenizers utilize meaningful words or subwords as the basic elements of language. In contrast, mainstream visual tokenizers, represented by patch-based methods such as Patch Embed, rely on meaningless rectangular patches as basic elements of vision, which cannot serve as effectively as words or subwords in language. Starting from the essence of the tokenizer, we defined semantically independent regions (SIRs) for vision. Based on this definition, we propose two properties that an ideal visual tokenizer should possess: (1) homogeneity, where SIRs serve as the basic elements of vision, and (2) adaptivity, which allows for a flexible number of tokens to accommodate images of any size and tasks of any granularity. Based on this, we designed a simple HOmogeneous visual tOKenizer: HOOK. HOOK mainly consists of two modules: the Object Perception Module (OPM) and the Object Vectorization Module (OVM). To achieve homogeneity, the OPM splits the image into 4×4 pixel seeds and then utilizes the attention mechanism to perceive SIRs. The OVM employs cross-attention to merge seeds within the same SIR. To achieve adaptability, the OVM defines a variable number of learnable vectors as cross-attention queries, allowing for the adjustment of token quantity. We conducted experiments on the NWPU-RESISC45, WHU-RS19 classification dataset, and GID5 segmentation dataset for sparse and dense tasks. The results demonstrate that the visual tokens obtained by HOOK correspond to individual objects, which demonstrates homogeneity. HOOK outperformed Patch Embed by 6% and 10% in the two tasks and achieved state-of-the-art performance compared to the baselines used for comparison. Compared to Patch Embed, which requires more than one hundred tokens for one image, HOOK requires only 6 and 8 tokens for sparse and dense tasks, respectively, resulting in efficiency improvements of 1.5 to 2.8 times. The homogeneity and adaptability we proposed provide new perspectives on the study of visual tokenizers, and guided by these properties, the HOOK we designed shows potential to replace Patch Embed. The code is available at https://github.com/GeoX-Lab/Hook.

keywords:
Remote Sensing Image Understanding , Visual Tokenizer , Homogeneous , Semantic Independent Region , Visual Transformer Model
journal: ISPRS Journal of Photogrammetry and Remote Sensing
\useunder

\ul

\affiliation

[csu]organization=School of Geosciences and Info-Physics, Central South University, No. 932 South Lushan Road, Changsha, 410083, Hunan, China

\affiliation

[XJL]organization=Xiangjiang Laboratory, No. 569, YueLu Avenue, Changsha, 410083, Hunan, China

Refer to caption
Figure 1: (a) Natural language tokenizer uses words or subwords as the basic elements of language. Similarly, for visual tokenizers, this work attempts to answer a fundamental question: What are the basic elements of vision? (b) The current mainstream visual tokenizers are patch-based methods, represented by Patch Embed, that can be categorized into three types based on hierarchy: patch level, subpatch level, and superpatch level. Their commonality lies in the patches as the basic elements of vision. (c) There exists a confusion matrix for tokens and objects. “Same Object Multiple Tokens” leads to incomplete object features, while “Same Token Multiple Objects” leads to unclear relationships between objects. “Multiple Tokens Multiple Objects” inherits the drawbacks of the above two cases. Patch-based methods inherently struggle to achieve the ideal scenario: ”Same Token Same Object”. (d) We define the concept of the Semantic Independent Region (SIR) and propose two properties of an ideal visual tokenizer: homogeneity and adaptability. We design a simple HOmogeneous visual tOKenizer, HOOK, where SIRs as the basic elements of vision.

1 Introduction

Large language models, such as the GPT series[1, 2, 3, 4], Llama[5, 6] and PaLM[7] which are built on transformer architecture and trained with the ”next-token prediction” objective, have demonstrated powerful natural language understanding, reasoning, and generalization capabilities. Subsequently, the development of multimodal large language models[8, 9, 10, 11, 12] has thoroughly transformed the research landscape in various fields, including remote sensing image understanding, leading to a paradigmatic revolution.

The tokenizer is a foundational and essential component of multimodal large language models that is aimed at identifying the basic elements of the data and using them to tokenize the data into a sequence of tokens. The natural language tokenizer (NL tokenizer) is fundamental for machine understanding of language and has been widely investigated and applied in the field of natural language processing[13, 14, 15, 16]. As shown in Figure 1-a, one key factor contributing to the great comprehension power of the large language model is that natural language tokenizers utilize meaningful words or subwords as the basic elements of language. However, visual tokenizers, which play a similar role in visual foundation models and multimodal large models, have long been overlooked, underestimated, and even misunderstood.

The visual tokenizer is aimed at identifying the basic elements of an image and tokenizing the image into a sequence of tokens based on the basic elements. What are the basic elements of vision? As shown in Figure 1-b, patch-based methods, represented by Patch Embed[17], are the most popular implementations of visual tokenizers, where the fundamental characteristic of these methods is the use of rectangular patches as the basic elements of vision. The advantage of patch-based methods lies in their simplicity and efficiency, but they have two prominent issues for remote sensing images:

(1) Rectangular patches fail to match irregular and complex objects, which leads to a lack of semantic homogeneity within tokens. As shown in Figure 1-c, there exists a confusion matrix for tokens and objects: if multiple objects are aggregated within a single token, the model may struggle to learn the relationships between the objects; if a single object is dispersed across multiple tokens, the model may struggle to learn the complete features of the individual object. We summarize these two scenarios as follows: “Same Token Multiple Objects” and “Same Object Multiple Tokens”. Additionally, if multiple objects correspond to multiple tokens, i.e., “Multiple Tokens Multiple Objects”, it inherits the drawbacks of the previous scenarios. Ideally, the relationship between tokens and objects should be “Same Object Same Token”, but due to the limitations of rectangular patches, patch-based methods fundamentally cannot achieve this ideal scenario.

(2) The efficiency issues arising from the fixed and redundant number of tokens limit the application of transformer-based models in remote sensing images. For instance, when the patch size is 16 ×\times× 16, a 224 ×\times× 224 remote sensing image corresponds to 196 tokens, and the number of tokens grows quadratically with increasing image size. Furthermore, in transformer models, the computational complexity of the attention mechanism is O(N2)𝑂superscript𝑁2O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), which indicates that the computational cost of the attention mechanism also scales quadratically with the number of tokens. This feature makes it challenging for transformer-based remote sensing models to handle high-resolution remote sensing images.

Therefore, meaningless rectangular patches are not suitable as the basic elements of images. Analogous to meaningful words or subwords that serve as the basic elements of natural language, an intuitive idea is to identify certain semantically meaningful regions in the image as the basic elements of the vision. As shown in Figure 1-d, we expand this idea and define two fundamental properties that an ideal visual tokenizer should possess:

(1) Homogeneity: Inspired by the NL tokenizer, which uses meaningful words or subwords as the basic elements of natural language, we define semantically independent regions (SIRs) for images. SIRs refer to regions in the image that are semantically independent of other regions outside themselves. Once SIRs are defined, they are internally homogeneous in semantics compared to inter-regions; thus, we refer to the visual tokenizer with SIRs as the basic elements of vision as homogeneous.

(2) Adaptivity: In the definition of SIRs, ”independence” is a dynamic concept. For example, the entire aircraft is an independent region compared to the ground, the wing is an independent region compared to the aircraft, and the flap is an independent region compared to the wing. Therefore, when addressing a specific remote sensing image, the image size and the granularity of the task determine the ideal granularity of independence, while the number of tokens determines the actual granularity of independence. Additionally, a visual tokenizer should be image-agnostic and task-agnostic. Therefore, the number of tokens should be adjusted to accommodate images of any size and tasks of any granularity. We refer to this property as the adaptivity of the visual tokenizer.

To develop an ideal visual tokenizer that meets the above properties, we define the confusion matrix for tokens and objects under strict definitions and constraints. We found that the transformations among “Same Object Same Token”, “Same Object Multiple Tokens”, “Same Token Multiple Objects”, and “Multiple Tokens Multiple Objects” actually involve only two meta-operations: ”Split” and ”Merge”. The process of obtaining homogeneous visual tokens can be abstracted as a routing selection problem, where two distinct general routings are ”Splitting and Merging” and ”Merging and Splitting”.

Based on the ”Splitting and Merging” routing, we designed a simple HOmogeneous visual tOKenizer: HOOK. HOOK mainly consists of two modules: the object perception module (OPM) and the object vectorization module (OVM).

To achieve homogeneity, HOOK adopts a strategy of first splitting the image into the finest granularity possible and then gradually merging to obtain SIRs. First, the OPM first leverages convolutional blocks to obtain 4 ×\times× 4 pixel seeds. Second, self-attention layers are used to associate the seeds belonging to the same SIR. Last, the OVM employs cross-attention to merge the associated seeds to obtain homogeneous visual tokens.

To achieve adaptivity, the OVM defines N𝑁Nitalic_N learnable vectors as queries for the cross-attention module, with the seeds outputted by the OPM serving as the key and value. Here, N𝑁Nitalic_N is treated as a variable hyperparameter to allow for the arbitrary adjustment of token quantity.

We tested the performance of HOOK on the sparse task using the NWPU-RESISC45[18] and WHU-RS19[19, 20] classification datasets and on the dense task using the GID5[21] semantic segmentation dataset. The experimental results show that the visual tokens obtained by HOOK correspond to individual objects, which demonstrates homogeneity. Compared to Patch Embed, HOOK achieved an improvement greater than 6% and 10% on the sparse task and dense task, respectively. Additionally, HOOK outperformed the baseline methods that we used for comparison, reaching a state-of-the-art performance. In terms of efficiency, HOOK requires only 6 tokens for the sparse task and 8 tokens for the dense task, whereas Patch Embed requires hundreds of tokens for a single image. This overall efficiency improvement entails an increase of 1.5 to 2.8 times.

Our contributions are summarized as follows:

(1) Our analyses and experiments show that the importance of visual tokenisation is far underestimated. Starting from the essence of tokenizers, we propose two fundamental properties that an ideal visual tokenizer should possess: homogeneity and adaptivity.

(2) To develop a visual tokenizer that satisfies homogeneity and adaptivity, we rigorously define the confusion matrix for tokens and objects under strict definitions and constraints. We argue that the construction of homogeneous tokens can be abstracted as a routing selection problem, leading to two general routings: ”Splitting and Merging” and ”Merging and Splitting”. This analysis provides new perspectives and ideas for research on visual tokenizers.

(3) Based on the ”Splitting and Merging” routing, we designed a simple homogeneous visual tokenizer, HOOK, which offers an object-oriented, plug-and-play implementation for visual tokenizers.

(4) The experimental results demonstrate that HOOK satisfies homogeneity and adaptivity. HOOK outperforms baselines, including Patch Embed, in terms of both performance and efficiency. This finding shows the potential for HOOK to replace patch-based methods and become a new foundation visual tokenizer.

We organize this paper as follows: In Section 2, we review the advancements in visual tokenizers, which can be classified into two types—patch-based and object-oriented on the different basic elements of vision. In Section 3, we conduct a theoretical analysis of visual tokens. We start by emphasizing the importance and necessity of homogeneity from the essence of visual tokens. We then rigorously define and analyze the confusion matrix for tokens and objects to identify two general routings for constructing homogeneous visual tokens. Sections 4 and 5 introduce HOOK, a homogeneous visual tokenizer designed based on the ”Splitting and Merging” routing, and discuss its performance via experiments. The results indicate that HOOK satisfies both homogeneity and adaptivity. Section 6 primarily delves into the significance of visual tokenizers, rethinks our HOOK, and outlines future research directions.

2 Related Work

2.1 Patch-based Visual Tokenizer

Patch-based methods use rectangular patches as the basic elements of images. Patch Embed, which is widely applied in patch-based methods, splits an image into nonoverlap** patches of a fixed size, typically 16×16 pixels. The advantage of Patch Embed lies in its simplicity and efficiency. However, the redundant token quantity and fixed patch size have been identified in various works[22, 23, 24, 25] as factors of the instability of training in ViT models.

As shown in Figure 1-b, we categorize the improvements in patch-based methods into three types based on hierarchy: patch level, subpatch level, and superpatch level.

A. Patch-level Methods

Patch-level methods argue that the visual tokens obtained through Patch Embed are highly redundant, and in reality, the model needs to select only a small portion of these tokens to recognize the image.

H. Yin et al. proposed A-ViT[26], where they defined the first dimension of tokens as the probability of discarding the token itself, which enables the model to autonomously select tokens that aid in image recognition. Y. Tang et al. calculated the impact of each token at each layer on the final model output, identifying and removing redundant tokens[27]. Y. Rao et al. introduced DynamicViT[28], and B. Pan et al. proposed IA-RED2[29], which inserted trainable prediction modules in the middle layers of the model for dynamic token selection. D. Marin et al. utilized pooling strategies to discard redundant tokens[30]. Y. Liu et al. introduced PatchDropout[31], a simple method that randomly drops a certain percentage of patches during model training, leading to efficiency improvements without compromising test accuracy. Y. Liang et al. proposed a token reorganization module to identify and merge tokens with minimal relevance to the CLS token, reducing the number of tokens[32]. Similarly, T. Wang et al. proposed PnP-DETR[33], which initially extracts features from the original image using ResNet as tokens and then identifies and merges background tokens, effectively enhancing model efficiency and object-level perception capabilities.

B. Subpatch-level methods

Subpatch-level methods argue that the implicit assumption in Patch Embed, where all patches in an image are treated equally, is unreasonable. These methods proposed that the patches corresponding to foreground regions in the image should be refined, meaning that more patches should be used to describe important areas of the image.

W. Chen et al. utilized class activation maps (CAMs) to identify salient regions in images and proposed using smaller patches for these significant areas to increase the overall number of patches[34]. Similarly, T. Ronen et al. employed GradCAM to locate salient regions and guided the generation of fine-grained patches using quadtree structures[35]. M. Chen et al. introduced CF-ViT[36] and used the confidence levels from a coarse-grained model to determine whether patches should undergo further processing. Y. Wang et al. also leveraged results from a coarse-grained model, continually densifying patches across the image until high-confidence recognition outcomes were obtained[37]. X. Yue et al. proposed PS-ViT[38], which employs progressive sampling to shift token positions toward the main regions of interest in the image, enabling a more detailed description of foreground elements through additional tokens.

C. Superpatch-level methods

Superpatch-level methods argue that the use of fixed-size patches restricts the model to capturing information at a single scale and propose the construction of multiscale patches to enhance the model’s robustness across different scales.

C.-F. Chen et al. introduced CrossViT[39], which uses two patches of different sizes to split an image, enabling the extraction of multiscale features from the same image. L. Beyer et al. proposed FlexiViT[40], which enhances the model’s robustness to different patch sizes by randomly interpolating parameters in the Patch Embed layer. S. H. Lee et al. presented shifted patch tokenization, which applies multiangle shifting and aggregation of patches to address the lack of local inductive bias in Patch Embed[41].

2.2 Object-oriented Visual Tokenizer

While the above methods have made significant progress in improving Patch Embed, patch-based approaches still cannot fundamentally overcome the limitations of rectangular patches. Currently, some research has begun to explore moving away from patches and instead reconstructing visual tokenizers based on the objects in the images.

B. Wu et al. proposed VT[42], which employs a simple convolutional operation to assign each pixel in an image to one of several semantic groups, map** them to visual tokens. However, VT lacks global semantic awareness of the image, potentially leading to an overreliance on local pixel characteristics such as color and texture. T. Yang et al. introduced visual concept tokenization (VCT)[43], which utilizes concept token reconstruction in a cleverly designed pretraining task to enable individual control over specific features by each visual token on simulated images. However, VCT requires pretraining and features a complex model structure. S. Qian et al. developed MoTo[23], which perceives the semantic consistency association between tokens but does not represent them as homogeneous visual tokens. J. Mei et al. proposed SPFormer[44], which divides images into irregular, semantically homogenous regions. However, the method relies on superpixel algorithms and lacks plug-and-play functionality.

To fundamentally address the limitations of the aforementioned approaches, we revisited tokenization in natural language processing and proposed two fundamental properties that a visual tokenizer should possess: homogeneity and adaptability. Building upon this premise, we introduced the HOmogeneous visual tOKenizer (HOOK), which demonstrates distinct advantages in both performance and efficiency over baselines, including Patch Embed. HOOK has the potential to replace patch-based methods as a new foundational visual tokenizer.

3 Theoretical Analysis

In this section, we conducted a theoretical analysis of visual tokens. In Section 3.1, we emphasized the importance and necessity of homogeneity in visual tokenizers by examining the essence of tokenization. In Section 3.2, we defined the confusion matrix for tokens and objects under strict definitions and constraints. Section 3.3 proposed two general routings for constructing homogeneous visual tokens based on the confusion matrix that was previously established.

3.1 Why do we need a homogeneous visual tokenizer?

The concept of the tokenizer originates from the field of natural language processing (NLP). Machines are unable to understand unstructured data, and the first step in enabling machines to comprehend language is to structurize the unstructured language documents. The NL tokenizer is one of the most popular methods for achieving this process by breaking down language into smaller basic elements known as tokens.

A. NL tokenizer

In the NL tokenizer, the definition of a token is not fixed. Following the order from larger tokens to smaller tokens, a natural language tokenizer can be classified into a sentence tokenizer, word tokenizer, subword tokenizer, and character tokenizer, among others. The size of the token involves a balance between efficiency and generalization:

(1) Larger tokens are more efficient but have poorer generalizability. For example, in a sentence tokenizer, each token represents a complete sentence, which allows for the efficient representation of a document with a small number of tokens. However, the corresponding vocabulary would need to contain all possible sentences, which is not feasible. As a result, sentence tokenizers struggle to handle out-of-vocabulary (OOV) issues and cannot generalize to unseen sentences.

(2) Smaller tokens have stronger generalizability but are less efficient. For instance, in a character tokenizer, the vocabulary consists of only 26 letters, with necessary symbols. Any new word can be composed of these characters. In theory, a character tokenizer can generalize to any text. However, because individual letters lack semantic meaning, machines struggle to learn words, sentences, and higher-level semantic information. Moreover, several tokens may be required to represent just one word.

Therefore, NL tokenizers need to define suitable basic elements of language to balance the conflicting goals of efficiency and generalization. The most popular NL tokenizers, such as Word2Vec[13], BPE[45], WordPiece[46], Unigram[16], and others[14, 15] belong to the category of word tokenizers or subword tokenizers, which use words or subwords as the basic elements of language. Words and subwords represent the smallest semantically independent elements in language, effectively striking a balance between efficiency and generalization.

B. Visual tokenizer

Research on visual tokenization has been driven by transformer-based vision models, which are aimed at identifying the basic elements of an image and splitting the image into a token sequence to meet the input requirements of transformer models. Like NL tokenizers, visual tokenizers also face the challenge of balancing efficiency and generalizability. Consider two extreme scenarios:

(1) Tokens that represent individual pixels. This approach offers the best generalizability because it can generalize to any image within the same color space. However, this method has the lowest efficiency: it requires 50,176 tokens to represent just one 224 ×\times× 224 image.

(2) Tokens that represent the entire image. Conversely, this approach maximizes efficiency but compromises generalizability.

In balancing these aspects, visual tokenizers are aimed at finding a suitable granularity of tokens that balances efficiency and generalization for effectively processing images within transformer-based models.

An intuitive approach is to find an intermediate value between the two extreme scenarios mentioned above, similar to words and subwords in NL tokenizers. Notably, the most popular patch-based methods align with this approach. Using Patch Embed as an example, it uses a size of 16 ×\times× 16 as an empirical middle ground to balance efficiency and generalizability. This is a fundamental reason why patch-based methods are prevalent in visual tokenization.

However, patch-based methods do not consider the differences between images and language. In language, words are inherently semantically independent basic elements, and subwords constructed through various optimization methods often consist of meaningful morphemes. However, in images, a fixed-size rectangular patch typically does not possess independent semantic meaning in most cases.

C. Semantically Independent Region

As shown in Figure 1-d, based on the observations and theoretical analysis mentioned above, we have defined the semantically independent region (SIR) as a replacement for the fixed-size rectangular patch as the basic element of an image. SIRs refer to regions in the image that are semantically independent of other regions outside themselves. Once semantically independent regions are defined, they are internally homogeneous in semantics compared to inter-regions. This is why we believe that an ideal visual tokenizer should possess homogeneity.

In addition, in the definition of SIRs, ”independence” is a dynamic concept. For example, an entire aircraft is an independent region in relation to the ground, wings are independent regions in relation to the aircraft, and flaps are independent regions in relation to the wings. Therefore, when addressing a specific remote sensing image, the image size and the granularity of the task determine the ideal granularity of independence, while the number of tokens determines the actual granularity of independence. Additionally, a visual tokenizer should be image-agnostic and task-agnostic. Therefore, the number of tokens should be adjusted to accommodate images of any size and tasks of any granularity. We refer to this property as the adaptivity of the visual tokenizer.

In conclusion, an ideal visual tokenizer should possess homogeneity (SIRs as the basic elements of vision) and adaptability (allowing for arbitrary adjustment of the token quantity).

3.2 Confusion matrix for tokens and objects

Refer to caption
Figure 2: Confusion matrix of tokens and objects. In a simplified scenario and under strict definitions, the confusion matrix of tokens and objects reveals two general routings for constructing homogeneous visual tokens: (1) splitting and merging and (2) merging and splitting.

Figure 1-c illustrates the confusion matrix between tokens and objects. In this section, we define this confusion matrix under more rigorous definitions and constraints.

First, we simplify the scenario: as shown in the left image of each set of images in Figure 2, we assume that a remote sensing image contains only two objects, and we are not concerned with the background. The tokens corresponding to the background regions are referred to as “background tokens”.

Second, we define the following concept:

Definition 1

“Region of the token” refers to the pixel area on the original image that contains the semantic information encapsulated by a specific token.

For ease of articulation and accuracy, we further define two types of relationships between objects and regions of the tokens:

Definition 2

“Cover” refers to the complete inclusion of an object in the region of the token on the image.

Definition 3

“Overlap” refers to the presence of an intersection between the region of the token and an object.

From the above definitions, it is evident that if the region of the token covers an object, they must overlap, and the reverse is not true.

Based on the above definitions, we can provide precise definitions for the four scenarios mentioned in Figure 1-c and Figure 2:

Definition 4

”Same Object Same Token” means that, excluding background tokens, any region of the token covers and only covers one object, and each region of the token is distinct from other regions.

Definition 5

”Same Object Multiple Tokens” means that no region of the token covers an object, and at most, the region overlaps one object.

Definition 6

”Same Token Multiple Objects” means that, excluding background tokens, each region of the token covers at least two objects.

Definition 7

”Multiple Tokens Multiple Objects” means that there is a region of the token that overlaps with two objects but covers at most one object.

As shown in Figure 2, the above definition completely describes the relationship between tokens and objects.

3.3 General routing to homogeneous visual tokens

In Section 1, we show that “Same Object Multiple Tokens” makes it difficult for the model to learn complete object features, that “Same Token Multiple Object” makes it difficult for the model to learn the relationships between objects, and “Multiple Tokens Multiple Objects” inherits the drawbacks of the above two cases. The ideal scenario between tokens and objects should be “Same Object Same Token”, and we refer to this token as a homogeneous visual token. In other words, in the four squares of Figure 2, only square 1 is ideal. Because the default visual tokenizer, Patch Embed, belongs to square 4, constructing the ideal visual token can be abstracted as a routing selection problem from square 4 to square 1.

According to the confusion matrix in Figure 2, without considering diagonal routings, each step has only two choices: ”move left” and ”move up.” ”Move left” refers to kee** the same number of objects but reducing the number of tokens from multiple to one. For example, going from ② to ① signifies that multiple tokens corresponding to one object are reduced to one token corresponding to one object, with the basic operation for this step being to “merge” tokens. “Move up” signifies kee** the same number of tokens but reducing the number of objects from multiple to one. For example, going from ③ to ① signifies that one token corresponding to multiple objects is reduced to one token corresponding to one object, with the basic operation for this step being to “split” tokens.

Thus, we have obtained two general routings for constructing homogeneous visual tokens:

(1) Splitting and Merging, which corresponds to routings ④ \rightarrow\rightarrow ① in Figure 2: first, the tokens are split until no region of token overlaps with two or more objects, achieving “Same Object Multiple Tokens”; then, the tokens are merged until no object corresponds to multiple tokens, achieving “Same Object Same Token”.

(2) Merging and Splitting, which corresponds to routings ④ \rightarrow\rightarrow ① in Figure 2: first, the tokens are merged until no object corresponds to multiple tokens, achieving “Same Token Multiple Objects”; second, the tokens are split until no token corresponds to multiple objects, achieving “Same Object Same Token”.

4 Method

Section 3.3 outlined two general routings for constructing homogeneous visual tokens. Based on the ”Splitting and Merging” routing, we designed a simple HOmogeneous visual tOKener: HOOK. In Section 4.1, we provide an overview of the structure of HOOK. Section 4.2 and Section 4.3 detail two important modules in HOOK: the object perceiver module and the object vectorization module. In Section 4.4, we discussed how HOOK, as a visual tokenizer, adapts to images of any size and tasks of any granularity.

4.1 Overview

Refer to caption
Figure 3: The architecture of HOOK, which mainly consists of two modules: the object perception module (OPM), which is responsible for perceiving semantically independent regions, and the object vectorization module (OVM), which is responsible for vectorizing semantically independent regions into tokens.

The overall architecture of HOOK is shown in Figure 3. HOOK can be viewed as a function T𝑇Titalic_T that maps the image I𝐼Iitalic_I to N𝑁Nitalic_N D-dimensional vectors, with each vector representing a token, as follows:

t=T(I)𝑡𝑇𝐼t=T(I)italic_t = italic_T ( italic_I ) (1)

where tRN×D𝑡superscript𝑅𝑁𝐷t\in R^{N\times D}italic_t ∈ italic_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT and IRH×W×C𝐼superscript𝑅𝐻𝑊𝐶I\in R^{H\times W\times C}italic_I ∈ italic_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT.

HOOK primarily consists of two modules: the object perception module (OPM) and the object vectorization module (OVM).

The OPM is aimed at perceiving semantically independent regions within the image. Specifically, the OPM initially splits the image into several 4 ×\times× 4 pixel-sized seeds using convolutional blocks, followed by stacked local and global self-attention layers to expand the seeds into semantically independent regions (as detailed in Section 4.2). The process above can be formalized as follows:

f=P(I)𝑓𝑃𝐼f=P(I)italic_f = italic_P ( italic_I ) (2)

where P𝑃Pitalic_P represents the object perception module and fRHW16×d𝑓superscript𝑅𝐻𝑊16𝑑f\in R^{\frac{HW}{16}\times d}italic_f ∈ italic_R start_POSTSUPERSCRIPT divide start_ARG italic_H italic_W end_ARG start_ARG 16 end_ARG × italic_d end_POSTSUPERSCRIPT represents the seeds after passing through the self-attention layers, where d𝑑ditalic_d denotes the dimension of the seeds.

The OVM is aimed at vectorizing semantically independent regions into tokens while achieving arbitrary adjustment of the token quantity. Specifically, we define N𝑁Nitalic_N learnable vectors q𝑞qitalic_q as queries and utilize the cross-attention mechanism to merge seeds belonging to the same semantically independent region into homogeneous tokens. Here, N𝑁Nitalic_N acts as a variable hyperparameter to enable arbitrary adjustment of the token quantity (as detailed in Section 4.3). The process above can be formalized as follows:

t=V(f,q)𝑡𝑉𝑓𝑞t=V(f,q)italic_t = italic_V ( italic_f , italic_q ) (3)

where qRN×D𝑞superscript𝑅𝑁𝐷q\in R^{N\times D}italic_q ∈ italic_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT and V𝑉Vitalic_V represents the object vectorization module.

In conclusion, the entire tokenizer is formalized as:

t=T(I)=V(P(I),q)𝑡𝑇𝐼𝑉𝑃𝐼𝑞t=T(I)=V(P(I),q)italic_t = italic_T ( italic_I ) = italic_V ( italic_P ( italic_I ) , italic_q ) (4)

4.2 Object Perception Module

HOOK treats semantically independent regions as the basic elements of vision. How do we find SIRs in an image? Gaining inspiration from a similar problem in NL tokenization—how to find semantically meaningful subwords—we examine an elegant solution provided by BPE[45]: first, we split the language into the smallest elements, such as splitting based on letters. Second, we calculate the most frequently occurring letter combinations in the corpus and construct new tokens based on those combinations, iterating this process. This approach corresponds to the ”Splitting and Merging” routing mentioned in Section 3.3. Our designed OPM consists of three main steps: 1. Splitting the image into several fine-grained seeds; 2. Merging the seeds into SIRs; 3. Stop** and merging.

Step 1: Splitting the image into several fine-grained seeds

One of the most intuitive approaches inspired by BPE is to treat each pixel in the image as a seed and then to gradually merge them to form SIRs. However, the cost of pixels as initial seeds is too expensive, as a 224 ×\times× 224 image can have as many as 50,176 pixels. To address this issue, we introduce a hypothesis specifically for remote sensing images:

Assumption 1

In remote sensing images, there are no complete objects within a 4x4 pixel window.

Under Assumption 1, the size of any SIR is larger than 4×4, so dividing the image into several 4 ×\times× 4 pixel-sized seeds actually converts from ”Multiple Tokens Multiple Objects” to ”Same Object Multiple Tokens”, as shown in Figure 2 from square 4 to square 2.

In the implementation, we use convolutional modules to obtain the seeds, as described above. Specifically, we stack 2 layers of convolutional layers with a kernel size of 2 and a stride of 2 to extract a local feature vector from the 4 ×\times× 4 pixel window, which becomes the obtained seed. Additionally, to enhance the representation capability of the seeds and support the expansion of subsequent SIRs, we adopt the traditional CNN architecture and intersperse convolutional layers with a kernel size of 3, a stride of 1, a padding of 1, batch normalization layers, and rectified linear unit (ReLU) activation layers within the convolutional block. At the end of the convolutional block, we add a dimension projection layer with a kernel size of 1 and a stride of 1 to project the image features to a 512-dimensional space.

Step 2: Merging the seeds into SIRs

How can seeds be expanded into SIRs? In natural language processing, the self-attention mechanism essentially relates and aggregates tokens, making tokens with semantic relationships more similar. This phenomenon has also been observed in various visual data[42, 47, 48, 49]. Inspired by this finding, we attempt to introduce a self-attention layer to make the seeds belonging to the same SIR more similar to achieve expansion from seeds to SIRs.

In the implementation, we adopted the SAM[50, 51] and did not use the standard self-attention module. Instead, we introduce local attention and global attention. First, local attention is used to establish relationships among local seeds, and then global attention is used to expand this relationship to a global scale. The ablation experiments in Section 5.5 demonstrate that local attention, global attention, and their stacking play important roles in the overall performance of HOOK.

Step 3: Stop** and merging

According to the analysis in Section 3.1, the ideal granularity of semantically independent regions depends on the specific image and task. Therefore, in the stop strategy for the continuous merging of seeds into SIRs, we did not introduce additional constraints or guidance. The expansion process of seeds naturally stops after passing through one layer of local attention and one layer of global attention, and it is jointly optimized with the entire model under the supervision of the specific task loss function.

Refer to caption
Figure 4: HOOK is capable of adapting to tasks with different granularities. For sparse tasks represented by classification, the model can average the tokens output by the model and then pass them through a linear classification layer to output the classification results. For dense tasks represented by segmentation, the model can utilize the intermediate variable, the attention map from the OVM, to restore the number of tokens and then pass them to the segmentation head to output the segmentation results.

4.3 Object Vectorization Module

The OPM merges seeds into SIRs, where seeds belonging to the same SIR will be more similar. The OVM is aimed at the vectorization of SIRs into visual tokens, and the number of visual tokens can be arbitrarily adjusted to meet adaptability requirements.

A. Jaegle et al. proposed a perception model framework called the Perceiver[52], which consists of only cross-attention and self-attention layers. This model can perceive multiple modalities without changing its structure. The core idea is that the modality data serve as the key and value for cross-attention, while the model’s predefined latent variables serve as queries. The latent variables are repeatedly exposed to the original modality data through stacked cross-attention, continuously refining the model’s perceptual results. Moreover, because only the latent variables pass through self-attention, the model’s efficiency is greatly improved.

The success of the Perceiver has inspired us in two ways: (1) The predefined query in cross-attention has the ability to perceive semantic information in the key and value. (2) The number of tokens obtained through cross-attention is determined only by the query. Furthermore, from the calculation principle of cross-attention, as indicated by Equation 5, it can be observed that the operation of calculating similarity between the query and the key matches the way the OPM finds SIRs (i.e., making seeds belonging to the same SIR more similar).

v=Softmax(QKTd)Vsuperscript𝑣𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇𝑑𝑉v^{\prime}=Softmax(\frac{QK^{T}}{\sqrt{d}})Vitalic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V (5)

Therefore, we implement the object vectorization module based on cross-attention. Specifically, we define N𝑁Nitalic_N learnable vectors q𝑞qitalic_q as the query for cross-attention, while the seeds outputted by the object perception module serve as key and value. During the forward process, each vector in q𝑞qitalic_q queries the seeds to retrieve the most similar seed set and aggregates them with weighting to form visual tokens.

In the specific implementation, we empirically found that when the dimension of cross-attention is defined as half of the object perception module dimension, i.e., 256, HOOK performs the best. Additionally, to align the dimensions of the visual tokens with the subsequent backbone network dimensions, we added an MLP after the cross-attention layer. The dimensions of the MLP and the learnable vector q are set to the dimensions of the backbone network. For example, if the backbone network is a standard ViT, the dimension is 768.

4.4 HOOK is image-agnostic and task-agnostic

HOOK is independent of specific images and tasks and is image-agnostic and task-agnostic.

First, HOOK can accept images of any size. HOOK uses a convolutional block to extract local features from the image (as detailed in Section 4.2). Since convolutional operations are not sensitive to image size, HOOK can handle images of any size. Considering that a large-scene image may lead to too many seeds in a 4 ×\times× 4 pixel window, which would increase the computational cost of the two self-attention layers in the OPM, we implemented adjustable seed sizes. In the experiments in Section 5.2, we demonstrated the changes in efficiency and accuracy due to variations in the convolution window size. We showed that even for large-scene remote sensing images, HOOK maintains high efficiency and accuracy.

Furthermore, HOOK demonstrates excellent performance on both dense tasks and sparse tasks. In sparse prediction tasks such as classification, the model only needs to output sparse discriminative information for the original image. As shown in Figure 4, after the tokens outputted by HOOK go through the backbone network, they can directly pass through a linear classification layer to produce classification results. In dense prediction tasks such as semantic segmentation, the model needs to output not only discriminative information but also pixel-level spatial information. We observed that the cross-attention in the OVM can output not only visual tokens but also intermediate variables: attention maps, which indicate the region represented by each visual token (i.e., region of the token mentioned in Section 3.2). Therefore, we use the attention map to reconstruct the original number of visual tokens, thus restoring spatial information. The experiments in Section 5.2 demonstrate that HOOK requires only 8 tokens to perform the semantic segmentation task.

5 Experiment

5.1 setup

A complete transformer-based visual model should consist of three main components: a visual tokenizer, a transformer backbone, and a task head. The visual tokenizer is responsible for converting raw images into token sequences, with the most popular visual tokenizer currently being Patch Embed. The transformer backbone serves as the core of the model, extracting and understanding the semantic information within the token sequences. The task head aligns the output from the backbone with specific downstream tasks. In this paper, we propose a new visual tokenizer called HOOK, which is aimed at replacing Patch Embed to enhance the model’s ability to understand visual data.

To fairly and effectively test the performance of HOOK, unless specified otherwise, we used the same transformer backbone (12-layer transformer encoder) and task head (linear classifier for classification tasks & SegFormer[53] head for segmentation tasks) in all experiments. Notably, many improvements to visual tokenizers not only change the structure of the visual tokenizer but also introduce new modules or mechanisms into the backbone or task head[26, 28, 36, 40, 54]. To accurately assess the impact of the visual tokenizer on the model’s performance, our baseline for comparison excludes these types of methods and only includes work that focuses solely on improving the visual tokenizer.

For the credibility of our experimental results, our first principle in selecting datasets is that they are publicly available and widely researched. Additionally, we aimed to validate that HOOK is image-agnostic and task-agnostic while testing its performance. Therefore, we tested the performance of HOOK on both sparse tasks, represented by classification, and dense tasks, represented by segmentation. Specifically, for the classification tasks, we chose the NWPU-RESISC45[18] and WHU-RS19 datasets[19, 20], with both datasets having fixed image sizes of 224 ×\times× 224. For the segmentation task, we selected the GID5 dataset[21], which has a fixed image size of 512 ×\times× 512.

The experiments demonstrated that HOOK possesses two fundamental properties—homogeneity and adaptability—and exhibits excellent performance and efficiency. Section 5.2 presents the test results of HOOK on the aforementioned two types of tasks and quantitatively compares the efficiency of HOOK with Patch Embed. The visualization results of the homogeneity of visual tokens are shown in Section 5.3. Additionally, we conducted more in-depth analysis and validation of the phenomena observed in the experiments. Section 5.4 experimentally verifies the simple intuition that under a stronger visual tokenizer, deeper backbone networks are redundant. The ablation results and analysis of the two types of attention mechanisms within the OVM are presented in Section 5.5, which shows that both attention mechanisms play important roles in improving the homogeneity of visual tokens, thereby affecting the final accuracy performance of the model. In Section 5.6, we discuss some limitations of HOOK to provide a comprehensive view of the HOOK method.

5.2 Main Results

We tested the performance of the model with HOOK as the visual tokenizer on two categories of tasks: sparse tasks and dense tasks. The essential difference between the two types of tasks is that sparse tasks only require sparse discriminative information, such as categories, while dense tasks require not only discriminative information but also dense spatial information. Representative sparse tasks include classification, re-identification, sentiment analysis, etc., while representative dense tasks include segmentation, detection, depth estimation, etc.

Table 1: Sparse and Dense Task with HOOK
Tokenizer Sparse Task (Classification) Dense Task (Segmentation)
Num of Tokens NWPU2:8(Acc1) WHU-RS19(Acc1) Num of Tokens GID5(mIoU)
Patch Embed 196 70.30 78.10 1024 67.79
Quadtree 100 70.99 81.05 400 49.18
VT 8 71.95 80.07 8 58.12
PnP-DETR 54 73.53 81.37 158 62.12
Conv-VGG19 49 74.46 83.66 256 53.73
Conv-ResNet50 49 75.17 81.37 256 68.81
HOOK(ours) 6 77.38 87.58 8 78.81

Table 1 displays the experimental results. To accurately evaluate the impact of different visual tokenizers on the overall performance of the model, the baseline for comparison includes only works that optimize the visual tokenizer while kee** the transformer backbone and task head consistent. In our selected baseline, Qurdtree[35] and PnP-DETR[33] are patch-based methods. The former assumes that all patches in the image are unreasonable and uses GradCAM to identify significant regions, implementing the generation of patches with different granularities using quadtrees. The latter reveals that the number of patches is redundant, using a scoring mechanism to identify background and foreground regions and merging features of background areas. VT[42] is an object-oriented method that uses a simple convolution operation to assign each pixel in the image to one of several semantic groups, which are eventually mapped to a visual token. Additionally, some studies[25, 33] have suggested that the combination of convolution and a transformer leads to better performance for transformer-based models, so we also tested the performance of two classic convolutional networks (VGG[55] and ResNet[56]) as visual tokenizers.

A. Sparse Task

For sparse prediction, we chose the classification tasks on the NWPU-RESISC45 (20% as trainset) and WHU-RS19 datasets. The image size is 224 ×\times× 224, and the metric used is the top-1 accuracy. The experimental results in Table 1 indicate that HOOK requires only 6 tokens to outperform the standard Patch Embed by 7.08 and 9.48, respectively, on the two datasets. Compared to the baselines, HOOK achieves state-of-the-art performance in terms of both token numbers and classification accuracy.

B. Dense Task

For dense prediction, we selected the semantic segmentation task on the GID5 dataset. The image size is 512 ×\times× 512, and the metric used is the mIoU. The dense adapter module introduced in Section 4.4 addresses the challenge of completing dense tasks with a small number of tokens without additional training parameters. The experimental results in Table 1 demonstrate that HOOK requires only 8 tokens to outperform the standard Patch Embed (11.02). Compared to the baselines, HOOK achieves state-of-the-art performance in terms of both token numbers and segmentation accuracy.

C. Efficiency

In addition to accuracy, the efficiency of the visual tokenizer is also crucial. In our experiments, as shown in Table 2, we compared the changes in overall model efficiency and accuracy when Patch Embed and HOOK were used as visual tokenizers.

Although the structure of HOOK is more complex than that of Patch Embed, HOOK significantly reduces the number of tokens, which greatly improves the efficiency of the backbone network. The results in Table 2 indicate that with a pixel window size of 4 times𝑡𝑖𝑚𝑒𝑠timesitalic_t italic_i italic_m italic_e italic_s 4, the efficiency of HOOK is lower than that of Patch Embed. As the pixel window size increases, the model’s efficiency improves, while the accuracy gradually decreases. When the pixel window size is 8 times𝑡𝑖𝑚𝑒𝑠timesitalic_t italic_i italic_m italic_e italic_s 8, HOOK demonstrates a distinct advantage in efficiency over Patch Embed in terms of the MACs metric (10.66 vs. 16.86), with only a slight decrease in accuracy compared to the 4x4 version of HOOK (0.17), still outperforming Patch Embed (6.91). This finding suggests that HOOK has distinct advantages over Patch Embed in terms of both accuracy and efficiency.

Table 2: Efficiency of HOOK
Tokenizer
Base token
size(pixels)
MACs(G) Acc.(%)
Patch Embed 16 ×\times× 16 16.86 70.30
HOOK(ours) 32 ×\times× 32 5.87 71.73
16 ×\times× 16 6.80 74.13
8 ×\times× 8 10.66 77.21
4 ×\times× 4 26.10 77.38

5.3 Visualization of homogeneous tokens

Refer to caption
Figure 5: Visualization results of homogeneous visual tokens. The left image shows the original image, while the right image displays the region of the token, with each color representing one region of the token. The images in (a) are obtained from the NWPU-RESISC45 classification dataset, while the images in (b) are obtained from the GID5 semantic segmentation dataset.
Table 3: Classification with fewer backbone layers
Tokenizer Backbone Layers Params(M) NWPU2:8(Acc1) WHU-RS19(Acc1)
Patch Embed 12 85.68 70.30 78.10
1 7.71 66.45 74.84
3 21.89 69.72 78.76
HOOK(ours) 1 \ul20.90 \ul76.88 \ul86.93
12 98.86 78.81 87.58

In Section 4.3, we mentioned that HOOK introduces a cross-attention mechanism that utilizes N𝑁Nitalic_N learnable query vectors to aggregate image features into N𝑁Nitalic_N tokens. We saved the intermediate variable, the attention map, which records the correspondence between the N𝑁Nitalic_N tokens output by HOOK and the original image features. By visualizing the attention map, we can observe the region of the token.

Figure 5 displays the visualization results, where each color represents one region of the token. The results indicate that each token roughly corresponds to a specific object, achieving the goal of SIRs as basic elements of vision. For more visualization results, please refer to A.

5.4 Stacking up to 12 layers is redundant

The standard ViT model consists of two main modules, Patch Embed and 12 layers of self-attention modules, with the primary computational cost concentrated in the latter. A simple intuition is that if a stronger visual tokenizer replaces Patch Embed, the 12 layers of self-attention modules may become redundant.

To validate this intuition, we conducted tests on classification tasks, and Table 3 displays the experimental results. In Table 3, boldface indicates the best performance, and underlining indicates the second-best performance. The experimental results show that when the number of backbone layers is reduced from 12 layers to 1 layer, HOOK incurs less accuracy loss than models using Patch Embed as the visual tokenizer. Specifically, on the NWPU dataset, Patch Embed incurs a loss of 3.85, while HOOK only incurs a loss of 1.93. On the WHU-RS19 dataset, Patch Embed incurred a loss of 3.26, while HOOK incurred only a loss of 0.65. These experimental results preliminarily confirm our intuition.

HOOK has more parameters than Patch Embed. To further illustrate that the effectiveness of HOOK stems from its homogeneity and adaptability rather than solely from the increase in parameters, we also compared ”Patch Embed + 3 backbone layers” and ”HOOK + 1 backbone layer” in Table 3. The former model has an overall parameter count of 21.89 M, while the latter model has 20.90 M, making their overall number of parameters similar. The results in Table 3 indicate that HOOK outperforms Patch Embed by 7.16 on the NWPU dataset and by 8.71 on the WHU-RS19 dataset, demonstrating that the effectiveness of HOOK does not solely depend on the increase in the number of parameters.

Interestingly, when Patch Embed is deployed as the visual tokenizer and the number of backbone layers is reduced from 12 layers to 3 layers, the model’s accuracy only shows a slight fluctuation, and in some cases, such as in the WHU-RS19 dataset, it even improves. This phenomenon is even more pronounced in the results of the segmentation task in Table 4. Both Patch Embed and HOOK show an increase in accuracy, with Patch Embed showing a more significant improvement, further highlighting the redundancy of the 12-layer backbone network. This conclusion may serve as inspiration for the design of future visual backbone models.

Table 4: Segmentation with fewer backbone layers
Tokenizer
Backbone
Layers
Params(M)
GID5
(mIoU)
Patch Embed 4 64.77 69.28
12 121.47 67.79
HOOK(ours) 4 \ul84.74 78.91
12 141.44 \ul78.81

5.5 Ablation

Refer to caption
Figure 6: Visualization of the ablation of local and global attention. The homogeneity of visual tokens is optimal when both attention modules are simultaneously utilized.
Table 5: Local and Global Attention
Tokenizer Object Perception Module Params(M) Acc.(%)
Local Attention Global Attention
HOOK(ours) 92.56 70.15
95.71 75.13
95.71 73.74
98.86 77.38

When designing the OVM, we referred to the implementation of the SAM model and incorporated its local attention module. In Table 5, we present ablation experiments conducted on the local attention and global attention modules. In Figure 6, we visualized the homogeneous tokens under four different conditions: ”No Attention module”, ”Local Attention only”, ”Global Attention only”, and ”Local & Global Attention”. When either local attention or global attention is lacking, the homogeneity of visual tokens significantly decreases, leading to a decrease in model accuracy. However, when stacking local attention and global attention, HOOK is better able to identify SIRs, and the model’s accuracy reaches its highest point. This finding demonstrates that local and global attention mechanisms play important roles in improving the homogeneity of visual tokens, thereby influencing the performance of HOOK.

5.6 Limitations of HOOK

Previous experiments have shown that HOOK has the two properties mentioned above. However, it is important to acknowledge the following limitations of HOOK:

(1) Homogeneity: Ideally, a homogeneous visual tokenizer should rely on semantic information in the image when identifying SIRs. However, from a visualization perspective, HOOK does not completely eliminate the influence of local textures and colors and still considers color and texture similarity as one of the criteria for determining SIRs. This feature results in poor visualization performance of HOOK in some complex-textured remote sensing images. Additionally, the limitation imposed by local attention also contributes to the decrease in homogeneity. Figure 7 illustrates three typical scenarios with relatively poor homogeneity: (a) constrained by local attention, (b) overly complex colors, and (c) overly complex textures.

In addition, we did not add any additional guidance or constraints to HOOK. HOOK is only supervised by the downstream task loss function when searching for SIRs. Therefore, we are currently not clear on how adding additional prior assumptions (such as connectivity assumptions) will affect the SIRs and the accuracy of the final model. This is one of the directions we will delve into in the future.

Table 6: Number of tokens in HOOK
Tokenizer
Num of
Tokens
Acc.(%)
HOOK(ours) 6 77.38
8 77.03
16 77.15
32 76.60
64 76.25
Refer to caption
Figure 7: Three typical cases of poor homogeneity: (a) shows that HOOK is limited by the attention range of the local attention module, resulting in poor homogeneity areas appearing in the lower right corner; (b) shows that HOOK is influenced by complex color information and fails to find semantic information in the image; and (c) shows that HOOK is influenced by complex texture information.

(2) Adaptability: HOOK can adjust the number of visual tokens as needed. Experiments have shown that it exhibits good adaptability in tasks involving different granularities and images of different sizes. In theory, the larger is the number of tokens, the finer the original image information that is reflected. However, the empirical experiments in Table 6 show that when we increase the number of tokens in classification tasks, the accuracy of HOOK shows an unstable decrease. This phenomenon indicates that the number of visual tokens may have a more complex nonlinear relationship with the original image and the specific task. This is an issue that we will delve into in the future.

6 Discussion

6.1 The visual tokenizer is the pupil of the machine

We believe that visual tokenizers play an extremely important role in transformer-based visual models and even in multimodal models. If the visual model is the eye of the machine, then the visual tokenizer is the machine’s pupil. Its importance is mainly manifested in the following two aspects:

(1) The visual tokenizer acts as a bridge between the original image and the model. On the one hand, the visual tokenizer needs to directly perceive the high-dimensional original image and compress it into a low-dimensional space to improve model efficiency. On the other hand, the completeness of the perception of the original image information by the visual tokenizer directly determines the upper limit of the model’s understanding of the image. In essence, the visual tokenizer compresses the original image into a lossless or low-loss format that conforms to the model’s input format (i.e., a token sequence) while striking a balance between efficiency and information completeness.

(2) The visual tokenizer directly affects the granularity and efficiency of the model in understanding images. From the perspective of the original image, visual tokens represent the basic elements of the image. From the model’s perspective, a unimodal visual model uses the meaning of tokens and the relationships among tokens to understand the image, while multimodal models establish relationships among modalities at the token level. Therefore, the quality of the visual tokenizer directly affects the granularity and efficiency of image understanding in visual models and multimodal models.

6.2 Rethinking HOOK

HOOK belongs to the first routing in Section 3.3: ”Splitting and Merging”. First, under the assumption that ”there are no complete objects in a 4 ×\times× 4 pixel window,” HOOK undergoes extreme splitting of the image. When split to a fine enough level, visual tokens will inevitably satisfy “Same Object Multiple Tokens”. Subsequently, we use a self-attention mechanism to associate seeds belonging to the same SIR and use cross-attention to merge tokens within the same SIR, ultimately achieving “Same Object Same Token”.

From this perspective, the main drawback of HOOK is the lack of precision in its splitting operation. In theory, only tokens that overlap with multiple objects need to be split. However, HOOK’s rough splitting operation may split tokens that already satisfy homogeneity, making it more difficult to judge whether tokens belong to the same object during subsequent merging operations. This disadvantage reduces the efficiency of the model. The experiment in Table 2 validates this point: in the case where the pixel window size is 4 ×\times× 4, the efficiency of HOOK is lower than that of Patch Embed.

6.3 Future work

In addition to the ”Splitting and Merging” method represented by HOOK, the ”Merging and Splitting” method is equally interesting. One possible implementation is to obtain tokens that can represent all objects in the image and then gradually peel off tokens based on the independence between objects. During this process, we may encounter the entanglement issue between semantically similar objects. Resolving this entanglement is a key challenge that needs to be addressed in this approach.

Furthermore, comparing two alternative routings is another worthwhile direction. For example, both routings will encounter the issue of semantic granularity in the second step. The ”Splitting and Merging” method needs to consider whether tokens that represent the fuselage and wings of an aircraft should be merged into one token during the merging process. On the other hand, the ”Merging and Splitting” method considers whether the token that represents the aircraft should be split into the aircraft and wings. The opposite implementations of the same problem in these two methods may cause fundamental differences in certain application scenarios.

Finally, although we only considered the case of ”Same Object Same Token” as the ideal visual token in the previous analysis, are the scenarios of ”Same Object Multiple Tokens” and ”Same Token Multiple Objects” truly meaningless? For tasks such as human-object interaction detection[57, 58] which are aimed at gaining a deeper understanding of interaction relationships, could using more tokens to describe the edges of objects in the case of ”Same Object Multiple Tokens” be more suitable? We will further explore these interesting questions in future research.

7 Conclusion

Multimodal large-scale language models have brought a revolutionary paradigm shift to the field of remote sensing image understanding. The visual tokenizer, as an important and fundamental component, has long been overlooked or even misunderstood. Starting from the essence of the tokenizer, we propose two fundamental properties that an ideal visual tokenizer should possess: (1) Homogeneity: semantic independent regions (SIR) as the basic elements of vision; (2) Adaptability: the number of tokens can be arbitrarily adjusted to support images of any size and tasks of any granularity. To construct a visual tokenizer that satisfies the above two properties, we rigorously define and analyze the binary relationship between tokens and objects and derive two general paths to obtain homogeneous tokens: ”Splitting and Merging” and ”Merging and Splitting”. Based on the former, we propose a simple HOmogeneous visual tOKenizer, HOOK. The experimental results show that HOOK achieves “Same Token Same Object” and outperforms Patch Embed and other baseline methods in both sparse prediction tasks represented by classification and dense prediction tasks represented by segmentation.

This paper emphasizes the importance of visual tokenizers in visual foundation models and multimodal large language models and provide a preliminary theoretical basis for the construction of visual tokenizers. However, the current theoretical analysis and experimental research are still insufficient. For example, the homogeneity that we propose essentially describes the properties of individual tokens, but we have not discussed in detail the emergent behavior exhibited by combinations of multiple tokens. We hope that our initial work can inspire researchers in the field to pay attention to and participate in an in-depth investigation of visual tokens.

References

  • Radford et al. [2018] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language understanding by generative pre-training (2018).
  • Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901.
  • Radford et al. [2019] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9.
  • Achiam et al. [2023] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
  • Touvron et al. [2023a] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023a).
  • Touvron et al. [2023b] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023b).
  • Chowdhery et al. [2023] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scaling language modeling with pathways, Journal of Machine Learning Research 24 (2023) 1–113.
  • Liu et al. [2024] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, Advances in neural information processing systems 36 (2024).
  • Bai et al. [2023] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, J. Zhou, Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023).
  • Zhan et al. [2024] Y. Zhan, Z. Xiong, Y. Yuan, Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model, arXiv preprint arXiv:2401.09712 (2024).
  • Guo et al. [2024] H. Guo, X. Su, C. Wu, B. Du, L. Zhang, D. Li, Remote sensing chatgpt: Solving remote sensing tasks with chatgpt and visual models, arXiv preprint arXiv:2401.09083 (2024).
  • Zhang et al. [2024] W. Zhang, M. Cai, T. Zhang, Y. Zhuang, X. Mao, Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain, arXiv preprint arXiv:2401.16822 (2024).
  • Mikolov et al. [2013] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013).
  • Pennington et al. [2014] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
  • Kudo and Richardson [2018] T. Kudo, J. Richardson, Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, arXiv preprint arXiv:1808.06226 (2018).
  • Kudo [2018] T. Kudo, Subword regularization: Improving neural network translation models with multiple subword candidates, arXiv preprint arXiv:1804.10959 (2018).
  • Dosovitskiy et al. [2020] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
  • Cheng et al. [2017] G. Cheng, J. Han, X. Lu, Remote sensing image scene classification: Benchmark and state of the art, Proceedings of the IEEE 105 (2017) 1865–1883.
  • Xia et al. [2010] G.-S. Xia, W. Yang, J. Delon, Y. Gousseau, H. Sun, H. MaÎtre, Structural high-resolution satellite image indexing, Vienna, Austria, 2010.
  • Dai and Yang [2011] D. Dai, W. Yang, Satellite image classification via two-layer sparse coding with biased image representation, IEEE Transactions on Geoscience and Remote Sensing 8 (2011) 173–176.
  • Tong et al. [2020] X.-Y. Tong, G.-S. Xia, Q. Lu, H. Shen, S. Li, S. You, L. Zhang, Land-cover classification with high-resolution remote sensing images using transferable deep models, Remote Sensing of Environment 237 (2020) 111322.
  • Chen et al. [2021] X. Chen, S. Xie, K. He, An empirical study of training self-supervised vision transformers, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9640–9649.
  • Qian et al. [2022] S. Qian, Y. Zhu, W. Li, M. Li, J. Jia, What makes for good tokenizers in vision transformer?, IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  • Ryoo et al. [2021] M. S. Ryoo, A. Piergiovanni, A. Arnab, M. Dehghani, A. Angelova, Tokenlearner: What can 8 learned tokens do for images and videos?, arXiv preprint arXiv:2106.11297 (2021).
  • Xiao et al. [2021] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, R. Girshick, Early convolutions help transformers see better, Advances in neural information processing systems 34 (2021) 30392–30400.
  • Yin et al. [2021] H. Yin, A. Vahdat, J. Alvarez, A. Mallya, J. Kautz, P. Molchanov, Adavit: Adaptive tokens for efficient vision transformer, arXiv preprint arXiv:2112.07658 (2021).
  • Tang et al. [2022] Y. Tang, K. Han, Y. Wang, C. Xu, J. Guo, C. Xu, D. Tao, Patch slimming for efficient vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12165–12174.
  • Rao et al. [2021] Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, C.-J. Hsieh, Dynamicvit: Efficient vision transformers with dynamic token sparsification, Advances in neural information processing systems 34 (2021) 13937–13949.
  • Pan et al. [2021] B. Pan, R. Panda, Y. Jiang, Z. Wang, R. Feris, A. Oliva, Ia-red 2: Interpretability-aware redundancy reduction for vision transformers, Advances in Neural Information Processing Systems 34 (2021) 24898–24911.
  • Marin et al. [2023] D. Marin, J.-H. R. Chang, A. Ranjan, A. Prabhu, M. Rastegari, O. Tuzel, Token pooling in vision transformers for image classification, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 12–21.
  • Liu et al. [2023] Y. Liu, C. Matsoukas, F. Strand, H. Azizpour, K. Smith, Patchdropout: Economizing vision transformers using patch dropout, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 3953–3962.
  • Liang et al. [2022] Y. Liang, C. Ge, Z. Tong, Y. Song, J. Wang, P. Xie, Not all patches are what you need: Expediting vision transformers via token reorganizations, arXiv preprint arXiv:2202.07800 (2022).
  • Wang et al. [2021] T. Wang, L. Yuan, Y. Chen, J. Feng, S. Yan, Pnp-detr: Towards efficient visual analysis with transformers, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4661–4670.
  • Chen et al. [2022] W. Chen, X. Huang, X. Liu, H. Wu, F. Qi, Authenticity identification of qi baishi’s shrimp painting with dynamic token enhanced visual transformer, in: Computer Graphics International Conference, Springer, 2022, pp. 554–565.
  • Ronen et al. [2023] T. Ronen, O. Levy, A. Golbert, Vision transformers with mixed-resolution tokenization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4612–4621.
  • Chen et al. [2023] M. Chen, M. Lin, K. Li, Y. Shen, Y. Wu, F. Chao, R. Ji, Cf-vit: A general coarse-to-fine method for vision transformer, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 2023, pp. 7042–7052.
  • Wang et al. [2021] Y. Wang, R. Huang, S. Song, Z. Huang, G. Huang, Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition, Advances in neural information processing systems 34 (2021) 11960–11973.
  • Yue et al. [2021] X. Yue, S. Sun, Z. Kuang, M. Wei, P. H. Torr, W. Zhang, D. Lin, Vision transformer with progressive sampling, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 387–396.
  • Chen et al. [2021] C.-F. R. Chen, Q. Fan, R. Panda, Crossvit: Cross-attention multi-scale vision transformer for image classification, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 357–366.
  • Beyer et al. [2023] L. Beyer, P. Izmailov, A. Kolesnikov, M. Caron, S. Kornblith, X. Zhai, M. Minderer, M. Tschannen, I. Alabdulmohsin, F. Pavetic, Flexivit: One model for all patch sizes, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14496–14506.
  • Lee et al. [2021] S. H. Lee, S. Lee, B. C. Song, Vision transformer for small-size datasets, arXiv preprint arXiv:2112.13492 (2021).
  • Wu et al. [2020] B. Wu, C. Xu, X. Dai, A. Wan, P. Zhang, Z. Yan, M. Tomizuka, J. Gonzalez, K. Keutzer, P. Vajda, Visual transformers: Token-based image representation and processing for computer vision, arXiv preprint arXiv:2006.03677 (2020).
  • Yang et al. [2022] T. Yang, Y. Wang, Y. Lu, N. Zheng, Visual concepts tokenization, Advances in Neural Information Processing Systems 35 (2022) 31571–31582.
  • Mei et al. [2024] J. Mei, L.-C. Chen, A. Yuille, C. Xie, Spformer: Enhancing vision transformer with superpixel representation, arXiv preprint arXiv:2401.02931 (2024).
  • Sennrich et al. [2015] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, arXiv preprint arXiv:1508.07909 (2015).
  • Schuster and Nakajima [2012] M. Schuster, K. Nakajima, Japanese and korean voice search, in: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2012, pp. 5149–5152.
  • Ru et al. [2023] L. Ru, H. Zheng, Y. Zhan, B. Du, Token contrast for weakly-supervised semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3093–3102.
  • Zhou et al. [2022] D. Zhou, Q. Hou, L. Yang, X. **, J. Feng, Token selection is a simple booster for vision transformers, IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  • Gong et al. [2021] C. Gong, D. Wang, M. Li, V. Chandra, Q. Liu, Vision transformers with patch diversification, arXiv preprint arXiv:2104.12753 (2021).
  • Kirillov et al. [2023] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., Segment anything, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026.
  • Wightman [2019] R. Wightman, Pytorch image models, https://github.com/rwightman/pytorch-image-models, 2019. doi:10.5281/zenodo.4414861.
  • Jaegle et al. [2021] A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, J. Carreira, Perceiver: General perception with iterative attention, in: International conference on machine learning, PMLR, 2021, pp. 4651–4664.
  • Xie et al. [2021] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, P. Luo, Segformer: Simple and efficient design for semantic segmentation with transformers, Advances in neural information processing systems 34 (2021) 12077–12090.
  • Yuan et al. [2021] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng, S. Yan, Tokens-to-token vit: Training vision transformers from scratch on imagenet, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 558–567.
  • Simonyan and Zisserman [2014] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014).
  • He et al. [2016] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • Kim et al. [2021] B. Kim, J. Lee, J. Kang, E.-S. Kim, H. J. Kim, Hotr: End-to-end human-object interaction detection with transformers, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 74–83.
  • Bergstrom and Shi [2020] T. Bergstrom, H. Shi, Human-object interaction detection: A quick survey and examination of methods, in: Proceedings of the 1st International Workshop on Human-centric Multimedia Analysis, 2020, pp. 63–71.

Appendix A Supplementary visualization results

Refer to caption
Figure 8: Supplementary visualization results of homogeneous visual tokens