¹¹institutetext: Automotive Software Innovation Center, Chongqing 401331, China
¹¹email: [email protected]
¹¹email: [email protected] ²²institutetext: Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China
²²email: [email protected]
²²email: [email protected]³³institutetext: University of Science and Technology of China
³³email: [email protected]⁴⁴institutetext: Institute of Intelligent Software, Guangzhou, China
⁴⁴email: [email protected]⁵⁵institutetext: Saarland University, Germany

A Survey on Visual Mamba

Hanwei Zhang 114455 Ying Zhu 22 Dan Wang 22 Lijun Zhang 11 Tianxiang Chen 33 Zi Ye 44

Abstract

State space models (SSMs) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently demonstrated significant promise in long-sequence modeling. Since the self-attention mechanism in transformers has quadratic complexity with image size and increasing computational demands, the researchers are now exploring how to adapt Mamba for computer vision tasks. This paper is the first comprehensive survey aiming to provide an in-depth analysis of Mamba models in the field of computer vision. It begins by exploring the foundational concepts contributing to Mamba’s success, including the state space model framework, selection mechanisms, and hardware-aware design. Next, we review these vision mamba models by categorizing them into foundational ones and enhancing them with techniques such as convolution, recurrence, and attention to improve their sophistication. We further delve into the widespread applications of Mamba in vision tasks, which include their use as a backbone in various levels of vision processing. This encompasses general visual tasks, Medical visual tasks (e.g., 2D / 3D segmentation, classification, and image registration, etc.), and Remote Sensing visual tasks. We specially introduce general visual tasks from two levels: High/Mid-level vision (e.g., Object detection, Segmentation, Video classification, etc.) and Low-level vision (e.g., Image super-resolution, Image restoration, Visual generation, etc.). We hope this endeavor will spark additional interest within the community to address current challenges and further apply Mamba models in computer vision.

Keywords:

Mamba Computer vision State space model Application.

1 Introduction

Deep Neural Networks (DNNs) have demonstrated remarkable performance across various artificial intelligence (AI) tasks, with the fundamental architecture playing a crucial role in determining the model’s capabilities. Traditional neural networks typically comprise Multi-Layer Perceptron (MLP) or Fully Connected (FC) layers [1, 2]. Convolutional neural networks (CNNs) [3, 4] introduce convolutional and pooling layers, which are particularly effective for processing shift-invariant data like images. Recurrent neural networks (RNNs) [5, 6] utilize recurrent cells to handle sequential or time series data. To address the issue of CNN, RNN, and GNN models only capturing local relationships, the Transformer [7, 8, 9], introduced in 2017, excels at learning long-distance feature representations. Transformers primarily rely on attention-based attention mechanisms, e.g. self-attention and cross-attention, to extract intrinsic features and improve their representation capability. Pre-trained massive transformer-based models, such as GPT-3 [10], deliver robust performance across various NLP datasets, excelling in natural language understanding and generation tasks. The remarkable performance of transformer-based models has propelled their widespread adoption in vision applications. The core of transformer models is their exceptional skill in capturing long-range dependencies and maximizing the use of large datasets. The feature extraction module is the main component of vision transformer architectures. It processes data using a sequence of self-attention blocks, significantly improving its capacity to analyze images.

However, a primary obstacle for Transformers is the substantial computational demands of the self-attention mechanism, which increases quadratically with image resolution. The Softmax operation within the attention blocks further intensifies the computational demands, presenting significant challenges for implementing these models on edge and low-resource devices. Additionally, real-time computer vision systems utilizing transformer-based models must adhere to stringent low-latency standards to maintain a high-quality user experience. This scenario highlights the continuous evolution of new architectures to enhance performance, although this often comes with the trade-off of higher computational demands. Many new models based on sparse attention mechanisms or innovative neural network paradigms have been proposed to reduce computational costs further while capturing long-range dependencies and maintaining high performance. State space models (SSMs) have emerged as a central focus among these developments. As shown in Fig. 1(a), the number of publications related to SSMs demonstrates an explosive growth trend. Initially devised to simulate dynamic systems in areas such as control theory and computational neuroscience using state variables, SSMs predominantly describe linear invariant (or stable) systems when adapted for deep learning.

Refer to caption — Figure 1: The number of SSMs and Mamba papers released to date(from year 2021 to year 2024.03).

As SSMs have evolved, a new class of selective state space models, termed Mamba [11]. It has advanced the modeling of discrete data, such as text, with state-space models (SSMs) through two key improvements. Firstly, it features an input-dependent mechanism that adjusts SSM parameters dynamically, enhancing information filtering. Secondly, Mamba uses a hardware-aware algorithm that processes data linearly with sequence length, boosting computational speed on modern systems. Inspired by Mamba’s achievements in language modeling, several initiatives are now aiming to adapt this success to the field of vision. Several studies have explored its integration with Mixture-of-Experts (MoE) techniques, as evidenced by works like Jamba [12], MoE-Mamba [13], and BlackMamba [14], outperformed the state-of-the-art architecture Transformer-MoE with fewer training steps. As illustrated in Fig. 1(b), since the release of Mamba in December 2023, the number of research papers focusing on Mamba in the vision domain has rapidly increased, reaching a peak in March 2024. This trend suggests that Mamba is emerging as a prominent research area in vision, potentially offering a viable alternative to Transformers. Therefore, A review of current related works is necessary and timely to provide a detailed overview of this new methodology in this evolving field.

Consequently, we present a comprehensive overview of how Mamba models are used in the vision domain. This paper aims to serve as a guide for researchers looking to delve deeper into this area. The critical contributions of our work include:

•

This survey paper is the first to provide a comprehensive review of the Mamba technique in the vision domain, explicitly focusing on analyzing the proposed strategies.
•

Expanding upon the Naive-based Mamba visual framework, we have investigated how Mamba’s capabilities can be enhanced and combined with other architectures to achieve superior performance.
•

We offer an in-depth exploration by organizing the literature based on various application tasks. We establish a taxonomy, identify advancements specific to each task, and offer insights on overcoming challenges.

The remainder of the survey is structured as follows: Section 2 examines the general and mathematical concepts underlying Mamba strategies. Section 3 discusses the naive Mamba visual models and how they integrate with other technologies to enhance performance, as proposed in recent years. Section 4 explores the application of Mamba technologies in addressing various computer vision tasks. Finally, Section 5 concludes the survey.

2 Formulation of Mamba

Mamba [11] is initially introduced in the domain of natural language processing. As shown in Fig. 2, the original Mamba Block integrates a Gated MLP into the State Space Model (SSM) architecture of H3 [15], utilizing an SSM sandwiched between two gated connections alongside a standard local convolution. For $\sigma$ SiLU [16] or Swish activation function [17] is used. The Mamba architecture consists of repeating of Mamba block interleaved with standard normalization and residual connections. An optional normalization layer (LayerNorm chosen by original Mamba) is used in a similar location as RetNet [18].

2.1 State Space Models (SSMs)

Consider a structured state space model (SSM) that maps one-dimensional sequence $x(t)\in\mathbb{R}^{L}$ to $y(t)\in\mathbb{R}^{L}$ through a hidden state $h(t)\in\mathbb{R}^{N}$ . With the evolution parameter $\mathbf{A}\in\mathbb{R}^{N\times N}$ and the projection parameters $\mathbf{B}\in\mathbb{R}^{N\times 1}$ , $\mathbf{C}\in\mathbb{R}^{1\times N}$ , such a model is formulated as linear ordinary differential equations

\displaystyle\begin{split}h^{\prime}(t)=&~{}\mathbf{A}h(t)+\mathbf{B}x(t),\\ y(t)=&~{}\mathbf{C}h(t).\end{split}

(1)

Discretization.

To adapt for deep learning, State Space Models (SSMs), as continuous-time models, are discretized with a Zero-Order Hold(ZOH) assumption. Thus, the continuous-time parameters $\mathbf{A},\mathbf{B}$ are transform to their discretized counterparts $\overline{\mathbf{A}},\overline{\mathbf{B}}$ with a timescale parameter $\Delta$ according to

\displaystyle\begin{split}\overline{\mathbf{A}}=&~{}\exp(\Delta\mathbf{A}),\\ \overline{\mathbf{B}}=&~{}(\Delta\mathbf{A})^{-1}(\exp(\Delta\mathbf{A})-% \mathbf{I})\cdot\Delta\mathbf{B}.\end{split}

(2)

Thus, (1) can be rewritten as

\displaystyle\begin{split}h_{t}=&~{}\overline{\mathbf{A}}h_{t-1}+\overline{% \mathbf{B}}x_{t},\\ y_{t}=&~{}\mathbf{C}h_{t}.\end{split}

(3)

To enhance computational efficiency and scalability, the iterative process in (3) can be synthesized through a global convolution

\displaystyle\begin{split}\overline{\mathbf{K}}=&~{}(\mathbf{C}\overline{% \mathbf{B}},\mathbf{C}\overline{\mathbf{A}}\overline{\mathbf{B}},\cdots,% \overline{\mathbf{A}}^{L-1}\overline{\mathbf{B}}),\\ \mathbf{y}=&~{}\mathbf{x}*\overline{\mathbf{K}},\end{split}

(4)

where $L$ is the length of the input sequence $\mathbf{x}$ , $\overline{\mathbf{K}}\in\mathbb{R}^{L}$ serves as the kernel of the SSMs and $*$ represents the convolution operation.

Architectures.

SSMs often serve as independent sequence transformations that can be integrated into end-to-end neural network architectures. Here we introduce several fundamental architectures. Linear attention [19] approximates self-attention with a recurrence mechanism as a simplified form of linear SSM. H3 [15], as illustrated in Fig. 2, places an SSM between two gated connections and inserts a standard local convolution before it. Following H3, Hyena [20], replaces the SSM layer with an MLP-parameterized global convolution [21]. RetNet [22] introduces an extra gate and uses simpler SSM. RetNet enables an alternative parallelizable computation path and employs a variant of multi-head attention (MHA) instead of convolutions. Inspired by attention-free Transformer [23], the recent RNN design RWKV [24], can be interpreted as the ratio of two SSMs due to its primary "WKV" mechanism involving Linear Time Invariance (LTI) recurrences.

Selective State Space Models.

Traditional SSMs demonstrated linear time complexity but their representativity of sequence context is inherently limited by time-invariant parameterization. To overcome this constraint, Selective State Space Models introduce selective scan for interactions among sequential states with

\displaystyle\begin{split}\mathbf{B}=&~{}S_{\mathbf{B}}(\mathbf{x}),\\ \mathbf{C}=&~{}S_{\mathbf{C}}(\mathbf{x}),\\ \Delta=&~{}\tau_{\Delta}(\Delta+S_{\Delta}(\mathbf{x})),\end{split}

(5)

before (2,3), so that parameters $\mathbf{B}\in\mathbb{R}^{B\times L\times N}$ , $\mathbf{C}^{B\times L\times N}$ and $\Delta^{B\times L\times D}$ are dependent on the input sequence $\mathbf{x}\in\mathbb{R}^{B\times L\times D}$ , where $B$ represents the batch size, and $D$ represents number of channels. Normally, $S_{B}$ and $S_{C}$ are linear parameterized projections to dimension $N$ , i.e. $Linear_{N}(\cdot)$ , while $S_{\Delta}(\mathbf{x})=Broadcast_{D}(Linear_{1}(\mathbf{x}))$ and $\tau_{\Delta}=softplus$ . The choice of $S_{\Delta}$ and $\tau_{\Delta}$ is due to a connection to RNN gating mechanisms explained later.

2.2 Other Key Concepts in Mamba

Selection Mechanism.

The connection between RNN gating and the discretization of continuous-time systems is well-established [25]. The classical gating mechanism of RNNs is an instance of the selection mechanism for SSMs. When $N=1,\mathbf{A}=-1,\mathbf{B}=1,S_{\Delta}=Linear(\mathbf{x})$ and $\tau_{\Delta}=softplus$ , then the selective SSM recurrence takes the form

\displaystyle\begin{split}g_{t}=&~{}\sigma(Linear(x(t)))\\ h_{t}=&~{}(1-g_{t})h_{t-1}+g_{t}x_{t}.\end{split}

(6)

Scan.

The selection mechanism is devised to address the constraints of Linear Time Invariance (LTI) models. However, it reintroduces the computation issue associated with SSMs. To enhance GPU utilization and efficiently materialize the state $h$ within the memory hierarchy, hardware-aware state expansion is enabled by selective scan. By incorporating kernel fusion and recomputation with parallel scan, the fused selective scan layer effectively reduces the amount of memory I/O operations, leading to a significant acceleration compared to conventional implementations.

Discussion.

Compared to RNNs and LSTMs, which struggle with vanishing gradients and long-range dependencies, Mamba offers efficient computation and memory utilization. While transformers excel in batch processing and handling long-range dependencies through attention mechanisms, they incur high computational costs, especially during inference. Mamba introduces a selective state space model, incorporating input-dependent matrices to enhance adaptability while maintaining the computational advantages of traditional SSMs. Mamba bridges the gap between traditional SSMs and modern neural network architectures by offering a selective dependency mechanism, optimal GPU memory utilization, and linear scalability with context length, thus providing a promising solution for various sequential data processing tasks.

3 Mamba for Vision

The original Mamba block is designed for one-dimensional sequences, yet vision-related tasks require processing multi-dimensional inputs like images, videos, and 3D representations. Consequently, to adapt Mamba for these tasks, enhancements to the scanning mechanism and architecture of the Mamba block are crucial to effectively handle multi-dimensional inputs.

In this section, we present efforts aimed at enabling Mamba to tackle vision-related tasks while enhancing its efficiency and performance. Initially, we delve into two foundational works: Vision Mamba [26] and VMamaba [27]. These works introduced the ViM block and VSS block, respectively, serving as the foundation for subsequent research endeavors. Subsequently, we explore additional works focused on refining the Mamba architecture as a backbone for vision-related tasks. Lastly, we discuss the work of integrating Mamba with other architectures such as convolution, recurrence, and attention.

3.1 Visual Mamba Block.

Drawing inspiration from the visual transformer architecture, it seems natural to preserve the framework of the transformer model while substituting the attention block with a Mamba block, while kee** the rest of the process intact. At the crux of the matter lies the adaptation of the Mamba block to vision-related tasks. Nearly simultaneously, Vision Mamba and VMamba present their respective solutions: the ViM block and the VSS block.

ViM.

ViM block [26], sometimes also mentioned as Bidirectional Mamba block, annotates image sequences with position embeddings and condenses visual representations using bidirectional state space models. It processes input both forward and backward, employing one-dimensional convolution for each direction as shown in (a) Fig. 4. The Softplus function ensures non-negative $\Delta$ . Forward and backward $\mathbf{y}$ are computed via the state space model described in equations (2) and (3), and then combined through SiLU gating to produce the output token sequence as (a) Fig. 3.

VSS.

The Visual State Space (VSS) block [27] incorporates the pivotal state space model operation. It begins by directing the input through a depth-wise convolution layer, followed by a SiLU activation function, and then through the state space model outlined in equations (2) and (3) employing an approximate $\overline{\mathbf{B}}$ . Afterwards, the output of the state space model is subjected to layer normalization before being amalgamated with the output of other information streams as (b) Fig. 3. To address the encountered direction-sensitive issue, they introduce the Cross-Scan Module (CSM) to traverse the spatial domain and convert any non-causal visual image into order patch sequences, as shown in (b) Fig. 4. They refine the approximation of $\overline{\mathbf{B}}$ using the first-order Taylor series $\overline{\mathbf{B}}=(\Delta\mathbf{A})^{-1}(\exp(\Delta\mathbf{A})-\mathbf{I% })\cdot\Delta\mathbf{B}\thickapprox(\Delta\mathbf{A})(\Delta\mathbf{A})^{-1}% \Delta\mathbf{B}=\Delta\mathbf{B}$ .

3.2 Pure Mamba

ViM-based.

Inspired by the vision transformer architecture, Vision Mamba [26] replaces the transformer encoder with a vision mamba encoder based on ViM blocks while retaining the remainder of the process. This involves converting the two-dimensional image into flattened patches, followed by linear projection of these patches into vectors and the addition of position embeddings. A class token represents the entire patch sequence, and subsequent steps involve normalization layers and a MLP layer to derive the final predictions.

LocalMamba [28] is built based on Vim block and it introduces a novel scanning methodology that includes localized scanning within distinct windows to capture detailed local information in conjunction with global context. Plus, LocalMamba searches scanning directions across different network layers to identify and apply the most effective scanning combinations. They propose two variants, i.e. with plain and hierarchical structures. They proposed their LocalVim Block including four scanning directions (cf. (d) Fig. 4): vim scanning and partitions tokens into distinct windows along with their flipped counterparts facilitating scanning from tail to head, state space module, and spatial and channel attention module (SCAttn).

VSS-based.

VMamba [27] has four stages after partitioning the input image into patches as Vision Mamba. VMamba stacks several VSS blocks on the feature map with resolution $\frac{H}{4}\times\frac{W}{4}$ as Stage 1. In Stage 2, before more VSS blocks involving, the feature map in Stage 1 goes through a patch merge operation for down-sampling in order to build hierarchical representations, resulting in an output resolution of $\frac{H}{8}\times\frac{W}{8}$ . While Stage 3 and Stage 4 are the repetition of Stage 1 and Stage 2 with resolutions of $\frac{H}{16}\times\frac{W}{16}$ and $\frac{H}{32}\times\frac{W}{32}$ .

Based on VSS block, PlainMamba block [29] enhances its capability to learn features from two-dimensional images through two main mechanisms: (i) employing a continuous 2D scanning process to improve spatial continuity, ensuring tokens in the scanning sequence are adjacent, as illustrated in (c) Fig. 4, and (ii) incorporating direction-aware updating to enable the model to discern spatial relations among tokens by encoding directional information. PlainMamba improves the spatial discontinuity when moving to a new row/column in the 2D scanning mechanisms of Vim and VMamba by continuing the scanning with a reversed direction until it reaches the final vision token of the image. Furthermore, PlainMamba eliminates the need for special tokens.

Within lightweight model designs, EfficientVMamba [30] improves the capabilities of VMamba with an atrous-based selective scan approach, i.e. Efficient 2D Scanning (ES2D). Instead of scanning all patches from various directions and increasing the total number of patches, ES2D adopts a strategy of scanning forward vertically and horizontally while skip** patches and maintaining the number of patches unchanged, as shown in (e) Fig. 4. Their Efficient Visual State Space (EVSS) block comprises a convolutional branch for local features, uses ES2D as the SSM branch for global features, and all branches end through a squeeze-excitation block. They employ EVSS blocks for both Stage 1 and Stage 2, while opting for Inverted Residual blocks in Stage 3 and Stage 4 to enhance the capture of global representations.

Visual data as multi-dimensional data.

As a part of multi-dimensional data, existing models for multi-dimensional data also work for visual-related tasks but often lack the capacity to facilitate inter- and intra-dimension communication or data-independent. The MambaMixer block [31] introduces a dual selection mechanism spanning across tokens and channels. It then links the sequential selective mixers through a weighted averaging mechanism, enabling layers to directly access input and output from various layers. Mamba-ND [32] expands the application of the SSM to higher dimensions by alternating sequence wandering across layers. Utilizing a similar scanning strategy as VMamba in the 2D scenario, it extends this approach to 3D. Additionally, they advocate for the use of multi-head SSM as an analogue to multi-head attention. In response to the inefficiencies and performance challenges encountered by traditional transformers in image and time series processing, a new architecture named Simplified Mamba-based Architecture, SiMBA [33] is proposed to incorporate the Mamba block for sequence modeling and Einstein FFT (EinFFT) for channel modeling, with the goal of enhancing the stability and efficiency of the model in handling image and time series tasks. The Mamba block proves effective at processing long sequence data, while EinFFT represents a novel channel modeling technique. Experimental results demonstrate that SiMBA surpasses existing State Space Models and transformers across multiple benchmark tests.

Summary of 2D scanning mechanisms.

Scan is a key component for Mamba, when it comes to multi-dimensional inputs, the scanning mechanisms matter. We summarize the existing 2D scanning mechanisms in Fig. 4. In particular, the direction-aware updating employs a set of learnable parameters $\{\Theta_{k}\}$ to representing the four cardinal directions plus a special begin direction for the initial token, and reformulate (3) as

\displaystyle\begin{split}h^{\prime}_{k}(t)=&~{}\overline{\mathbf{A}}_{t}h_{k}% (t)+(\overline{\mathbf{B}}_{t}+\overline{\Theta}_{k,t})x(t),\\ y^{\prime}(t)=&~{}\sum_{k=1}^{4}\mathbf{C}_{t}h^{\prime}_{k}(t),\\ y(t)=&~{}y^{\prime}(t)\odot z(t).\end{split}

(7)

Expanding on the fundamental structure of Mamba and (7), we can devise the additional scanning mechanisms depicted in Fig. 4.

As an important element of Mamba, scanning mechanisms not only help in efficiency but also provide information in the scenario of visual-related tasks. We summarize the usage of different scanning mechanisms in existing works as Table. 1. Cross-Scan [27] and BiDirectional Scan [26] stand out as the most widely adopted scanning mechanisms. However, various other scanning mechanisms serve specific purposes. For example, 3D BiDirectional Scan [36] and Spatiotemporal Selective Scan [38] are tailored for video inputs. Local Scan [28] focuses on gathering local information, while ES2D [30] prioritizes efficiency.

Table 1: Summary of the Scanning mechanisms used in visual Mamba.

Scanning Mechanisms	Method
BiDirectional Scan [26]	Vision Mamba [26],Motion Mamba [37]
	HARMamba [40],MMA [41],VL-Mamba[42]
	Video Mamba Suite [43],Point Mamba [44]
	LMa-UNet [45]
	Motion-Guided Dual-Camera Tracker [46]
Cross-Scan [27]	VMamba [27],VL-Mamba[42],VMRNN [47]
	RES-VMAMBA [48],Sigma [49],ReMamber [50]
	Mamba-UNet [51],Semi-Mamba-UNet [52]
	VMambaMorph [53],ChangeMamba [54]
	H-vmunet [55],MambaMIR [56],MambaIR [57]
	Serpent [58],Mamba-HUNet [59],TM-UNet [60]
	Swin-UMamba [61],UltraLight VM-UNet [62]
	VM-UNet [63],VM-UNET-V2 [64]
	MedMamba [65],MIM-ISTD [66],RS3Mamba [67]
Continuous 2D Scanning [29]	PlainMamba [29]
Local Scan [28]	LocalMamba [28],FreqMamba [68]
Efficient 2D Scanning (ES2D) [30]	EfficientVMamba [30]
Zigzag Scan [34]	ZigMa [34]
Omnidirectional Selective Scan [35]	VmambaIR [35],RS-Mamba [69]
3D BiDirectional Scan [36]	VideoMamba [36]
Hierarchical Scan [37]	Motion Mamba [37]
Spatiotemporal Selective Scan [38]	Vivim [38]
Multi-Path Scan [39]	RSMamba [39]

3.3 Mamba with Other Architectures

Mamba, being a novel component compared to convolution, recurrence, and attention, offers opportunities for synergistic combinations with other architectures that are still relatively underexplored. In this section, we examine existing exploratory findings on such combinations.

Mamba with Convolution.

To combine Mamba with convolution, Mamba introduces the ability to obtain local information, which is essential for tasks related to medical images or segmentation tasks. RES-VMAMBA [48] pioneers incorporating a residual learning framework within the VMamba model to simultaneously leverage global and local state features inherent in the original VMamba architectural design. This architecture commences with a stem module responsible for processing the input image, followed by a series of VSS Blocks organized sequentially across four distinct stages. Diverging from the original VMamba framework, the Res-VMamba architecture adopts the VMamba structure as its backbone and directly integrates raw data into the feature map. They refer to this integration as the global-residual mechanism to distinguish it from the residual structure in the VSS block. This integration aims to facilitate the sharing of global image features alongside the information processed through the VSS blocks. This design seeks to harness the localized details captured by individual VSS blocks and the overarching global features inherent in the unprocessed input, thereby enhancing the model’s representational capacity and improving its performance on tasks requiring a comprehensive understanding of visual data.

Mamba with Recurrence.

To harness the long-sequence modeling capabilities of Mamba blocks and the spatiotemporal representation prowess of LSTM, the VMRNN[47] Cell eliminates all weights and biases in ConvLSTM[70] and employs VSS blocks to learn spatial dependencies vertically. In the VMRNN Cell, long-term and short-term temporal dependencies are captured by updating the information on cell states and hidden states from a horizontal perspective. Building upon the VMRNN Cell, two variants are proposed: VMRNN-B and VMRNN-D. VMRNN-B primarily relies on stacking VMRNN layers, while VMRNN-D incorporates more VMRNN Cells and introduces Patch Merging and Patch Expanding layers. The Patch Merging layer serves for downsampling, effectively reducing the spatial dimensions of the data, which aids in decreasing computational complexity and capturing more abstract, global features. Conversely, the patch-expanding layer is utilized for upsampling, increasing the spatial dimensions to restore detail and enable precise localization of features in the reconstruction phase. Ultimately, the reconstruction layer takes the hidden state from the VMRNN layer and scales it back to the input size, generating the predicted frame for the next time step. Integrating downsampling and upsampling processes offers significant advantages in our predictive architecture. Downsampling simplifies the input representation, enabling the model to process higher-level features with reduced computational overhead. This is particularly advantageous for more abstractly understanding complex patterns and relationships within the data.

Mamba with Attention.

The SSM-ViT block [71] is introduced for effective event-based information processing. It comprises three main components: a self-attention block (Block-SA), a dilated attention block (Grid-SA), and an SSM block. Block-SA focuses on immediate spatial relations and provides a detailed representation of nearby features. Grid-SA offers a global perspective, capturing comprehensive spatial relations and overall input structure. The SSM block ensures temporal consistency and smooth information transition between consecutive time steps. By integrating SSM with self-attention, the SSM-ViT block enables faster training and parameter timescale adjustment for temporal aggregation.

The Meet More Areas (MMA) block introduced in [41] adopts a MetaFormer-style architecture, comprising two Layer Normalization layers, a token mixer (consisting of a channel attention mechanism and a ViM block in parallel), and an MLP block for deep feature extraction. There are two main reasons for this choice: Firstly, models adopting MetaFormer-style architectures have shown promising results, indicating the potential for achieving favorable outcomes. Secondly, to fully leverage and utilize the global information extracted by the ViM block, the channel attention mechanism is incorporated to activate more pixels, as global details play a role in determining the channel attention weights. Additionally, it’s reasonable to suggest that employing a convolution-based module can enhance the visual representation obtained by the ViM block and streamline the training process, similar to the benefits observed with transformers. For restoration, Residual State Space Blocks (RSSBs) [57] block add VSS block in front of the channel attention block, which enables VSS to focus on learning diverse channel representation after which the critical channels are selected by subsequent channel attention, thus avoiding channel redundancy.

4 Visual Mamba in Application Fields

Mamba-based modules elevate the efficiency of processing sequential data, adeptly capturing long-range dependencies and seamlessly integrating into existing systems. In medical visual tasks and remote sensing images, where inputs often entail high-resolution data, Mamba emerges as a pivotal tool in augmenting various visual tasks, particularly those pertinent to medical applications.

In this section, we begin by highlighting the contributions of Mamba-based modules in enhancing general visual-related tasks. Subsequently, we delve into their specific impact on medical visual tasks and remote-sensing images.


(a) BiDirectional	(b) Cross-Scan [27]	(c) Continuous 2D	(d) Local Scan [28]
Scan [26]		Scanning [29]

(e) Efficient 2D Scanning (ES2D) [30]		(f) Zigzag Scan [34]

(g) Omnidirectional Selective Scan [35]		(h) 3D BiDirectional Scan [36]

(i) Hierarchical Scan [37]		(j) Spatiotemporal Selective Scan [38]

(k) Multi-Path Scan [39]

A Survey on Visual Mamba

Abstract

Keywords:

1 Introduction

2 Formulation of Mamba

2.1 State Space Models (SSMs)

Discretization.

Architectures.

Selective State Space Models.

2.2 Other Key Concepts in Mamba

Selection Mechanism.

Scan.

Discussion.

3 Mamba for Vision

3.1 Visual Mamba Block.

ViM.

VSS.

3.2 Pure Mamba

ViM-based.

VSS-based.

Visual data as multi-dimensional data.

Summary of 2D scanning mechanisms.

3.3 Mamba with Other Architectures

Mamba with Convolution.

Mamba with Recurrence.

Mamba with Attention.

4 Visual Mamba in Application Fields

4.1 General Visual Mamba

4.1.1 High/Mid-level vision

4.1.2 Low-level vision

4.2 Medical Visual Mamba

4.2.1 2D Medical Image

4.2.2 3D Medical Image

4.2.3 Challenge

4.3 Remote Sensing Image

5 Conclusion

5.0.1 Acknowledgements

5.0.2 \discintname

References