(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: 11{}^{1}~{}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTHubei University of Arts & Science,   22{}^{2}~{}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTNeolix Autonomous Vehicle
33{}^{3}~{}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTKuaishou Technology,    44{}^{4}~{}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTSun Yat-Sen University
{\dagger} Equal contributions   🖂 Corresponding author [email protected]

CountFormer: Multi-View Crowd Counting Transformer

1Hong Mo🖂    2Xiong Zhang    3Jianchao Tan    2Cheng Yang
   1Qiong Gu    1Bo Hang    4Wenqi Ren
Abstract

Multi-view counting (MVC) methods have shown their superiority over single-view counterparts, particularly in situations characterized by heavy occlusion and severe perspective distortions.

However, hand-crafted heuristic features and identical camera layout requirements in conventional MVC methods limit their applicability and scalability in real-world scenarios. In this work, we propose a concise 3D MVC framework called CountFormer to elevate multi-view image-level features to a scene-level volume representation and estimate the 3D density map based on the volume features. By incorporating a camera encoding strategy, CountFormer successfully embeds camera parameters into the volume query and image-level features, enabling it to handle various camera layouts with significant differences. Furthermore, we introduce a feature lifting module capitalized on the attention mechanism to transform image-level features into a 3D volume representation for each camera view. Subsequently, the multi-view volume aggregation module attentively aggregates various multi-view volumes to create a comprehensive scene-level volume representation, allowing CountFormer to handle images captured by arbitrary dynamic camera layouts. The proposed method performs favorably against the state-of-the-art approaches across various widely used datasets, demonstrating its greater suitability for real-world deployment compared to conventional MVC frameworks.

footnotetext: This Research is Supported by the Open Project Program of State Key Laboratory of Virtual RealityTechnology and Systems, Beihang University (NO.VRLAB2024C05)

1 Introduction

Single-view counting (SVC) has exhibited promising effectiveness, yielding remarkable achievements on well-established datasets [93, 81, 1, 28, 5, 41, 42, 52, 69, 62, 31, 53, 6, 57, 74], while the inherent limitations of existing SVC approaches hinder their practical application, thereby impeding their effective deployment in real-world scenarios.

Recently, there has been a growing trend towards addressing the multi-view counting (MVC) problem [86, 90, 87, 88, 91, 89, 82]. Specifically, existing approaches in the field of MVC typically make use of flat ground-plane assumptions to transform the image-level features onto the ground plane. Subsequently, fusing the multi-view (MV) ground features to recover scene-level features, and estimating the scene-level density using predictors based on fused ground features.

However, the flat ground assumptions [86, 90, 87, 88, 91, 89, 95] are not always guaranteed, which can result in misalignment between the scene-level features and the real-world environment, leading to less accurate counting performance. Additionally, in the MV feature fusion module, the attention weights are determined solely based on the distance from the ground plane to each camera without considering the features themselves [86, 90, 91, 89, 95], which may limit the effectiveness of the fusion strategy. Moreover, most current approaches[86, 90, 87, 88, 89, 95] can only handle statically fixed camera layouts, i.e., the camera configurations are identical during the training and inference stage, lacking the ability to perform the MVC task with images from arbitrary dynamic camera settings.

Simultaneously, noteworthy advancements have been accomplished in the domain of multi-view perception (MVP). Specifically, research works [55, 15, 29, 17, 30, 35, 26] focus on lifting image-level features of MV images to the cohesive scene-level space, followed by the execution of specific perception tasks using the derived scene-level features. Despite the concise architecture design and the promising performances on the MVP task, it is infeasible to naively adopt existing MVP methods without considering the specific challenges of the MVC settings because contemporary MVP approaches necessitate the assumption of stable and fixed camera layouts, which may not hold in MVC scenarios. Moreover, the inherent philosophies of employing multi-camera layouts in MVC and MVP are different. The MVP tasks require multi-view cameras with limited overlapped field of view (FOV) to provide a 360 FOV of the scene, while MVC tasks rely on multi-view settings that the FOV of each camera has significant overlap with the others to address occlusion and scale variation challenges.

In this work, we incorporate the recent advanced ideas in MVP and propose an innovative MV learning framework called CountFormer to extend the applicability and scalability of the existing MVC approaches. CountFormer is founded upon a primary consideration, where it shall be adequate to process images captured with arbitrary dynamic camera layout settings and robust enough to alleviate the performance drop caused by extrinsic parameter perturbation. To accomplish this objective, a feature lifting module is first proposed to lift the image-level feature in 3D space for each view. Specifically, the 3D scene is voxelized into individual voxels, and features corresponding to each voxel and camera view are obtained using a deformable attention mechanism [98]. Since the feature lifting module does not necessitate the flat ground assumption, the CountFormer is more suitable to deal with challenging situations such as congested crowds or uneven terrain. In addition, the integration of the attention mechanism enhances its robustness against fluctuations in camera extrinsic parameters, which is particularly beneficial in practical settings where such perturbations are inevitable [79].

Subsequently, an MV volume aggregation module is introduced to attentively fuse the MV volume features to generate the comprehensive scene-level volume representation, where the blending weights are estimated by implicitly joint considering the voxel features and the geometry property of the cameras, ensuring that both the visual features and the geometric information are considered during the fusion process, leading to more accurate and robust results than previous methods [86, 90, 89, 95, 91]. Due to the sophisticated design, the aggregation module is capable of efficiently handling a dynamic number of volume features, allowing the CountFormer to adapt to scenarios where the number of cameras may vary, ensuring robust performance in various applications. Afterwards, CountFormer employs 3D convolution operators to estimate the 3D scene-level density map from the aggregated volume representation.

Refer to caption
Figure 1: Framework of the CountFormer. The Image Encoder extracts multi-view and multi-level features (MVML) from the multi-view images of the scene. Image-Level Camera Embedding Module fuses camera intrinsic and extrinsic with the MVML features. The elaborate Cross-View Attention Module, a sophisticated attention component, transforms the image-level features into scene-level volume representations. Besides main components, a 2D Density Predictor is used to estimate the image space density, 3D Density Predictors are employed to regress for the 3D scene-level density, and a simple feature pyramid network fuses the multi-scale voxel features.

To enhance the robustness and representation capability of CountFormer in handling arbitrary dynamic camera layouts, the camera encoding strategy is incorporated, which involves implicitly encoding the extrinsic and intrinsic camera parameters. Specifically, the camera information is implicitly encoded into the volume query that is used to lift the corresponding image-level features in 3D space. Additionally, the camera information is also encoded into the corresponding image-level features. By incorporating the camera information into both the volume query and image-level features, CountFormer effectively integrates the camera-specific characteristics into its processing pipeline, leading to improved performance in various camera configurations and scene understanding tasks.

In summary, CountFormer is highly versatile and scalable, making it well-suited and efficient for usage in real-world scenarios, and our main contributions are as follows:

  1. We creatively design a revolutionary multi-view counting (MVC) framework, called CountFormer, which is the first attempt to solve the 3D MVC problem to fit a real-world environment.

  2. A feature lifting module and an MV volume aggregation module are conceived to transform the MV image-level features w.r.t arbitrary dynamic camera layouts into a unified scene-level volume representation.

  3. We present an effective strategy to embed the camera parameters into the image-level features and the volume query, facilitating accurate and adaptable representation among diverse camera setups.

2 Related Work

Due to the extensive scope of related works, we only discuss works that are strongly related to our framework.

Single-View Counting (SVC). Since the work of [84], density estimation has been the main paradigm for crowd counting. Subsequent works, such as [48, 85, 50, 60, 40, 93, 3, 1, 61, 39, 59, 21, 13, 25, 37] continue improving performance by designing more powerful model structures that capable of learning multi-scale representation and perspective-free feature detection. Simultaneously, another line of works [16, 52, 28, 8, 74, 5, 70, 63, 83, 44] adopt the insight from semantic image segmentation to exploit the encoder-decoder architecture to facilitate the multi-scale learning capability. Recently, by adopting the transformer architecture[68, 7], works such as [11, 66, 64, 33, 73, 32, 53, 78, 9, 54] substantially advance performances because of their remarkable representation and generalization capability. In addition to the intricate structures, [49, 43, 80, 12, 19, 97, 57] advances the progress by optimizing the training process or improving the cross-scene generalization ability. Despite the promising performance, SVC methods encounter challenges that necessitate resolution, including effectively dealing with scale variation, mitigating occlusion difficulties, and handling congested crowds.

Multi-View Perception (MVP). The pioneering research [55] introduces a novel approach that involves the transformation of multi-view multi-level features into a cohesive scene-level space, followed by the execution of specific perception tasks using the derived scene-level features. Subsequently, certain studies [15, 34, 47, 27] enhance the framework by incorporating sophisticated fundamental components. For instance, these studies utilize the transformer structure as the foundational framework [77], employ a robust perception head [76, 35], integrate multiple modal sensors to enhance performance [47, 34], and leverage the attention mechanism to effectively encode image-level features to the scene-level space [29, 96]. Furthermore, researchers tackle the task of 3D semantic occupancy prediction by employing a voxelization technique to convert the scene into discrete voxels and subsequently conduct 3D semantic segmentation to retrieve semantic features for each voxel [75, 94, 30, 18, 71, 67]. Despite the solid theoretical foundation and the extensive experimental validation of their effectiveness, naively transferring existing MVP approaches to solve the MVC task proves to be difficult, where the biggest challenge remains that MVC necessitates the capability of handling arbitrary dynamic camera layouts, which is infeasible for existing methods.

Multi-View Counting (MVC). Pioneering works [86, 90, 91] propose to transform each camera view’s image-level features to the scene’s ground plane and then fuse these features to estimate the scene-level density map. To further improve the performance, [87, 88] consider individuals’ variable height in the 3D environment by introducing multi-height ground planes along the z-axis. Although demonstrating promising performance, the strong assumption limits its applicability and scalability. Specifically, the feature transform module [86, 90, 89, 91, 87, 88] necessitated a flat ground assumption, which is not guaranteed in real-world situations. Furthermore, the fusion module’s blending weights are exclusively determined by the geometric position of the cameras [86, 90, 91, 89], ignoring the critical semantic features. However, these approaches seem counterintuitive, as occlusions and other crucial information related to the scene are primarily encoded in the semantic features. Alternatively, [87, 88] attempt to achieve reasonable fusion results by jointly considering geometric and semantic features, while these approaches come at the expense of limiting the flexibility and adaptability of the methods to handle dynamic camera layout settings.

3 Methodology

3.1 Image Encoder

The image encoder aims to extract multi-view and multi-level (MVML) features {nl}nN,lLsubscriptsuperscriptsubscriptnlformulae-sequencenNlL\left\{\mathcal{F}_{\mathrm{n}}^{\mathrm{l}}\right\}_{\mathrm{n}\leq\mathrm{N}% ,~{}\mathrm{l}\leq\mathrm{L}}{ caligraphic_F start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT roman_n ≤ roman_N , roman_l ≤ roman_L end_POSTSUBSCRIPT from multiple synchronized captured images {n}nNsubscriptsubscriptnnN\left\{\mathcal{I}_{\mathrm{n}}\right\}_{\mathrm{n}\leq\mathrm{N}}{ caligraphic_I start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT roman_n ≤ roman_N end_POSTSUBSCRIPT, where varying NN\mathrm{N}roman_N is the number of views, and LL\mathrm{L}roman_L indicates the maximum feature pyramid level. Each image nsubscriptn\mathcal{I}_{\mathrm{n}}caligraphic_I start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT is first fed into a vision transformer [46] to extract multi-level features. Subsequently, considering the significance of multi-scale feature fusion, the feature pyramid network [36] is adopted to aggregate and dispatch the multi-scale features to enrich the multi-scale representation and contextual information encoding capability, which are also employed by extensive research works [14, 92, 27, 17, 24].

3.2 Image-Level Camera Embedding

The camera parameters play a key role in transforming image-level features into scene-level representation. Existing approaches tend to employ channel-wise camera encoding to differentiate features across different views [17, 29, 45, 26, 51, 38]. However, the existing strategy proves inefficient when dealing with the challenging CVCS dataset [91], which includes tens of thousands of diverse camera configurations. In this study, we devise both channel-wise and spatial-wise camera encoding to facilitate learning from MV images with significantly varied camera layouts. Formally, given the extrinsic parameters M3×4Msuperscript34{\mathrm{M}}\in\mathbb{R}^{3\times 4}roman_M ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 4 end_POSTSUPERSCRIPT representing transformation from scene space to camera space, the intrinsic parameters K3×3Ksuperscript33\mathrm{K}\in\mathbb{R}^{3\times 3}roman_K ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT, and the image-level augmentation matrix Ac2×3subscriptAcsuperscript23\mathrm{A_{c}}\in\mathbb{R}^{2\times 3}roman_A start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 3 end_POSTSUPERSCRIPT, we flatten, concatenate, and extend the {M,K,Ac}MKsubscriptAc\{\mathrm{M,K,A_{c}}\}{ roman_M , roman_K , roman_A start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT } to build the spatial-aligned camera-parameter vector ξ27×H×W𝜉superscript27HW\xi\in\mathbb{R}^{27\times\mathrm{H}\times\mathrm{W}}italic_ξ ∈ blackboard_R start_POSTSUPERSCRIPT 27 × roman_H × roman_W end_POSTSUPERSCRIPT. Then we generate the positional encoding 𝒫c2×H×Wsubscript𝒫csuperscript2HW\mathcal{{P}}_{\rm{c}}\in\mathbb{R}^{2\times\mathrm{H}\times\mathrm{W}}caligraphic_P start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 × roman_H × roman_W end_POSTSUPERSCRIPT defined with 𝒫c(,u,v)=[u/W,v/H]Tsubscript𝒫cuvsuperscriptuWvHT\mathcal{P}_{\rm{c}}(\cdot,\mathrm{u,v})=[\nicefrac{{\mathrm{u}}}{{\mathrm{W}}% },\nicefrac{{\rm{v}}}{{\mathrm{H}}}]^{\mathrm{T}}caligraphic_P start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( ⋅ , roman_u , roman_v ) = [ / start_ARG roman_u end_ARG start_ARG roman_W end_ARG , / start_ARG roman_v end_ARG start_ARG roman_H end_ARG ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT. Suppose the feature map is C×H×WsuperscriptCHW\mathcal{F}\in\mathbb{R}^{\mathrm{C}\times\mathrm{H}\times\mathrm{W}}caligraphic_F ∈ blackboard_R start_POSTSUPERSCRIPT roman_C × roman_H × roman_W end_POSTSUPERSCRIPT, then the image-level camera embedding is performed with

Ec(,{M,K,Ac})=ϕc([ξ,𝒫c]),subscriptEcMKsubscriptActensor-productsubscriptitalic-ϕc𝜉subscript𝒫c\displaystyle\mathrm{E}_{\rm{c}}\Big{(}\mathcal{F},\{\mathrm{M,K,A_{c}}\}\Big{% )}=\mathcal{F}\otimes\phi_{\rm{c}}\Big{(}[\xi,\mathcal{P}_{\rm{c}}]\Big{)},roman_E start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( caligraphic_F , { roman_M , roman_K , roman_A start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT } ) = caligraphic_F ⊗ italic_ϕ start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ( [ italic_ξ , caligraphic_P start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT ] ) , (1)

where []delimited-[][\cdot][ ⋅ ] denotes tensor concatenation, ϕcsubscriptitalic-ϕc\phi_{\rm{c}}italic_ϕ start_POSTSUBSCRIPT roman_c end_POSTSUBSCRIPT:29C:absentsuperscript29superscriptC:\mathbb{R}^{29}\rightarrow\mathbb{R}^{\mathrm{C}}: blackboard_R start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT roman_C end_POSTSUPERSCRIPT represents a Multi-Layer Perceptron (MLP) that aggregates the positional encoding and the camera parameters, and tensor-product\otimes remains a binary operator, such as the Hadamard Product [23] or the widely-used element-wise addition.

3.3 Multi Scale Transformers

The multi-scale transformers aim to generate the multi-scale volume representations {𝒱1,𝒱2,,𝒱L}subscript𝒱1subscript𝒱2subscript𝒱L\{\mathcal{V}_{1},\mathcal{V}_{2},\cdots,\mathcal{V}_{\mathrm{L}}\}{ caligraphic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , caligraphic_V start_POSTSUBSCRIPT roman_L end_POSTSUBSCRIPT } from the MVML features {nl}nN,lLsubscriptsuperscriptsubscriptnlformulae-sequencenNlL\left\{\mathcal{F}_{\mathrm{n}}^{\mathrm{l}}\right\}_{\rm{n}\leq\mathrm{N},l% \leq\mathrm{L}}{ caligraphic_F start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT roman_n ≤ roman_N , roman_l ≤ roman_L end_POSTSUBSCRIPT, where NN\mathrm{N}roman_N is the number of employed view and LL\mathrm{L}roman_L refers the maximum feature pyramid level. To accomplish this objective, for each level ll\mathrm{l}roman_l, a CountFormer is dedicated to lift the MV image-level features {nl}nNsubscriptsubscriptsuperscriptlnnN\{\mathcal{F}^{\mathrm{l}}_{\mathrm{n}}\}_{\mathrm{n}\leq\mathrm{N}}{ caligraphic_F start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT roman_n ≤ roman_N end_POSTSUBSCRIPT to the scene-level volume representation 𝒱lsubscript𝒱l\mathcal{V}_{\mathrm{l}}caligraphic_V start_POSTSUBSCRIPT roman_l end_POSTSUBSCRIPT.

Specifically, the CountFormer consists of multiple encoder layers, each following the standard design principles of transformers [68], with three specialized components, i.e., the learnable volume query, the cross-view attention, and a computationally efficient 3D convolution that replaces the self-attention mechanism. To elaborate further, the volume query is used as the initial query for each camera view, which is then combined with camera extrinsic and intrinsic parameters to create the view-dependent query. The primary purpose of the cross-view attention module is to lift the image-level features to volume features for each camera view and then aggregate the MV volumes to produce a comprehensive scene-level volume representation. In detail, it comprises three indispensable components, i.e., volume-level query embedding, the feature lifting module, and the MV volume aggregation module. Note that, in the following discussion, the subscripts ll\rm{l}roman_l and nn\rm{n}roman_n may be omitted to keep simplicity.

Volume Query Representation. The scene is discretized into voxels with shape Z×Y×XZYX\mathrm{Z\times Y\times X}roman_Z × roman_Y × roman_X, and assigned with a group of volume-shaped learnable parameters 𝒬C×Z×Y×X𝒬superscriptCZYX\mathcal{Q}\in\mathbb{R}^{\mathrm{C\times Z\times Y\times X}}caligraphic_Q ∈ blackboard_R start_POSTSUPERSCRIPT roman_C × roman_Z × roman_Y × roman_X end_POSTSUPERSCRIPT as the queries, where X,Y,ZXYZ\mathrm{X,Y,Z}roman_X , roman_Y , roman_Z are the spatial shape of the volume and CC\mathrm{C}roman_C governs the hidden dimension. Specifically, the query 𝒬pCsuperscript𝒬psuperscriptC\mathcal{Q}^{\mathrm{p}}\in\mathbb{R}^{\mathrm{C}}caligraphic_Q start_POSTSUPERSCRIPT roman_p end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_C end_POSTSUPERSCRIPT located at p=(d,h,w)pdhw\rm{p}=\rm{(d,h,w)}roman_p = ( roman_d , roman_h , roman_w ) is responsible for the corresponding voxel in the volume, each voxel in the volume corresponds to a real-world size with ss\mathrm{s}roman_s meter, and the center of the volume is aligned with the origin point in the 3D scene coordinate system. In comparison to existing approaches [86, 95, 87] that utilize Inverse Perspective Map** (IPM) [2] strategy or Spatial Transformer Networks (STN) [20] module for deriving the volume representation of the scene, the query-based attention paradigm provides greater scalability, does not rely on the flat ground assumption, and requires less hyperparameter tuning, making it a more practical approach.

Volume-Level Camera Embedding. Existing MVP approaches [67, 18, 77, 29, 96, 35] consider 𝒬𝒬\mathcal{Q}caligraphic_Q as the universal query and do not take into account the camera’s intrinsic and extrinsic parameters, which is comprehensible since MVP task comprises a fixed and stable camera layout [4, 65], e.g., the golden MVP benchmark nuScene [4] comprises 6 stable surrounding cameras to provide 360 FOV. Nevertheless, the view-agnostic global query 𝒬𝒬\mathcal{Q}caligraphic_Q is inadequate in addressing the complexities posed by challenging surveillance environments, where significantly diverse dynamic camera layouts are inevitable in such scenarios. For instance, the CVCS benchmark [91] consists of images from tens of thousands of camera views. To tackle this problem, we develop a crucial volume-level camera embedding module that utilizes a similar strategy as the image-level camera embedding, which encodes the camera parameters {M,K}MK\{\mathrm{M,K}\}{ roman_M , roman_K } and volume-level augmentation matric Av2×3subscriptAvsuperscript23\rm A_{v}\in\mathbb{R}^{2\times 3}roman_A start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 3 end_POSTSUPERSCRIPT into the versatile 𝒬𝒬\mathcal{Q}caligraphic_Q to create the view-dependent volume query with

Ev(𝒬,{M,K,Av})=𝒬ϕv([ξ,𝒫v]),subscriptEv𝒬MKsubscriptAvtensor-product𝒬subscriptitalic-ϕv𝜉subscript𝒫v\displaystyle\small\mathrm{E}_{\mathrm{v}}\Big{(}\mathcal{Q},\{\mathrm{M,K,A_{% v}}\}\Big{)}=\mathcal{Q}\otimes\phi_{\mathrm{v}}\Big{(}[\xi,\mathcal{P}_{% \mathrm{v}}]\Big{)},roman_E start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ( caligraphic_Q , { roman_M , roman_K , roman_A start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT } ) = caligraphic_Q ⊗ italic_ϕ start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ( [ italic_ξ , caligraphic_P start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ] ) , (2)

where ϕv:30C:subscriptitalic-ϕvsuperscript30superscriptC\phi_{\mathrm{v}}:\mathbb{R}^{30}\rightarrow\mathbb{R}^{\mathrm{C}}italic_ϕ start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT 30 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT roman_C end_POSTSUPERSCRIPT fuses the positional encoding with the camera parameters, 𝒫v3×Z×Y×Xsubscript𝒫vsuperscript3ZYX\mathcal{P}_{\mathrm{v}}\in\mathbb{R}^{3\times\mathrm{Z\times Y\times X}}caligraphic_P start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × roman_Z × roman_Y × roman_X end_POSTSUPERSCRIPT stands for the positional encoding with 𝒫v(,d,h,w)=[d/Z,h/Y,w/X]Tsubscript𝒫vdhwsuperscriptdZhYwXT\mathcal{P}_{\mathrm{v}}(\cdot,\rm{d,h,w})=[\nicefrac{{\rm{d}}}{{\mathrm{Z}}},% \nicefrac{{\rm{h}}}{{\mathrm{Y}}},\nicefrac{{\rm{w}}}{{\mathrm{X}}}]^{\mathrm{% T}}caligraphic_P start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT ( ⋅ , roman_d , roman_h , roman_w ) = [ / start_ARG roman_d end_ARG start_ARG roman_Z end_ARG , / start_ARG roman_h end_ARG start_ARG roman_Y end_ARG , / start_ARG roman_w end_ARG start_ARG roman_X end_ARG ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, and ξ𝜉\xiitalic_ξ takes similar definition as in Equation 1.

Feature Lifting Module. Considering the need for supporting dynamic camera layouts and the complex deployment environment in the MVC task, employing deformable attention [98, 29] would be more suitable than relying on the IPM paradigm for constructing the feature lifting module. Employing deform attention to lift image feature to 3D scene space has been widely used in multi-view tasks [29, 22, 72, 30]. In this work, the CountFormer also adopts this method but replaces the global versatile query with a view-dependent query to address the complexities in surveillance environments. Mathematically, given the camera dependent query 𝒬𝒬\mathcal{Q}caligraphic_Q encoded with Equation 2, and the corresponding image-level feature \mathcal{F}caligraphic_F, for each voxel query 𝒬psuperscript𝒬p\mathcal{Q}^{\mathrm{p}}caligraphic_Q start_POSTSUPERSCRIPT roman_p end_POSTSUPERSCRIPT located at p=(d,h,w)pdhw\mathrm{p=(d,h,w)}roman_p = ( roman_d , roman_h , roman_w ), one may first recover the corresponding 3D location (x,y,z)superscriptxsuperscriptysuperscriptz\mathrm{(x^{\prime},y^{\prime},z^{\prime})}( roman_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) w.r.t. the scene with

x=s(wX/2);y=s(hY/2);z=s(dZ/2),formulae-sequencesuperscriptxswX2formulae-sequencesuperscriptyshY2superscriptzsdZ2\displaystyle\mathrm{x^{\prime}=s\cdot(w-\nicefrac{{\mathrm{X}}}{{2}});~{}~{}y% ^{\prime}=s\cdot(h-\nicefrac{{\mathrm{Y}}}{{2}});~{}~{}z^{\prime}=s\cdot(d-% \nicefrac{{\mathrm{Z}}}{{2}})},roman_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_s ⋅ ( roman_w - / start_ARG roman_X end_ARG start_ARG 2 end_ARG ) ; roman_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_s ⋅ ( roman_h - / start_ARG roman_Y end_ARG start_ARG 2 end_ARG ) ; roman_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_s ⋅ ( roman_d - / start_ARG roman_Z end_ARG start_ARG 2 end_ARG ) , (3)

then obtain the homogeneous reference point p=(u,v)puv\mathrm{p=(u,v)}roman_p = ( roman_u , roman_v ) by projecting the 3D scene-level point (x,y,z)superscriptxsuperscriptysuperscriptz\mathrm{(x^{\prime},y^{\prime},z^{\prime})}( roman_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) according to the intrinsic KK\mathrm{K}roman_K and extrinsic MM\mathrm{M}roman_M as

[u,v,1]T=KM[x,y,z,1]T,superscriptuv1TKMsuperscriptsuperscriptxsuperscriptysuperscriptz1T\displaystyle\mathrm{[u,v,1]^{\mathrm{T}}=\mathrm{K}\cdot\mathrm{M}\cdot[x^{% \prime},y^{\prime},z^{\prime},1]^{T}},[ roman_u , roman_v , 1 ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT = roman_K ⋅ roman_M ⋅ [ roman_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT , (4)

and finally, perform the sophisticated deformable attention mechanism at the projected location (u,v)uv\rm(u,v)( roman_u , roman_v ).

Multi-View Volume Aggregation. The MV volume aggregation module aggregates the MV volume representations to produce a comprehensive volume of the scene. Existing fusion approaches can only deal with stable and fixed camera layouts [29, 86, 90, 88, 87], or are compromised in conducting the fusion process by only considering the geometric position of the cameras and ignoring the critical semantic features [90, 86, 91], or assume that different views contribute equally to the scene representation [29, 17, 27]. In summary, all existing approaches lack the necessary scalability to effectively handle dynamic camera layouts. In this work, a compact yet effective attention mechanism is devised to aggregate the MV volume representations {𝒱n}nNsubscriptsubscript𝒱nnN\{\mathcal{V}_{\rm n}\}_{\rm n\leq{\mathrm{N}}}{ caligraphic_V start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT roman_n ≤ roman_N end_POSTSUBSCRIPT with

𝒱=n𝒲n𝒱n,𝒱subscriptndirect-productsubscript𝒲nsubscript𝒱n\displaystyle\small\mathcal{V}=\sum\nolimits_{\rm n}\mathcal{W}_{\rm n}\odot% \mathcal{V}_{\rm n},caligraphic_V = ∑ start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT caligraphic_W start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT ⊙ caligraphic_V start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT , (5)

where 𝒲n1×Z×Y×Xsubscript𝒲nsuperscript1ZYX\mathcal{W}_{\rm n}\in\mathbb{R}^{1\times\mathrm{Z\times Y\times X}}caligraphic_W start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × roman_Z × roman_Y × roman_X end_POSTSUPERSCRIPT denotes the attention weight, direct-product\odot refers the element-wise product operation, and 𝒱nC×Z×Y×Xsubscript𝒱nsuperscriptCZYX\mathcal{V}_{\rm n}\in\mathbb{R}^{\mathrm{C\times Z\times Y\times X}}caligraphic_V start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_C × roman_Z × roman_Y × roman_X end_POSTSUPERSCRIPT is the volume feature w.r.t. the nthsubscriptnth\rm n_{\mathrm{th}}roman_n start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT view. Moreover, instead of employing the computationally heavily vanilla self-attention [68], we tend to estimate the weight of the pthsubscriptpth{\rm p}_{\mathrm{th}}roman_p start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT voxel in the nthsubscriptnth\rm n_{\mathrm{th}}roman_n start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT view 𝒲npsuperscriptsubscript𝒲np\rm\mathcal{W}_{n}^{p}caligraphic_W start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_p end_POSTSUPERSCRIPT much cheaper,

𝒲np={exp{ϕ(𝒬np)}/k𝐒exp{ϕ(𝒬kp)},ifn𝐒0,otherssuperscriptsubscript𝒲npcasesitalic-ϕsuperscriptsubscript𝒬npsubscriptk𝐒italic-ϕsuperscriptsubscript𝒬kpifn𝐒0others\displaystyle\mathcal{W}_{\rm n}^{\rm p}=\begin{cases}\nicefrac{{\exp\{\phi(% \mathcal{Q}_{\rm n}^{\rm p})\}}}{{\sum_{\rm k\in\mathbf{S}}\exp\{\phi(\mathcal% {Q}_{\rm k}^{\rm p})\}}},&\mathrm{if}~{}{\rm n}\in\mathbf{S}\\ 0,&\mathrm{others}\end{cases}caligraphic_W start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_p end_POSTSUPERSCRIPT = { start_ROW start_CELL / start_ARG roman_exp { italic_ϕ ( caligraphic_Q start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_p end_POSTSUPERSCRIPT ) } end_ARG start_ARG ∑ start_POSTSUBSCRIPT roman_k ∈ bold_S end_POSTSUBSCRIPT roman_exp { italic_ϕ ( caligraphic_Q start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_p end_POSTSUPERSCRIPT ) } end_ARG , end_CELL start_CELL roman_if roman_n ∈ bold_S end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL roman_others end_CELL end_ROW (6)

where 𝐒𝐒\mathbf{S}bold_S represents the whole camera-views that voxel pp\rm proman_p hits, ϕ::italic-ϕabsent\phi:italic_ϕ : C+superscriptCsubscript~{}\mathbb{R}^{\mathrm{C}}\rightarrow\mathbb{R}_{+}blackboard_R start_POSTSUPERSCRIPT roman_C end_POSTSUPERSCRIPT → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT measures the importance of voxel, and 𝒬nsubscript𝒬n\mathcal{Q}_{\rm n}caligraphic_Q start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT referes to the volume query with camera-encoding w.r.t. the nthsubscriptnth\rm n_{\mathrm{th}}roman_n start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT view. It is worth noting that the aggregation strategy possesses significant advantages over previous methods. For instance, it is independent of the permutation of the camera views and thus is capable of handling dynamic camera layouts. Besides, the blending weights 𝒲npsuperscriptsubscript𝒲np\rm\mathcal{W}_{n}^{p}caligraphic_W start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_p end_POSTSUPERSCRIPT inherently encode the semantic contextual and geometric position, making it more practical than previous fusion strategies [91, 29, 17], especially dealing with occlusion. It shall point out that the MVP tasks necessitate the multi-cameras to provide 360 FOV comprehensive perception of the scene, where marginal overlap** exists between different cameras [4, 65], making the cross-view fusion strategy not necessitates elaborated design [29, 17, 75]. In contrast, MVC tasks heavily rely on the overlap** between different camera views to address the ambiguity caused by occlusion and the scale variation resulting from perspective projection [86, 91]. As a result, the multi-view fusion strategy needs to be carefully designed to effectively integrate information from multiple views.

3.4 Density Predictor

As 3D density estimation requires more low-level features to enable the network to learn fine-grained density, we incorporate the 3D FPN [36] to perform multi-scale volume feature fusion. Formally, given multi-scale 3D volumes {𝒱l}lLsubscriptsuperscript𝒱llL\{\mathcal{V}^{\rm l}\}_{\rm l\leq\mathrm{L}}{ caligraphic_V start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT roman_l ≤ roman_L end_POSTSUBSCRIPT, we upsample l1thlsubscript1th{\rm l-1}_{\mathrm{th}}roman_l - 1 start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT level 3D density features 𝒳l1subscript𝒳l1\mathcal{X}_{\rm l-1}caligraphic_X start_POSTSUBSCRIPT roman_l - 1 end_POSTSUBSCRIPT with 3D deconvolution layer and fuse it with lthsubscriptlth\rm l_{\rm th}roman_l start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT volume representation 𝒱lsubscript𝒱l\mathcal{V}_{\rm l}caligraphic_V start_POSTSUBSCRIPT roman_l end_POSTSUBSCRIPT as

𝒳l=𝒱l+DeConv(𝒳l1),subscript𝒳lsubscript𝒱lDeConvsubscript𝒳l1\displaystyle\small\mathcal{X}_{\rm l}=\mathcal{V}_{\rm l}+\mathrm{DeConv}(% \mathcal{X}_{\rm l-1}),caligraphic_X start_POSTSUBSCRIPT roman_l end_POSTSUBSCRIPT = caligraphic_V start_POSTSUBSCRIPT roman_l end_POSTSUBSCRIPT + roman_DeConv ( caligraphic_X start_POSTSUBSCRIPT roman_l - 1 end_POSTSUBSCRIPT ) , (7)

we apply 3D convolution layer on 𝒳lsubscript𝒳l\mathcal{X}_{\rm l}caligraphic_X start_POSTSUBSCRIPT roman_l end_POSTSUBSCRIPT to estimate the 3D density map 𝒢lsubscript𝒢l\mathcal{G}_{\rm l}caligraphic_G start_POSTSUBSCRIPT roman_l end_POSTSUBSCRIPT, and supervise the training procedure with L2NormsubscriptL2Norm\mathrm{L}_{2}~{}\mathrm{Norm}roman_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Norm. Considering that the high-resolution prediction remains more important, loss weight αlsubscript𝛼l\alpha_{\rm l}italic_α start_POSTSUBSCRIPT roman_l end_POSTSUBSCRIPT is employed to balance the training losses among various resolutions. Besides, as shown in Figure 1, we add the 2D density estimation task as an intermediate supervision to accelerate the training process, which may aid in gradient propagation and improve the overall training procedure. This design is reasonable because the complicated CountFormer architecture makes the gradient feedback from the 3D density supervision too long, while the intermediate 2D density estimation task elegantly improves the gradient feedback instead. To this end, the training objective of the CountFormer is

=λ|¯|2+lαl|𝒢l𝒢l¯|2,𝜆subscript¯2subscriptlsubscript𝛼lsubscriptsubscript𝒢l¯subscript𝒢l2\displaystyle\small\mathcal{L}=\lambda\left|\mathcal{H}-\bar{\mathcal{H}}% \right|_{2}+\sum\nolimits_{\rm l}\alpha_{\rm l}\left|\mathcal{G}_{\rm l}-\bar{% \mathcal{G}_{\rm l}}\right|_{2},caligraphic_L = italic_λ | caligraphic_H - over¯ start_ARG caligraphic_H end_ARG | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT roman_l end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT roman_l end_POSTSUBSCRIPT | caligraphic_G start_POSTSUBSCRIPT roman_l end_POSTSUBSCRIPT - over¯ start_ARG caligraphic_G start_POSTSUBSCRIPT roman_l end_POSTSUBSCRIPT end_ARG | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (8)

where ||2\left|\cdot\right|_{2}| ⋅ | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the L2NormsubscriptL2Norm\mathrm{L}_{2}~{}\mathrm{Norm}roman_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Norm, λ𝜆\lambdaitalic_λ trades between the 2D density supervision and 3D density estimation, \mathcal{H}caligraphic_H and 𝒢lsubscript𝒢l\mathcal{G}_{\rm l}caligraphic_G start_POSTSUBSCRIPT roman_l end_POSTSUBSCRIPT denote the GT 2D and 3D density, and ¯¯\bar{\cdot}over¯ start_ARG ⋅ end_ARG is the corresponding prediction. Compared to the complicated loss design strategy adopted in [86, 90, 87, 88, 89], the training objective in CountFormer is much simpler yet more straightforward to tune.

4 Experiments

4.1 Experiment Settings

We leverage all existing MVC datasets to evaluate the effectiveness of the CountFormer, including CityStreet [86], PETS2009 [10], DukeMTMC [58], and CVCS [91]. To make a fair comparison, following conventional works [86, 90, 89, 87, 88], we employ the mean absolute error (MAEMAEabsent\mathrm{MAE}\downarrowroman_MAE ↓) and normalized mean absolute error (NAENAEabsent\mathrm{NAE}\downarrowroman_NAE ↓) as the evaluation criteria to quantify the counting performances on both image-level and scene-level. Our code and model is available at https://github.com/MandyMo/ECCV_Countformer for research purpose.

4.2 Qualitative Experiments

To better demonstrate the robustness of CountFormer in the challenging scenarios, we draw some representative samples from CityStreet and PETS2009 testing sets, as illustrated in Figure 2, i.e., in the presence of occlusion and the congested crowds. Experiments demonstrate the capability of CountFormer in dealing with occlusion. Specifically, the CVCS approach[91] encounters difficulties in accurately reconstructing the density in spatial regions obstructed by objects such as buses or trees. By employing the MV volume aggregation module, our CountFormer can alleviate this issue and make a reasonable estimation, because the aggregation module is capable of dynamic blend voxel features from all views. Moreover, both the CVCS [91] and the 3D Counting [87] demonstrate limited efficacy in managing densely populated crowds. As expected, CountFormer effectively tackled these challenges by utilizing the feature lifting module to transform image-level features to 3D volume for each view and adopting the MV volume aggregation module to attentively fuse suitable features for each voxel.

Refer to caption
Figure 2: Qualitative Results. The figure exhibits several typical scenarios on the CityStreet (with 3 views) and PETS2009 (with 3 views) datasets, including occlusion and congested crowds. For each sample, the multi-view images, the ground truth scene-level density and estimated density from CVCS method[91], 3D Counting approach[87], and the CountFormer are presented in the bird’s eye view, respectively.
Refer to caption
Figure 3: Qualitative Results. The figure visualizes 3 challenging scenarios on the CVCS benchmark. Regarding each sample, the visualization includes the multi-view images (with 5 views), ground truth density, density obtained with the MV volume aggregation module, and density estimated without this module.

To gain a better understanding of the effectiveness of the feature aggregation module in CountFormer, we select a 3D reference point for each sample and report the fusion weight of each point across all views (the projected reference points are marked as \bigstar, \square, \vartriangle, and \lozenge in Figure 2). As Table 1 reports the attention weights of each 3D reference point, CountFormer assigns smaller weights to occluded views because the MV volume aggregation module jointly consider the geometric information and semantic feature of each view, while the CVCS approach focuses on geometric camera position only (see \bigstar, \square, and \vartriangle ). Additionally, CountFormer also automatically assigns appropriate weights to views with congested crowds, enhancing the count accuracy (see \lozenge).

Sample Method Atten-Weight
C1 C3 C4
\bigstar CVCS 0.15 0.23 0.62
CountFormer 0.40 0.51 0.09
\square CVCS 0.67 0.22 0.11
CountFormer 0.18 0.33 0.49
(a) Attention Weights on CityStreet.
Sample Method Atten-Weight
C1 C2 C3
\vartriangle CVCS  0.38  0.41  0.21
CountFormer 0.19 0.45 0.36
\lozenge CVCS 0.35 0.43 0.28
CountFormer 0.45 0.47 0.08
(b) Attention Weights on PETS2009.
Table 1: Quantitative Analysis of Attention Weights. The table presents the attention weights according to the MV features on CityStreet and PETS2009, where the 3D reference points \bigstar , \square, \vartriangle, and \lozenge according each view are marked on Figure 2.

Figure 3 depicts the density maps w/ and w/o the volume aggregation module on the challenging CVCS benchmark, one may observe that, w/o the aggregation module, it tends to undercount in spatial regions that are occluded in some views (2nd sample), the model fails to conduct exact counting at the present of exaggerated views (1st sample), and the CountFormer may struggle to predict sharp density maps with large-scale variation (3rd sample). However, w/ the aggregate module, CounFormer is capable of dealing with these challenges, highlighting this module’s ability to effectively fuse MV volume features, especially in situations with significant scale variations or occlusions.

Refer to caption
(a) Histogram Comparison
     Dataset    SOTA  Ours
CityStreet [86] 6.98 [90] 4.72
CVCS [91] 7.22 [91] 4.79
PETS2009 [10] 3.08 [95] 0.74
DukeMTMC[58] 0.87 [95] 0.88
(a) Quantitative Comparison
Figure 4: Comparisons with state-of-the-art (SOTA) methods. The figure presents the comparisons between [86, 90, 87, 88, 91, 95, 56, 89] and our CountFormer, where the mean absolute error (MAE \downarrow) is used to evaluate the performance on the CityStreet dataset[86], CVCS dataset [91], PETS2009 dataset[10], and DukeMTMC dataset[58]. For better visualization, we plot the best performance among [86, 90, 87, 88, 91, 95, 56, 89] to compare with ours on each dataset.

4.3 Quantitative Experiments

As Figure 4 briefly summarizes the comparison of CountFormer against the latest state-of-the-art (SOTA) performances, the CountFormer substantially outperforms the SOTA by a large margin on PETS2009 [10], CityStreet[86], and CVCS [91] datasets, and achieves comparable performance on DukeMTMC [58].

Specifically, on the DukeMTMC dataset[58], our method achieves competitive results among newly launched approaches [86, 90, 87, 88, 91, 95, 89, 56]. Although the performance seems not overwhelming, it is reasonable considering the saturated performance [90, 88, 91, 95], the inaccurate annotation quality [86], and the non-overlap** camera view. In contrast, the PETS2009 dataset [10] contains a congested crowd distribution and overlap** views. As a result, the CountFormer significantly outperforms all existing approaches [86, 90, 87, 88, 91, 95, 56, 89] and achieves exceptional performance, reducing the scene-level MAEMAE\mathrm{MAE}roman_MAE by 76.0%percent76.076.0\%76.0 %. Similarly, a similar trend can be seen in the CityStreet benchmark [86], which contains a larger crowd distribution, severe dynamic occlusions from the environment, and diverse scale variations caused by perspective projection, the CountFormer significantly outperforms all current approaches [86, 90, 87, 88, 91, 95, 56, 89, 89] in terms of MAE/NAEMAENAE\mathrm{MAE/NAE}roman_MAE / roman_NAE for both scene-level and single-view level, maintaining a state-of-the-art performance and reducing the scene-level MAE/NAE by nearly half. On the large-scale and most challenging MVC benchmark CVCS [91], CountFormer achieves an impressive MAE/NAEMAENAE\mathrm{MAE/NAE}roman_MAE / roman_NAE and sets a new state-of-the-art performance, outperforming the latest 3D counting approach [88] by 178%percent178178\%178 %.

It is not astonishing when considering the superior architecture of the CountFormer. Specifically, (1) the Feature Lifting Module employs the attention mechanism to retrieve semantic features from image space, which proves more robust than the IPM [2] strategy that previous works adopted [90, 88] because the IPM assumes the world to be flat on a plane, which seems difficult to be guaranteed. Moreover, it remains complicated to consider the height of people in the crowd for the IPM strategy, making the IPM tend to commit misaligned feature transformation; (2) the MV Volume Aggregation Module is capable of solving the occlusion and scale variation problem by picking features from the appropriate view, yet contemporary works [91] turn to fuse multi-view features based on the distance of the IPM plane to the camera while ignoring the semantic feature themselves. While distance prior is capable of solving the scale variation dilemma, it seems powerless when dealing with occlusion since occlusion necessitates the semantic features for further validation; (3) the camera encoding strategy embeds the camera parameters into the volume query and the image features, implicitly allowing the CountFormer to model the camera extrinsic and intrinsic and facilitating the CountFormer to deal with arbitrary camera layouts.

LL\mathrm{~{}L~{}}roman_L VV\mathrm{~{}V~{}}roman_V AA\mathrm{~{}A~{}}roman_A II\mathrm{~{}I~{}}roman_I PETS2009 CityStreet CVCS
MAEMAE\mathrm{MAE}roman_MAE NAENAE\mathrm{NAE}roman_NAE MAEMAE\mathrm{MAE}roman_MAE NAENAE\mathrm{NAE}roman_NAE MAEMAE\mathrm{MAE}roman_MAE NAENAE\mathrm{NAE}roman_NAE
  2.76  0.114  7.12  0.084  11.3  0.088
1.31 0.054 5.65 0.071 9.51 0.074
1.17 0.047 5.15 0.062 6.22 0.048
0.80 0.033 4.81 0.057 4.96 0.041
0.74 0.030 4.72 0.058 4.79 0.039
Table 2: Ablation Study. The table presents the ablation results w.r.t. various combinations of the critical components, where A adopts the MV Volume Aggregation Module, A simply averages the MV volume for each voxel, I and  I  denotes injecting camera-encoding into image-level features or not, V and  V  governs whether adopting volume-level camera encoding, L refers to the Feature Lifting Module, and  L  naively employs the IPM to transform image features to 3D space.

4.4 Ablation Study

In this section, various ablation experiments are conducted to comprehend the merit of the CountFormer competently.

Firstly, we evaluate the impact of the critical designs that constitute the CountFormer. As Table 2 summarizes the ablation results, adopting the Feature Lifting Module substantially improves the performances on the PETS2009 and CityStreet datasets because PETS2009 and CityStreet comprise fixed camera layouts, making the Feature Lifting Module adequate to lift image features to volume representation and favoring the volume queries to encode the camera layouts implicitly, overwhelming the boosting of the performances.

Method LL\mathrm{~{}L~{}}roman_L VV\mathrm{~{}V~{}}roman_V AA\mathrm{~{}A~{}}roman_A PETS2009 CityStreet CVCS
MAEMAE\mathrm{MAE}roman_MAE NAENAE\mathrm{NAE}roman_NAE MAEMAE\mathrm{MAE}roman_MAE NAENAE\mathrm{NAE}roman_NAE MAEMAE\mathrm{MAE}roman_MAE NAENAE\mathrm{NAE}roman_NAE
3D Counting [87]  3.25  0.136  7.63  0.102  12.8  0.116
2.53 0.102 6.76 0.091 12.9 0.114
2.18 0.084 6.09 0.082 8.52 0.069
1.29 0.051 5.54 0.074 6.21 0.046
CVCS [91] 3.81 0.142 7.43 0.101 7.27 0.061
2.79 0.108 7.14 0.098 7.41 0.063
2.54 0.097 6.86 0.094 7.01 0.059
1.72 0.063 6.17 0.080 6.76 0.054
BEVFormer [29] 3.04 0.119 7.17 0.096 9.56 0.087
2.63 0.105 6.63 0.085 7.32 0.061
1.45 0.058 5.81 0.078 6.33 0.048
Table 3: Ablation Study. The table shows the effects of integrating critical components into existing approaches [88, 91, 29], where L, V, A, L, V, and A shares a similar definition as in Table 2.

Furthermore, it is noteworthy that the integration of the Volume-Level Camera Embedding alongside the MV volume Aggregation Module exhibits a substantial enhancement in performance when applied to the CVCS dataset, which currently stands as one of the most challenging benchmarks for MVC analysis. We believe the notable enhancement in performance attributed to the Volume-Level Camera Embedding’s explicit incorporation of camera extrinsic and intrinsic parameters, as well as the MV Volume Aggregation Module’s aptitude for selecting suitable features for each voxel. Furthermore, the introduction of camera encoding into image features yields marginal performance improvement, which is reasonable as the camera encoding complements the attention mechanism within the feature-lifting module.

#Train-View #Test-View
3 5 7 9
\star  23.7/0.178  11.0/0.081  6.24/0.047  3.88/0.029
5 26.8/0.197 11.0/0.081 8.06/0.059 4.52/0.034
U(2,11)U211\mathrm{U}(2,11)roman_U ( 2 , 11 ) 24.0/0.179 11.9/0.088 6.72/0.052 4.02/0.030
Table 4: Ablation Study. The table investigates the robustness of CountFormer against various numbers of camera views on the CVCS dataset [91], where the number of image views used for training is not necessarily equal to that in the testing stage and the ground-truth count is all the people in the scene. Specifically, \star indicates that the #Train-view equals the #Test-view, 5555 means the CountFormer is trained with #Train-View=5 and evaluated with various #Test-View, and U(a,b)Uab\mathrm{U}(\rm a,b)roman_U ( roman_a , roman_b ) denotes the #Train-view uniformly sampled from aa\rm aroman_a to bb\rm broman_b in each iteration.

Table 3 presents the effectiveness of the components when integrated with established MVC methodologies [87, 91] and MVP approach[29]. We may observe that the volume-level camera embedding, the MV volume aggregation module, and the feature lifting module can also substantially improve the performances of existing MVC methods [91, 87]. Additionally, experimental results on the challenging CVCS dataset demonstrate that the MV volume aggregation and camera-level volume embedding are indispensable in dealing with arbitrary dynamic camera layouts. Moreover, it is infeasible to naively adopt the established MVP architecture BEVFormer [29] without considering the specific challenges of the MVC settings. Fortunately, equipped with the proposed components, e.g., volume-level camera embedding and MV volume aggregation module, the BEVFormer achieves promising performances.

#Train-View #Test-View PETS2009 CityStreet
MAEMAE\mathrm{MAE}roman_MAE NAENAE\mathrm{NAE}roman_NAE MAEMAE\mathrm{MAE}roman_MAE NAENAE\mathrm{NAE}roman_NAE
\star  1  5.43  0.217  7.75  0.091
2 2.07 0.082 6.03 0.071
3 0.74 0.030 4.72 0.058
3 1 7.05 0.267 8.42 0.100
2 3.46 0.139 6.71 0.080
3 0.74 0.030 4.72 0.058
U(1,3)U13\mathrm{U}(1,3)roman_U ( 1 , 3 ) 1 6.01 0.257 7.91 0.098
2 2.56 0.991 6.29 0.074
3 0.89 0.035 5.16 0.063
Table 5: Ablation Study. The table demonstrates the robustness of CountFormer with various numbers of cameras on the PETS2009 [10] and CityStreet [86] datasets, where \star and U(,)U\mathrm{U}(\cdot,\cdot)roman_U ( ⋅ , ⋅ ) take similar definition as in Table 4.

Moreover, in challenging surveillance environments, it is inevitable in practical situations that some cameras stand down, necessitating the capability of dealing with dynamic camera layouts, e.g., various numbers of camera views and cameras with significantly different views. As Table 4 and Table 5 demonstrate the performances of CountFormer with dynamic arbitrary camera layouts, CountFormer achieves encouraging performances even when some cameras are deactivated during the inference stage. By incorporating dynamic view selection during the training stage, CountFormer achieves comparable performance to models trained with a predefined number of views. This robustness in handling dynamic camera layouts contributes to the effectiveness of CountFormer in challenging surveillance scenarios.

5 Conclusion and Discussion

3D multi-view counting (MVC) is a challenging research area with many potential real-world applications. Nevertheless, there is currently no method that can solve the 3D MVC problem with arbitrary dynamic camera layouts. To address this gap, we developed a concise multi-view learning framework, CountFormer. Experimental results demonstrated that CountFormer is capable of handling challenging scenarios that single-view counting approaches struggle with. Quantitatively, CountFormer substantially outperforms all existing MVC approaches and achieves state-of-the-art performance on most MVC benchmarks. We believe that CountFormer can provide valuable insights for further research on MVC in real-world scenarios. In addition, the training process of CountFormer requires labor-intensive 3D annotations of the head point, posing challenges for deployment in real-world scenarios. However, labeling 2D head points in image space has proven to be more efficient. Therefore, it deserves to explore methods to leverage 2D annotations for training CountFormer in the future. Furthermore, Resource consumption is a significant problem that requires careful consideration in real-world scenarios. Employing more efficient attention mechanisms, using channel pruning methods, as well as adopting the quantization techniques to accelerate the inference speed is a promising work.

References

  • [1] Deepak Babu Sam, Shiv Surya, and R Venkatesh Babu. Switching convolutional neural network for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
  • [2] Massimo Bertozz, Alberto Broggi, and Alessandra Fascioli. Stereo inverse perspective map**: theory and applications. Image and Vision Computing (IVC), 16(8):585–590, 1998.
  • [3] Lokesh Boominathan, Srinivas SS Kruthiventi, and R Venkatesh Babu. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the International Conference on Multimedia (MM), pages 640–644. ACM, 2016.
  • [4] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631. IEEE, 2020.
  • [5] Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2018.
  • [6] Zhi-Qi Cheng, Qi Dai, Hong Li, **gkuan Song, Xiao Wu, and Alexander G Hauptmann. Rethinking spatial invariance of convolutional networks for object counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19638–19648. IEEE, 2022.
  • [7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2020.
  • [8] Zhipeng Du, Miao**g Shi, Jiankang Deng, and Stefanos Zafeiriou. Redesigning multi-scale neural network for crowd counting. IEEE Transactions on Image Processing (TIP), 2023.
  • [9] Yanyan Fang, Shenghua Gao, **g Li, Weixin Luo, Linfang He, and Bo Hu. Multi-level feature fusion based locality-constrained spatial transformer network for video crowd counting. Neurocomputing, 392:98–107, 2020.
  • [10] James Ferryman and Ali Shahrokni. Pets2009: Dataset and challenge. In IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pages 1–6. IEEE, 2009.
  • [11] Junyu Gao, Maoguo Gong, and Xuelong Li. Congested crowd instance localization with dilated convolutional swin transformer. Neurocomputing, 513:94–103, 2022.
  • [12] Jiaqi Gao, **gqi Li, Hongming Shan, Yanyun Qu, James Z Wang, Fei-Yue Wang, and Jun** Zhang. Forget less, count better: a domain-incremental self-distillation learning benchmark for lifelong crowd counting. Frontiers of Information Technology & Electronic Engineering, 24(2):187–202, 2023.
  • [13] Junyu Gao, Qi Wang, and Xuelong Li. Pcc net: Perspective crowd counting via spatial convolutional network. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 30(10):3486–3498, 2019.
  • [14] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7036–7045. IEEE, 2019.
  • [15] Anthony Hu, Zak Murez, Nikhil Mohan, Sofía Dudas, Jeffrey Hawke, Vijay Badrinarayanan, Roberto Cipolla, and Alex Kendall. Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15273–15282. IEEE, 2021.
  • [16] Yutao Hu, Xiaolong Jiang, Xuhui Liu, Baochang Zhang, Jungong Han, Xianbin Cao, and David Doermann. Nas-count: Counting-by-density with neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 747–766. Springer, 2020.
  • [17] Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  • [18] Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9223–9232. IEEE, 2023.
  • [19] Zhi-Kai Huang, Wei-Ting Chen, Yuan-Chun Chiang, Sy-Yen Kuo, and Ming-Hsuan Yang. Counting crowds in bad weather. arXiv preprint arXiv:2306.01209, 2023.
  • [20] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. Advances in Neural Information Processing Systems (NeurIPS), 28, 2015.
  • [21] Xiaoheng Jiang, Li Zhang, Mingliang Xu, Tianzhu Zhang, Pei Lv, Bing Zhou, Xin Yang, and Yanwei Pang. Attention scaling for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4706–4715. IEEE, 2020.
  • [22] Yanqin Jiang, Li Zhang, Zhenwei Miao, Xiatian Zhu, ** Gao, Weiming Hu, and Yu-Gang Jiang. Polarformer: Multi-camera 3d object detection with polar transformer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), number 1, pages 1042–1050, 2023.
  • [23] **-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325, 2016.
  • [24] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6399–6408. IEEE, 2019.
  • [25] Yinjie Lei, Yan Liu, **** Zhang, and Lingqiao Liu. Towards using count-level weak supervision for crowd counting. Pattern Recognition (PR), 109:107616, 2021.
  • [26] Tianyu Li, Pei** Jia, Bangjun Wang, Li Chen, Kun Jiang, Junchi Yan, and Hongyang Li. Lanesegnet: Map learning with lane segment perception for autonomous driving. arXiv preprint arXiv:2312.16108, 2023.
  • [27] Yinhao Li, Zheng Ge, Guanyi Yu, **rong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 37, pages 1477–1485, 2023.
  • [28] Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.
  • [29] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 1–18. Springer, 2022.
  • [30] Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, and Jose M Alvarez. Fb-bev: Bev representation from forward-backward view transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6919–6928. IEEE, 2023.
  • [31] Dongze Lian, Xianing Chen, **g Li, Weixin Luo, and Shenghua Gao. Locating and counting heads in crowds with a depth prior. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(12):9056–9072, 2021.
  • [32] Dingkang Liang, Xiwu Chen, Wei Xu, Yu Zhou, and Xiang Bai. Transcrowd: weakly-supervised crowd counting with transformers. Science China Information Sciences, 65(6):160104, 2022.
  • [33] Dingkang Liang, Wei Xu, and Xiang Bai. An end-to-end transformer model for crowd localization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 38–54. Springer, 2022.
  • [34] Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems (NeurIPS), 35:10421–10434, 2022.
  • [35] Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction. In International Conference on Learning Representations (ICLR), 2022.
  • [36] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125. IEEE, 2017.
  • [37] Chengxin Liu, Hao Lu, Zhiguo Cao, and Tongliang Liu. Point-query quadtree for crowd counting, localization, and more. arXiv preprint arXiv:2308.13814, 2023.
  • [38] Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object detection from multi-camera videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 18580–18590, 2023.
  • [39] Jiang Liu, Chenqiang Gao, Deyu Meng, and Alexander G Hauptmann. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5197–5206. IEEE, 2018.
  • [40] Liang Liu, Hao Lu, Hongwei Zou, Haipeng Xiong, Zhiguo Cao, and Chunhua Shen. Weighing counts: Sequential crowd counting by reinforcement learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 164–181. Springer, 2020.
  • [41] Lingbo Liu, Zhilin Qiu, Guanbin Li, Shufan Liu, Wanli Ouyang, and Liang Lin. Crowd counting with deep structured scale integration network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2019.
  • [42] Ning Liu, Yongchao Long, Changqing Zou, Qun Niu, Li Pan, and Hefeng Wu. Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019.
  • [43] Weizhe Liu, Nikita Durasov, and Pascal Fua. Leveraging self-supervision for cross-domain crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5341–5352. IEEE, 2022.
  • [44] Weizhe Liu, Mathieu Salzmann, and Pascal Fua. Context-aware crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5099–5108. IEEE, 2019.
  • [45] Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 531–548. Springer, 2022.
  • [46] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021.
  • [47] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela L Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceddings of the IEEE International Conference on Robotics and Automation (ICRA), pages 2774–2781. IEEE, 2023.
  • [48] Yiming Ma, Victor Sanchez, and Tanaya Guha. Fusioncount: efficient crowd counting via multiscale feature fusion. In International Conference on Image Processing (ICIP), pages 3256–3260. IEEE, 2022.
  • [49] Zhiheng Ma, Xiaopeng Hong, Xing Wei, Yunfeng Qiu, and Yihong Gong. Towards a universal model for cross-dataset crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3205–3214. IEEE, 2021.
  • [50] Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Gong. Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6142–6151. IEEE, 2019.
  • [51] Yunze Man, Liang-Yan Gui, and Yu-Xiong Wang. Bev-guided multi-modality fusion for driving perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21960–21969, 2023.
  • [52] Hong Mo, Wenqi Ren, Yuan Xiong, Xiaoqi Pan, Zhong Zhou, Xiaochun Cao, and Wei Wu. Background noise filtering and distribution dividing for crowd counting. IEEE Transactions on Image Processing (TIP), 29:8199–8212, 2020.
  • [53] Hong Mo, Wenqi Ren, Xiong Zhang, Feihu Yan, Zhong Zhou, Xiaochun Cao, and Wei Wu. Attention-guided collaborative counting. IEEE Transactions on Image Processing (TIP), 31:6306–6319, 2022.
  • [54] Xiaoqi Pan, Hong Mo, Zhong Zhou, and Wei Wu. Attention guided region division for crowd counting. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2568–2572. IEEE, 2020.
  • [55] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision (ECCV), pages 194–210. Springer, 2020.
  • [56] Haibo Qiu, Chunyu Wang, **gdong Wang, Naiyan Wang, and Wenjun Zeng. Cross view fusion for 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4342–4351. IEEE, 2019.
  • [57] Yasiru Ranasinghe, Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, and Vishal M Patel. Diffuse-denoise-count: Accurate crowd-counting with diffusion models. arXiv preprint arXiv:2303.12790, 2023.
  • [58] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 17–35. Springer, 2016.
  • [59] Miao**g Shi, Zhaohui Yang, Chao Xu, and Qijun Chen. Revisiting perspective information for efficient crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7279–7288. IEEE, 2019.
  • [60] Xiaowen Shi, Xin Li, Caili Wu, Shuchen Kong, **g Yang, and Liang He. A real-time deep network for crowd counting. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2328–2332. IEEE, 2020.
  • [61] Vishwanath A Sindagi and Vishal M Patel. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In Proceedings of the International Conference on Advanced Video and Signal based Surveillance (AVSS), pages 1–6. IEEE, 2017.
  • [62] Qingyu Song, Changan Wang, Zhengkai Jiang, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yang Wu. Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3365–3374. IEEE, 2021.
  • [63] Qingyu Song, Changan Wang, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Jian Wu, and Jiayi Ma. To choose or to fuse? scale selection for crowd counting. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 35, pages 2576–2583, 2021.
  • [64] Guolei Sun, Yun Liu, Thomas Probst, Danda Pani Paudel, Nikola Popovic, and Luc Van Gool. Boosting crowd counting with transformers. arXiv preprint arXiv:2105.10926, 2021.
  • [65] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2446–2454. IEEE, 2020.
  • [66] Ye Tian, Xiangxiang Chu, and Hongpeng Wang. Cctrans: Simplifying and improving crowd counting with transformer. arXiv preprint arXiv:2109.14483, 2021.
  • [67] Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, ** Luo, Dahua Lin, et al. Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8406–8415, 2023.
  • [68] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
  • [69] Boyu Wang, Huidong Liu, Dimitris Samaras, and Minh Hoai Nguyen. Distribution matching for crowd counting. Advances in Neural Information Processing Systems (NeurIPS), pages 1595–1607, 2020.
  • [70] Peng Wang, Chenyu Gao, Yang Wang, Hui Li, and Ye Gao. Mobilecount: An efficient encoder-decoder framework for real-time crowd counting. Neurocomputing, 407:292–299, 2020.
  • [71] Xiaofeng Wang, Zheng Zhu, Wenbo Xu, Yunpeng Zhang, Yi Wei, Xu Chi, Yun Ye, Dalong Du, Jiwen Lu, and Xingang Wang. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023.
  • [72] Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. Frustumformer: Adaptive instance-aware resampling for multi-view 3d detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5096–5105, 2023.
  • [73] Xing Wei, Yuanrui Kang, Jihao Yang, Yunfeng Qiu, Dahu Shi, Wenming Tan, and Yihong Gong. Scene-adaptive attention network for crowd counting. arXiv preprint arXiv:2112.15509, 2021.
  • [74] Xing Wei, Yunfeng Qiu, Zhiheng Ma, Xiaopeng Hong, and Yihong Gong. Semi-supervised crowd counting via multiple representation learning. IEEE Transactions on Image Processing (TIP), 2023.
  • [75] Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21729–21740. IEEE, 2023.
  • [76] Runsheng Xu, Zhengzhong Tu, Hao Xiang, Wei Shao, Bolei Zhou, and Jiaqi Ma. Cobevt: Cooperative bird’s eye view semantic segmentation with sparse transformers. arXiv preprint arXiv:2207.02202, 2022.
  • [77] Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, et al. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17830–17839. IEEE, 2023.
  • [78] Shaopeng Yang, Weiyu Guo, and Yuheng Ren. Crowdformer: An overlap patching vision transformer for top-down crowd counting. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 23–29, 2022.
  • [79] Kaiwen Yuan, Zhenyu Guo, and Z Jane Wang. Rggnet: Tolerance aware lidar-camera online calibration with geometric deep learning and generative model. IEEE Robotics and Automation Letters (RA-L), 5(4):6956–6963, 2020.
  • [80] Maoxun Yuan, Yinyan Wang, and Xingxing Wei. Translation, scale and rotation: Cross-modal alignment meets rgb-infrared vehicle detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 509–525. Springer, 2022.
  • [81] Lingke Zeng, Xiangmin Xu, Bolun Cai, Suo Qiu, and Tong Zhang. Multi-scale convolutional neural networks for crowd counting. In Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, 2017.
  • [82] Qiang Zhai, Fan Yang, Xin Li, Guo-Sen Xie, Hong Cheng, and Zicheng Liu. Co-communication graph convolutional network for multi-view crowd counting. IEEE Transactions on Multimedia (TMM), 25:5813–5825, 2022.
  • [83] Anran Zhang, Lei Yue, Jiayi Shen, Fan Zhu, Xiantong Zhen, Xianbin Cao, and Ling Shao. Attentional neural fields for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5714–5723. IEEE, 2019.
  • [84] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 833–841. IEEE, 2015.
  • [85] Lu Zhang, Miao**g Shi, and Qiaobo Chen. Crowd counting via scale-adaptive convolutional neural network. In Winter Conference on Applications of Computer Vision (WACV), pages 1113–1121. IEEE, 2018.
  • [86] Qi Zhang and Antoni B Chan. Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8297–8306. IEEE, 2019.
  • [87] Qi Zhang and Antoni B Chan. 3d crowd counting via multi-view fusion with 3d gaussian kernels. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 34, pages 12837–12844, 2020.
  • [88] Qi Zhang and Antoni B Chan. 3d crowd counting via geometric attention-guided multi-view fusion. International Journal of Computer Vision (IJCV), 130(12):3123–3139, 2022.
  • [89] Qi Zhang and Antoni B Chan. Calibration-free multi-view crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 227–244. Springer, 2022.
  • [90] Qi Zhang and Antoni B Chan. Wide-area crowd counting: Multi-view fusion networks for counting in large scenes. International Journal of Computer Vision (IJCV), 130(8):1938–1960, 2022.
  • [91] Qi Zhang, Wei Lin, and Antoni B Chan. Cross-view cross-scene multi-view crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 557–567. IEEE, 2021.
  • [92] Xiong Zhang, Hongmin Xu, Hong Mo, Jianchao Tan, Cheng Yang, Lei Wang, and Wenqi Ren. Dcnas: Densely connected neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13956–13967. IEEE, 2021.
  • [93] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 589–597. IEEE, 2016.
  • [94] Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2304.05316, 2023.
  • [95] Liangfeng Zheng, Yongzhi Li, and Yadong Mu. Learning factorized cross-view fusion for multi-view crowd counting. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2021.
  • [96] Brady Zhou and Philipp Krähenbühl. Cross-view transformers for real-time map-view semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13760–13769. IEEE, 2022.
  • [97] Huilin Zhu, **gling Yuan, Xian Zhong, Zhengwei Yang, Zheng Wang, and Shengfeng He. Daot: Domain-agnostically aligned optimal transport for domain-adaptive crowd counting. arXiv preprint arXiv:2308.05311, 2023.
  • [98] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations (ICLR), 2020.