Search | arXiv e-print repository

MARS: Paying more attention to visual attributes for text-based person search

Authors: Alex Ergasti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati

Abstract: Text-based person search (TBPS) is a problem that gained significant interest within the research community. The task is that of retrieving one or more images of a specific individual based on a textual description. The multi-modal nature of the task requires learning representations that bridge text and image data within a shared latent space. Existing TBPS systems face two major challenges. One… ▽ More Text-based person search (TBPS) is a problem that gained significant interest within the research community. The task is that of retrieving one or more images of a specific individual based on a textual description. The multi-modal nature of the task requires learning representations that bridge text and image data within a shared latent space. Existing TBPS systems face two major challenges. One is defined as inter-identity noise that is due to the inherent vagueness and imprecision of text descriptions and it indicates how descriptions of visual attributes can be generally associated to different people; the other is the intra-identity variations, which are all those nuisances e.g. pose, illumination, that can alter the visual appearance of the same textual attributes for a given subject. To address these issues, this paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive), which enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss. The former employs a Masked AutoEncoder trained to reconstruct randomly masked image patches with the aid of the textual description. In doing so the model is encouraged to learn more expressive representations and textual-visual relations in the latent space. The Attribute Loss, instead, balances the contribution of different types of attributes, defined as adjective-noun chunks of text. This loss ensures that every attribute is taken into consideration in the person retrieval process. Extensive experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements, with significant gains in the mean Average Precision (mAP) metric w.r.t. the current state of the art. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2404.18924 [pdf, other]

Swin2-MoSE: A New Single Image Super-Resolution Model for Remote Sensing

Authors: Leonardo Rossi, Vittorio Bernuzzi, Tomaso Fontanini, Massimo Bertozzi, Andrea Prati

Abstract: Due to the limitations of current optical and sensor technologies and the high cost of updating them, the spectral and spatial resolution of satellites may not always meet desired requirements. For these reasons, Remote-Sensing Single-Image Super-Resolution (RS-SISR) techniques have gained significant interest. In this paper, we propose Swin2-MoSE model, an enhanced version of Swin2SR. Our model i… ▽ More Due to the limitations of current optical and sensor technologies and the high cost of updating them, the spectral and spatial resolution of satellites may not always meet desired requirements. For these reasons, Remote-Sensing Single-Image Super-Resolution (RS-SISR) techniques have gained significant interest. In this paper, we propose Swin2-MoSE model, an enhanced version of Swin2SR. Our model introduces MoE-SM, an enhanced Mixture-of-Experts (MoE) to replace the Feed-Forward inside all Transformer block. MoE-SM is designed with Smart-Merger, and new layer for merging the output of individual experts, and with a new way to split the work between experts, defining a new per-example strategy instead of the commonly used per-token one. Furthermore, we analyze how positional encodings interact with each other, demonstrating that per-channel bias and per-head bias can positively cooperate. Finally, we propose to use a combination of Normalized-Cross-Correlation (NCC) and Structural Similarity Index Measure (SSIM) losses, to avoid typical MSE loss limitations. Experimental results demonstrate that Swin2-MoSE outperforms SOTA by up to 0.377 ~ 0.958 dB (PSNR) on task of 2x, 3x and 4x resolution-upscaling (Sen2Venus and OLI2MSI datasets). We show the efficacy of Swin2-MoSE, applying it to a semantic segmentation task (SeasoNet dataset). Code and pretrained are available on https://github.com/IMPLabUniPr/swin2-mose/tree/official_code △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2403.12743 [pdf, other]

Towards Controllable Face Generation with Semantic Latent Diffusion Models

Authors: Alex Ergasti, Claudio Ferrari, Tomaso Fontanini, Massimo Bertozzi, Andrea Prati

Abstract: Semantic Image Synthesis (SIS) is among the most popular and effective techniques in the field of face generation and editing, thanks to its good generation quality and the versatility is brings along. Recent works attempted to go beyond the standard GAN-based framework, and started to explore Diffusion Models (DMs) for this task as these stand out with respect to GANs in terms of both quality and… ▽ More Semantic Image Synthesis (SIS) is among the most popular and effective techniques in the field of face generation and editing, thanks to its good generation quality and the versatility is brings along. Recent works attempted to go beyond the standard GAN-based framework, and started to explore Diffusion Models (DMs) for this task as these stand out with respect to GANs in terms of both quality and diversity. On the other hand, DMs lack in fine-grained controllability and reproducibility. To address that, in this paper we propose a SIS framework based on a novel Latent Diffusion Model architecture for human face generation and editing that is both able to reproduce and manipulate a real reference image and generate diversity-driven results. The proposed system utilizes both SPADE normalization and cross-attention layers to merge shape and style information and, by doing so, allows for a precise control over each of the semantic parts of the human face. This was not possible with previous methods in the state of the art. Finally, we performed an extensive set of experiments to prove that our model surpasses current state of the art, both qualitatively and quantitatively. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2312.17561 [pdf, other]

Informative Rays Selection for Few-Shot Neural Radiance Fields

Authors: Marco Orsingher, Anthony Dell'Eva, Paolo Zani, Paolo Medici, Massimo Bertozzi

Abstract: Neural Radiance Fields (NeRF) have recently emerged as a powerful method for image-based 3D reconstruction, but the lengthy per-scene optimization limits their practical usage, especially in resource-constrained settings. Existing approaches solve this issue by reducing the number of input views and regularizing the learned volumetric representation with either complex losses or additional inputs… ▽ More Neural Radiance Fields (NeRF) have recently emerged as a powerful method for image-based 3D reconstruction, but the lengthy per-scene optimization limits their practical usage, especially in resource-constrained settings. Existing approaches solve this issue by reducing the number of input views and regularizing the learned volumetric representation with either complex losses or additional inputs from other modalities. In this paper, we present KeyNeRF, a simple yet effective method for training NeRF in few-shot scenarios by focusing on key informative rays. Such rays are first selected at camera level by a view selection algorithm that promotes baseline diversity while guaranteeing scene coverage, then at pixel level by sampling from a probability distribution based on local image entropy. Our approach performs favorably against state-of-the-art methods, while requiring minimal changes to existing NeRF codebases. △ Less

Submitted 29 December, 2023; originally announced December 2023.

Comments: To appear at VISAPP 2024

arXiv:2309.16009 [pdf, ps, other]

Floer potentials, cluster algebras and quiver representations

Authors: Peter Albers, Maria Bertozzi, Markus Reineke

Abstract: We use cluster algebras to interpret Floer potentials of monotone Lagrangian tori in toric del Pezzo surfaces as cluster characters of quiver representations. We use cluster algebras to interpret Floer potentials of monotone Lagrangian tori in toric del Pezzo surfaces as cluster characters of quiver representations. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: 14 pages

MSC Class: 53D12 (Primary) 13F60; 16G20 (Secondary)

arXiv:2308.16071 [pdf, other]

Semantic Image Synthesis via Class-Adaptive Cross-Attention

Authors: Tomaso Fontanini, Claudio Ferrari, Giuseppe Lisanti, Massimo Bertozzi, Andrea Prati

Abstract: In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they t… ▽ More In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for map** styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality, as well as improved global and local style transfer. Code and models available at https://github.com/TFonta/CA2SIS. △ Less

Submitted 20 February, 2024; v1 submitted 30 August, 2023; originally announced August 2023.

Comments: Code and models available at https://github.com/TFonta/CA2SIS

arXiv:2307.05317 [pdf, other]

Automatic Generation of Semantic Parts for Face Image Synthesis

Authors: Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati

Abstract: Semantic image synthesis (SIS) refers to the problem of generating realistic imagery given a semantic segmentation mask that defines the spatial layout of object classes. Most of the approaches in the literature, other than the quality of the generated images, put effort in finding solutions to increase the generation diversity in terms of style i.e. texture. However, they all neglect a different… ▽ More Semantic image synthesis (SIS) refers to the problem of generating realistic imagery given a semantic segmentation mask that defines the spatial layout of object classes. Most of the approaches in the literature, other than the quality of the generated images, put effort in finding solutions to increase the generation diversity in terms of style i.e. texture. However, they all neglect a different feature, which is the possibility of manipulating the layout provided by the mask. Currently, the only way to do so is manually by means of graphical users interfaces. In this paper, we describe a network architecture to address the problem of automatically manipulating or generating the shape of object classes in semantic segmentation masks, with specific focus on human faces. Our proposed model allows embedding the mask class-wise into a latent space where each class embedding can be independently edited. Then, a bi-directional LSTM block and a convolutional decoder output a new, locally manipulated mask. We report quantitative and qualitative results on the CelebMask-HQ dataset, which show our model can both faithfully reconstruct and modify a segmentation mask at the class level. Also, we show our model can be put before a SIS generator, opening the way to a fully automatic generation control of both shape and texture. Code available at https://github.com/TFonta/Semantic-VAE. △ Less

Submitted 11 July, 2023; originally announced July 2023.

Comments: Preprint, accepted for publication at ICIAP 2023

arXiv:2302.10719 [pdf, other]

Memory-augmented Online Video Anomaly Detection

Authors: Leonardo Rossi, Vittorio Bernuzzi, Tomaso Fontanini, Massimo Bertozzi, Andrea Prati

Abstract: The ability to understand the surrounding scene is of paramount importance for Autonomous Vehicles (AVs). This paper presents a system capable to work in an online fashion, giving an immediate response to the arise of anomalies surrounding the AV, exploiting only the videos captured by a dash-mounted camera. Our architecture, called MOVAD, relies on two main modules: a Short-Term Memory Module to… ▽ More The ability to understand the surrounding scene is of paramount importance for Autonomous Vehicles (AVs). This paper presents a system capable to work in an online fashion, giving an immediate response to the arise of anomalies surrounding the AV, exploiting only the videos captured by a dash-mounted camera. Our architecture, called MOVAD, relies on two main modules: a Short-Term Memory Module to extract information related to the ongoing action, implemented by a Video Swin Transformer (VST), and a Long-Term Memory Module injected inside the classifier that considers also remote past information and action context thanks to the use of a Long-Short Term Memory (LSTM) network. The strengths of MOVAD are not only linked to its excellent performance, but also to its straightforward and modular architecture, trained in a end-to-end fashion with only RGB frames with as less assumptions as possible, which makes it easy to implement and play with. We evaluated the performance of our method on Detection of Traffic Anomaly (DoTA) dataset, a challenging collection of dash-mounted camera videos of accidents. After an extensive ablation study, MOVAD is able to reach an AUC score of 82.17\%, surpassing the current state-of-the-art by +2.87 AUC. Our code will be available on https://github.com/IMPLabUniPr/movad/tree/movad_vad △ Less

Submitted 27 September, 2023; v1 submitted 21 February, 2023; originally announced February 2023.

MSC Class: 68-02; 68-04; 68-06; 68T07; 68T10; 68T45 ACM Class: F.1.1

arXiv:2210.13041 [pdf, other]

Learning Neural Radiance Fields from Multi-View Geometry

Authors: Marco Orsingher, Paolo Zani, Paolo Medici, Massimo Bertozzi

Abstract: We present a framework, called MVG-NeRF, that combines classical Multi-View Geometry algorithms and Neural Radiance Fields (NeRF) for image-based 3D reconstruction. NeRF has revolutionized the field of implicit 3D representations, mainly due to a differentiable volumetric rendering formulation that enables high-quality and geometry-aware novel view synthesis. However, the underlying geometry of th… ▽ More We present a framework, called MVG-NeRF, that combines classical Multi-View Geometry algorithms and Neural Radiance Fields (NeRF) for image-based 3D reconstruction. NeRF has revolutionized the field of implicit 3D representations, mainly due to a differentiable volumetric rendering formulation that enables high-quality and geometry-aware novel view synthesis. However, the underlying geometry of the scene is not explicitly constrained during training, thus leading to noisy and incorrect results when extracting a mesh with marching cubes. To this end, we propose to leverage pixelwise depths and normals from a classical 3D reconstruction pipeline as geometric priors to guide NeRF optimization. Such priors are used as pseudo-ground truth during training in order to improve the quality of the estimated underlying surface. Moreover, each pixel is weighted by a confidence value based on the forward-backward reprojection error for additional robustness. Experimental results on real-world data demonstrate the effectiveness of this approach in obtaining clean 3D meshes from images, while maintaining competitive performances in novel view synthesis. △ Less

Submitted 24 October, 2022; originally announced October 2022.

Comments: ECCV 2022 Workshop on "Learning to Generate 3D Shapes and Scenes"

arXiv:2208.05274 [pdf, other]

Arbitrary Point Cloud Upsampling with Spherical Mixture of Gaussians

Authors: Anthony Dell'Eva, Marco Orsingher, Massimo Bertozzi

Abstract: Generating dense point clouds from sparse raw data benefits downstream 3D understanding tasks, but existing models are limited to a fixed upsampling ratio or to a short range of integer values. In this paper, we present APU-SMOG, a Transformer-based model for Arbitrary Point cloud Upsampling (APU). The sparse input is firstly mapped to a Spherical Mixture of Gaussians (SMOG) distribution, from whi… ▽ More Generating dense point clouds from sparse raw data benefits downstream 3D understanding tasks, but existing models are limited to a fixed upsampling ratio or to a short range of integer values. In this paper, we present APU-SMOG, a Transformer-based model for Arbitrary Point cloud Upsampling (APU). The sparse input is firstly mapped to a Spherical Mixture of Gaussians (SMOG) distribution, from which an arbitrary number of points can be sampled. Then, these samples are fed as queries to the Transformer decoder, which maps them back to the target surface. Extensive qualitative and quantitative evaluations show that APU-SMOG outperforms state-of-the-art fixed-ratio methods, while effectively enabling upsampling with any scaling factor, including non-integer values, with a single trained model. The code is available at https://github.com/apusmog/apusmog/ △ Less

Submitted 10 January, 2023; v1 submitted 10 August, 2022; originally announced August 2022.

Comments: Accepted to 3DV 2022 (Oral)

arXiv:2207.08439 [pdf, other]

Revisiting PatchMatch Multi-View Stereo for Urban 3D Reconstruction

Authors: Marco Orsingher, Paolo Zani, Paolo Medici, Massimo Bertozzi

Abstract: In this paper, a complete pipeline for image-based 3D reconstruction of urban scenarios is proposed, based on PatchMatch Multi-View Stereo (MVS). Input images are firstly fed into an off-the-shelf visual SLAM system to extract camera poses and sparse keypoints, which are used to initialize PatchMatch optimization. Then, pixelwise depths and normals are iteratively computed in a multi-scale framewo… ▽ More In this paper, a complete pipeline for image-based 3D reconstruction of urban scenarios is proposed, based on PatchMatch Multi-View Stereo (MVS). Input images are firstly fed into an off-the-shelf visual SLAM system to extract camera poses and sparse keypoints, which are used to initialize PatchMatch optimization. Then, pixelwise depths and normals are iteratively computed in a multi-scale framework with a novel depth-normal consistency loss term and a global refinement algorithm to balance the inherently local nature of PatchMatch. Finally, a large-scale point cloud is generated by back-projecting multi-view consistent estimates in 3D. The proposed approach is carefully evaluated against both classical MVS algorithms and monocular depth networks on the KITTI dataset, showing state of the art performances. △ Less

Submitted 18 July, 2022; originally announced July 2022.

Comments: Poster presentation at IEEE Intelligent Vehicles Symposium (IV 2022, https://iv2022.com/)

arXiv:2207.08434 [pdf, other]

doi 10.1007/978-3-031-06430-2_10

Efficient View Clustering and Selection for City-Scale 3D Reconstruction

Authors: Marco Orsingher, Paolo Zani, Paolo Medici, Massimo Bertozzi

Abstract: Image datasets have been steadily growing in size, harming the feasibility and efficiency of large-scale 3D reconstruction methods. In this paper, a novel approach for scaling Multi-View Stereo (MVS) algorithms up to arbitrarily large collections of images is proposed. Specifically, the problem of reconstructing the 3D model of an entire city is targeted, starting from a set of videos acquired by… ▽ More Image datasets have been steadily growing in size, harming the feasibility and efficiency of large-scale 3D reconstruction methods. In this paper, a novel approach for scaling Multi-View Stereo (MVS) algorithms up to arbitrarily large collections of images is proposed. Specifically, the problem of reconstructing the 3D model of an entire city is targeted, starting from a set of videos acquired by a moving vehicle equipped with several high-resolution cameras. Initially, the presented method exploits an approximately uniform distribution of poses and geometry and builds a set of overlap** clusters. Then, an Integer Linear Programming (ILP) problem is formulated for each cluster to select an optimal subset of views that guarantees both visibility and matchability. Finally, local point clouds for each cluster are separately computed and merged. Since clustering is independent from pairwise visibility information, the proposed algorithm runs faster than existing literature and allows for a massive parallelization. Extensive testing on urban data are discussed to show the effectiveness and the scalability of this approach. △ Less

Submitted 18 July, 2022; originally announced July 2022.

Comments: Oral presentation at ICIAP 2021 (https://www.iciap2021.org/)

arXiv:2109.04468 [pdf, other]

Leveraging Local Domains for Image-to-Image Translation

Authors: Anthony Dell'Eva, Fabio Pizzati, Massimo Bertozzi, Raoul de Charette

Abstract: Image-to-image (i2i) networks struggle to capture local changes because they do not affect the global scene structure. For example, translating from highway scenes to offroad, i2i networks easily focus on global color features but ignore obvious traits for humans like the absence of lane markings. In this paper, we leverage human knowledge about spatial domain characteristics which we refer to as… ▽ More Image-to-image (i2i) networks struggle to capture local changes because they do not affect the global scene structure. For example, translating from highway scenes to offroad, i2i networks easily focus on global color features but ignore obvious traits for humans like the absence of lane markings. In this paper, we leverage human knowledge about spatial domain characteristics which we refer to as 'local domains' and demonstrate its benefit for image-to-image translation. Relying on a simple geometrical guidance, we train a patch-based GAN on few source data and hallucinate a new unseen domain which subsequently eases transfer learning to target. We experiment on three tasks ranging from unstructured environments to adverse weather. Our comprehensive evaluation setting shows we are able to generate realistic translations, with minimal priors, and training only on a few images. Furthermore, when trained on our translations images we show that all tested proxy tasks are significantly improved, without ever seeing target domain at training. △ Less

Submitted 14 February, 2022; v1 submitted 9 September, 2021; originally announced September 2021.

Comments: VISAPP 2022 Best Paper Award

arXiv:2010.08567 [pdf, other]

Infinite staircases for Hirzebruch surfaces

Authors: Maria Bertozzi, Tara S. Holm, Emily Maw, Dusa McDuff, Grace T. Mwakyoma, Ana Rita Pires, Morgan Weiler

Abstract: We consider the embedding capacity functions $c_{H_b}(z)$ for symplectic embeddings of ellipsoids of eccentricity $z$ into the family of nontrivial rational Hirzebruch surfaces $H_b$ with symplectic form parametrized by $b\in [0,1)$. This function was known to have an infinite staircase in the monotone cases ($b= 0$ and $ b= 1/3$). It is also known that for each $b$ there is at most one value of… ▽ More We consider the embedding capacity functions $c_{H_b}(z)$ for symplectic embeddings of ellipsoids of eccentricity $z$ into the family of nontrivial rational Hirzebruch surfaces $H_b$ with symplectic form parametrized by $b\in [0,1)$. This function was known to have an infinite staircase in the monotone cases ($b= 0$ and $ b= 1/3$). It is also known that for each $b$ there is at most one value of $z$ that can be the accumulation point of such a staircase. In this manuscript, we identify three sequences of open, disjoint, blocked $b$-intervals, consisting of $b$-parameters where the embedding capacity function for $H_b$ does not contain an infinite staircase. There is one sequence in each of the intervals $(0,1/5)$, $(1/5,1/3)$, and $(1/3,1)$. We then establish six sequences of associated infinite staircases, one occurring at each endpoint of the blocked $b$-intervals. The staircase numerics are variants of those in the Fibonacci staircase for the projective plane (the case $b=0$). We also show that there is no staircase at the point $b=1/5$, even though this value is not blocked. The focus of this paper is to develop techniques, both graphical and numeric, that allow identification of potential staircases, and then to understand the obstructions well enough to prove that the purported staircases really do have the required properties. A subsequent paper will explore in more depth the set of $b$ that admit infinite staircases. △ Less

Submitted 19 April, 2021; v1 submitted 16 October, 2020; originally announced October 2020.

Comments: 90 pages, 12 figures. Version 2 has several typos fixed and numbering changed to match style in to-be-published version

MSC Class: Primary: 53D05. Secondary: 53D35; 11A55; 53D42; 53-04

arXiv:2001.08071 [pdf, ps, other]

Momentum map images of representation spaces of quivers

Authors: Maria Bertozzi, Markus Reineke

Abstract: We consider the base change action on real or complex representation spaces of quivers and the associated momentum map for a maximal compact subgroup of the base change group, as introduced by A. King. We give an explicit description of the momentum map image in terms of recursively defined inequalities on eigenvalues of Hermitian operators. Moreover, we characterize when the momentum map image is… ▽ More We consider the base change action on real or complex representation spaces of quivers and the associated momentum map for a maximal compact subgroup of the base change group, as introduced by A. King. We give an explicit description of the momentum map image in terms of recursively defined inequalities on eigenvalues of Hermitian operators. Moreover, we characterize when the momentum map image is maximal possible, respectively of positive volume. △ Less

Submitted 14 May, 2020; v1 submitted 22 January, 2020; originally announced January 2020.

Comments: 14 pages; corrected definition of height function, corrected examples, clarified relation to work of Baldoni-Vergne-Walter

Showing 1–15 of 15 results for author: Bertozzi, M