Search | arXiv e-print repository

On Spectral Data for $(2,2)$ Berry Connections, Difference Equations & Equivariant Quantum Cohomology

Authors: Andrea E. V. Ferrari, Daniel Zhang

Abstract: We study supersymmetric Berry connections of 2d $\mathcal{N}=(2,2)$ gauged linear sigma models (GLSMs) quantized on a circle, which are periodic monopoles, with the aim to provide a fruitful physical arena for recent mathematical constructions related to the latter. These are difference modules encoding monopole solutions via a Hitchin-Kobayashi correspondence established by Mochizuki. We demonstr… ▽ More We study supersymmetric Berry connections of 2d $\mathcal{N}=(2,2)$ gauged linear sigma models (GLSMs) quantized on a circle, which are periodic monopoles, with the aim to provide a fruitful physical arena for recent mathematical constructions related to the latter. These are difference modules encoding monopole solutions via a Hitchin-Kobayashi correspondence established by Mochizuki. We demonstrate how the difference modules arise naturally by studying the ground states as the cohomology of a one-parameter family of supercharges. In particular, we show how they are related to one kind of monopole spectral data, a quantization of the Cherkis-Kapustin spectral curve, and relate them to the physics of the GLSM. By considering states generated by D-branes and leveraging the difference modules, we derive novel difference equations for brane amplitudes. We then show that in the conformal limit, these degenerate into novel difference equations for hemisphere partition functions, which are exactly calculable. When the GLSM flows to a nonlinear sigma model with Kähler target $X$, we show that the difference modules are related to the equivariant quantum cohomology of $X$. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: Contribution to the proceedings of GLSM@30

arXiv:2404.05465 [pdf, other]

HAMMR: HierArchical MultiModal React agents for generic VQA

Authors: Lluis Castrejon, Thomas Mensink, Howard Zhou, Vittorio Ferrari, Andre Araujo, Jasper Uijlings

Abstract: Combining Large Language Models (LLMs) with external specialized tools (LLMs+tools) is a recent paradigm to solve multimodal tasks such as Visual Question Answering (VQA). While this approach was demonstrated to work well when optimized and evaluated for each individual benchmark, in practice it is crucial for the next generation of real-world AI systems to handle a broad range of multimodal probl… ▽ More Combining Large Language Models (LLMs) with external specialized tools (LLMs+tools) is a recent paradigm to solve multimodal tasks such as Visual Question Answering (VQA). While this approach was demonstrated to work well when optimized and evaluated for each individual benchmark, in practice it is crucial for the next generation of real-world AI systems to handle a broad range of multimodal problems. Therefore we pose the VQA problem from a unified perspective and evaluate a single system on a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more. In this setting, we demonstrate that naively applying the LLM+tools approach using the combined set of all tools leads to poor results. This motivates us to introduce HAMMR: HierArchical MultiModal React. We start from a multimodal ReAct-based system and make it hierarchical by enabling our HAMMR agents to call upon other specialized agents. This enhances the compositionality of the LLM+tools approach, which we show to be critical for obtaining high accuracy on generic VQA. Concretely, on our generic VQA suite, HAMMR outperforms the naive LLM+tools approach by 19.5%. Additionally, HAMMR achieves state-of-the-art results on this task, outperforming the generic standalone PaLI-X VQA model by 5.0%. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2312.00878 [pdf, other]

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Authors: Walid Bousselham, Felix Petersen, Vittorio Ferrari, Hilde Kuehne

Abstract: Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (… ▽ More Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. To leverage those capabilities, we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We show that the concept of self-self attention corresponds to clustering, thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation, we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation. It shows that GEM not only outperforms other training-free open-vocabulary localization methods, but also achieves state-of-the-art results on the recently proposed OpenImagesV7 large-scale segmentation benchmark. △ Less

Submitted 14 December, 2023; v1 submitted 1 December, 2023; originally announced December 2023.

Comments: Code available at https://github.com/WalBouss/GEM

arXiv:2312.00357 [pdf]

A Generalizable Deep Learning System for Cardiac MRI

Authors: Rohan Shad, Cyril Zakka, Dhamanpreet Kaur, Robyn Fong, Ross Warren Filice, John Mongan, Kimberly Kalianos, Nishith Khandwala, David Eng, Matthew Leipzig, Walter Witschey, Alejandro de Feria, Victor Ferrari, Euan Ashley, Michael A. Acker, Curtis Langlotz, William Hiesinger

Abstract: Cardiac MRI allows for a comprehensive assessment of myocardial structure, function, and tissue characteristics. Here we describe a foundational vision system for cardiac MRI, capable of representing the breadth of human cardiovascular disease and health. Our deep learning model is trained via self-supervised contrastive learning, by which visual concepts in cine-sequence cardiac MRI scans are lea… ▽ More Cardiac MRI allows for a comprehensive assessment of myocardial structure, function, and tissue characteristics. Here we describe a foundational vision system for cardiac MRI, capable of representing the breadth of human cardiovascular disease and health. Our deep learning model is trained via self-supervised contrastive learning, by which visual concepts in cine-sequence cardiac MRI scans are learned from the raw text of the accompanying radiology reports. We train and evaluate our model on data from four large academic clinical institutions in the United States. We additionally showcase the performance of our models on the UK BioBank, and two additional publicly available external datasets. We explore emergent zero-shot capabilities of our system, and demonstrate remarkable performance across a range of tasks; including the problem of left ventricular ejection fraction regression, and the diagnosis of 35 different conditions such as cardiac amyloidosis and hypertrophic cardiomyopathy. We show that our deep learning system is capable of not only understanding the staggering complexity of human cardiovascular disease, but can be directed towards clinical problems of interest yielding impressive, clinical grade diagnostic accuracy with a fraction of the training data typically required for such tasks. △ Less

Submitted 1 December, 2023; originally announced December 2023.

Comments: 21 page main manuscript, 4 figures. Supplementary Appendix and code will be made available on publication

ACM Class: I.2.10

arXiv:2311.08454 [pdf, other]

Berry Connections for 2d $(2,2)$ Theories, Monopole Spectral Data & (Generalised) Cohomology Theories

Authors: Andrea E. V. Ferrari, Daniel Zhang

Abstract: We study Berry connections for supersymmetric ground states of 2d $\mathcal{N}=(2,2)$ GLSMs quantised on a circle, which are generalised periodic monopoles. Periodic monopole solutions may be encoded into difference modules, as shown by Mochizuki, or into an alternative algebraic construction given in terms of vector bundles endowed with filtrations. By studying the ground states in terms of a one… ▽ More We study Berry connections for supersymmetric ground states of 2d $\mathcal{N}=(2,2)$ GLSMs quantised on a circle, which are generalised periodic monopoles. Periodic monopole solutions may be encoded into difference modules, as shown by Mochizuki, or into an alternative algebraic construction given in terms of vector bundles endowed with filtrations. By studying the ground states in terms of a one-parameter family of supercharges, we relate these two different kinds of spectral data to the physics of the GLSMs. From the difference modules we derive novel difference equations for brane amplitudes, which in the conformal limit yield novel difference equations for hemisphere or vortex partition functions. When the GLSM flows to a nonlinear sigma model with Kähler target $X$, we show that the two kinds of spectral data are related to different (generalised) cohomology theories: the difference modules are related to the equivariant quantum cohomology of $X$, whereas the vector bundles with filtrations are related to its equivariant K-theory. △ Less

Submitted 3 June, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

Comments: 54 pages + appendix. Some clarifications added, typos corrected, abstract streamlined

arXiv:2311.05087 [pdf, other]

Boundary vertex algebras for 3d $\mathcal{N}=4$ rank-0 SCFTs

Authors: Andrea E. V. Ferrari, Niklas Garner, Heeyeon Kim

Abstract: We initiate the study of boundary Vertex Operator Algebras (VOAs) of topologically twisted 3d $\mathcal{N}=4$ rank-0 SCFTs. This is a recently introduced class of $\mathcal{N}=4$ SCFTs that by definition have zero-dimensional Higgs and Coulomb branches. We briefly explain why it is reasonable to obtain rational VOAs at the boundary of their topological twists. When a rank-0 SCFT is realized as the… ▽ More We initiate the study of boundary Vertex Operator Algebras (VOAs) of topologically twisted 3d $\mathcal{N}=4$ rank-0 SCFTs. This is a recently introduced class of $\mathcal{N}=4$ SCFTs that by definition have zero-dimensional Higgs and Coulomb branches. We briefly explain why it is reasonable to obtain rational VOAs at the boundary of their topological twists. When a rank-0 SCFT is realized as the IR fixed point of a $\mathcal{N}=2$ Lagrangian theory, we propose a technique for the explicit construction of its topological twists and boundary VOAs based on deformations of the holomorphic-topological twist of the $\mathcal{N}=2$ microscopic description. We apply this technique to the $B$ twist of a newly discovered family of 3d $\mathcal{N}=4$ rank-0 SCFTs ${\mathcal T}_r$ and argue that they admit the simple affine VOAs $L_r(\mathfrak{osp}(1|2))$ at their boundary. In the simplest case, this leads to a novel level-rank duality between $L_1(\mathfrak{osp}(1|2))$ and the minimal model $M(2,5)$. As an aside, we present a TQFT obtained by twisting a 3d $\mathcal{N}=2$ QFT that admits the $M(3,4)$ minimal model as a boundary VOA and briefly comment on the classical freeness of VOAs at the boundary of 3d TQFTs. △ Less

Submitted 27 June, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: minor revision

arXiv:2311.04587 [pdf, other]

Log Statements Generation via Deep Learning: Widening the Support Provided to Developers

Authors: Antonio Mastropaolo, Valentina Ferrari, Luca Pascarella, Gabriele Bavota

Abstract: Logging assists in monitoring events that transpire during the execution of software. Previous research has highlighted the challenges confronted by developers when it comes to logging, including dilemmas such as where to log, what data to record, and which log level to employ (e.g., info, fatal). In this context, we introduced LANCE, an approach rooted in deep learning (DL) that has demonstrated… ▽ More Logging assists in monitoring events that transpire during the execution of software. Previous research has highlighted the challenges confronted by developers when it comes to logging, including dilemmas such as where to log, what data to record, and which log level to employ (e.g., info, fatal). In this context, we introduced LANCE, an approach rooted in deep learning (DL) that has demonstrated the ability to correctly inject a log statement into Java methods in ~15% of cases. Nevertheless, LANCE grapples with two primary constraints: (i) it presumes that a method necessitates the inclusion of logging statements and; (ii) it allows the injection of only a single (new) log statement, even in situations where the injection of multiple log statements might be essential. To address these limitations, we present LEONID, a DL-based technique that can distinguish between methods that do and do not require the inclusion of log statements. Furthermore, LEONID supports the injection of multiple log statements within a given method when necessary, and it also enhances LANCE's proficiency in generating meaningful log messages through the combination of DL and Information Retrieval (IR). △ Less

Submitted 8 November, 2023; originally announced November 2023.

arXiv:2308.16139 [pdf, other]

MedShapeNet -- A Large-Scale Dataset of 3D Medical Shapes for Computer Vision

Authors: Jianning Li, Zongwei Zhou, Jiancheng Yang, Antonio Pepe, Christina Gsaxner, Gijs Luijten, Chongyu Qu, Tiezheng Zhang, Xiaoxi Chen, Wenxuan Li, Marek Wodzinski, Paul Friedrich, Kangxian Xie, Yuan **, Narmada Ambigapathy, Enrico Nasca, Naida Solak, Gian Marco Melito, Viet Duc Vu, Afaque R. Memon, Christopher Schlachta, Sandrine De Ribaupierre, Rajnikant Patel, Roy Eagleson, Xiaojun Chen , et al. (132 additional authors not shown)

Abstract: Prior to the deep learning era, shape was commonly used to describe the objects. Nowadays, state-of-the-art (SOTA) algorithms in medical imaging are predominantly diverging from computer vision, where voxel grids, meshes, point clouds, and implicit surface models are used. This is seen from numerous shape-related publications in premier vision conferences as well as the growing popularity of Shape… ▽ More Prior to the deep learning era, shape was commonly used to describe the objects. Nowadays, state-of-the-art (SOTA) algorithms in medical imaging are predominantly diverging from computer vision, where voxel grids, meshes, point clouds, and implicit surface models are used. This is seen from numerous shape-related publications in premier vision conferences as well as the growing popularity of ShapeNet (about 51,300 models) and Princeton ModelNet (127,915 models). For the medical domain, we present a large collection of anatomical shapes (e.g., bones, organs, vessels) and 3D models of surgical instrument, called MedShapeNet, created to facilitate the translation of data-driven vision algorithms to medical applications and to adapt SOTA vision algorithms to medical problems. As a unique feature, we directly model the majority of shapes on the imaging data of real patients. As of today, MedShapeNet includes 23 dataset with more than 100,000 shapes that are paired with annotations (ground truth). Our data is freely accessible via a web interface and a Python application programming interface (API) and can be used for discriminative, reconstructive, and variational benchmarks as well as various applications in virtual, augmented, or mixed reality, and 3D printing. Exemplary, we present use cases in the fields of classification of brain tumors, facial and skull reconstructions, multi-class anatomy completion, education, and 3D printing. In future, we will extend the data and improve the interfaces. The project pages are: https://medshapenet.ikim.nrw/ and https://github.com/Jianningli/medshapenet-feedback △ Less

Submitted 12 December, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

Comments: 16 pages

MSC Class: 68T01

arXiv:2308.11606 [pdf, other]

StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

Authors: Emanuele Bugliarello, Hernan Moraldo, Ruben Villegas, Mohammad Babaeizadeh, Mohammad Taghi Saffar, Han Zhang, Dumitru Erhan, Vittorio Ferrari, Pieter-Jan Kindermans, Paul Voigtlaender

Abstract: Generating video stories from text prompts is a complex task. In addition to having high visual quality, videos need to realistically adhere to a sequence of text prompts whilst being consistent throughout the frames. Creating a benchmark for video generation requires data annotated over time, which contrasts with the single caption used often in video datasets. To fill this gap, we collect compre… ▽ More Generating video stories from text prompts is a complex task. In addition to having high visual quality, videos need to realistically adhere to a sequence of text prompts whilst being consistent throughout the frames. Creating a benchmark for video generation requires data annotated over time, which contrasts with the single caption used often in video datasets. To fill this gap, we collect comprehensive human annotations on three existing datasets, and introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate forthcoming text-to-video models. Our benchmark includes three video generation tasks of increasing difficulty: action execution, where the next action must be generated starting from a conditioning video; story continuation, where a sequence of actions must be executed starting from a conditioning video; and story generation, where a video must be generated from only text prompts. We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions. Finally, we establish guidelines for human evaluation of video stories, and reaffirm the need of better automatic metrics for video generation. StoryBench aims at encouraging future research efforts in this exciting new area. △ Less

Submitted 12 October, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: NeurIPS D&B 2023

arXiv:2306.09224 [pdf, other]

Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories

Authors: Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, André Araujo, Vittorio Ferrari

Abstract: We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evi… ▽ More We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evidence to support each answer. Empirically, we show that our dataset poses a hard challenge for large vision+language models as they perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA [37], yet it only achieves 13.0% accuracy on our dataset. Moreover, we experimentally show that progress on answering our encyclopedic questions can be achieved by augmenting large models with a mechanism that retrieves relevant information from the knowledge base. An oracle experiment with perfect retrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and an automatic retrieval-augmented prototype yields 48.8%. We believe that our dataset enables future research on retrieval-augmented vision+language models. It is available at https://github.com/google-research/google-research/tree/master/encyclopedic_vqa . △ Less

Submitted 24 July, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: ICCV'23

arXiv:2306.09109 [pdf, other]

NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations

Authors: Varun Jampani, Kevis-Kokitsi Maninis, Andreas Engelhardt, Arjun Karpur, Karen Truong, Kyle Sargent, Stefan Popov, André Araujo, Ricardo Martin-Brualla, Kaushal Patel, Daniel Vlasic, Vittorio Ferrari, Ameesh Makadia, Ce Liu, Yuanzhen Li, Howard Zhou

Abstract: Recent advances in neural reconstruction enable high-quality 3D object reconstruction from casually captured image collections. Current techniques mostly analyze their progress on relatively simple image collections where Structure-from-Motion (SfM) techniques can provide ground-truth (GT) camera poses. We note that SfM techniques tend to fail on in-the-wild image collections such as image search… ▽ More Recent advances in neural reconstruction enable high-quality 3D object reconstruction from casually captured image collections. Current techniques mostly analyze their progress on relatively simple image collections where Structure-from-Motion (SfM) techniques can provide ground-truth (GT) camera poses. We note that SfM techniques tend to fail on in-the-wild image collections such as image search results with varying backgrounds and illuminations. To enable systematic research progress on 3D reconstruction from casual image captures, we propose NAVI: a new dataset of category-agnostic image collections of objects with high-quality 3D scans along with per-image 2D-3D alignments providing near-perfect GT camera parameters. These 2D-3D alignments allow us to extract accurate derivative annotations such as dense pixel correspondences, depth and segmentation maps. We demonstrate the use of NAVI image collections on different problem settings and show that NAVI enables more thorough evaluations that were not possible with existing datasets. We believe NAVI is beneficial for systematic research progress on 3D reconstruction and correspondence estimation. Project page: https://navidataset.github.io △ Less

Submitted 13 October, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: NeurIPS 2023 camera ready. Project page: https://navidataset.github.io

arXiv:2306.09077 [pdf, other]

Estimating Generic 3D Room Structures from 2D Annotations

Authors: Denys Rozumnyi, Stefan Popov, Kevis-Kokitsi Maninis, Matthias Nießner, Vittorio Ferrari

Abstract: Indoor rooms are among the most common use cases in 3D scene understanding. Current state-of-the-art methods for this task are driven by large annotated datasets. Room layouts are especially important, consisting of structural elements in 3D, such as wall, floor, and ceiling. However, they are difficult to annotate, especially on pure RGB video. We propose a novel method to produce generic 3D room… ▽ More Indoor rooms are among the most common use cases in 3D scene understanding. Current state-of-the-art methods for this task are driven by large annotated datasets. Room layouts are especially important, consisting of structural elements in 3D, such as wall, floor, and ceiling. However, they are difficult to annotate, especially on pure RGB video. We propose a novel method to produce generic 3D room layouts just from 2D segmentation masks, which are easy to annotate for humans. Based on these 2D annotations, we automatically reconstruct 3D plane equations for the structural elements and their spatial extent in the scene, and connect adjacent elements at the appropriate contact edges. We annotate and publicly release 2246 3D room layouts on the RealEstate10k dataset, containing YouTube videos. We demonstrate the high quality of these 3D layouts annotations with extensive experiments. △ Less

Submitted 21 December, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: https://github.com/google-research/cad-estate Accepted at 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

arXiv:2306.09011 [pdf, other]

CAD-Estate: Large-scale CAD Model Annotation in RGB Videos

Authors: Kevis-Kokitsi Maninis, Stefan Popov, Matthias Nießner, Vittorio Ferrari

Abstract: We propose a method for annotating videos of complex multi-object scenes with a globally-consistent 3D representation of the objects. We annotate each object with a CAD model from a database, and place it in the 3D coordinate frame of the scene with a 9-DoF pose transformation. Our method is semi-automatic and works on commonly-available RGB videos, without requiring a depth sensor. Many steps are… ▽ More We propose a method for annotating videos of complex multi-object scenes with a globally-consistent 3D representation of the objects. We annotate each object with a CAD model from a database, and place it in the 3D coordinate frame of the scene with a 9-DoF pose transformation. Our method is semi-automatic and works on commonly-available RGB videos, without requiring a depth sensor. Many steps are performed automatically, and the tasks performed by humans are simple, well-specified, and require only limited reasoning in 3D. This makes them feasible for crowd-sourcing and has allowed us to construct a large-scale dataset by annotating real-estate videos from YouTube. Our dataset CAD-Estate offers 101k instances of 12k unique CAD models placed in the 3D representations of 20k videos. In comparison to Scan2CAD, the largest existing dataset with CAD model annotations on real scenes, CAD-Estate has 7x more instances and 4x more unique CAD models. We showcase the benefits of pre-training a Mask2CAD model on CAD-Estate for the task of automatic 3D object reconstruction and pose estimation, demonstrating that it leads to performance improvements on the popular Scan2CAD benchmark. The dataset is available at https://github.com/google-research/cad-estate. △ Less

Submitted 14 August, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: Project page: https://github.com/google-research/cad-estate

arXiv:2304.11055 [pdf, other]

Free field realisation of boundary vertex algebras for Abelian gauge theories in three dimensions

Authors: Christopher Beem, Andrea E. V. Ferrari

Abstract: We study the boundary vertex algebras of $A$-twisted $\mathcal{N}=4$ Abelian gauge theories in three dimensions. These are identified with the BRST quotient (semi-infinite cohomology) of collections of symplectic bosons and free fermions that reflect the matter content of the corresponding gauge theory. We develop various free field realisations for these vertex algebras which we propose to interp… ▽ More We study the boundary vertex algebras of $A$-twisted $\mathcal{N}=4$ Abelian gauge theories in three dimensions. These are identified with the BRST quotient (semi-infinite cohomology) of collections of symplectic bosons and free fermions that reflect the matter content of the corresponding gauge theory. We develop various free field realisations for these vertex algebras which we propose to interpret in terms of their localisation on their associated varieties. We derive the free field realisations by bosonising the elementary symplectic bosons and free fermions and then calculating the relevant semi-infinite cohomology, which can be done systematically. An interesting feature of our construction is that for certain preferred free field realisations, the outer automorphism symmetry of the vertex algebras in question (which are identified with the symmetries of the Coulomb branch in the infrared) are made manifest. △ Less

Submitted 21 April, 2023; originally announced April 2023.

Comments: 54 pages + appendices

arXiv:2304.06419 [pdf, other]

Tracking by 3D Model Estimation of Unknown Objects in Videos

Authors: Denys Rozumnyi, Jiri Matas, Marc Pollefeys, Vittorio Ferrari, Martin R. Oswald

Abstract: Most model-free visual object tracking methods formulate the tracking task as object location estimation given by a 2D segmentation or a bounding box in each video frame. We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation, namely the textured 3D shape and 6DoF pose in each video frame. Our representation tackles… ▽ More Most model-free visual object tracking methods formulate the tracking task as object location estimation given by a 2D segmentation or a bounding box in each video frame. We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation, namely the textured 3D shape and 6DoF pose in each video frame. Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames, including frames where some points are invisible. To achieve that, the estimation is driven by re-rendering the input video frames as well as possible through differentiable rendering, which has not been used for tracking before. The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose. We improve the state-of-the-art in 2D segmentation tracking on three different datasets with mostly rigid objects. △ Less

Submitted 13 April, 2023; originally announced April 2023.

arXiv:2303.04739 [pdf, other]

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

Authors: Victor Ferrari, Rafael Sousa, Marcio Pereira, João P. L. de Carvalho, José Nelson Amaral, José Moreira, Guido Araujo

Abstract: Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to compute convolutions is known as the Im2Col + BLAS method. This paper proposes SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers . This algorithm introduce… ▽ More Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to compute convolutions is known as the Im2Col + BLAS method. This paper proposes SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers . This algorithm introduces: (a) Convolution Slicing Analysis (CSA) - a convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) - a code-generation pass that uses CSA to generate a tiled direct-convolution macro-kernel; and (c) Vector-Based Packing (VBP) - an architecture-specific optimized input-tensor packing solution based on vector-register shift instructions for convolutions with unitary stride. Experiments conducted on 393 convolutions from full ONNX-MLIR machine-learning models indicate that the elimination of the Im2Col transformation and the use of fast packing routines result in a total packing time reduction, on full model inference, of 2.0x - 3.9x on Intel x86 and 3.6x - 7.2x on IBM POWER10. The speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 9% - 25% for Intel x86 and 10% - 42% for IBM POWER10 architectures. The total convolution speedup for model inference is 12% - 27% on Intel x86 and 26% - 46% on IBM POWER10. SConv also outperforms BLAS GEMM, when computing pointwise convolutions, in more than 83% of the 219 tested instances. △ Less

Submitted 8 March, 2023; originally announced March 2023.

Comments: 15 pages, 11 figures

arXiv:2302.12948 [pdf, other]

Agile Modeling: From Concept to Classifier in Minutes

Authors: Otilia Stretcu, Edward Vendrow, Kenji Hata, Krishnamurthy Viswanathan, Vittorio Ferrari, Sasan Tavakkol, Wenlei Zhou, Aditya Avinash, Enming Luo, Neil Gordon Alldrin, MohammadHossein Bateni, Gabriel Berger, Andrew Bunner, Chun-Ta Lu, Javier A Rey, Giulia DeSalvo, Ranjay Krishna, Ariel Fuxman

Abstract: The application of computer vision to nuanced subjective use cases is growing. While crowdsourcing has served the vision community well for most objective tasks (such as labeling a "zebra"), it now falters on tasks where there is substantial subjectivity in the concept (such as identifying "gourmet tuna"). However, empowering any user to develop a classifier for their concept is technically diffic… ▽ More The application of computer vision to nuanced subjective use cases is growing. While crowdsourcing has served the vision community well for most objective tasks (such as labeling a "zebra"), it now falters on tasks where there is substantial subjectivity in the concept (such as identifying "gourmet tuna"). However, empowering any user to develop a classifier for their concept is technically difficult: users are neither machine learning experts, nor have the patience to label thousands of examples. In reaction, we introduce the problem of Agile Modeling: the process of turning any subjective visual concept into a computer vision model through a real-time user-in-the-loop interactions. We instantiate an Agile Modeling prototype for image classification and show through a user study (N=14) that users can create classifiers with minimal effort under 30 minutes. We compare this user driven process with the traditional crowdsourcing paradigm and find that the crowd's notion often differs from that of the user's, especially as the concepts become more subjective. Finally, we scale our experiments with simulations of users training classifiers for ImageNet21k categories to further demonstrate the efficacy. △ Less

Submitted 12 May, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

arXiv:2302.11217 [pdf, other]

Connecting Vision and Language with Video Localized Narratives

Authors: Paul Voigtlaender, Soravit Changpinyo, Jordi Pont-Tuset, Radu Soricut, Vittorio Ferrari

Abstract: We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. However, this is challenging on a video. Our new protocol empowers annotators to tell the story of a video with Localized Narrati… ▽ More We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. However, this is challenging on a video. Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects. We annotated 20k videos of the OVIS, UVO, and Oops datasets, totalling 1.7M words. Based on this data, we also construct new benchmarks for the video narrative grounding and video question answering tasks, and provide reference results from strong baseline models. Our annotations are available at https://google.github.io/video-localized-narratives/. △ Less

Submitted 15 March, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

Comments: Accepted at CVPR 2023

arXiv:2302.03207 [pdf]

Tunable Structural Transmissive Color in Fano-Resonant Optical Coatings Employing Phase-Change Materials

Authors: Yi-Siou Huang, Chih-Yu Lee, Medha Rath, Victoria Ferrari, Heshan Yu, Taylor J. Woehl, Jimmy Ni, Ichiro Takeuchi, Carlos Ríos

Abstract: Reversible, nonvolatile, and pronounced refractive index modulation is an unprecedented combination of properties enabled by chalcogenide phase-change materials (PCMs). This combination of properties makes PCMs a fast-growing platform for active, low-energy nanophotonics, including tunability to otherwise passive thin-film optical coatings. Here, we integrate the PCM Sb2Se3 into a novel four-layer… ▽ More Reversible, nonvolatile, and pronounced refractive index modulation is an unprecedented combination of properties enabled by chalcogenide phase-change materials (PCMs). This combination of properties makes PCMs a fast-growing platform for active, low-energy nanophotonics, including tunability to otherwise passive thin-film optical coatings. Here, we integrate the PCM Sb2Se3 into a novel four-layer thin-film optical coating that exploits photonic Fano resonances to achieve tunable structural colors in both reflection and transmission. We show, contrary to traditional coatings, that Fano-resonant optical coatings (FROCs) allow for achieving transmissive and reflective structures with narrowband peaks at the same resonant wavelength. Moreover, we demonstrate asymmetric optical response in reflection, where Fano resonance and narrow-band filtering are observed depending upon the light incidence side. Finally, we use a multi-objective inverse design via machine learning (ML) to provide a wide range of solution sets with optimized structures while providing information on the performance limitations of the PCM-based FROCs. Adding tunability to the newly introduced Fano-resonant optical coatings opens various applications in spectral and beam splitting, and simultaneous reflective and transmissive displays, diffractive objects, and holograms. △ Less

Submitted 6 February, 2023; originally announced February 2023.

Comments: 16 pages, 12 figures

arXiv:2301.08252 [pdf]

doi 10.1016/j.chemolab.2023.104751

Evaluation of the potential of Near Infrared Hyperspectral Imaging for monitoring the invasive brown marmorated stink bug

Authors: Veronica Ferrari, Rosalba Calvini, Bas Boom, Camilla Menozzi, Aravind Krishnaswamy Rangarajan, Lara Maistrello, Peter Offermans, Alessandro Ulrici

Abstract: The brown marmorated stink bug (BMSB), Halyomorpha halys, is an invasive insect pest of global importance that damages several crops, compromising agri-food production. Field monitoring procedures are fundamental to perform risk assessment operations, in order to promptly face crop infestations and avoid economical losses. To improve pest management, spectral cameras mounted on Unmanned Aerial Veh… ▽ More The brown marmorated stink bug (BMSB), Halyomorpha halys, is an invasive insect pest of global importance that damages several crops, compromising agri-food production. Field monitoring procedures are fundamental to perform risk assessment operations, in order to promptly face crop infestations and avoid economical losses. To improve pest management, spectral cameras mounted on Unmanned Aerial Vehicles (UAVs) and other Internet of Things (IoT) devices, such as smart traps or unmanned ground vehicles, could be used as an innovative technology allowing fast, efficient and real-time monitoring of insect infestations. The present study consists in a preliminary evaluation at the laboratory level of Near Infrared Hyperspectral Imaging (NIR-HSI) as a possible technology to detect BMSB specimens on different vegetal backgrounds, overcoming the problem of BMSB mimicry. Hyperspectral images of BMSB were acquired in the 980-1660 nm range, considering different vegetal backgrounds selected to mimic a real field application scene. Classification models were obtained following two different chemometric approaches. The first approach was focused on modelling spectral information and selecting relevant spectral regions for discrimination by means of sparse-based variable selection coupled with Soft Partial Least Squares Discriminant Analysis (s-Soft PLS-DA) classification algorithm. The second approach was based on modelling spatial and spectral features contained in the hyperspectral images using Convolutional Neural Networks (CNN). Finally, to further improve BMSB detection ability, the two strategies were merged, considering only the spectral regions selected by s-Soft PLS-DA for CNN modelling. △ Less

Submitted 19 January, 2023; originally announced January 2023.

Comments: Accepted manuscript

Journal ref: Chemometrics and Intelligent Laboratory Systems, 2023, 234, 104751

arXiv:2301.02249 [pdf, other]

doi 10.21468/SciPostPhys.16.3.080

Generalized Symmetries and Anomalies of 3d N=4 SCFTs

Authors: Lakshya Bhardwaj, Mathew Bullimore, Andrea E. V. Ferrari, Sakura Schafer-Nameki

Abstract: We study generalized global symmetries and their 't Hooft anomalies in 3d N=4 superconformal field theories (SCFTs). Following some general considerations, we focus on good quiver gauge theories, comprised of balanced unitary nodes and unbalanced unitary and special unitary nodes. While the global form of the Higgs branch symmetry group may be determined from the UV Lagrangian, the global form of… ▽ More We study generalized global symmetries and their 't Hooft anomalies in 3d N=4 superconformal field theories (SCFTs). Following some general considerations, we focus on good quiver gauge theories, comprised of balanced unitary nodes and unbalanced unitary and special unitary nodes. While the global form of the Higgs branch symmetry group may be determined from the UV Lagrangian, the global form of Coulomb branch symmetry groups and associated mixed 't Hooft anomalies are more subtle due to potential symmetry enhancement in the IR. We describe how Coulomb branch symmetry groups and their mixed 't Hooft anomalies can be deduced from the UV Lagrangian by studying center charges of various types of monopole operators, providing a concrete and unambiguous way to implement 't Hooft anomaly matching. The final expression for the symmetry group and 't Hooft anomalies has a concise form that can be easily read off from the quiver data, specifically from the positions of the unbalanced and flavor nodes with respect to the positions of the balanced nodes. We provide consistency checks by applying our method to compute symmetry groups of 3d N=4 theories corresponding to magnetic quivers of 4d Class S theories and 5d SCFTs. We are able to match these results against the flavor symmetry groups of the 4d and 5d theories computed using independent methods. Another strong consistency check is provided by comparing symmetry groups and anomalies of two theories related by 3d mirror symmetry. △ Less

Submitted 24 January, 2024; v1 submitted 5 January, 2023; originally announced January 2023.

Comments: 79 pages, v2: Corrected an important typo reported by M. Sperling

Journal ref: SciPost Phys. 16, 080 (2024)

arXiv:2212.11920 [pdf, other]

Beyond SOT: Tracking Multiple Generic Objects at Once

Authors: Christoph Mayer, Martin Danelljan, Ming-Hsuan Yang, Vittorio Ferrari, Luc Van Gool, Alina Kuznetsova

Abstract: Generic Object Tracking (GOT) is the problem of tracking target objects, specified by bounding boxes in the first frame of a video. While the task has received much attention in the last decades, researchers have almost exclusively focused on the single object setting. Multi-object GOT benefits from a wider applicability, rendering it more attractive in real-world applications. We attribute the la… ▽ More Generic Object Tracking (GOT) is the problem of tracking target objects, specified by bounding boxes in the first frame of a video. While the task has received much attention in the last decades, researchers have almost exclusively focused on the single object setting. Multi-object GOT benefits from a wider applicability, rendering it more attractive in real-world applications. We attribute the lack of research interest into this problem to the absence of suitable benchmarks. In this work, we introduce a new large-scale GOT benchmark, LaGOT, containing multiple annotated target objects per sequence. Our benchmark allows users to tackle key remaining challenges in GOT, aiming to increase robustness and reduce computation through joint tracking of multiple objects simultaneously. In addition, we propose a transformer-based GOT tracker baseline capable of joint processing of multiple objects through shared computation. Our approach achieves a 4x faster run-time in case of 10 concurrent objects compared to tracking each object independently and outperforms existing single object trackers on our new benchmark. In addition, our approach achieves highly competitive results on single-object GOT datasets, setting a new state of the art on TrackingNet with a success rate AUC of 84.4%. Our benchmark, code, and trained models will be made publicly available. △ Less

Submitted 25 February, 2024; v1 submitted 22 December, 2022; originally announced December 2022.

Comments: accepted by WACV'24

arXiv:2212.07393 [pdf, other]

Non-invertible Symmetries and Higher Representation Theory II

Authors: Thomas Bartsch, Mathew Bullimore, Andrea E. V. Ferrari, Jamie Pearson

Abstract: In this paper we continue our investigation of the global categorical symmetries that arise when gauging finite higher groups and their higher subgroups with discrete torsion. The motivation is to provide a common perspective on the construction of non-invertible global symmetries in higher dimensions and a precise description of the associated symmetry categories. We propose that the symmetry cat… ▽ More In this paper we continue our investigation of the global categorical symmetries that arise when gauging finite higher groups and their higher subgroups with discrete torsion. The motivation is to provide a common perspective on the construction of non-invertible global symmetries in higher dimensions and a precise description of the associated symmetry categories. We propose that the symmetry categories obtained by gauging higher subgroups may be defined as higher group-theoretical fusion categories, which are built from the projective higher representations of higher groups. As concrete applications we provide a unified description of the symmetry categories of gauge theories in three and four dimensions based on the Lie algebra $\mathfrak{so}(N)$, and a fully categorical description of non-invertible symmetries obtained by gauging a 1-form symmetry with a mixed 't Hooft anomaly. We also discuss the effect of discrete torsion on symmetry categories, based a series of obstructions determined by spectral sequence arguments. △ Less

Submitted 14 July, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

Comments: 56 pages + appendix, v2: clarifications and citations added

arXiv:2210.14142 [pdf, other]

From colouring-in to pointillism: revisiting semantic segmentation supervision

Authors: Rodrigo Benenson, Vittorio Ferrari

Abstract: The prevailing paradigm for producing semantic segmentation training data relies on densely labelling each pixel of each image in the training set, akin to colouring-in books. This approach becomes a bottleneck when scaling up in the number of images, classes, and annotators. Here we propose instead a pointillist approach for semantic segmentation annotation, where only point-wise yes/no questions… ▽ More The prevailing paradigm for producing semantic segmentation training data relies on densely labelling each pixel of each image in the training set, akin to colouring-in books. This approach becomes a bottleneck when scaling up in the number of images, classes, and annotators. Here we propose instead a pointillist approach for semantic segmentation annotation, where only point-wise yes/no questions are answered. We explore design alternatives for such an active learning approach, measure the speed and consistency of human annotators on this task, show that this strategy enables training good segmentation models, and that it is suitable for evaluating models at test time. As concrete proof of the scalability of our method, we collected and released 22.6M point labels over 4,171 classes on the Open Images dataset. Our results enable to rethink the semantic segmentation pipeline of annotation, training, and evaluation from a pointillism point of view. △ Less

Submitted 17 November, 2022; v1 submitted 25 October, 2022; originally announced October 2022.

Comments: Open Images V7 available at https://g.co/dataset/open-images

arXiv:2210.07670 [pdf, other]

Multi-View Photometric Stereo Revisited

Authors: Berk Kaya, Suryansh Kumar, Carlos Oliveira, Vittorio Ferrari, Luc Van Gool

Abstract: Multi-view photometric stereo (MVPS) is a preferred method for detailed and precise 3D acquisition of an object from images. Although popular methods for MVPS can provide outstanding results, they are often complex to execute and limited to isotropic material objects. To address such limitations, we present a simple, practical approach to MVPS, which works well for isotropic as well as other objec… ▽ More Multi-view photometric stereo (MVPS) is a preferred method for detailed and precise 3D acquisition of an object from images. Although popular methods for MVPS can provide outstanding results, they are often complex to execute and limited to isotropic material objects. To address such limitations, we present a simple, practical approach to MVPS, which works well for isotropic as well as other object material types such as anisotropic and glossy. The proposed approach in this paper exploits the benefit of uncertainty modeling in a deep neural network for a reliable fusion of photometric stereo (PS) and multi-view stereo (MVS) network predictions. Yet, contrary to the recently proposed state-of-the-art, we introduce neural volume rendering methodology for a trustworthy fusion of MVS and PS measurements. The advantage of introducing neural volume rendering is that it helps in the reliable modeling of objects with diverse material types, where existing MVS methods, PS methods, or both may fail. Furthermore, it allows us to work on neural 3D shape representation, which has recently shown outstanding results for many geometric processing tasks. Our suggested new loss function aims to fits the zero level set of the implicit neural function using the most certain MVS and PS network predictions coupled with weighted neural volume rendering cost. The proposed approach shows state-of-the-art results when tested extensively on several benchmark datasets. △ Less

Submitted 14 October, 2022; originally announced October 2022.

Comments: Accepted for publication at IEEE/CVF WACV 2023. Draft info: 10 pages, 5 figure, and 3 tables

arXiv:2208.05993 [pdf, other]

Non-invertible Symmetries and Higher Representation Theory I

Authors: Thomas Bartsch, Mathew Bullimore, Andrea E. V. Ferrari, Jamie Pearson

Abstract: The purpose of this paper is to investigate the global categorical symmetries that arise when gauging finite higher groups in three or more dimensions. The motivation is to provide a common perspective on constructions of non-invertible global symmetries in higher dimensions and a precise description of the associated symmetry categories. This paper focusses on gauging finite groups and split 2-gr… ▽ More The purpose of this paper is to investigate the global categorical symmetries that arise when gauging finite higher groups in three or more dimensions. The motivation is to provide a common perspective on constructions of non-invertible global symmetries in higher dimensions and a precise description of the associated symmetry categories. This paper focusses on gauging finite groups and split 2-groups in three dimensions. In addition to topological Wilson lines, we show that this generates a rich spectrum of topological surface defects labelled by 2-representations and explain their connection to condensation defects for Wilson lines. We derive various properties of the topological defects and show that the associated symmetry category is the fusion 2-category of 2-representations. This allows us to determine the full symmetry categories of certain gauge theories with disconnected gauge groups. A subsequent paper will examine gauging more general higher groups in higher dimensions. △ Less

Submitted 5 May, 2023; v1 submitted 11 August, 2022; originally announced August 2022.

Comments: 55 pages + Appendices. v2: references updated

arXiv:2206.04453 [pdf, other]

The Missing Link: Finding label relations across datasets

Authors: Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

Abstract: Computer vision is driven by the many datasets available for training or evaluating novel methods. However, each dataset has a different set of class labels, visual definition of classes, images following a specific distribution, annotation protocols, etc. In this paper we explore the automatic discovery of visual-semantic relations between labels across datasets. We aim to understand how instance… ▽ More Computer vision is driven by the many datasets available for training or evaluating novel methods. However, each dataset has a different set of class labels, visual definition of classes, images following a specific distribution, annotation protocols, etc. In this paper we explore the automatic discovery of visual-semantic relations between labels across datasets. We aim to understand how instances of a certain class in a dataset relate to the instances of another class in another dataset. Are they in an identity, parent/child, overlap relation? Or is there no link between them at all? To find relations between labels across datasets, we propose methods based on language, on vision, and on their combination. We show that we can effectively discover label relations across datasets, as well as their type. We apply our method to four applications: understand label relations, identify missing aspects, increase label specificity, and predict transfer learning gains. We conclude that label relations cannot be established by looking at the names of classes alone, as they depend strongly on how each of the datasets was constructed. △ Less

Submitted 9 August, 2022; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: ECCV 2022

arXiv:2205.15330 [pdf, other]

doi 10.21468/SciPostPhys.16.3.087

Anomalies of Generalized Symmetries from Solitonic Defects

Authors: Lakshya Bhardwaj, Mathew Bullimore, Andrea E. V. Ferrari, Sakura Schafer-Nameki

Abstract: We propose the general idea that 't Hooft anomalies of generalized global symmetries can be understood in terms of the properties of solitonic defects, which generically are non-topological defects. The defining property of such defects is that they act as sources for background fields of generalized symmetries. 't Hooft anomalies arise when solitonic defects are charged under these generalized sy… ▽ More We propose the general idea that 't Hooft anomalies of generalized global symmetries can be understood in terms of the properties of solitonic defects, which generically are non-topological defects. The defining property of such defects is that they act as sources for background fields of generalized symmetries. 't Hooft anomalies arise when solitonic defects are charged under these generalized symmetries. We illustrate this idea for several kinds of anomalies in various spacetime dimensions. A systematic exploration is performed in 3d for 0-form, 1-form, and 2-group symmetries, whose 't Hooft anomalies are related to two special types of solitonic defects, namely vortex line defects and monopole operators. This analysis is supplemented with detailed computations of such anomalies in a large class of 3d gauge theories. Central to this computation is the determination of the gauge and 0-form charges of a variety of monopole operators: these involve standard gauge monopole operators, but also fractional gauge monopole operators, as well as monopole operators for 0-form symmetries. The charges of these monopole operators mainly receive contributions from Chern-Simons terms and fermions in the matter content. Along the way, we interpret the vanishing of the global gauge and ABJ anomalies, which are anomalies not captured by local anomaly polynomials, as the requirement that gauge monopole operators and mixed monopole operators for 0-form and gauge symmetries have non-fractional integer charges. △ Less

Submitted 26 January, 2024; v1 submitted 30 May, 2022; originally announced May 2022.

Comments: 85 pages

Journal ref: SciPost Phys. 16, 087 (2024)

arXiv:2205.06216 [pdf, ps, other]

doi 10.21468/SciPostPhys.14.4.063

Supersymmetric ground states of 3d $\mathcal{N}=4$ SUSY gauge theories and Heisenberg Algebras

Authors: Andrea E. V. Ferrari

Abstract: We consider 3d $\mathcal{N} = 4$ theories on the geometry $Σ\times\mathbb{R}$, where $Σ$ is a closed and connected Riemann surface, from the point of view of a quantum mechanics on $\mathbb{R}$. Focussing on the elementary mirror pair in the presence of real deformation parameters, namely SQED with one hypermultiplet (SQED[1]) and the free hypermulitplet, we study the algebras of local operators i… ▽ More We consider 3d $\mathcal{N} = 4$ theories on the geometry $Σ\times\mathbb{R}$, where $Σ$ is a closed and connected Riemann surface, from the point of view of a quantum mechanics on $\mathbb{R}$. Focussing on the elementary mirror pair in the presence of real deformation parameters, namely SQED with one hypermultiplet (SQED[1]) and the free hypermulitplet, we study the algebras of local operators in the respective quantum mechanics as well as their action on the vector space of supersymmetric ground states. We demonstrate that the algebras can be described in terms of Heisenberg algebras, and that they act in a way reminiscent of Segal-Bargmann (B-twist of the free hypermultiplet) and Nakajima (A-twist of SQED[1]) operators. △ Less

Submitted 21 April, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

Comments: Scipost version. Minor notational change in the appendix, typos corrected

Journal ref: SciPost Phys. 14, 063 (2023)

arXiv:2204.01403 [pdf, other]

How stable are Transferability Metrics evaluations?

Authors: Andrea Agostinelli, Michal Pándy, Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

Abstract: Transferability metrics is a maturing field with increasing interest, which aims at providing heuristics for selecting the most suitable source models to transfer to a given target dataset, without fine-tuning them all. However, existing works rely on custom experimental setups which differ across papers, leading to inconsistent conclusions about which transferability metrics work best. In this pa… ▽ More Transferability metrics is a maturing field with increasing interest, which aims at providing heuristics for selecting the most suitable source models to transfer to a given target dataset, without fine-tuning them all. However, existing works rely on custom experimental setups which differ across papers, leading to inconsistent conclusions about which transferability metrics work best. In this paper we conduct a large-scale study by systematically constructing a broad range of 715k experimental setup variations. We discover that even small variations to an experimental setup lead to different conclusions about the superiority of a transferability metric over another. Then we propose better evaluations by aggregating across many experiments, enabling to reach more stable conclusions. As a result, we reveal the superiority of LogME at selecting good source datasets to transfer from in a semantic segmentation scenario, NLEEP at selecting good source architectures in an image classification scenario, and GBC at determining which target task benefits most from a given source model. Yet, no single transferability metric works best in all scenarios. △ Less

Submitted 20 October, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

Comments: ECCV 2022

arXiv:2203.13296 [pdf, other]

RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers

Authors: Michał J. Tyszkiewicz, Kevis-Kokitsi Maninis, Stefan Popov, Vittorio Ferrari

Abstract: We propose a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos. It relies on two alternative ways to represent its knowledge: as a global 3D grid of features and an array of view-specific 2D grids. We progressively exchange information between the two with a dedicated bidirectional attention mechanism. We exploit knowledge about the image formation pr… ▽ More We propose a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos. It relies on two alternative ways to represent its knowledge: as a global 3D grid of features and an array of view-specific 2D grids. We progressively exchange information between the two with a dedicated bidirectional attention mechanism. We exploit knowledge about the image formation process to significantly sparsify the attention weight matrix, making our architecture feasible on current hardware, both in terms of memory and computation. We attach a DETR-style head on top of the 3D feature grid in order to detect the objects in the scene and to predict their 3D pose and 3D shape. Compared to previous methods, our architecture is single stage, end-to-end trainable, and it can reason holistically about a scene from multiple video frames without needing a brittle tracking step. We evaluate our method on the challenging Scan2CAD dataset, where we outperform (1) recent state-of-the-art methods for 3D object pose estimation from RGB videos; and (2) a strong alternative method combining Multi-view Stereo with RGB-D CAD alignment. We plan to release our source code. △ Less

Submitted 26 August, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

Comments: ECCV 2022 camera ready

arXiv:2202.13071 [pdf, other]

Uncertainty-Aware Deep Multi-View Photometric Stereo

Authors: Berk Kaya, Suryansh Kumar, Carlos Oliveira, Vittorio Ferrari, Luc Van Gool

Abstract: This paper presents a simple and effective solution to the longstanding classical multi-view photometric stereo (MVPS) problem. It is well-known that photometric stereo (PS) is excellent at recovering high-frequency surface details, whereas multi-view stereo (MVS) can help remove the low-frequency distortion due to PS and retain the global geometry of the shape. This paper proposes an approach tha… ▽ More This paper presents a simple and effective solution to the longstanding classical multi-view photometric stereo (MVPS) problem. It is well-known that photometric stereo (PS) is excellent at recovering high-frequency surface details, whereas multi-view stereo (MVS) can help remove the low-frequency distortion due to PS and retain the global geometry of the shape. This paper proposes an approach that can effectively utilize such complementary strengths of PS and MVS. Our key idea is to combine them suitably while considering the per-pixel uncertainty of their estimates. To this end, we estimate per-pixel surface normals and depth using an uncertainty-aware deep-PS network and deep-MVS network, respectively. Uncertainty modeling helps select reliable surface normal and depth estimates at each pixel which then act as a true representative of the dense surface geometry. At each pixel, our approach either selects or discards deep-PS and deep-MVS network prediction depending on the prediction uncertainty measure. For dense, detailed, and precise inference of the object's surface profile, we propose to learn the implicit neural shape representation via a multilayer perceptron (MLP). Our approach encourages the MLP to converge to a natural zero-level set surface using the confident prediction from deep-PS and deep-MVS networks, providing superior dense surface reconstruction. Extensive experiments on the DiLiGenT-MV benchmark dataset show that our method provides high-quality shape recovery with a much lower memory footprint while outperforming almost all of the existing approaches. △ Less

Submitted 28 March, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

Comments: Accepted for publication in IEEE/CVF CVPR 2022. (11 Pages, 6 Figures, 3 Tables)

arXiv:2111.14643 [pdf, other]

Urban Radiance Fields

Authors: Konstantinos Rematas, Andrew Liu, Pratul P. Srinivasan, Jonathan T. Barron, Andrea Tagliasacchi, Thomas Funkhouser, Vittorio Ferrari

Abstract: The goal of this work is to perform 3D reconstruction and novel view synthesis from data captured by scanning platforms commonly deployed for world map** in urban outdoor environments (e.g., Street View). Given a sequence of posed RGB images and lidar sweeps acquired by cameras and scanners moving through an outdoor scene, we produce a model from which 3D surfaces can be extracted and novel RGB… ▽ More The goal of this work is to perform 3D reconstruction and novel view synthesis from data captured by scanning platforms commonly deployed for world map** in urban outdoor environments (e.g., Street View). Given a sequence of posed RGB images and lidar sweeps acquired by cameras and scanners moving through an outdoor scene, we produce a model from which 3D surfaces can be extracted and novel RGB images can be synthesized. Our approach extends Neural Radiance Fields, which has been demonstrated to synthesize realistic novel images for small scenes in controlled settings, with new methods for leveraging asynchronously captured lidar data, for addressing exposure variation between captured images, and for leveraging predicted image segmentations to supervise densities on rays pointing at the sky. Each of these three extensions provides significant performance improvements in experiments on Street View data. Our system produces state-of-the-art 3D surface reconstructions and synthesizes higher quality novel views in comparison to both traditional methods (e.g.~COLMAP) and recent neural representations (e.g.~Mip-NeRF). △ Less

Submitted 29 November, 2021; originally announced November 2021.

Comments: Project: https://urban-radiance-fields.github.io/

arXiv:2111.14465 [pdf, other]

Motion-from-Blur: 3D Shape and Motion Estimation of Motion-blurred Objects in Videos

Authors: Denys Rozumnyi, Martin R. Oswald, Vittorio Ferrari, Marc Pollefeys

Abstract: We propose a method for jointly estimating the 3D motion, 3D shape, and appearance of highly motion-blurred objects from a video. To this end, we model the blurred appearance of a fast moving object in a generative fashion by parametrizing its 3D position, rotation, velocity, acceleration, bounces, shape, and texture over the duration of a predefined time window spanning multiple frames. Using dif… ▽ More We propose a method for jointly estimating the 3D motion, 3D shape, and appearance of highly motion-blurred objects from a video. To this end, we model the blurred appearance of a fast moving object in a generative fashion by parametrizing its 3D position, rotation, velocity, acceleration, bounces, shape, and texture over the duration of a predefined time window spanning multiple frames. Using differentiable rendering, we are able to estimate all parameters by minimizing the pixel-wise reprojection error to the input video via backpropagating through a rendering pipeline that accounts for motion blur by averaging the graphics output over short time intervals. For that purpose, we also estimate the camera exposure gap time within the same optimization. To account for abrupt motion changes like bounces, we model the motion trajectory as a piece-wise polynomial, and we are able to estimate the specific time of the bounce at sub-frame accuracy. Experiments on established benchmark datasets demonstrate that our method outperforms previous methods for fast moving object deblurring and 3D reconstruction. △ Less

Submitted 7 April, 2022; v1 submitted 29 November, 2021; originally announced November 2021.

Comments: CVPR 2022 camera-ready

Journal ref: 2022 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

arXiv:2111.13011 [pdf, other]

Transferability Metrics for Selecting Source Model Ensembles

Authors: Andrea Agostinelli, Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

Abstract: We address the problem of ensemble selection in transfer learning: Given a large pool of source models we want to select an ensemble of models which, after fine-tuning on the target training set, yields the best performance on the target test set. Since fine-tuning all possible ensembles is computationally prohibitive, we aim at predicting performance on the target dataset using a computationally… ▽ More We address the problem of ensemble selection in transfer learning: Given a large pool of source models we want to select an ensemble of models which, after fine-tuning on the target training set, yields the best performance on the target test set. Since fine-tuning all possible ensembles is computationally prohibitive, we aim at predicting performance on the target dataset using a computationally efficient transferability metric. We propose several new transferability metrics designed for this task and evaluate them in a challenging and realistic transfer learning setup for semantic segmentation: we create a large and diverse pool of source models by considering 17 source datasets covering a wide variety of image domain, two different architectures, and two pre-training schemes. Given this pool, we then automatically select a subset to form an ensemble performing well on a given target dataset. We compare the ensemble selected by our method to two baselines which select a single source model, either (1) from the same pool as our method; or (2) from a pool containing large source models, each with similar capacity as an ensemble. Averaged over 17 target datasets, we outperform these baselines by 6.0% and 2.5% relative mean IoU, respectively. △ Less

Submitted 31 March, 2022; v1 submitted 25 November, 2021; originally announced November 2021.

arXiv:2111.12780 [pdf, other]

Transferability Estimation using Bhattacharyya Class Separability

Authors: Michal Pándy, Andrea Agostinelli, Jasper Uijlings, Vittorio Ferrari, Thomas Mensink

Abstract: Transfer learning has become a popular method for leveraging pre-trained models in computer vision. However, without performing computationally expensive fine-tuning, it is difficult to quantify which pre-trained source models are suitable for a specific target task, or, conversely, to which tasks a pre-trained source model can be easily adapted to. In this work, we propose Gaussian Bhattacharyya… ▽ More Transfer learning has become a popular method for leveraging pre-trained models in computer vision. However, without performing computationally expensive fine-tuning, it is difficult to quantify which pre-trained source models are suitable for a specific target task, or, conversely, to which tasks a pre-trained source model can be easily adapted to. In this work, we propose Gaussian Bhattacharyya Coefficient (GBC), a novel method for quantifying transferability between a source model and a target dataset. In a first step we embed all target images in the feature space defined by the source model, and represent them with per-class Gaussians. Then, we estimate their pairwise class separability using the Bhattacharyya coefficient, yielding a simple and effective measure of how well the source model transfers to the target task. We evaluate GBC on image classification tasks in the context of dataset and architecture selection. Further, we also perform experiments on the more complex semantic segmentation transferability estimation task. We demonstrate that GBC outperforms state-of-the-art transferability metrics on most evaluation criteria in the semantic segmentation settings, matches the performance of top methods for dataset transferability in image classification, and performs best on architecture selection problems for image classification. △ Less

Submitted 11 April, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

Comments: Accepted for CVPR 2022

arXiv:2110.05621 [pdf, other]

Neural Architecture Search for Efficient Uncalibrated Deep Photometric Stereo

Authors: Francesco Sarno, Suryansh Kumar, Berk Kaya, Zhiwu Huang, Vittorio Ferrari, Luc Van Gool

Abstract: We present an automated machine learning approach for uncalibrated photometric stereo (PS). Our work aims at discovering lightweight and computationally efficient PS neural networks with excellent surface normal accuracy. Unlike previous uncalibrated deep PS networks, which are handcrafted and carefully tuned, we leverage differentiable neural architecture search (NAS) strategy to find uncalibrate… ▽ More We present an automated machine learning approach for uncalibrated photometric stereo (PS). Our work aims at discovering lightweight and computationally efficient PS neural networks with excellent surface normal accuracy. Unlike previous uncalibrated deep PS networks, which are handcrafted and carefully tuned, we leverage differentiable neural architecture search (NAS) strategy to find uncalibrated PS architecture automatically. We begin by defining a discrete search space for a light calibration network and a normal estimation network, respectively. We then perform a continuous relaxation of this search space and present a gradient-based optimization strategy to find an efficient light calibration and normal estimation network. Directly applying the NAS methodology to uncalibrated PS is not straightforward as certain task-specific constraints must be satisfied, which we impose explicitly. Moreover, we search for and train the two networks separately to account for the Generalized Bas-Relief (GBR) ambiguity. Extensive experiments on the DiLiGenT dataset show that the automatically searched neural architectures performance compares favorably with the state-of-the-art uncalibrated PS methods while having a lower memory footprint. △ Less

Submitted 11 October, 2021; originally announced October 2021.

Comments: Accepted for publication at IEEE/CVF, WACV 2022. (11 pages)

arXiv:2110.05594 [pdf, other]

Neural Radiance Fields Approach to Deep Multi-View Photometric Stereo

Authors: Berk Kaya, Suryansh Kumar, Francesco Sarno, Vittorio Ferrari, Luc Van Gool

Abstract: We present a modern solution to the multi-view photometric stereo problem (MVPS). Our work suitably exploits the image formation model in a MVPS experimental setup to recover the dense 3D reconstruction of an object from images. We procure the surface orientation using a photometric stereo (PS) image formation model and blend it with a multi-view neural radiance field representation to recover the… ▽ More We present a modern solution to the multi-view photometric stereo problem (MVPS). Our work suitably exploits the image formation model in a MVPS experimental setup to recover the dense 3D reconstruction of an object from images. We procure the surface orientation using a photometric stereo (PS) image formation model and blend it with a multi-view neural radiance field representation to recover the object's surface geometry. Contrary to the previous multi-staged framework to MVPS, where the position, iso-depth contours, or orientation measurements are estimated independently and then fused later, our method is simple to implement and realize. Our method performs neural rendering of multi-view images while utilizing surface normals estimated by a deep photometric stereo network. We render the MVPS images by considering the object's surface normals for each 3D sample point along the viewing direction rather than explicitly using the density gradient in the volume space via 3D occupancy information. We optimize the proposed neural radiance field representation for the MVPS setup efficiently using a fully connected deep network to recover the 3D geometry of an object. Extensive evaluation on the DiLiGenT-MV benchmark dataset shows that our method performs better than the approaches that perform only PS or only multi-view stereo (MVS) and provides comparable results against the state-of-the-art multi-stage fusion methods. △ Less

Submitted 11 October, 2021; originally announced October 2021.

Comments: Accepted for publication at IEEE/CVF WACV 2022. 18 pages

arXiv:2106.08762 [pdf, other]

Shape from Blur: Recovering Textured 3D Shape and Motion of Fast Moving Objects

Authors: Denys Rozumnyi, Martin R. Oswald, Vittorio Ferrari, Marc Pollefeys

Abstract: We address the novel task of jointly reconstructing the 3D shape, texture, and motion of an object from a single motion-blurred image. While previous approaches address the deblurring problem only in the 2D image domain, our proposed rigorous modeling of all object properties in the 3D domain enables the correct description of arbitrary object motion. This leads to significantly better image decom… ▽ More We address the novel task of jointly reconstructing the 3D shape, texture, and motion of an object from a single motion-blurred image. While previous approaches address the deblurring problem only in the 2D image domain, our proposed rigorous modeling of all object properties in the 3D domain enables the correct description of arbitrary object motion. This leads to significantly better image decomposition and sharper deblurring results. We model the observed appearance of a motion-blurred object as a combination of the background and a 3D object with constant translation and rotation. Our method minimizes a loss on reconstructing the input image via differentiable rendering with suitable regularizers. This enables estimating the textured 3D mesh of the blurred object with high fidelity. Our method substantially outperforms competing approaches on several benchmarks for fast moving objects deblurring. Qualitative results show that the reconstructed 3D mesh generates high-quality temporal super-resolution and novel views of the deblurred object. △ Less

Submitted 26 October, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

Comments: Accepted to 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

Journal ref: 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

arXiv:2105.08783 [pdf, other]

doi 10.21468/SciPostPhys.12.2.072

Supersymmetric Ground States of 3d $\mathcal{N}=4$ Gauge Theories on a Riemann Surface

Authors: Mathew Bullimore, Andrea E. V. Ferrari, Heeyeon Kim

Abstract: This paper studies supersymmetric ground states of 3d $\mathcal{N}=4$ supersymmetric gauge theories on a Riemann surface of genus $g$. There are two distinct spaces of supersymmetric ground states arising from the $A$ and $B$ type twists on the Riemann surface, which lead to effective supersymmetric quantum mechanics with four supercharges and supermultiplets of type $\mathcal{N}=(2,2)$ and… ▽ More This paper studies supersymmetric ground states of 3d $\mathcal{N}=4$ supersymmetric gauge theories on a Riemann surface of genus $g$. There are two distinct spaces of supersymmetric ground states arising from the $A$ and $B$ type twists on the Riemann surface, which lead to effective supersymmetric quantum mechanics with four supercharges and supermultiplets of type $\mathcal{N}=(2,2)$ and $\mathcal{N}=(0,4)$ respectively. We compute the space of supersymmetric ground states in each case, graded by flavour and R-symmetries and in different chambers for real mass and FI parameters, for a large class of supersymmetric gauge theories. The results are formulated geometrically in terms of the Higgs branch geometry. We perform extensive checks of compatibility with the twisted index and mirror symmetry. △ Less

Submitted 23 November, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

Comments: 50 pages, SciPost version

Journal ref: SciPost Phys. 12, 072 (2022)

arXiv:2105.02317 [pdf, other]

doi 10.1145/3461702.3462594

A Step Toward More Inclusive People Annotations for Fairness

Authors: Candice Schumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, Caroline Pantofaru

Abstract: The Open Images Dataset contains approximately 9 million images and is a widely accepted dataset for computer vision research. As is common practice for large datasets, the annotations are not exhaustive, with bounding boxes and attribute labels for only a subset of the classes in each image. In this paper, we present a new set of annotations on a subset of the Open Images dataset called the MIAP… ▽ More The Open Images Dataset contains approximately 9 million images and is a widely accepted dataset for computer vision research. As is common practice for large datasets, the annotations are not exhaustive, with bounding boxes and attribute labels for only a subset of the classes in each image. In this paper, we present a new set of annotations on a subset of the Open Images dataset called the MIAP (More Inclusive Annotations for People) subset, containing bounding boxes and attributes for all of the people visible in those images. The attributes and labeling methodology for the MIAP subset were designed to enable research into model fairness. In addition, we analyze the original annotation methodology for the person class and its subclasses, discussing the resulting patterns in order to inform future annotation efforts. By considering both the original and exhaustive annotation sets, researchers can also now study how systematic patterns in training annotations affect modeling. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Journal ref: AIES (2021)

arXiv:2103.13318 [pdf, other]

Factors of Influence for Transfer Learning across Diverse Appearance Domains and Task Types

Authors: Thomas Mensink, Jasper Uijlings, Alina Kuznetsova, Michael Gygli, Vittorio Ferrari

Abstract: Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the cir… ▽ More Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood. In this paper we carry out an extensive experimental exploration of transfer learning across vastly different image domains (consumer photos, autonomous driving, aerial imagery, underwater, indoor scenes, synthetic, close-ups) and task types (semantic segmentation, object detection, depth estimation, keypoint detection). Importantly, these are all complex, structured output tasks types relevant to modern computer vision applications. In total we carry out over 2000 transfer learning experiments, including many where the source and target come from different image domains, task types, or both. We systematically analyze these experiments to understand the impact of image domain, task type, and dataset size on transfer learning performance. Our study leads to several insights and concrete recommendations: (1) for most tasks there exists a source which significantly outperforms ILSVRC'12 pre-training; (2) the image domain is the most important factor for achieving positive transfer; (3) the source dataset should \emph{include} the image domain of the target dataset to achieve best results; (4) at the same time, we observe only small negative effects when the image domain of the source task is much broader than that of the target; (5) transfer across task types can be beneficial, but its success is heavily dependent on both the source and target task types. △ Less

Submitted 20 November, 2021; v1 submitted 24 March, 2021; originally announced March 2021.

Comments: Accepted for future publication in TPAMI

arXiv:2102.08860 [pdf, other]

ShaRF: Shape-conditioned Radiance Fields from a Single View

Authors: Konstantinos Rematas, Ricardo Martin-Brualla, Vittorio Ferrari

Abstract: We present a method for estimating neural scenes representations of objects given only a single image. The core of our method is the estimation of a geometric scaffold for the object and its use as a guide for the reconstruction of the underlying radiance field. Our formulation is based on a generative process that first maps a latent code to a voxelized shape, and then renders it to an image, wit… ▽ More We present a method for estimating neural scenes representations of objects given only a single image. The core of our method is the estimation of a geometric scaffold for the object and its use as a guide for the reconstruction of the underlying radiance field. Our formulation is based on a generative process that first maps a latent code to a voxelized shape, and then renders it to an image, with the object appearance being controlled by a second latent code. During inference, we optimize both the latent codes and the networks to fit a test image of a new object. The explicit disentanglement of shape and appearance allows our model to be fine-tuned given a single image. We can then render new views in a geometrically consistent manner and they represent faithfully the input object. Additionally, our method is able to generalize to images outside of the training domain (more realistic renderings and even real photographs). Finally, the inferred geometric scaffold is itself an accurate estimate of the object's 3D shape. We demonstrate in several experiments the effectiveness of our approach in both synthetic and real images. △ Less

Submitted 23 June, 2021; v1 submitted 17 February, 2021; originally announced February 2021.

Comments: Project page: http://www.krematas.com/sharf/index.html

arXiv:2102.04980 [pdf, other]

Telling the What while Pointing to the Where: Multimodal Queries for Image Retrieval

Authors: Soravit Changpinyo, Jordi Pont-Tuset, Vittorio Ferrari, Radu Soricut

Abstract: Most existing image retrieval systems use text queries as a way for the user to express what they are looking for. However, fine-grained image retrieval often requires the ability to also express where in the image the content they are looking for is. The text modality can only cumbersomely express such localization preferences, whereas pointing is a more natural fit. In this paper, we propose an… ▽ More Most existing image retrieval systems use text queries as a way for the user to express what they are looking for. However, fine-grained image retrieval often requires the ability to also express where in the image the content they are looking for is. The text modality can only cumbersomely express such localization preferences, whereas pointing is a more natural fit. In this paper, we propose an image retrieval setup with a new form of multimodal queries, where the user simultaneously uses both spoken natural language (the what) and mouse traces over an empty canvas (the where) to express the characteristics of the desired target image. We then describe simple modifications to an existing image retrieval model, enabling it to operate in this setup. Qualitative and quantitative experiments show that our model effectively takes this spatial guidance into account, and provides significantly more accurate retrieval results compared to text-only equivalent systems. △ Less

Submitted 24 August, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

Comments: IEEE/CVF International Conference on Computer Vision (ICCV 2021)

arXiv:2012.12554 [pdf, other]

Efficient video annotation with visual interpolation and frame selection guidance

Authors: A. Kuznetsova, A. Talati, Y. Luo, K. Simmons, V. Ferrari

Abstract: We introduce a unified framework for generic video annotation with bounding boxes. Video annotation is a longstanding problem, as it is a tedious and time-consuming process. We tackle two important challenges of video annotation: (1) automatic temporal interpolation and extrapolation of bounding boxes provided by a human annotator on a subset of all frames, and (2) automatic selection of frames to… ▽ More We introduce a unified framework for generic video annotation with bounding boxes. Video annotation is a longstanding problem, as it is a tedious and time-consuming process. We tackle two important challenges of video annotation: (1) automatic temporal interpolation and extrapolation of bounding boxes provided by a human annotator on a subset of all frames, and (2) automatic selection of frames to annotate manually. Our contribution is two-fold: first, we propose a model that has both interpolating and extrapolating capabilities; second, we propose a guiding mechanism that sequentially generates suggestions for what frame to annotate next, based on the annotations made previously. We extensively evaluate our approach on several challenging datasets in simulation and demonstrate a reduction in terms of the number of manual bounding boxes drawn by 60% over linear interpolation and by 35% over an off-the-shelf tracker. Moreover, we also show 10% annotation time improvement over a state-of-the-art method for video annotation with bounding boxes [25]. Finally, we run human annotation experiments and provide extensive analysis of the results, showing that our approach reduces actual measured annotation time by 50% compared to commonly used linear interpolation. △ Less

Submitted 23 December, 2020; originally announced December 2020.

Comments: accepted to WACV 2021

arXiv:2012.11575 [pdf, other]

From Points to Multi-Object 3D Reconstruction

Authors: Francis Engelmann, Konstantinos Rematas, Bastian Leibe, Vittorio Ferrari

Abstract: We propose a method to detect and reconstruct multiple 3D objects from a single RGB image. The key idea is to optimize for detection, alignment and shape jointly over all objects in the RGB image, while focusing on realistic and physically plausible reconstructions. To this end, we propose a keypoint detector that localizes objects as center points and directly predicts all object properties, incl… ▽ More We propose a method to detect and reconstruct multiple 3D objects from a single RGB image. The key idea is to optimize for detection, alignment and shape jointly over all objects in the RGB image, while focusing on realistic and physically plausible reconstructions. To this end, we propose a keypoint detector that localizes objects as center points and directly predicts all object properties, including 9-DoF bounding boxes and 3D shapes -- all in a single forward pass. The proposed method formulates 3D shape reconstruction as a shape selection problem, i.e. it selects among exemplar shapes from a given database. This makes it agnostic to shape representations, which enables a lightweight reconstruction of realistic and visually-pleasing shapes based on CAD-models, while the training objective is formulated around point clouds and voxel representations. A collision-loss promotes non-intersecting objects, further increasing the reconstruction realism. Given the RGB image, the presented approach performs lightweight reconstruction in a single-stage, it is real-time capable, fully differentiable and end-to-end trainable. Our experiments compare multiple approaches for 9-DoF bounding box estimation, evaluate the novel shape-selection mechanism and compare to recent methods in terms of 3D bounding box estimation and 3D shape reconstruction quality. △ Less

Submitted 21 June, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

Comments: CVPR2021 - Project Page: https://francisengelmann.github.io/points2objects/

arXiv:2012.06777 [pdf, other]

Uncalibrated Neural Inverse Rendering for Photometric Stereo of General Surfaces

Authors: Berk Kaya, Suryansh Kumar, Carlos Oliveira, Vittorio Ferrari, Luc Van Gool

Abstract: This paper presents an uncalibrated deep neural network framework for the photometric stereo problem. For training models to solve the problem, existing neural network-based methods either require exact light directions or ground-truth surface normals of the object or both. However, in practice, it is challenging to procure both of this information precisely, which restricts the broader adoption o… ▽ More This paper presents an uncalibrated deep neural network framework for the photometric stereo problem. For training models to solve the problem, existing neural network-based methods either require exact light directions or ground-truth surface normals of the object or both. However, in practice, it is challenging to procure both of this information precisely, which restricts the broader adoption of photometric stereo algorithms for vision application. To bypass this difficulty, we propose an uncalibrated neural inverse rendering approach to this problem. Our method first estimates the light directions from the input images and then optimizes an image reconstruction loss to calculate the surface normals, bidirectional reflectance distribution function value, and depth. Additionally, our formulation explicitly models the concave and convex parts of a complex surface to consider the effects of interreflections in the image formation process. Extensive evaluation of the proposed method on the challenging subjects generally shows comparable or better results than the supervised and classical approaches. △ Less

Submitted 17 April, 2021; v1 submitted 12 December, 2020; originally announced December 2020.

Comments: Accepted for publication at CVPR 2021. Document info: 18 pages, 21 Figures, 5 tables. (Minor typo corrected)

arXiv:2012.04641 [pdf, other]

Vid2CAD: CAD Model Alignment using Multi-View Constraints from Videos

Authors: Kevis-Kokitsi Maninis, Stefan Popov, Matthias Nießner, Vittorio Ferrari

Abstract: We address the task of aligning CAD models to a video sequence of a complex scene containing multiple objects. Our method can process arbitrary videos and fully automatically recover the 9 DoF pose for each object appearing in it, thus aligning them in a common 3D coordinate frame. The core idea of our method is to integrate neural network predictions from individual frames with a temporally globa… ▽ More We address the task of aligning CAD models to a video sequence of a complex scene containing multiple objects. Our method can process arbitrary videos and fully automatically recover the 9 DoF pose for each object appearing in it, thus aligning them in a common 3D coordinate frame. The core idea of our method is to integrate neural network predictions from individual frames with a temporally global, multi-view constraint optimization formulation. This integration process resolves the scale and depth ambiguities in the per-frame predictions, and generally improves the estimate of all pose parameters. By leveraging multi-view constraints, our method also resolves occlusions and handles objects that are out of view in individual frames, thus reconstructing all objects into a single globally consistent CAD representation of the scene. In comparison to the state-of-the-art single-frame method Mask2CAD that we build on, we achieve substantial improvements on the Scan2CAD dataset (from 11.6% to 30.7% class average accuracy). △ Less

Submitted 25 January, 2022; v1 submitted 8 December, 2020; originally announced December 2020.

Comments: T-PAMI 2022 | Video: https://www.youtube.com/watch?v=R1cXg0vpwe4 | Project page: https://www.kmaninis.com/vid2cad/

arXiv:2012.00595 [pdf, other]

doi 10.1109/CVPR46437.2021.00346

DeFMO: Deblurring and Shape Recovery of Fast Moving Objects

Authors: Denys Rozumnyi, Martin R. Oswald, Vittorio Ferrari, Jiri Matas, Marc Pollefeys

Abstract: Objects moving at high speed appear significantly blurred when captured with cameras. The blurry appearance is especially ambiguous when the object has complex shape or texture. In such cases, classical methods, or even humans, are unable to recover the object's appearance and motion. We propose a method that, given a single image with its estimated background, outputs the object's appearance and… ▽ More Objects moving at high speed appear significantly blurred when captured with cameras. The blurry appearance is especially ambiguous when the object has complex shape or texture. In such cases, classical methods, or even humans, are unable to recover the object's appearance and motion. We propose a method that, given a single image with its estimated background, outputs the object's appearance and position in a series of sub-frames as if captured by a high-speed camera (i.e. temporal super-resolution). The proposed generative model embeds an image of the blurred object into a latent space representation, disentangles the background, and renders the sharp appearance. Inspired by the image formation model, we design novel self-supervised loss function terms that boost performance and show good generalization capabilities. The proposed DeFMO method is trained on a complex synthetic dataset, yet it performs well on real-world data from several datasets. DeFMO outperforms the state of the art and generates high-quality temporal super-resolution frames. △ Less

Submitted 30 March, 2021; v1 submitted 1 December, 2020; originally announced December 2020.

Comments: CVPR 2021 camera-ready

Journal ref: 2021 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

arXiv:2007.11603 [pdf, ps, other]

The Twisted Index and Topological Saddles

Authors: Mathew Bullimore, Andrea E. V. Ferrari, Heeyeon Kim, Guangyu Xu

Abstract: The twisted index of 3d $\mathcal{N}=2$ gauge theories on $S^1 \times Σ$ has an algebro-geometric interpretation as the Witten index of an effective supersymmetric quantum mechanics. In this paper, we consider the contributions to the supersymmetric quantum mechanics from topological saddle points in supersymmetric localisation of abelian gauge theories. Topological saddles are configurations wher… ▽ More The twisted index of 3d $\mathcal{N}=2$ gauge theories on $S^1 \times Σ$ has an algebro-geometric interpretation as the Witten index of an effective supersymmetric quantum mechanics. In this paper, we consider the contributions to the supersymmetric quantum mechanics from topological saddle points in supersymmetric localisation of abelian gauge theories. Topological saddles are configurations where the matter fields vanish and the gauge symmetry is unbroken, which exist for non-vanishing effective Chern-Simons levels. We compute the contributions to the twisted index from both topological and vortex-like saddles points and show that their combination recovers the Jeffrey-Kirwan residue prescription for the twisted index and its wall-crossing. △ Less

Submitted 22 June, 2023; v1 submitted 22 July, 2020; originally announced July 2020.

Comments: 28 pages, typos corrected to align with published 2022 version

Showing 1–50 of 203 results for author: Ferrari, V