Search | arXiv e-print repository

Is CLIP the main roadblock for fine-grained open-world perception?

Authors: Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Fabrizio Falchi

Abstract: Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving, which require the ability to respond to open-world stimuli. A key ingredient is the ability to identify objects based on free-form textual queries defined at infer… ▽ More Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving, which require the ability to respond to open-world stimuli. A key ingredient is the ability to identify objects based on free-form textual queries defined at inference time - a task known as open-vocabulary object detection. Multimodal backbones like CLIP are the main enabling technology for current open-world perception solutions. Despite performing well on generic queries, recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings - i.e., for distinguishing subtle object features like color, shape, and material. In this paper, we perform a detailed examination of these open-vocabulary object recognition limitations to find the root cause. We evaluate the performance of CLIP, the most commonly used vision-language backbone, against a fine-grained object-matching benchmark, revealing interesting analogies between the limitations of open-vocabulary object detectors and their backbones. Experiments suggest that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space. Therefore, we try to understand whether fine-grained knowledge is present in CLIP embeddings but not exploited at inference time due, for example, to the unsuitability of the cosine similarity matching function, which may discard important object characteristics. Our preliminary experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts, paving the way towards the development of backbones inherently able to process fine-grained details. The code for reproducing these experiments is available at https://github.com/lorebianchi98/FG-CLIP. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2311.17518 [pdf, other]

The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

Authors: Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Claudio Gennaro, Fabrizio Falchi

Abstract: Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenarios, where object classes are defined in free-text formats during inference. In this paper, we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand fine-grained properties of objects and their parts. To this end, we intro… ▽ More Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenarios, where object classes are defined in free-text formats during inference. In this paper, we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand fine-grained properties of objects and their parts. To this end, we introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing different properties like color, pattern, and material. We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions, which shine in standard open-vocabulary benchmarks, struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://lorebianchi98.github.io/FG-OVD/. △ Less

Submitted 5 April, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: Accepted as Highlight at CVPR2024

arXiv:2304.14942 [pdf, other]

The Emotions of the Crowd: Learning Image Sentiment from Tweets via Cross-modal Distillation

Authors: Alessio Serra, Fabio Carrara, Maurizio Tesconi, Fabrizio Falchi

Abstract: Trends and opinion mining in social media increasingly focus on novel interactions involving visual media, like images and short videos, in addition to text. In this work, we tackle the problem of visual sentiment analysis of social media images -- specifically, the prediction of image sentiment polarity. While previous work relied on manually labeled training sets, we propose an automated approac… ▽ More Trends and opinion mining in social media increasingly focus on novel interactions involving visual media, like images and short videos, in addition to text. In this work, we tackle the problem of visual sentiment analysis of social media images -- specifically, the prediction of image sentiment polarity. While previous work relied on manually labeled training sets, we propose an automated approach for building sentiment polarity classifiers based on a cross-modal distillation paradigm; starting from scraped multimodal (text + images) data, we train a student model on the visual modality based on the outputs of a textual teacher model that analyses the sentiment of the corresponding textual modality. We applied our method to randomly collected images crawled from Twitter over three months and produced, after automatic cleaning, a weakly-labeled dataset of $\sim$1.5 million images. Despite exploiting noisy labeled samples, our training pipeline produces classifiers showing strong generalization capabilities and outperforming the current state of the art on five manually labeled benchmarks for image sentiment polarity prediction. △ Less

Submitted 28 April, 2023; originally announced April 2023.

arXiv:2211.10351 [pdf]

Deep learning for structural health monitoring: An application to heritage structures

Authors: Fabio Carrara, Fabrizio Falchi, Maria Girardi, Nicola Messina, Cristina Padovani, Daniele Pellegrini

Abstract: Thanks to recent advancements in numerical methods, computer power, and monitoring technology, seismic ambient noise provides precious information about the structural behavior of old buildings. The measurement of the vibrations produced by anthropic and environmental sources and their use for dynamic identification and structural health monitoring of buildings initiated an emerging, cross-discipl… ▽ More Thanks to recent advancements in numerical methods, computer power, and monitoring technology, seismic ambient noise provides precious information about the structural behavior of old buildings. The measurement of the vibrations produced by anthropic and environmental sources and their use for dynamic identification and structural health monitoring of buildings initiated an emerging, cross-disciplinary field engaging seismologists, engineers, mathematicians, and computer scientists. In this work, we employ recent deep learning techniques for time-series forecasting to inspect and detect anomalies in the large dataset recorded during a long-term monitoring campaign conducted on the San Frediano bell tower in Lucca. We frame the problem as an unsupervised anomaly detection task and train a Temporal Fusion Transformer to learn the normal dynamics of the structure. We then detect the anomalies by looking at the differences between the predicted and observed frequencies. △ Less

Submitted 4 November, 2022; originally announced November 2022.

arXiv:2111.14576 [pdf, other]

Recurrent Vision Transformer for Solving Visual Reasoning Problems

Authors: Nicola Messina, Giuseppe Amato, Fabio Carrara, Claudio Gennaro, Fabrizio Falchi

Abstract: Although convolutional neural networks (CNNs) showed remarkable results in many vision tasks, they are still strained by simple yet challenging visual reasoning problems. Inspired by the recent success of the Transformer network in computer vision, in this paper, we introduce the Recurrent Vision Transformer (RViT) model. Thanks to the impact of recurrent connections and spatial attention in reaso… ▽ More Although convolutional neural networks (CNNs) showed remarkable results in many vision tasks, they are still strained by simple yet challenging visual reasoning problems. Inspired by the recent success of the Transformer network in computer vision, in this paper, we introduce the Recurrent Vision Transformer (RViT) model. Thanks to the impact of recurrent connections and spatial attention in reasoning tasks, this network achieves competitive results on the same-different visual reasoning problems from the SVRT dataset. The weight-sharing both in spatial and depth dimensions regularizes the model, allowing it to learn using far fewer free parameters, using only 28k training samples. A comprehensive ablation study confirms the importance of a hybrid CNN + Transformer architecture and the role of the feedback connections, which iteratively refine the internal representation until a stable prediction is obtained. In the end, this study can lay the basis for a deeper understanding of the role of attention and recurrent connections for solving visual abstract reasoning tasks. △ Less

Submitted 29 November, 2021; originally announced November 2021.

arXiv:2106.02842 [pdf, other]

Multi-Camera Vehicle Counting Using Edge-AI

Authors: Luca Ciampi, Claudio Gennaro, Fabio Carrara, Fabrizio Falchi, Claudio Vairo, Giuseppe Amato

Abstract: This paper presents a novel solution to automatically count vehicles in a parking lot using images captured by smart cameras. Unlike most of the literature on this task, which focuses on the analysis of single images, this paper proposes the use of multiple visual sources to monitor a wider parking area from different perspectives. The proposed multi-camera system is capable of automatically estim… ▽ More This paper presents a novel solution to automatically count vehicles in a parking lot using images captured by smart cameras. Unlike most of the literature on this task, which focuses on the analysis of single images, this paper proposes the use of multiple visual sources to monitor a wider parking area from different perspectives. The proposed multi-camera system is capable of automatically estimate the number of cars present in the entire parking lot directly on board the edge devices. It comprises an on-device deep learning-based detector that locates and counts the vehicles from the captured images and a decentralized geometric-based approach that can analyze the inter-camera shared areas and merge the data acquired by all the devices. We conduct the experimental evaluation on an extended version of the CNRPark-EXT dataset, a collection of images taken from the parking lot on the campus of the National Research Council (CNR) in Pisa, Italy. We show that our system is robust and takes advantage of the redundant information deriving from the different cameras, improving the overall performance without requiring any extra geometrical information of the monitored scene. △ Less

Submitted 5 June, 2021; originally announced June 2021.

Comments: Submitted to Expert Systems With Applications

arXiv:2101.09129 [pdf, other]

doi 10.1016/j.patrec.2020.12.019

Solving the Same-Different Task with Convolutional Neural Networks

Authors: Nicola Messina, Giuseppe Amato, Fabio Carrara, Claudio Gennaro, Fabrizio Falchi

Abstract: Deep learning demonstrated major abilities in solving many kinds of different real-world problems in computer vision literature. However, they are still strained by simple reasoning tasks that humans consider easy to solve. In this work, we probe current state-of-the-art convolutional neural networks on a difficult set of tasks known as the same-different problems. All the problems require the sam… ▽ More Deep learning demonstrated major abilities in solving many kinds of different real-world problems in computer vision literature. However, they are still strained by simple reasoning tasks that humans consider easy to solve. In this work, we probe current state-of-the-art convolutional neural networks on a difficult set of tasks known as the same-different problems. All the problems require the same prerequisite to be solved correctly: understanding if two random shapes inside the same image are the same or not. With the experiments carried out in this work, we demonstrate that residual connections, and more generally the skip connections, seem to have only a marginal impact on the learning of the proposed problems. In particular, we experiment with DenseNets, and we examine the contribution of residual and recurrent connections in already tested architectures, ResNet-18, and CorNet-S respectively. Our experiments show that older feed-forward networks, AlexNet and VGG, are almost unable to learn the proposed problems, except in some specific scenarios. We show that recently introduced architectures can converge even in the cases where the important parts of their architecture are removed. We finally carry out some zero-shot generalization tests, and we discover that in these scenarios residual and recurrent connections can have a stronger impact on the overall test accuracy. On four difficult problems from the SVRT dataset, we can reach state-of-the-art results with respect to the previous approaches, obtaining super-human performances on three of the four problems. △ Less

Submitted 22 January, 2021; originally announced January 2021.

Comments: Preprint of the paper published in Patter Recognition Letters (Elsevier)

Journal ref: Pattern Recognition Letters, Volume 143, March 2021, Pages 75-80

arXiv:2011.08102 [pdf, other]

Combining GANs and AutoEncoders for Efficient Anomaly Detection

Authors: Fabio Carrara, Giuseppe Amato, Luca Brombin, Fabrizio Falchi, Claudio Gennaro

Abstract: In this work, we propose CBiGAN -- a novel method for anomaly detection in images, where a consistency constraint is introduced as a regularization term in both the encoder and decoder of a BiGAN. Our model exhibits fairly good modeling power and reconstruction consistency capability. We evaluate the proposed method on MVTec AD -- a real-world benchmark for unsupervised anomaly detection on high-r… ▽ More In this work, we propose CBiGAN -- a novel method for anomaly detection in images, where a consistency constraint is introduced as a regularization term in both the encoder and decoder of a BiGAN. Our model exhibits fairly good modeling power and reconstruction consistency capability. We evaluate the proposed method on MVTec AD -- a real-world benchmark for unsupervised anomaly detection on high-resolution images -- and compare against standard baselines and state-of-the-art approaches. Experiments show that the proposed method improves the performance of BiGAN formulations by a large margin and performs comparably to expensive state-of-the-art iterative methods while reducing the computational cost. We also observe that our model is particularly effective in texture-type anomaly detection, as it sets a new state of the art in this category. Our code is available at https://github.com/fabiocarrara/cbigan-ad/. △ Less

Submitted 26 November, 2020; v1 submitted 16 November, 2020; originally announced November 2020.

Comments: 8 pages, 5 figures, 3 tables, pre-print, to be published in the proceedings of the 25th International Conference on Pattern Recognition (ICPR2020)

arXiv:2008.02749 [pdf, other]

The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search Engines for Large-Scale Video Retrieval

Authors: Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, Franca Debole, Fabrizio Falchi, Claudio Gennaro, Lucia Vadicamo, Claudio Vairo

Abstract: In this paper, we describe in details VISIONE, a video search system that allows users to search for videos using textual keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and satisfy user needs. The peculiarity of our approach is that we e… ▽ More In this paper, we describe in details VISIONE, a video search system that allows users to search for videos using textual keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and satisfy user needs. The peculiarity of our approach is that we encode all the information extracted from the keyframes, such as visual deep features, tags, color and object locations, using a convenient textual encoding indexed in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) have to be merged. In addition, we report an extensive analysis of the system retrieval performance, using the query logs generated during the Video Browser Showdown (VBS) 2019 competition. This allowed us to fine-tune the system by choosing the optimal parameters and strategies among the ones that we tested. △ Less

Submitted 18 March, 2021; v1 submitted 6 August, 2020; originally announced August 2020.

Comments: 22 pages, 12 figures

arXiv:2007.06475 [pdf, other]

Automatic Pass Annotation from Soccer VideoStreams Based on Object Detection and LSTM

Authors: Danilo Sorano, Fabio Carrara, Paolo Cintia, Fabrizio Falchi, Luca Pappalardo

Abstract: Soccer analytics is attracting increasing interest in academia and industry, thanks to the availability of data that describe all the spatio-temporal events that occur in each match. These events (e.g., passes, shots, fouls) are collected by human operators manually, constituting a considerable cost for data providers in terms of time and economic resources. In this paper, we describe PassNet, a m… ▽ More Soccer analytics is attracting increasing interest in academia and industry, thanks to the availability of data that describe all the spatio-temporal events that occur in each match. These events (e.g., passes, shots, fouls) are collected by human operators manually, constituting a considerable cost for data providers in terms of time and economic resources. In this paper, we describe PassNet, a method to recognize the most frequent events in soccer, i.e., passes, from video streams. Our model combines a set of artificial neural networks that perform feature extraction from video streams, object detection to identify the positions of the ball and the players, and classification of frame sequences as passes or not passes. We test PassNet on different scenarios, depending on the similarity of conditions to the match used for training. Our results show good classification results and significant improvement in the accuracy of pass detection with respect to baseline classifiers, even when the match's video conditions of the test and training sets are considerably different. PassNet is the first step towards an automated event annotation system that may break the time and the costs for event annotation, enabling data collections for minor and non-professional divisions, youth leagues and, in general, competitions whose matches are not currently annotated by data providers. △ Less

Submitted 13 July, 2020; originally announced July 2020.

arXiv:1912.02918 [pdf, other]

doi 10.1016/j.cviu.2020.103103

Detection of Face Recognition Adversarial Attacks

Authors: Fabio Valerio Massoli, Fabio Carrara, Giuseppe Amato, Fabrizio Falchi

Abstract: Deep Learning methods have become state-of-the-art for solving tasks such as Face Recognition (FR). Unfortunately, despite their success, it has been pointed out that these learning models are exposed to adversarial inputs - images to which an imperceptible amount of noise for humans is added to maliciously fool a neural network - thus limiting their adoption in real-world applications. While it i… ▽ More Deep Learning methods have become state-of-the-art for solving tasks such as Face Recognition (FR). Unfortunately, despite their success, it has been pointed out that these learning models are exposed to adversarial inputs - images to which an imperceptible amount of noise for humans is added to maliciously fool a neural network - thus limiting their adoption in real-world applications. While it is true that an enormous effort has been spent in order to train robust models against this type of threat, adversarial detection techniques have recently started to draw attention within the scientific community. A detection approach has the advantage that it does not require to re-train any model, thus it can be added on top of any system. In this context, we present our work on adversarial samples detection in forensics mainly focused on detecting attacks against FR systems in which the learning model is typically used only as a features extractor. Thus, in these cases, train a more robust classifier might not be enough to defence a FR system. In this frame, the contribution of our work is four-fold: i) we tested our recently proposed adversarial detection approach against classifier attacks, i.e. adversarial samples crafted to fool a FR neural network acting as a classifier; ii) using a k-Nearest Neighbor (kNN) algorithm as a guidance, we generated deep features attacks against a FR system based on a DL model acting as features extractor, followed by a kNN which gives back the query identity based on features similarity; iii) we used the deep features attacks to fool a FR system on the 1:1 Face Verification task and we showed their superior effectiveness with respect to classifier attacks in fooling such type of system; iv) we used the detectors trained on classifier attacks to detect deep features attacks, thus showing that such approach is generalizable to different types of offensives. △ Less

Submitted 5 December, 2019; originally announced December 2019.

MSC Class: I.2.0; I.2.6 ACM Class: I.2.0; I.2.6

Journal ref: Computer Vision and Image Understanding Volume 202, January 2021, 103103

arXiv:1905.07774 [pdf, other]

doi 10.1073/pnas.1816621116

Bacteria push the limits of chemotactic precision to navigate dynamic chemical gradients

Authors: Douglas R. Brumley, Francesco Carrara, Andrew M. Hein, Yutaka Yawata, Simon A. Levin, Roman Stocker

Abstract: Ephemeral aggregations of bacteria are ubiquitous in the environment, where they serve as hotbeds of metabolic activity, nutrient cycling, and horizontal gene transfer. In many cases, these regions of high bacterial concentration are thought to form when motile cells use chemotaxis to navigate to chemical hotspots. However, what governs the dynamics of bacterial aggregations is unclear. Here, we u… ▽ More Ephemeral aggregations of bacteria are ubiquitous in the environment, where they serve as hotbeds of metabolic activity, nutrient cycling, and horizontal gene transfer. In many cases, these regions of high bacterial concentration are thought to form when motile cells use chemotaxis to navigate to chemical hotspots. However, what governs the dynamics of bacterial aggregations is unclear. Here, we use a novel experimental platform to create realistic sub-millimeter scale nutrient pulses with controlled nutrient concentrations. By combining experiments, mathematical theory and agent-based simulations, we show that individual \textit{Vibrio ordalii} bacteria begin chemotaxis toward hotspots of dissolved organic matter (DOM) when the magnitude of the chemical gradient rises sufficiently far above the sensory noise that is generated by stochastic encounters with chemoattractant molecules. Each DOM hotspot is surrounded by a dynamic ring of chemotaxing cells, which congregate in regions of high DOM concentration before dispersing as DOM diffuses and gradients become too noisy for cells to respond to. We demonstrate that \textit{V. ordalii} operates close to the theoretical limits on chemotactic precision. Numerical simulations of chemotactic bacteria, in which molecule counting noise is explicitly taken into account, point at a tradeoff between nutrient acquisition and the cost of chemotactic precision. More generally, our results illustrate how limits on sensory precision can be used to understand the location, spatial extent, and lifespan of bacterial behavioral responses in ecologically relevant environments. △ Less

Submitted 19 May, 2019; originally announced May 2019.

Comments: 6 pages, 5 figures. PNAS first published May 16, 2019 https://doi.org/10.1073/pnas.1816621116

arXiv:1704.06178 [pdf, other]

Exploring epoch-dependent stochastic residual networks

Authors: Fabio Carrara, Andrea Esuli, Fabrizio Falchi, Alejandro Moreo Fernández

Abstract: The recently proposed stochastic residual networks selectively activate or bypass the layers during training, based on independent stochastic choices, each of which following a probability distribution that is fixed in advance. In this paper we present a first exploration on the use of an epoch-dependent distribution, starting with a higher probability of bypassing deeper layers and then activatin… ▽ More The recently proposed stochastic residual networks selectively activate or bypass the layers during training, based on independent stochastic choices, each of which following a probability distribution that is fixed in advance. In this paper we present a first exploration on the use of an epoch-dependent distribution, starting with a higher probability of bypassing deeper layers and then activating them more frequently as training progresses. Preliminary results are mixed, yet they show some potential of adding an epoch-dependent management of distributions, worth of further investigation. △ Less

Submitted 20 April, 2017; originally announced April 2017.

Comments: Preliminary report

arXiv:1606.07287 [pdf, other]

doi 10.1007/s10791-017-9318-6

Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions

Authors: Fabio Carrara, Andrea Esuli, Tiziano Fagni, Fabrizio Falchi, Alejandro Moreo Fernández

Abstract: In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation mod… ▽ More In this paper we tackle the problem of image search when the query is a short textual description of the image the user is looking for. We choose to implement the actual search process as a similarity search in a visual feature space, by learning to translate a textual query into a visual representation. Searching in the visual feature space has the advantage that any update to the translation model does not require to reprocess the, typically huge, image collection on which the search is performed. We propose Text2Vis, a neural network that generates a visual representation, in the visual feature space of the fc6-fc7 layers of ImageNet, from a short descriptive text. Text2Vis optimizes two loss functions, using a stochastic loss-selection method. A visual-focused loss is aimed at learning the actual text-to-visual feature map**, while a text-focused loss is aimed at modeling the higher-level semantic concepts expressed in language and countering the overfit on non-relevant visual components of the visual loss. We report preliminary results on the MS-COCO dataset. △ Less

Submitted 23 June, 2016; originally announced June 2016.

Comments: Neu-IR '16 SIGIR Workshop on Neural Information Retrieval, July 21, 2016, Pisa, Italy

arXiv:1512.04217 [pdf, other]

doi 10.1098/rsif.2015.0844

Physical Limits on Bacterial Navigation in Dynamic Environments

Authors: Andrew M. Hein, Douglas R. Brumley, Francesco Carrara, Roman Stocker, Simon A. Levin

Abstract: Many chemotactic bacteria inhabit environments in which chemicals appear as localized pulses and evolve by processes such as diffusion and mixing. We show that, in such environments, physical limits on the accuracy of temporal gradient sensing govern when and where bacteria can accurately measure the cues they use to navigate. Chemical pulses are surrounded by a predictable dynamic region, outside… ▽ More Many chemotactic bacteria inhabit environments in which chemicals appear as localized pulses and evolve by processes such as diffusion and mixing. We show that, in such environments, physical limits on the accuracy of temporal gradient sensing govern when and where bacteria can accurately measure the cues they use to navigate. Chemical pulses are surrounded by a predictable dynamic region, outside which bacterial cells cannot resolve gradients above noise. The outer boundary of this region initially expands in proportion to $\sqrt{t}$, before rapidly contracting. Our analysis also reveals how chemokinesis - the increase in swimming speed many bacteria exhibit when absolute chemical concentration exceeds a threshold - may serve to enhance chemotactic accuracy and sensitivity when the chemical landscape is dynamic. More generally, our framework provides a rigorous method for partitioning bacteria into populations that are "near" and "far" from chemical hotspots in complex, rapidly evolving environments such as those that dominate aquatic ecosystems. △ Less

Submitted 14 December, 2015; originally announced December 2015.

Comments: 19 pages, 5 figures (including Supplementary Text). Journal of The Royal Society Interface, in press

Showing 1–15 of 15 results for author: Carrara, F