Search | arXiv e-print repository

Automatic benchmarking of large multimodal models via iterative experiment programming

Authors: Alessandro Conti, Enrico Fini, Paolo Rota, Yiming Wang, Massimiliano Mancini, Elisa Ricci

Abstract: Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Gi… ▽ More Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand, and progressively compile a scientific report. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions. Finally, the LLM refines the report, presenting the results to the user in natural language. Thanks to its modularity, our framework is flexible and extensible as new tools become available. Empirically, APEx reproduces the findings of existing studies while allowing for arbitrary analyses and hypothesis testing. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 31 pages, 6 figures, code is available at https://github.com/altndrr/apex

arXiv:2406.04345 [pdf, other]

Stereo-Depth Fusion through Virtual Pattern Projection

Authors: Luca Bartolomei, Matteo Poggi, Fabio Tosi, Andrea Conti, Stefano Mattoccia

Abstract: This paper presents a novel general-purpose stereo and depth data fusion paradigm that mimics the active stereo principle by replacing the unreliable physical pattern projector with a depth sensor. It works by projecting virtual patterns consistent with the scene geometry onto the left and right images acquired by a conventional stereo camera, using the sparse hints obtained from a depth sensor, t… ▽ More This paper presents a novel general-purpose stereo and depth data fusion paradigm that mimics the active stereo principle by replacing the unreliable physical pattern projector with a depth sensor. It works by projecting virtual patterns consistent with the scene geometry onto the left and right images acquired by a conventional stereo camera, using the sparse hints obtained from a depth sensor, to facilitate the visual correspondence. Purposely, any depth sensing device can be seamlessly plugged into our framework, enabling the deployment of a virtual active stereo setup in any possible environment and overcoming the severe limitations of physical pattern projection, such as the limited working range and environmental conditions. Exhaustive experiments on indoor and outdoor datasets featuring both long and close range, including those providing raw, unfiltered depth hints from off-the-shelf depth sensors, highlight the effectiveness of our approach in notably boosting the robustness and accuracy of algorithms and deep stereo without any code modification and even without re-training. Additionally, we assess the performance of our strategy on active stereo evaluation datasets with conventional pattern projection. Indeed, in all these scenarios, our virtual pattern projection paradigm achieves state-of-the-art performance. The source code is available at: https://github.com/bartn8/vppstereo. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: extended version of ICCV 2023: "Active Stereo Without Pattern Projector"

arXiv:2404.10864 [pdf, other]

Vocabulary-free Image Classification and Semantic Segmentation

Authors: Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci

Abstract: Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC)… ▽ More Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an unconstrained language-induced semantic space to an input image without needing a known vocabulary. VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories. To address VIC, we propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database. CaSED first extracts the set of candidate categories from the most semantically similar captions in the database and then assigns the image to the best-matching candidate category according to the same vision-language model. Furthermore, we demonstrate that CaSED can be applied locally to generate a coarse segmentation mask that classifies image regions, introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its variants outperform other more complex vision-language models, on classification and semantic segmentation benchmarks, while using much fewer parameters. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: Under review, 22 pages, 10 figures, code is available at https://github.com/altndrr/vicss. arXiv admin note: text overlap with arXiv:2306.00917

arXiv:2404.07560 [pdf, other]

Socially Pertinent Robots in Gerontological Healthcare

Authors: Xavier Alameda-Pineda, Angus Addlesee, Daniel Hernández García, Chris Reinke, Soraya Arias, Federica Arrigoni, Alex Auternaud, Lauriane Blavette, Cigdem Beyan, Luis Gomez Camara, Ohad Cohen, Alessandro Conti, Sébastien Dacunha, Christian Dondrup, Yoav Ellinson, Francesco Ferro, Sharon Gannot, Florian Gras, Nancie Gunson, Radu Horaud, Moreno D'Incà, Imad Kimouche, Séverin Lemaignan, Oliver Lemon, Cyril Liotard , et al. (19 additional authors not shown)

Abstract: Despite the many recent achievements in develo** and deploying social robotics, there are still many underexplored environments and applications for which systematic evaluation of such systems by end-users is necessary. While several robotic platforms have been used in gerontological healthcare, the question of whether or not a social interactive robot with multi-modal conversational capabilitie… ▽ More Despite the many recent achievements in develo** and deploying social robotics, there are still many underexplored environments and applications for which systematic evaluation of such systems by end-users is necessary. While several robotic platforms have been used in gerontological healthcare, the question of whether or not a social interactive robot with multi-modal conversational capabilities will be useful and accepted in real-life facilities is yet to be answered. This paper is an attempt to partially answer this question, via two waves of experiments with patients and companions in a day-care gerontological facility in Paris with a full-sized humanoid robot endowed with social and conversational interaction capabilities. The software architecture, developed during the H2020 SPRING project, together with the experimental protocol, allowed us to evaluate the acceptability (AES) and usability (SUS) with more than 60 end-users. Overall, the users are receptive to this technology, especially when the robot perception and action skills are robust to environmental clutter and flexible to handle a plethora of different interactions. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.05426 [pdf, other]

Test-Time Zero-Shot Temporal Action Localization

Authors: Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, Elisa Ricci

Abstract: Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore… ▽ More Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach. △ Less

Submitted 11 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2401.14401 [pdf, other]

Range-Agnostic Multi-View Depth Estimation With Keyframe Selection

Authors: Andrea Conti, Matteo Poggi, Valerio Cambareri, Stefano Mattoccia

Abstract: Methods for 3D reconstruction from posed frames require prior knowledge about the scene metric range, usually to recover matching cues along the epipolar lines and narrow the search range. However, such prior might not be directly available or estimated inaccurately in real scenarios -- e.g., outdoor 3D reconstruction from video sequences -- therefore heavily hampering performance. In this paper,… ▽ More Methods for 3D reconstruction from posed frames require prior knowledge about the scene metric range, usually to recover matching cues along the epipolar lines and narrow the search range. However, such prior might not be directly available or estimated inaccurately in real scenarios -- e.g., outdoor 3D reconstruction from video sequences -- therefore heavily hampering performance. In this paper, we focus on multi-view depth estimation without requiring prior knowledge about the metric range of the scene by proposing RAMDepth, an efficient and purely 2D framework that reverses the depth estimation and matching steps order. Moreover, we demonstrate the capability of our framework to provide rich insights about the quality of the views used for prediction. Additional material can be found on our project page https://andreaconti.github.io/projects/range_agnostic_multi_view_depth. △ Less

Submitted 25 January, 2024; originally announced January 2024.

Comments: 3DV 2024 Project Page https://andreaconti.github.io/projects/range_agnostic_multi_view_depth GitHub Page https://github.com/andreaconti/ramdepth.git

arXiv:2312.09254 [pdf, other]

Revisiting Depth Completion from a Stereo Matching Perspective for Cross-domain Generalization

Authors: Luca Bartolomei, Matteo Poggi, Andrea Conti, Fabio Tosi, Stefano Mattoccia

Abstract: This paper proposes a new framework for depth completion robust against domain-shifting issues. It exploits the generalization capability of modern stereo networks to face depth completion, by processing fictitious stereo pairs obtained through a virtual pattern projection paradigm. Any stereo network or traditional stereo matcher can be seamlessly plugged into our framework, allowing for the depl… ▽ More This paper proposes a new framework for depth completion robust against domain-shifting issues. It exploits the generalization capability of modern stereo networks to face depth completion, by processing fictitious stereo pairs obtained through a virtual pattern projection paradigm. Any stereo network or traditional stereo matcher can be seamlessly plugged into our framework, allowing for the deployment of a virtual stereo setup that is future-proof against advancement in the stereo field. Exhaustive experiments on cross-domain generalization support our claims. Hence, we argue that our framework can help depth completion to reach new deployment scenarios. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: 3DV 2024. Code: https://github.com/bartn8/vppdc - Project page: https://vppdc.github.io/

arXiv:2309.12315 [pdf, other]

Active Stereo Without Pattern Projector

Authors: Luca Bartolomei, Matteo Poggi, Fabio Tosi, Andrea Conti, Stefano Mattoccia

Abstract: This paper proposes a novel framework integrating the principles of active stereo in standard passive camera systems without a physical pattern projector. We virtually project a pattern over the left and right images according to the sparse measurements obtained from a depth sensor. Any such devices can be seamlessly plugged into our framework, allowing for the deployment of a virtual active stere… ▽ More This paper proposes a novel framework integrating the principles of active stereo in standard passive camera systems without a physical pattern projector. We virtually project a pattern over the left and right images according to the sparse measurements obtained from a depth sensor. Any such devices can be seamlessly plugged into our framework, allowing for the deployment of a virtual active stereo setup in any possible environment, overcoming the limitation of pattern projectors, such as limited working range or environmental conditions. Experiments on indoor/outdoor datasets, featuring both long and close-range, support the seamless effectiveness of our approach, boosting the accuracy of both stereo algorithms and deep networks. △ Less

Submitted 21 September, 2023; originally announced September 2023.

Comments: ICCV 2023. Code: https://github.com/bartn8/vppstereo - Project page: https://vppstereo.github.io

arXiv:2308.09139 [pdf, other]

The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation

Authors: Giacomo Zara, Alessandro Conti, Subhankar Roy, Stéphane Lathuilière, Paolo Rota, Elisa Ricci

Abstract: Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this wo… ▽ More Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V achieves significant improvement over state-of-the-art SFVUDA methods. △ Less

Submitted 22 August, 2023; v1 submitted 17 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV2023, 14 pages, 7 figures, code is available at https://github.com/giaczara/dallv

arXiv:2306.00917 [pdf, other]

Vocabulary-free Image Classification

Authors: Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci

Abstract: Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, term… ▽ More Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories. In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction. △ Less

Submitted 12 January, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: Accepted at NeurIPS2023, 19 pages, 8 figures, code is available at https://github.com/altndrr/vic

arXiv:2212.00790 [pdf, other]

Sparsity Agnostic Depth Completion

Authors: Andrea Conti, Matteo Poggi, Stefano Mattoccia

Abstract: We present a novel depth completion approach agnostic to the sparsity of depth points, that is very likely to vary in many practical applications. State-of-the-art approaches yield accurate results only when processing a specific density and distribution of input points, i.e. the one observed during training, narrowing their deployment in real use cases. On the contrary, our solution is robust to… ▽ More We present a novel depth completion approach agnostic to the sparsity of depth points, that is very likely to vary in many practical applications. State-of-the-art approaches yield accurate results only when processing a specific density and distribution of input points, i.e. the one observed during training, narrowing their deployment in real use cases. On the contrary, our solution is robust to uneven distributions and extremely low densities never witnessed during training. Experimental results on standard indoor and outdoor benchmarks highlight the robustness of our framework, achieving accuracy comparable to state-of-the-art methods when tested with density and distribution equal to the training one while being much more accurate in the other cases. Our pretrained models and further material are available in our project page. △ Less

Submitted 1 December, 2022; originally announced December 2022.

Comments: This paper has been accepted for publication at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 2023

arXiv:2210.11467 [pdf, other]

Multi-View Guided Multi-View Stereo

Authors: Matteo Poggi, Andrea Conti, Stefano Mattoccia

Abstract: This paper introduces a novel deep framework for dense 3D reconstruction from multiple image frames, leveraging a sparse set of depth measurements gathered jointly with image acquisition. Given a deep multi-view stereo network, our framework uses sparse depth hints to guide the neural network by modulating the plane-sweep cost volume built during the forward step, enabling us to infer constantly m… ▽ More This paper introduces a novel deep framework for dense 3D reconstruction from multiple image frames, leveraging a sparse set of depth measurements gathered jointly with image acquisition. Given a deep multi-view stereo network, our framework uses sparse depth hints to guide the neural network by modulating the plane-sweep cost volume built during the forward step, enabling us to infer constantly much more accurate depth maps. Moreover, since multiple viewpoints can provide additional depth measurements, we propose a multi-view guidance strategy that increases the density of the sparse points used to guide the network, thus leading to even more accurate results. We evaluate our Multi-View Guided framework within a variety of state-of-the-art deep multi-view stereo networks, demonstrating its effectiveness at improving the results achieved by each of them on BlendedMVG and DTU datasets. △ Less

Submitted 20 October, 2022; originally announced October 2022.

Comments: IROS 2022. First two authors contributed equally. Project page: https://github.com/andreaconti/multi-view-guided-multi-view-stereo

arXiv:2210.05246 [pdf, other]

Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition

Authors: Alessandro Conti, Paolo Rota, Yiming Wang, Elisa Ricci

Abstract: Automatically understanding emotions from visual data is a fundamental task for human behaviour understanding. While models devised for Facial Expression Recognition (FER) have demonstrated excellent performances on many datasets, they often suffer from severe performance degradation when trained and tested on different datasets due to domain shift. In addition, as face images are considered highl… ▽ More Automatically understanding emotions from visual data is a fundamental task for human behaviour understanding. While models devised for Facial Expression Recognition (FER) have demonstrated excellent performances on many datasets, they often suffer from severe performance degradation when trained and tested on different datasets due to domain shift. In addition, as face images are considered highly sensitive data, the accessibility to large-scale datasets for model training is often denied. In this work, we tackle the above-mentioned problems by proposing the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for FER. Our method exploits self-supervised pretraining to learn good feature representations from the target data and proposes a novel and robust cluster-level pseudo-labelling strategy that accounts for in-cluster statistics. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER, and is on par with methods addressing FER in the UDA setting. △ Less

Submitted 11 October, 2022; originally announced October 2022.

Comments: Accepted at BMVC2022, 13 pages, 4 figures, code is available at https://github.com/altndrr/clup

arXiv:2210.03118 [pdf, other]

Unsupervised confidence for LiDAR depth maps and applications

Authors: Andrea Conti, Matteo Poggi, Filippo Aleotti, Stefano Mattoccia

Abstract: Depth perception is pivotal in many fields, such as robotics and autonomous driving, to name a few. Consequently, depth sensors such as LiDARs rapidly spread in many applications. The 3D point clouds generated by these sensors must often be coupled with an RGB camera to understand the framed scene semantically. Usually, the former is projected over the camera image plane, leading to a sparse depth… ▽ More Depth perception is pivotal in many fields, such as robotics and autonomous driving, to name a few. Consequently, depth sensors such as LiDARs rapidly spread in many applications. The 3D point clouds generated by these sensors must often be coupled with an RGB camera to understand the framed scene semantically. Usually, the former is projected over the camera image plane, leading to a sparse depth map. Unfortunately, this process, coupled with the intrinsic issues affecting all the depth sensors, yields noise and gross outliers in the final output. Purposely, in this paper, we propose an effective unsupervised framework aimed at explicitly addressing this issue by learning to estimate the confidence of the LiDAR sparse depth map and thus allowing for filtering out the outliers. Experimental results on the KITTI dataset highlight that our framework excels for this purpose. Moreover, we demonstrate how this achievement can improve a wide range of tasks. △ Less

Submitted 6 October, 2022; originally announced October 2022.

Comments: IROS 2022. Code available at https://github.com/andreaconti/lidar-confidence

arXiv:2208.08860 [pdf]

An intertwined neural network model for EEG classification in brain-computer interfaces

Authors: Andrea Duggento, Mario De Lorenzo, Stefano Bargione, Allegra Conti, Vincenzo Catrambone, Gaetano Valenza, Nicola Toschi

Abstract: The brain computer interface (BCI) is a nonstimulatory direct and occasionally bidirectional communication link between the brain and a computer or an external device. Classically, EEG-based BCI algorithms have relied on models such as support vector machines and linear discriminant analysis or multiclass common spatial patterns. During the last decade, however, more sophisticated machine learning… ▽ More The brain computer interface (BCI) is a nonstimulatory direct and occasionally bidirectional communication link between the brain and a computer or an external device. Classically, EEG-based BCI algorithms have relied on models such as support vector machines and linear discriminant analysis or multiclass common spatial patterns. During the last decade, however, more sophisticated machine learning architectures, such as convolutional neural networks, recurrent neural networks, long short-term memory networks and gated recurrent unit networks, have been extensively used to enhance discriminability in multiclass BCI tasks. Additionally, preprocessing and denoising of EEG signals has always been key in the successful decoding of brain activity, and the determination of an optimal and standardized EEG preprocessing activity is an active area of research. In this paper, we present a deep neural network architecture specifically engineered to a) provide state-of-the-art performance in multiclass motor imagery classification and b) remain robust to preprocessing to enable real-time processing of raw data as it streams from EEG and BCI equipment. It is based on the intertwined use of time-distributed fully connected (tdFC) and space-distributed 1D temporal convolutional layers (sdConv) and explicitly addresses the possibility that interaction of spatial and temporal features of the EEG signal occurs at all levels of complexity. Numerical experiments demonstrate that our architecture provides superior performance compared baselines based on a combination of 3D convolutions and recurrent neural networks in a six-class motor imagery network, with a subjectwise accuracy that reaches 99%. Importantly, these results remain unchanged when minimal or extensive preprocessing is applied, possibly paving the way for a more transversal and real-time use of deep learning architectures in EEG classification. △ Less

Submitted 4 August, 2022; originally announced August 2022.

arXiv:2207.11482 [pdf, other]

Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

Authors: Riccardo Franceschini, Enrico Fini, Cigdem Beyan, Alessandro Conti, Federica Arrigoni, Elisa Ricci

Abstract: Emotion recognition is involved in several real-world applications. With an increase in available modalities, automatic understanding of emotions is being performed more accurately. The success in Multimodal Emotion Recognition (MER), primarily relies on the supervised learning paradigm. However, data annotation is expensive, time-consuming, and as emotion expression and perception depends on seve… ▽ More Emotion recognition is involved in several real-world applications. With an increase in available modalities, automatic understanding of emotions is being performed more accurately. The success in Multimodal Emotion Recognition (MER), primarily relies on the supervised learning paradigm. However, data annotation is expensive, time-consuming, and as emotion expression and perception depends on several factors (e.g., age, gender, culture) obtaining labels with a high reliability is hard. Motivated by these, we focus on unsupervised feature learning for MER. We consider discrete emotions, and as modalities text, audio and vision are used. Our method, as being based on contrastive loss between pairwise modalities, is the first attempt in MER literature. Our end-to-end feature learning approach has several differences (and advantages) compared to existing MER methods: i) it is unsupervised, so the learning is lack of data labelling cost; ii) it does not require data spatial augmentation, modality alignment, large number of batch size or epochs; iii) it applies data fusion only at inference; and iv) it does not require backbones pre-trained on emotion recognition task. The experiments on benchmark datasets show that our method outperforms several baseline approaches and unsupervised learning methods applied in MER. Particularly, it even surpasses a few supervised MER state-of-the-art. △ Less

Submitted 23 July, 2022; originally announced July 2022.

Comments: Accepted to 26th International Conference on Pattern Recognition (ICPR) 2022

arXiv:2204.01693 [pdf, other]

Monitoring social distancing with single image depth estimation

Authors: Alessio Mingozzi, Andrea Conti, Filippo Aleotti, Matteo Poggi, Stefano Mattoccia

Abstract: The recent pandemic emergency raised many challenges regarding the countermeasures aimed at containing the virus spread, and constraining the minimum distance between people resulted in one of the most effective strategies. Thus, the implementation of autonomous systems capable of monitoring the so-called social distance gained much interest. In this paper, we aim to address this task leveraging a… ▽ More The recent pandemic emergency raised many challenges regarding the countermeasures aimed at containing the virus spread, and constraining the minimum distance between people resulted in one of the most effective strategies. Thus, the implementation of autonomous systems capable of monitoring the so-called social distance gained much interest. In this paper, we aim to address this task leveraging a single RGB frame without additional depth sensors. In contrast to existing single-image alternatives failing when ground localization is not available, we rely on single image depth estimation to perceive the 3D structure of the observed scene and estimate the distance between people. During the setup phase, a straightforward calibration procedure, leveraging a scale-aware SLAM algorithm available even on consumer smartphones, allows us to address the scale ambiguity affecting single image depth estimation. We validate our approach through indoor and outdoor images employing a calibrated LiDAR + RGB camera asset. Experimental results highlight that our proposal enables sufficiently reliable estimation of the inter-personal distance to monitor social distancing effectively. This fact confirms that despite its intrinsic ambiguity, if appropriately driven single image depth estimation can be a viable alternative to other depth perception techniques, more expensive and not always feasible in practical applications. Our evaluation also highlights that our framework can run reasonably fast and comparably to competitors, even on pure CPU systems. Moreover, its practical deployment on low-power systems is around the corner. △ Less

Submitted 29 April, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

Comments: Accepted for pubblication on IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI)

arXiv:2001.11494 [pdf, other]

Peregrine: Network Localization and Navigation with Scalable Inference and Efficient Operation

Authors: Bryan Teague, Zhenyu Liu, Florian Meyer, Andrea Conti, Moe Z. Win

Abstract: Location-aware networks will enable new services and applications in fields such as autonomous driving, smart cities, and the Internet-of-Things. One promising solution for ubiquitous localization is network localization and navigation (NLN), where devices form a network that cooperatively localizes itself, reducing the infrastructure needed for accurate localization. This paper introduces a real-… ▽ More Location-aware networks will enable new services and applications in fields such as autonomous driving, smart cities, and the Internet-of-Things. One promising solution for ubiquitous localization is network localization and navigation (NLN), where devices form a network that cooperatively localizes itself, reducing the infrastructure needed for accurate localization. This paper introduces a real-time NLN system named Peregrine, which combines distributed NLN algorithms with commercially available ultra-wideband (UWB) sensing and communication technology. The Peregrine software application, for the first time, integrates three NLN algorithms to jointly perform the tasks of localization and network operation in a technology agnostic manner, leveraging both spatial and temporal cooperation. Peregrine hardware is composed of low-cost, compact devices that comprise a microprocessor and a commercial UWB radio. This paper presents the design of the Peregrine system and characterizes the performance impact of each algorithmic component. Indoor experiments validate that our approach to realizing NLN is both reliable and scalable, and maintains sub-meter-level accuracy even in challenging indoor scenarios. △ Less

Submitted 30 January, 2020; originally announced January 2020.

Comments: 16 pages, 8 figures

arXiv:0710.1383 [pdf, ps, other]

doi 10.1109/TIT.2009.2018273

Log-concavity property of the error probability with application to local bounds for wireless communications

Authors: Andrea Conti, Dmitry Panchenko, Sergiy Sidenko, Velio Tralli

Abstract: A clear understanding the behavior of the error probability (EP) as a function of signal-to-noise ratio (SNR) and other system parameters is fundamental for assessing the design of digital wireless communication systems.We propose an analytical framework based on the log-concavity property of the EP which we prove for a wide family of multidimensional modulation formats in the presence of Gaussi… ▽ More A clear understanding the behavior of the error probability (EP) as a function of signal-to-noise ratio (SNR) and other system parameters is fundamental for assessing the design of digital wireless communication systems.We propose an analytical framework based on the log-concavity property of the EP which we prove for a wide family of multidimensional modulation formats in the presence of Gaussian disturbances and fading. Based on this property, we construct a class of local bounds for the EP that improve known generic bounds in a given region of the SNR and are invertible, as well as easily tractable for further analysis. This concept is motivated by the fact that communication systems often operate with performance in a certain region of interest (ROI) and, thus, it may be advantageous to have tighter bounds within this region instead of generic bounds valid for all SNRs. We present a possible application of these local bounds, but their relevance is beyond the example made in this paper. △ Less

Submitted 20 February, 2009; v1 submitted 6 October, 2007; originally announced October 2007.

Journal ref: IEEE Trans. Inform. Theory, 2009, vol. 55, no. 6, 2766-2775.

arXiv:0704.0282 [pdf, other]

On Punctured Pragmatic Space-Time Codes in Block Fading Channel

Authors: Samuele Bandi, Luca Stabellini, Andrea Conti, Velio Tralli

Abstract: This paper considers the use of punctured convolutional codes to obtain pragmatic space-time trellis codes over block-fading channel. We show that good performance can be achieved even when puncturation is adopted and that we can still employ the same Viterbi decoder of the convolutional mother code by using approximated metrics without increasing the complexity of the decoding operations. This paper considers the use of punctured convolutional codes to obtain pragmatic space-time trellis codes over block-fading channel. We show that good performance can be achieved even when puncturation is adopted and that we can still employ the same Viterbi decoder of the convolutional mother code by using approximated metrics without increasing the complexity of the decoding operations. △ Less

Submitted 2 April, 2007; originally announced April 2007.

arXiv:cs/0703142 [pdf, ps, other]

Pragmatic Space-Time Trellis Codes for Block Fading Channels

Authors: Marco Chiani, Andrea Conti, Velio Tralli

Abstract: A pragmatic approach for the construction of space-time codes over block fading channels is investigated. The approach consists in using common convolutional encoders and Viterbi decoders with suitable generators and rates, thus greatly simplifying the implementation of space-time codes. For the design of pragmatic space-time codes a methodology is proposed and applied, based on the extension of… ▽ More A pragmatic approach for the construction of space-time codes over block fading channels is investigated. The approach consists in using common convolutional encoders and Viterbi decoders with suitable generators and rates, thus greatly simplifying the implementation of space-time codes. For the design of pragmatic space-time codes a methodology is proposed and applied, based on the extension of the concept of generalized transfer function for convolutional codes over block fading channels. Our search algorithm produces the convolutional encoder generators of pragmatic space-time codes for various number of states, number of antennas and fading rate. Finally it is shown that, for the investigated cases, the performance of pragmatic space-time codes is better than that of previously known space-time codes, confirming that they are a valuable choice in terms of both implementation complexity and performance. △ Less

Submitted 28 March, 2007; originally announced March 2007.

Showing 1–21 of 21 results for author: Conti, A