Skip to main content

Showing 1–50 of 108 results for author: Cucchiara, R

.
  1. arXiv:2407.01397  [pdf, other

    cs.CV cs.AI

    Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

    Authors: Matteo Mosconi, Andriy Sorokin, Aniello Panariello, Angelo Porrello, Jacopo Bonato, Marco Cotogni, Luigi Sabetta, Simone Calderara, Rita Cucchiara

    Abstract: The use of skeletal data allows deep learning models to perform action recognition efficiently and effectively. Herein, we believe that exploring this problem within the context of Continual Learning is crucial. While numerous studies focus on skeleton-based action recognition from a traditional offline perspective, only a handful venture into online approaches. In this respect, we introduce CHARO… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted at ICPR 2024

  2. arXiv:2405.20743  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Trajectory Forecasting through Low-Rank Adaptation of Discrete Latent Codes

    Authors: Riccardo Benaglia, Angelo Porrello, Pietro Buzzega, Simone Calderara, Rita Cucchiara

    Abstract: Trajectory forecasting is crucial for video surveillance analytics, as it enables the anticipation of future movements for a set of agents, e.g. basketball players engaged in intricate interactions with long-term intentions. Deep generative models offer a natural learning approach for trajectory forecasting, yet they encounter difficulties in achieving an optimal balance between sampling fidelity… ▽ More

    Submitted 31 May, 2024; originally announced May 2024.

    Comments: 15 pages, 3 figures, 5 tables

  3. arXiv:2405.20008  [pdf, other

    cs.CV

    Sharing Key Semantics in Transformer Makes Efficient Image Restoration

    Authors: Bin Ren, Yawei Li, **gyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Ming-Hsuan Yang, Nicu Sebe

    Abstract: Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the Vision Transformers (ViTs) emergence has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objec… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: 9 pages

  4. arXiv:2405.16350  [pdf, ps, other

    cs.AI cs.LG

    A Second-Order perspective on Compositionality and Incremental Learning

    Authors: Angelo Porrello, Lorenzo Bonicelli, Pietro Buzzega, Monica Millunzi, Simone Calderara, Rita Cucchiara

    Abstract: The fine-tuning of deep pre-trained models has recently revealed compositional properties. This enables the arbitrary composition of multiple specialized modules into a single, multi-task model. However, identifying the conditions that promote compositionality remains an open issue, with recent efforts concentrating mainly on linearized networks. We conduct a theoretical study that attempts to dem… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  5. arXiv:2405.13127  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Towards Retrieval-Augmented Architectures for Image Captioning

    Authors: Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara

    Abstract: The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This wor… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Comments: ACM Transactions on Multimedia Computing, Communications and Applications (2024)

  6. arXiv:2404.17243  [pdf, other

    cs.CV

    Binarizing Documents by Leveraging both Space and Frequency

    Authors: Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

    Abstract: Document Image Binarization is a well-known problem in Document Analysis and Computer Vision, although it is far from being solved. One of the main challenges of this task is that documents generally exhibit degradations and acquisition artifacts that can greatly vary throughout the page. Nonetheless, even when dealing with a local patch of the document, taking into account the overall appearance… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: Accepted at ICDAR2024

  7. arXiv:2404.15406  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

    Authors: Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at int… ▽ More

    Submitted 22 May, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: CVPR 2024 Workshop on What is Next in Multimodal Foundation Models

  8. arXiv:2404.10054  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    AIGeN: An Adversarial Approach for Instruction Generation in VLN

    Authors: Niyati Rawal, Roberto Bigazzi, Lorenzo Baraldi, Rita Cucchiara

    Abstract: In the last few years, the research interest in Vision-and-Language Navigation (VLN) has grown significantly. VLN is a challenging task that involves an agent following human instructions and navigating in a previously unknown environment to reach a specified goal. Recent work in literature focuses on different ways to augment the available datasets of instructions for improving navigation perform… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: Accepted to 7th Multimodal Learning and Applications Workshop (MULA 2024) at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024

  9. arXiv:2404.06542  [pdf, other

    cs.CV

    Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation

    Authors: Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However, captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further, training on large-scale datasets in… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: CVPR 2024. Project page: https://aimagelab.github.io/freeda/

  10. arXiv:2403.14828  [pdf, other

    cs.CV

    Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

    Authors: Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara

    Abstract: Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try… ▽ More

    Submitted 25 March, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

  11. arXiv:2403.08933  [pdf, other

    cs.CV cs.AI

    Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images

    Authors: Giuseppe Cartella, Vittorio Cuculo, Marcella Cornia, Rita Cucchiara

    Abstract: Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural language of your desired output is all you need to obtain breathtaking results. However, as the use of generative models grows, so do concerns about the propagation of malicious content and misinformation. Consequently, the research community is actively wo… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: Accepted to IEEE Signal Processing Letters 2024

  12. arXiv:2403.07076  [pdf, other

    cs.RO cs.AI cs.CV

    Map** High-level Semantic Regions in Indoor Environments without Object Recognition

    Authors: Roberto Bigazzi, Lorenzo Baraldi, Shreyas Kousik, Rita Cucchiara, Marco Pavone

    Abstract: Robots require a semantic understanding of their surroundings to operate in an efficient and explainable way in human environments. In the literature, there has been an extensive focus on object labeling and exhaustive scene graph generation; less effort has been focused on the task of purely identifying and map** large semantic regions. The present work proposes a method for semantic region map… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: Accepted by IEEE International Conference on Robotics and Automation (ICRA 2024)

  13. arXiv:2402.18673  [pdf, other

    cs.CV cs.AI

    Trends, Applications, and Challenges in Human Attention Modelling

    Authors: Giuseppe Cartella, Marcella Cornia, Vittorio Cuculo, Alessandro D'Amelio, Dario Zanca, Giuseppe Boccignone, Rita Cucchiara

    Abstract: Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying visual exploration, but also for providing support to artificial intelligence models that aim to solve problems in various domains, including image and video processing, vision-and-language applications, and language modelling. This survey offers a reasoned… ▽ More

    Submitted 22 April, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: Accepted at IJCAI 2024 Survey Track

  14. arXiv:2402.12451  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    The Revolution of Multimodal Large Language Models: A Survey

    Authors: Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

    Abstract: Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-foll… ▽ More

    Submitted 6 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: ACL 2024 (Findings)

  15. arXiv:2402.10798  [pdf, other

    cs.CV cs.AI

    VATr++: Choose Your Words Wisely for Handwritten Text Generation

    Authors: Bram Vanherle, Vittorio Pippi, Silvia Cascianelli, Nick Michiels, Frank Van Reeth, Rita Cucchiara

    Abstract: Styled Handwritten Text Generation (HTG) has received significant attention in recent years, propelled by the success of learning-based solutions employing GANs, Transformers, and, preliminarily, Diffusion Models. Despite this surge in interest, there remains a critical yet understudied aspect - the impact of the input, both visual and textual, on the HTG model training and its subsequent influenc… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

  16. arXiv:2402.02634  [pdf, other

    cs.CV cs.LG eess.IV

    Key-Graph Transformer for Image Restoration

    Authors: Bin Ren, Yawei Li, **gyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Nicu Sebe

    Abstract: While it is crucial to capture global information for effective image restoration (IR), integrating such cues into transformer-based methods becomes computationally expensive, especially with high input resolution. Furthermore, the self-attention mechanism in transformers is prone to considering unnecessary global cues from unrelated objects or regions, introducing computational inefficiencies. In… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

    Comments: 9 pages, 6 figures

  17. arXiv:2401.03191  [pdf, other

    cs.CV

    DistFormer: Enhancing Local and Global Features for Monocular Per-Object Distance Estimation

    Authors: Aniello Panariello, Gianluca Mancusi, Fedy Haj Ali, Angelo Porrello, Simone Calderara, Rita Cucchiara

    Abstract: Accurate per-object distance estimation is crucial in safety-critical applications such as autonomous driving, surveillance, and robotics. Existing approaches rely on two scales: local information (i.e., the bounding box proportions) or global information, which encodes the semantics of the scene as well as the spatial relations with neighboring objects. However, these approaches may struggle with… ▽ More

    Submitted 6 January, 2024; originally announced January 2024.

  18. arXiv:2311.16254  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

    Authors: Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of… ▽ More

    Submitted 12 April, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

  19. arXiv:2310.20316  [pdf, other

    cs.CV cs.DL

    HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

    Authors: Vittorio Pippi, Fabio Quattrini, Silvia Cascianelli, Rita Cucchiara

    Abstract: Styled Handwritten Text Generation (Styled HTG) is an important task in document analysis, aiming to generate text images with the handwriting of given reference images. In recent years, there has been significant progress in the development of deep learning models for tackling this task. Being able to measure the performance of HTG models via a meaningful and representative criterion is key for f… ▽ More

    Submitted 31 October, 2023; originally announced October 2023.

    Comments: Accepted at BMVC2023

  20. arXiv:2309.05551  [pdf, other

    cs.CV

    OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

    Authors: Giuseppe Cartella, Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara

    Abstract: The inexorable growth of online shop** and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In th… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Comments: International Conference on Image Analysis and Processing (ICIAP) 2023

  21. arXiv:2308.12383  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

    Authors: Manuele Barraco, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions. Although successful, the attention operator only considers a weighted summation of projections of the current input sample, therefore ignoring the relevant semantic information whi… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  22. arXiv:2308.11513  [pdf, other

    cs.CV cs.AI cs.LG

    TrackFlow: Multi-Object Tracking with Normalizing Flows

    Authors: Gianluca Mancusi, Aniello Panariello, Angelo Porrello, Matteo Fabbri, Simone Calderara, Rita Cucchiara

    Abstract: The field of multi-object tracking has recently seen a renewed interest in the good old schema of tracking-by-detection, as its simplicity and strong priors spare it from the complex design and painful babysitting of tracking-by-attention approaches. In view of this, we aim at extending tracking-by-detection to multi-modal settings, where a comprehensive cost has to be computed from heterogeneous… ▽ More

    Submitted 22 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV 2023

  23. arXiv:2308.05070  [pdf, other

    cs.CV cs.DL

    Volumetric Fast Fourier Convolution for Detecting Ink on the Carbonized Herculaneum Papyri

    Authors: Fabio Quattrini, Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

    Abstract: Recent advancements in Digital Document Restoration (DDR) have led to significant breakthroughs in analyzing highly damaged written artifacts. Among those, there has been an increasing interest in applying Artificial Intelligence techniques for virtually unwrap** and automatically detecting ink on the Herculaneum papyri collection. This collection consists of carbonized scrolls and fragments of… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

    Comments: Accepted at the 4th ICCV Workshop on e-Heritage (in conjunction with ICCV 2023)

  24. arXiv:2307.12718  [pdf, other

    cs.CV

    CarPatch: A Synthetic Benchmark for Radiance Field Evaluation on Vehicle Components

    Authors: Davide Di Nucci, Alessandro Simoni, Matteo Tomei, Luca Ciuffreda, Roberto Vezzani, Rita Cucchiara

    Abstract: Neural Radiance Fields (NeRFs) have gained widespread recognition as a highly effective technique for representing 3D reconstructions of objects and scenes derived from sets of images. Despite their efficiency, NeRF models can pose challenges in certain scenarios such as vehicle inspection, where the lack of sufficient data or the presence of challenging elements (e.g. reflections) strongly impact… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

    Comments: Accepted at ICIAP2023

  25. arXiv:2307.09416  [pdf, other

    cs.CV cs.CL

    Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation

    Authors: Federico Betti, Jacopo Staiano, Lorenzo Baraldi, Lorenzo Baraldi, Rita Cucchiara, Nicu Sebe

    Abstract: Research in Image Generation has recently made significant progress, particularly boosted by the introduction of Vision-Language models which are able to produce high-quality visual content based on textual inputs. Despite ongoing advancements in terms of generation quality and realism, no methodical frameworks have been defined yet to quantitatively measure the quality of the generated content an… ▽ More

    Submitted 19 July, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

    Comments: Accepted as oral at ACM MultiMedia 2023 (Brave New Ideas track)

  26. arXiv:2306.07346  [pdf, other

    cs.CV cs.AI cs.MM

    Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

    Authors: Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Andrea Pilzer, Rita Cucchiara

    Abstract: The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of visual tasks such as image classification. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the i… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

  27. arXiv:2305.13501  [pdf, other

    cs.CV cs.AI cs.MM

    LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

    Authors: Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara

    Abstract: The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a give… ▽ More

    Submitted 3 August, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: ACM Multimedia 2023

  28. arXiv:2305.02593  [pdf, other

    cs.CV cs.DL

    How to Choose Pretrained Handwriting Recognition Models for Single Writer Fine-Tuning

    Authors: Vittorio Pippi, Silvia Cascianelli, Christopher Kermorvant, Rita Cucchiara

    Abstract: Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on both modern and historical manuscripts in large benchmark datasets. Nonetheless, those models struggle to obtain the same performance when applied to manuscripts with peculiar characteristics, such as language, paper support, ink, and author handwriting. This issue is ver… ▽ More

    Submitted 4 May, 2023; originally announced May 2023.

    Comments: Accepted at ICDAR2023

  29. arXiv:2304.02051  [pdf, other

    cs.CV cs.AI cs.MM

    Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

    Authors: Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara

    Abstract: Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-co… ▽ More

    Submitted 23 August, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

    Comments: ICCV 2023

  30. arXiv:2304.02049  [pdf, other

    cs.CV cs.AI cs.LG

    Multi-Class Unlearning for Image Classification via Weight Filtering

    Authors: Samuele Poppi, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods that target a limited subset or a single class, our framework unlearns all classes in a single round. We achieve this by modulating the network's components using memory matrices, enabling the network to demonstrate selective unlearning behavior for any clas… ▽ More

    Submitted 8 June, 2024; v1 submitted 4 April, 2023; originally announced April 2023.

    Comments: IEEE Intelligent Systems (2024)

  31. arXiv:2304.01842  [pdf, other

    cs.CV

    Evaluating Synthetic Pre-Training for Handwriting Processing Tasks

    Authors: Vittorio Pippi, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: In this work, we explore massive pre-training on synthetic word images for enhancing the performance on four benchmark downstream handwriting analysis tasks. To this end, we build a large synthetic dataset of word images rendered in several handwriting fonts, which offers a complete supervision signal. We use it to train a simple convolutional neural network (ConvNet) with a fully supervised objec… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

  32. arXiv:2304.00500  [pdf, other

    cs.CV cs.AI cs.MM

    Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images

    Authors: Roberto Amoroso, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Alberto Del Bimbo, Rita Cucchiara

    Abstract: Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by s… ▽ More

    Submitted 21 May, 2024; v1 submitted 2 April, 2023; originally announced April 2023.

    Comments: ACM Transactions on Multimedia Computing, Communications and Applications (2024)

  33. arXiv:2303.15269  [pdf, other

    cs.CV

    Handwritten Text Generation from Visual Archetypes

    Authors: Vittorio Pippi, Silvia Cascianelli, Rita Cucchiara

    Abstract: Generating synthetic images of handwritten text in a writer-specific style is a challenging task, especially in the case of unseen styles and new words, and even more when these latter contain characters that are rarely encountered during training. While emulating a writer's style has been recently addressed by generative models, the generalization towards rare characters has been disregarded. In… ▽ More

    Submitted 27 March, 2023; originally announced March 2023.

    Comments: Accepted at CVPR2023

  34. arXiv:2303.12112  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

    Authors: Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contr… ▽ More

    Submitted 20 July, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

    Comments: CVPR 2023 (highlight paper)

  35. arXiv:2302.06375  [pdf, other

    cs.LG cs.AI

    One Transformer for All Time Series: Representing and Training with Time-Dependent Heterogeneous Tabular Data

    Authors: Simone Luetto, Fabrizio Garuti, Enver Sangineto, Lorenzo Forni, Rita Cucchiara

    Abstract: There is a recent growing interest in applying Deep Learning techniques to tabular data, in order to replicate the success of other Artificial Intelligence areas in this structured domain. Specifically interesting is the case in which tabular data have a time dependence, such as, for instance financial transactions. However, the heterogeneity of the tabular values, in which categorical elements ar… ▽ More

    Submitted 12 July, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

    Comments: 9 pages, 2 figures, 7 tables

  36. arXiv:2301.11706  [pdf, other

    cs.LG cs.AI cs.CV

    Input Perturbation Reduces Exposure Bias in Diffusion Models

    Authors: Mang Ning, Enver Sangineto, Angelo Porrello, Simone Calderara, Rita Cucchiara

    Abstract: Denoising Diffusion Probabilistic Models have shown an impressive generation quality, although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the exposure bias problem in autoregressive text generation. Specifically, we note that there is a discrepancy between trai… ▽ More

    Submitted 18 June, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

    Comments: accepted by ICML 2023

  37. arXiv:2301.07150  [pdf, other

    cs.RO cs.AI cs.CL cs.CV

    Embodied Agents for Efficient Exploration and Smart Scene Description

    Authors: Roberto Bigazzi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments. As a step towards this objective, in this work, we tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment w… ▽ More

    Submitted 17 January, 2023; originally announced January 2023.

    Comments: Accepted by IEEE International Conference on Robotics and Automation (ICRA 2023)

  38. arXiv:2208.08109  [pdf, other

    cs.CV

    Boosting Modern and Historical Handwritten Text Recognition with Deformable Convolutions

    Authors: Silvia Cascianelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Handwritten Text Recognition (HTR) in free-layout pages is a challenging image understanding task that can provide a relevant boost to the digitization of handwritten documents and reuse of their content. The task becomes even more challenging when dealing with historical documents due to the variability of the writing style and degradation of the page quality. State-of-the-art HTR approaches typi… ▽ More

    Submitted 17 August, 2022; originally announced August 2022.

    Journal ref: International Journal on Document Analysis and Recognition (IJDAR), 2022, 1-11

  39. arXiv:2208.07682  [pdf, other

    cs.CV cs.DL

    The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition

    Authors: Silvia Cascianelli, Vittorio Pippi, Martin Maarand, Marcella Cornia, Lorenzo Baraldi, Christopher Kermorvant, Rita Cucchiara

    Abstract: Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting -- even of the same author over a wide time-span -- and the scarcity of data from ancient, poorly represented languages. With… ▽ More

    Submitted 16 August, 2022; originally announced August 2022.

    Comments: Accepted at ICPR 2022

  40. arXiv:2208.05251  [pdf, other

    cs.CV cs.AI cs.MM

    Consistency-based Self-supervised Learning for Temporal Anomaly Localization

    Authors: Aniello Panariello, Angelo Porrello, Simone Calderara, Rita Cucchiara

    Abstract: This work tackles Weakly Supervised Anomaly detection, in which a predictor is allowed to learn not only from normal examples but also from a few labeled anomalies made available during training. In particular, we deal with the localization of anomalous activities within the video stream: this is a very challenging scenario, as training examples come only with video-level annotations (and not fram… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

    Comments: Accepted to the WCPA Workshop at ECCV 2022 (1st International Workshop and Challenge on People Analysis: From Face, Body and Fashion to 3D Virtual Avatars). 13 pages, 2 figures

  41. arXiv:2207.14757  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

    Authors: Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Giuseppe Amato, Rita Cucchiara

    Abstract: Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-ver… ▽ More

    Submitted 29 July, 2022; originally announced July 2022.

    Comments: CBMI 2022

  42. arXiv:2207.13162  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Retrieval-Augmented Transformer for Image Captioning

    Authors: Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, wi… ▽ More

    Submitted 22 August, 2022; v1 submitted 26 July, 2022; originally announced July 2022.

    Comments: CBMI 2022

  43. arXiv:2206.08704  [pdf, other

    cs.LG cs.CV

    Maximum Class Separation as Inductive Bias in One Matrix

    Authors: Tejaswi Kasarla, Gertjan J. Burghouts, Max van Spengler, Elise van der Pol, Rita Cucchiara, Pascal Mettes

    Abstract: Maximizing the separation between classes constitutes a well-known inductive bias in machine learning and a pillar of many traditional algorithms. By default, deep networks are not equipped with this inductive bias and therefore many alternative solutions have been proposed through differential optimization. Current approaches tend to optimize classification and separation jointly: aligning inputs… ▽ More

    Submitted 22 October, 2022; v1 submitted 17 June, 2022; originally announced June 2022.

  44. arXiv:2205.12551  [pdf, other

    cs.CV cs.CR

    Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers

    Authors: Bin Ren, Yahui Liu, Yue Song, Wei Bi, Rita Cucchiara, Nicu Sebe, Wei Wang

    Abstract: Position Embeddings (PEs), an arguably indispensable component in Vision Transformers (ViTs), have been shown to improve the performance of ViTs on many vision tasks. However, PEs have a potentially high risk of privacy leakage since the spatial information of the input patches is exposed. This caveat naturally raises a series of interesting questions about the impact of PEs on the accuracy, priva… ▽ More

    Submitted 26 May, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: Accepted to CVPR2023

  45. arXiv:2204.11561  [pdf, other

    cs.CV cs.AI cs.LG

    Goal-driven Self-Attentive Recurrent Networks for Trajectory Prediction

    Authors: Luigi Filippo Chiara, Pasquale Coscia, Sourav Das, Simone Calderara, Rita Cucchiara, Lamberto Ballan

    Abstract: Human trajectory forecasting is a key component of autonomous vehicles, social-aware robots and advanced video-surveillance applications. This challenging task typically requires knowledge about past motion, the environment and likely destination areas. In this context, multi-modality is a fundamental aspect and its effective modeling can be beneficial to any architecture. Inferring accurate traje… ▽ More

    Submitted 25 April, 2022; originally announced April 2022.

    Comments: Accepted to CVPR 2022 Precognition Workshop

  46. Embodied Navigation at the Art Gallery

    Authors: Roberto Bigazzi, Federico Landi, Silvia Cascianelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Embodied agents, trained to explore and navigate indoor photorealistic environments, have achieved impressive results on standard datasets and benchmarks. So far, experiments and evaluations have involved domestic and working scenes like offices, flats, and houses. In this paper, we build and release a new 3D space with unique characteristics: the one of a complete art museum. We name this environ… ▽ More

    Submitted 19 April, 2022; originally announced April 2022.

    Comments: Accepted by 21st International Conference on Image Analysis and Processing (ICIAP 2021)

  47. arXiv:2204.08532  [pdf, other

    cs.CV cs.AI cs.GR cs.MM

    Dress Code: High-Resolution Multi-Category Virtual Try-On

    Authors: Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, Rita Cucchiara

    Abstract: Image-based virtual try-on strives to transfer the appearance of a clothing item onto the image of a target person. Prior work focuses mainly on upper-body clothes (e.g. t-shirts, shirts, and tops) and neglects full-body or lower-body items. This shortcoming arises from a main factor: current publicly available datasets for image-based virtual try-on do not account for this variety, thus limiting… ▽ More

    Submitted 13 July, 2022; v1 submitted 18 April, 2022; originally announced April 2022.

    Comments: ECCV 2022 - Video Demo: https://www.youtube.com/watch?v=qr6TW3uTHG4

  48. Spot the Difference: A Novel Task for Embodied Agents in Changing Environments

    Authors: Federico Landi, Roberto Bigazzi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Embodied AI is a recent research area that aims at creating intelligent agents that can move and operate inside an environment. Existing approaches in this field demand the agents to act in completely new and unexplored scenes. However, this setting is far from realistic use cases that instead require executing multiple tasks in the same environment. Even if the environment changes over time, the… ▽ More

    Submitted 18 April, 2022; originally announced April 2022.

    Comments: Accepted by 26TH International Conference on Pattern Recognition (ICPR 2022)

  49. arXiv:2203.04781  [pdf, other

    cs.CV cs.AI

    How many Observations are Enough? Knowledge Distillation for Trajectory Forecasting

    Authors: Alessio Monti, Angelo Porrello, Simone Calderara, Pasquale Coscia, Lamberto Ballan, Rita Cucchiara

    Abstract: Accurate prediction of future human positions is an essential task for modern video-surveillance systems. Current state-of-the-art models usually rely on a "history" of past tracked locations (e.g., 3 to 5 seconds) to predict a plausible sequence of future locations (e.g., up to the next 5 seconds). We feel that this common schema neglects critical traits of realistic applications: as the collecti… ▽ More

    Submitted 9 March, 2022; originally announced March 2022.

    Comments: Accepted by CVPR 2022

  50. arXiv:2202.10492  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    CaMEL: Mean Teacher Learning for Image Captioning

    Authors: Manuele Barraco, Matteo Stefanini, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities. In this paper we present CaMEL, a novel Transformer-based architecture for image captioning. Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase. The interplay betw… ▽ More

    Submitted 21 February, 2022; originally announced February 2022.