Search | arXiv e-print repository

CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

Authors: Emanuele Vivoli, Marco Bertini, Dimosthenis Karatzas

Abstract: The comic domain is rapidly advancing with the development of single-page analysis and synthesis models. However, evaluation metrics and datasets lag behind, often limited to small-scale or single-style test sets. We introduce a novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis. Unlike existing benchmarks that focus on isolated tasks such as objec… ▽ More The comic domain is rapidly advancing with the development of single-page analysis and synthesis models. However, evaluation metrics and datasets lag behind, often limited to small-scale or single-style test sets. We introduce a novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis. Unlike existing benchmarks that focus on isolated tasks such as object detection or text recognition, CoMix addresses a broader range of tasks including object detection, speaker identification, character re-identification, reading order, and multi-modal reasoning tasks like character naming and dialogue generation. Our benchmark comprises three existing datasets with expanded annotations to support multi-task evaluation. To mitigate the over-representation of manga-style data, we have incorporated a new dataset of carefully selected American comic-style books, thereby enriching the diversity of comic styles. CoMix is designed to assess pre-trained models in zero-shot and limited fine-tuning settings, probing their transfer capabilities across different comic styles and tasks. The validation split of the benchmark is publicly available for research purposes, and an evaluation server for the held-out test split is also provided. Comparative results between human performance and state-of-the-art models reveal a significant performance gap, highlighting substantial opportunities for advancements in comic understanding. The dataset, baseline models, and code are accessible at the repository link. This initiative sets a new standard for comprehensive comic analysis, providing the community with a common benchmark for evaluation on a large and varied set. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: Under review. Repository link: https://github.com/emanuelevivoli/CoMix-dataset

arXiv:2407.03540 [pdf, other]

Comics Datasets Framework: Mix of Comics datasets for detection benchmarking

Authors: Emanuele Vivoli, Irene Campaioli, Mariateresa Nardoni, Niccolò Biondi, Marco Bertini, Dimosthenis Karatzas

Abstract: Comics, as a medium, uniquely combine text and images in styles often distinct from real-world visuals. For the past three decades, computational research on comics has evolved from basic object detection to more sophisticated tasks. However, the field faces persistent challenges such as small datasets, inconsistent annotations, inaccessible model weights, and results that cannot be directly compa… ▽ More Comics, as a medium, uniquely combine text and images in styles often distinct from real-world visuals. For the past three decades, computational research on comics has evolved from basic object detection to more sophisticated tasks. However, the field faces persistent challenges such as small datasets, inconsistent annotations, inaccessible model weights, and results that cannot be directly compared due to varying train/test splits and metrics. To address these issues, we aim to standardize annotations across datasets, introduce a variety of comic styles into the datasets, and establish benchmark results with clear, replicable settings. Our proposed Comics Datasets Framework standardizes dataset annotations into a common format and addresses the overrepresentation of manga by introducing Comics100, a curated collection of 100 books from the Digital Comics Museum, annotated for detection in our uniform format. We have benchmarked a variety of detection architectures using the Comics Datasets Framework. All related code, model weights, and detailed evaluation processes are available at https://github.com/emanuelevivoli/cdf, ensuring transparency and facilitating replication. This initiative is a significant advancement towards improving object detection in comics, laying the groundwork for more complex computational tasks dependent on precise object recognition. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: Accepted at MANPU - COMICS workshop at ICDAR

arXiv:2407.03056 [pdf, other]

Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation

Authors: Marco Mistretta, Alberto Baldrati, Marco Bertini, Andrew D. Bagdanov

Abstract: Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization to unseen tasks, but fall short of the performance of supervised methods in generalizing to downstream tasks with limited data. Prompt learning is emerging as a parameter-efficient method for adapting VLMs, but state-of-the-art approaches require annotated samples. In this paper we propose a novel approach to prompt lear… ▽ More Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization to unseen tasks, but fall short of the performance of supervised methods in generalizing to downstream tasks with limited data. Prompt learning is emerging as a parameter-efficient method for adapting VLMs, but state-of-the-art approaches require annotated samples. In this paper we propose a novel approach to prompt learning based on unsupervised knowledge distillation from more powerful models. Our approach, which we call Knowledge Distillation Prompt Learning (KDPL), can be integrated into existing prompt learning techniques and eliminates the need for labeled examples during adaptation. Our experiments on more than ten standard benchmark datasets demonstrate that KDPL is very effective at improving generalization of learned prompts for zero-shot domain generalization, zero-shot cross-dataset generalization, and zero-shot base-to-novel class generalization problems. KDPL requires no ground-truth labels for adaptation, and moreover we show that even in the absence of any knowledge of training class names it can be used to effectively transfer knowledge. The code is publicly available at https://github.com/miccunifi/KDPL. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: Accepted for publication at ECCV24

arXiv:2406.14607 [pdf, other]

Quantum Extreme Learning of molecular potential energy surfaces and force fields

Authors: Gabriele Lo Monaco, Marco Bertini, Salvatore Lorenzo, G. Massimo Palma

Abstract: Quantum machine learning algorithms are expected to play a pivotal role in quantum chemistry simulations in the immediate future. One such key application is the training of a quantum neural network to learn the potential energy surface and force field of molecular systems. We address this task by using the quantum extreme learning machine paradigm. This particular supervised learning routine allo… ▽ More Quantum machine learning algorithms are expected to play a pivotal role in quantum chemistry simulations in the immediate future. One such key application is the training of a quantum neural network to learn the potential energy surface and force field of molecular systems. We address this task by using the quantum extreme learning machine paradigm. This particular supervised learning routine allows for resource-efficient training, consisting of a simple linear regression performed on a classical computer. We have tested a setup that can be used to study molecules of any dimension and is optimized for immediate use on NISQ devices with a limited number of native gates. We have applied this setup to three case studies: lithium hydride, water, and formamide, carrying out both noiseless simulations and actual implementation on IBM quantum hardware. Compared to other supervised learning routines, the proposed setup requires minimal quantum resources, making it feasible for direct implementation on quantum platforms, while still achieving a high level of predictive accuracy compared to simulations. Our encouraging results pave the way towards the future application to more complex molecules, being the proposed setup scalable. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 14 pages, 7 figures. Accepted on Machine Learning: Science and Technology

arXiv:2405.02951 [pdf, other]

iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval

Authors: Lorenzo Agnolucci, Alberto Baldrati, Marco Bertini, Alberto Del Bimbo

Abstract: Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot… ▽ More Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset. We propose an approach named iSEARLE (improved zero-Shot composEd imAge Retrieval with textuaL invErsion) that involves map** the visual information of the reference image into a pseudo-word token in CLIP token embedding space and combining it with the relative caption. To foster research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO (Composed Image Retrieval on Common Objects in context), the first CIR dataset where each query is labeled with multiple ground truths and a semantic categorization. The experimental results illustrate that iSEARLE obtains state-of-the-art performance on three different CIR datasets -- FashionIQ, CIRR, and the proposed CIRCO -- and two additional evaluation settings, namely domain conversion and object composition. The dataset, the code, and the model are publicly available at https://github.com/miccunifi/SEARLE. △ Less

Submitted 5 May, 2024; originally announced May 2024.

Comments: Extended version of the ICCV2023 paper arXiv:2303.15247

arXiv:2403.14828 [pdf, other]

Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

Authors: Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara

Abstract: Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try… ▽ More Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs. △ Less

Submitted 25 March, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

arXiv:2403.11176 [pdf, other]

Quality-Aware Image-Text Alignment for Real-World Image Quality Assessment

Authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini

Abstract: No-Reference Image Quality Assessment (NR-IQA) focuses on designing methods to measure image quality in alignment with human perception when a high-quality reference image is unavailable. The reliance on annotated Mean Opinion Scores (MOS) in the majority of state-of-the-art NR-IQA approaches limits their scalability and broader applicability to real-world scenarios. To overcome this limitation, w… ▽ More No-Reference Image Quality Assessment (NR-IQA) focuses on designing methods to measure image quality in alignment with human perception when a high-quality reference image is unavailable. The reliance on annotated Mean Opinion Scores (MOS) in the majority of state-of-the-art NR-IQA approaches limits their scalability and broader applicability to real-world scenarios. To overcome this limitation, we propose QualiCLIP (Quality-aware CLIP), a CLIP-based self-supervised opinion-unaware method that does not require labeled MOS. In particular, we introduce a quality-aware image-text alignment strategy to make CLIP generate representations that correlate with the inherent quality of the images. Starting from pristine images, we synthetically degrade them with increasing levels of intensity. Then, we train CLIP to rank these degraded images based on their similarity to quality-related antonym text prompts, while guaranteeing consistent representations for images with comparable quality. Our method achieves state-of-the-art performance on several datasets with authentic distortions. Moreover, despite not requiring MOS, QualiCLIP outperforms supervised methods when their training dataset differs from the testing one, thus proving to be more suitable for real-world scenarios. Furthermore, our approach demonstrates greater robustness and improved explainability than competing methods. The code and the model are publicly available at https://github.com/miccunifi/QualiCLIP. △ Less

Submitted 17 March, 2024; originally announced March 2024.

arXiv:2311.04263 [pdf, other]

Perceptual Quality Improvement in Videoconferencing using Keyframes-based GAN

Authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, Alberto Del Bimbo

Abstract: In the latest years, videoconferencing has taken a fundamental role in interpersonal relations, both for personal and business purposes. Lossy video compression algorithms are the enabling technology for videoconferencing, as they reduce the bandwidth required for real-time video streaming. However, lossy video compression decreases the perceived visual quality. Thus, many techniques for reducing… ▽ More In the latest years, videoconferencing has taken a fundamental role in interpersonal relations, both for personal and business purposes. Lossy video compression algorithms are the enabling technology for videoconferencing, as they reduce the bandwidth required for real-time video streaming. However, lossy video compression decreases the perceived visual quality. Thus, many techniques for reducing compression artifacts and improving video visual quality have been proposed in recent years. In this work, we propose a novel GAN-based method for compression artifacts reduction in videoconferencing. Given that, in this context, the speaker is typically in front of the camera and remains the same for the entire duration of the transmission, we can maintain a set of reference keyframes of the person from the higher-quality I-frames that are transmitted within the video stream and exploit them to guide the visual quality improvement; a novel aspect of this approach is the update policy that maintains and updates a compact and effective set of reference keyframes. First, we extract multi-scale features from the compressed and reference frames. Then, our architecture combines these features in a progressive manner according to facial landmarks. This allows the restoration of the high-frequency details lost after the video compression. Experiments show that the proposed approach improves visual quality and generates photo-realistic results even with high compression rates. Code and pre-trained networks are publicly available at https://github.com/LorenzoAgnolucci/Keyframes-GAN. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: IEEE Transactions on Multimedia 2023 (IEEE TMM 2023)

arXiv:2311.04261 [pdf, other]

Restoration of Analog Videos Using Swin-UNet

Authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, Alberto Del Bimbo

Abstract: In this paper, we present a system to restore analog videos of historical archives. These videos often contain severe visual degradation due to the deterioration of their tape supports that require costly and slow manual interventions to recover the original content. The proposed system uses a multi-frame approach and is able to deal with severe tape mistracking, which results in completely scramb… ▽ More In this paper, we present a system to restore analog videos of historical archives. These videos often contain severe visual degradation due to the deterioration of their tape supports that require costly and slow manual interventions to recover the original content. The proposed system uses a multi-frame approach and is able to deal with severe tape mistracking, which results in completely scrambled frames. Tests on real-world videos from a major historical video archive show the effectiveness of our demo system. The code and the pre-trained model are publicly available at https://github.com/miccunifi/analog-video-restoration. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: ACM MM 2022 (Demo)

arXiv:2310.14926 [pdf, other]

Reference-based Restoration of Digitized Analog Videotapes

Authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, Alberto Del Bimbo

Abstract: Analog magnetic tapes have been the main video data storage device for several decades. Videos stored on analog videotapes exhibit unique degradation patterns caused by tape aging and reader device malfunctioning that are different from those observed in film and digital video restoration tasks. In this work, we present a reference-based approach for the resToration of digitized Analog videotaPEs… ▽ More Analog magnetic tapes have been the main video data storage device for several decades. Videos stored on analog videotapes exhibit unique degradation patterns caused by tape aging and reader device malfunctioning that are different from those observed in film and digital video restoration tasks. In this work, we present a reference-based approach for the resToration of digitized Analog videotaPEs (TAPE). We leverage CLIP for zero-shot artifact detection to identify the cleanest frames of each video through textual prompts describing different artifacts. Then, we select the clean frames most similar to the input ones and employ them as references. We design a transformer-based Swin-UNet network that exploits both neighboring and reference frames via our Multi-Reference Spatial Feature Fusion (MRSFF) blocks. MRSFF blocks rely on cross-attention and attention pooling to take advantage of the most useful parts of each reference frame. To address the absence of ground truth in real-world videos, we create a synthetic dataset of videos exhibiting artifacts that closely resemble those commonly found in analog videotapes. Both quantitative and qualitative experiments show the effectiveness of our approach compared to other state-of-the-art methods. The code, the model, and the synthetic dataset are publicly available at https://github.com/miccunifi/TAPE. △ Less

Submitted 3 November, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

Comments: WACV2024

arXiv:2310.14918 [pdf, other]

ARNIQA: Learning Distortion Manifold for Image Quality Assessment

Authors: Lorenzo Agnolucci, Leonardo Galteri, Marco Bertini, Alberto Del Bimbo

Abstract: No-Reference Image Quality Assessment (NR-IQA) aims to develop methods to measure image quality in alignment with human perception without the need for a high-quality reference image. In this work, we propose a self-supervised approach named ARNIQA (leArning distoRtion maNifold for Image Quality Assessment) for modeling the image distortion manifold to obtain quality representations in an intrinsi… ▽ More No-Reference Image Quality Assessment (NR-IQA) aims to develop methods to measure image quality in alignment with human perception without the need for a high-quality reference image. In this work, we propose a self-supervised approach named ARNIQA (leArning distoRtion maNifold for Image Quality Assessment) for modeling the image distortion manifold to obtain quality representations in an intrinsic manner. First, we introduce an image degradation model that randomly composes ordered sequences of consecutively applied distortions. In this way, we can synthetically degrade images with a large variety of degradation patterns. Second, we propose to train our model by maximizing the similarity between the representations of patches of different images distorted equally, despite varying content. Therefore, images degraded in the same manner correspond to neighboring positions within the distortion manifold. Finally, we map the image representations to the quality scores with a simple linear regressor, thus without fine-tuning the encoder weights. The experiments show that our approach achieves state-of-the-art performance on several datasets. In addition, ARNIQA demonstrates improved data efficiency, generalization capabilities, and robustness compared to competing methods. The code and the model are publicly available at https://github.com/miccunifi/ARNIQA. △ Less

Submitted 4 November, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

Comments: WACV2024

arXiv:2310.08368 [pdf, other]

Map** Memes to Words for Multimodal Hateful Meme Classification

Authors: Giovanni Burbi, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto Del Bimbo

Abstract: Multimodal image-text memes are prevalent on the internet, serving as a unique form of communication that combines visual and textual elements to convey humor, ideas, or emotions. However, some memes take a malicious turn, promoting hateful content and perpetuating discrimination. Detecting hateful memes within this multimodal context is a challenging task that requires understanding the intertwin… ▽ More Multimodal image-text memes are prevalent on the internet, serving as a unique form of communication that combines visual and textual elements to convey humor, ideas, or emotions. However, some memes take a malicious turn, promoting hateful content and perpetuating discrimination. Detecting hateful memes within this multimodal context is a challenging task that requires understanding the intertwined meaning of text and images. In this work, we address this issue by proposing a novel approach named ISSUES for multimodal hateful meme classification. ISSUES leverages a pre-trained CLIP vision-language model and the textual inversion technique to effectively capture the multimodal semantic content of the memes. The experiments show that our method achieves state-of-the-art results on the Hateful Memes Challenge and HarMeme datasets. The code and the pre-trained models are publicly available at https://github.com/miccunifi/ISSUES. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: ICCV2023 CLVL Workshop

arXiv:2309.12110 [pdf, other]

doi 10.1007/978-3-031-20302-2_11

Exploiting CLIP-based Multi-modal Approach for Artwork Classification and Retrieval

Authors: Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo

Abstract: Given the recent advances in multimodal image pretraining where visual models trained with semantically dense textual supervision tend to have better generalization capabilities than those trained using categorical attributes or through unsupervised techniques, in this work we investigate how recent CLIP model can be applied in several tasks in artwork domain. We perform exhaustive experiments on… ▽ More Given the recent advances in multimodal image pretraining where visual models trained with semantically dense textual supervision tend to have better generalization capabilities than those trained using categorical attributes or through unsupervised techniques, in this work we investigate how recent CLIP model can be applied in several tasks in artwork domain. We perform exhaustive experiments on the NoisyArt dataset which is a dataset of artwork images crawled from public resources on the web. On such dataset CLIP achieves impressive results on (zero-shot) classification and promising results in both artwork-to-artwork and description-to-artwork domain. △ Less

Submitted 21 September, 2023; originally announced September 2023.

Comments: Proc. of Florence Heri-Tech 2022: The Future of Heritage Science and Technologies: ICT and Digital Heritage, 2022

arXiv:2309.05551 [pdf, other]

OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Authors: Giuseppe Cartella, Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara

Abstract: The inexorable growth of online shop** and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In th… ▽ More The inexorable growth of online shop** and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip. △ Less

Submitted 11 September, 2023; originally announced September 2023.

Comments: International Conference on Image Analysis and Processing (ICIAP) 2023

arXiv:2308.11485 [pdf, other]

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

Authors: Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto del Bimbo

Abstract: Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP mo… ▽ More Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: Accepted in ACM Transactions on Multimedia Computing Communications and Applications (TOMM)

arXiv:2307.14063 [pdf, other]

ECO: Ensembling Context Optimization for Vision-Language Models

Authors: Lorenzo Agnolucci, Alberto Baldrati, Francesco Todino, Federico Becattini, Marco Bertini, Alberto Del Bimbo

Abstract: Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning text… ▽ More Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks. △ Less

Submitted 26 July, 2023; originally announced July 2023.

arXiv:2306.01081 [pdf, other]

4DSR-GCN: 4D Video Point Cloud Upsampling using Graph Convolutional Networks

Authors: Lorenzo Berlincioni, Stefano Berretti, Marco Bertini, Alberto Del Bimbo

Abstract: Time varying sequences of 3D point clouds, or 4D point clouds, are now being acquired at an increasing pace in several applications (e.g., LiDAR in autonomous or assisted driving). In many cases, such volume of data is transmitted, thus requiring that proper compression tools are applied to either reduce the resolution or the bandwidth. In this paper, we propose a new solution for upscaling and re… ▽ More Time varying sequences of 3D point clouds, or 4D point clouds, are now being acquired at an increasing pace in several applications (e.g., LiDAR in autonomous or assisted driving). In many cases, such volume of data is transmitted, thus requiring that proper compression tools are applied to either reduce the resolution or the bandwidth. In this paper, we propose a new solution for upscaling and restoration of time-varying 3D video point clouds after they have been heavily compressed. In consideration of recent growing relevance of 3D applications, %We focused on a model allowing user-side upscaling and artifact removal for 3D video point clouds, a real-time stream of which would require . Our model consists of a specifically designed Graph Convolutional Network (GCN) that combines Dynamic Edge Convolution and Graph Attention Networks for feature aggregation in a Generative Adversarial setting. By taking inspiration PointNet++, We present a different way to sample dense point clouds with the intent to make these modules work in synergy to provide each node enough features about its neighbourhood in order to later on generate new vertices. Compared to other solutions in the literature that address the same task, our proposed model is capable of obtaining comparable results in terms of quality of the reconstruction, while using a substantially lower number of parameters (about 300KB), making our solution deployable in edge computing devices such as LiDAR. △ Less

Submitted 1 June, 2023; originally announced June 2023.

arXiv:2305.13501 [pdf, other]

LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

Authors: Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara

Abstract: The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a give… ▽ More The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task. Source code and trained models are publicly available at: https://github.com/miccunifi/ladi-vton. △ Less

Submitted 3 August, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: ACM Multimedia 2023

arXiv:2304.02051 [pdf, other]

Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

Authors: Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia, Marco Bertini, Rita Cucchiara

Abstract: Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-co… ▽ More Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs. Source code and collected multimodal annotations are publicly available at: https://github.com/aimagelab/multimodal-garment-designer. △ Less

Submitted 23 August, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

Comments: ICCV 2023

arXiv:2303.15335 [pdf, other]

Error assessment of microwave holography inversion for shallow buried objects

Authors: Emanuele Vivoli, Luca Bossi, Marco Bertini, Pierluigi Falorni, Lorenzo Capineri

Abstract: Holographic imaging is a technique that uses microwave energy to create a three-dimensional image of an object or scene. This technology has potential applications in land mine detection, as the long-wavelength microwave energy can penetrate the ground and create an image of hidden objects without the need for direct physical contact. However, the inversion algorithms commonly used to digitally re… ▽ More Holographic imaging is a technique that uses microwave energy to create a three-dimensional image of an object or scene. This technology has potential applications in land mine detection, as the long-wavelength microwave energy can penetrate the ground and create an image of hidden objects without the need for direct physical contact. However, the inversion algorithms commonly used to digitally reconstruct 3D images from holographic images, such as Convolution, Angular Spectrum, and Fresnel, are known to have limitations and can introduce errors in the reconstructed image. Despite these challenges, the use of holographic radar at around 2 GHz in combination with holographic imaging techniques for land mine detection allows to recover size and shape of buried objects. In this paper, we estimate the reconstruction error for the convolution algorithm based on hologram imaging simulation and assess these errors recommending an increase in the scanner area, considering the limitations that the system has and the expected error reduction. △ Less

Submitted 27 March, 2023; originally announced March 2023.

Comments: accepted at IWA-GPR

arXiv:2303.15247 [pdf, other]

Zero-Shot Composed Image Retrieval with Textual Inversion

Authors: Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto Del Bimbo

Abstract: Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), th… ▽ More Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE. △ Less

Submitted 19 August, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

Comments: ICCV2023

arXiv:2106.13603 [pdf, other]

Partially fake it till you make it: mixing real and fake thermal images for improved object detection

Authors: Francesco Bongini, Lorenzo Berlincioni, Marco Bertini, Alberto Del Bimbo

Abstract: In this paper we propose a novel data augmentation approach for visual content domains that have scarce training datasets, compositing synthetic 3D objects within real scenes. We show the performance of the proposed system in the context of object detection in thermal videos, a domain where 1) training datasets are very limited compared to visible spectrum datasets and 2) creating full realistic s… ▽ More In this paper we propose a novel data augmentation approach for visual content domains that have scarce training datasets, compositing synthetic 3D objects within real scenes. We show the performance of the proposed system in the context of object detection in thermal videos, a domain where 1) training datasets are very limited compared to visible spectrum datasets and 2) creating full realistic synthetic scenes is extremely cumbersome and expensive due to the difficulty in modeling the thermal properties of the materials of the scene. We compare different augmentation strategies, including state of the art approaches obtained through RL techniques, the injection of simulated data and the employment of a generative model, and study how to best combine our proposed augmentation with these other techniques.Experimental results demonstrate the effectiveness of our approach, and our single-modality detector achieves state-of-the-art results on the FLIR ADAS dataset. △ Less

Submitted 25 June, 2021; originally announced June 2021.

arXiv:2102.02005 [pdf, other]

Robust pedestrian detection in thermal imagery using synthesized images

Authors: My Kieu, Lorenzo Berlincioni, Leonardo Galteri, Marco Bertini, Andrew D. Bagdanov, Alberto Del Bimbo

Abstract: In this paper we propose a method for improving pedestrian detection in the thermal domain using two stages: first, a generative data augmentation approach is used, then a domain adaptation method using generated data adapts an RGB pedestrian detector. Our model, based on the Least-Squares Generative Adversarial Network, is trained to synthesize realistic thermal versions of input RGB images which… ▽ More In this paper we propose a method for improving pedestrian detection in the thermal domain using two stages: first, a generative data augmentation approach is used, then a domain adaptation method using generated data adapts an RGB pedestrian detector. Our model, based on the Least-Squares Generative Adversarial Network, is trained to synthesize realistic thermal versions of input RGB images which are then used to augment the limited amount of labeled thermal pedestrian images available for training. We apply our generative data augmentation strategy in order to adapt a pretrained YOLOv3 pedestrian detector to detection in the thermal-only domain. Experimental results demonstrate the effectiveness of our approach: using less than 50\% of available real thermal training data, and relying on synthesized data generated by our model in the domain adaptation phase, our detector achieves state-of-the-art results on the KAIST Multispectral Pedestrian Detection Benchmark; even if more real thermal data is available adding GAN generated images to the training data results in improved performance, thus showing that these images act as an effective form of data augmentation. To the best of our knowledge, our detector achieves the best single-modality detection results on KAIST with respect to the state-of-the-art. △ Less

Submitted 3 February, 2021; originally announced February 2021.

Comments: Accepted at ICPR2020

arXiv:2008.12046 [pdf, other]

Inner Eye Canthus Localization for Human Body Temperature Screening

Authors: Claudio Ferrari, Lorenzo Berlincioni, Marco Bertini, Alberto Del Bimbo

Abstract: In this paper, we propose an automatic approach for localizing the inner eye canthus in thermal face images. We first coarsely detect 5 facial keypoints corresponding to the center of the eyes, the nosetip and the ears. Then we compute a sparse 2D-3D points correspondence using a 3D Morphable Face Model (3DMM). This correspondence is used to project the entire 3D face onto the image, and subsequen… ▽ More In this paper, we propose an automatic approach for localizing the inner eye canthus in thermal face images. We first coarsely detect 5 facial keypoints corresponding to the center of the eyes, the nosetip and the ears. Then we compute a sparse 2D-3D points correspondence using a 3D Morphable Face Model (3DMM). This correspondence is used to project the entire 3D face onto the image, and subsequently locate the inner eye canthus. Detecting this location allows to obtain the most precise body temperature measurement for a person using a thermal camera. We evaluated the approach on a thermal face dataset provided with manually annotated landmarks. However, such manual annotations are normally conceived to identify facial parts such as eyes, nose and mouth, and are not specifically tailored for localizing the eye canthus region. As additional contribution, we enrich the original dataset by using the annotated landmarks to deform and project the 3DMM onto the images. Then, by manually selecting a small region corresponding to the eye canthus, we enrich the dataset with additional annotations. By using the manual landmarks, we ensure the correctness of the 3DMM projection, which can be used as ground-truth for future evaluations. Moreover, we supply the dataset with the 3D head poses and per-point visibility masks for detecting self-occlusions. The data will be publicly released. △ Less

Submitted 27 August, 2020; originally announced August 2020.

arXiv:2004.09695 [pdf, other]

Image Retrieval using Multi-scale CNN Features Pooling

Authors: Federico Vaccaro, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo

Abstract: In this paper, we address the problem of image retrieval by learning images representation based on the activations of a Convolutional Neural Network. We present an end-to-end trainable network architecture that exploits a novel multi-scale local pooling based on NetVLAD and a triplet mining procedure based on samples difficulty to obtain an effective image representation. Extensive experiments sh… ▽ More In this paper, we address the problem of image retrieval by learning images representation based on the activations of a Convolutional Neural Network. We present an end-to-end trainable network architecture that exploits a novel multi-scale local pooling based on NetVLAD and a triplet mining procedure based on samples difficulty to obtain an effective image representation. Extensive experiments show that our approach is able to reach state-of-the-art results on three standard datasets. △ Less

Submitted 24 April, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

Comments: Accepted at ICMR 2020

arXiv:1812.04551 [pdf, ps, other]

Uniqueness of Segal quantization for oscillating systems

Authors: Massimo Bertini, Sergio Cacciatori, Manuel Falchi Perna

Abstract: We show that the Segal quantization of an arbitrary system of decoupled harmonic oscillators is unique in the sense that the one particle Hilbert space is completely determined by the requests of being a naturally complex symplectic space carrying a unitary realization of the dynamical evolution of the considered system. We show that the Segal quantization of an arbitrary system of decoupled harmonic oscillators is unique in the sense that the one particle Hilbert space is completely determined by the requests of being a naturally complex symplectic space carrying a unitary realization of the dynamical evolution of the considered system. △ Less

Submitted 11 December, 2018; originally announced December 2018.

Comments: 13 pages

arXiv:1708.04573 [pdf, ps, other]

Volume preserving flow by powers of symmetric polynomials in the principal curvatures

Authors: Maria Chiara Bertini, Carlo Sinestrari

Abstract: We study a volume preserving curvature flow of convex hypersurfaces, driven by a power of the $k$-th elementary symmetric polynomial in the principal curvatures. Unlike most of the previous works on related problems, we do not require assumptions on the curvature pinching of the initial datum. We prove that the solution exists for all times and that the speed remains bounded and converges to a con… ▽ More We study a volume preserving curvature flow of convex hypersurfaces, driven by a power of the $k$-th elementary symmetric polynomial in the principal curvatures. Unlike most of the previous works on related problems, we do not require assumptions on the curvature pinching of the initial datum. We prove that the solution exists for all times and that the speed remains bounded and converges to a constant in an integral norm. In the case of the volume preserving scalar curvature flow, we can prove that the hypersurfaces converge smoothly and exponentially fast to a round sphere. △ Less

Submitted 8 February, 2018; v1 submitted 15 August, 2017; originally announced August 2017.

Comments: 17 pages

arXiv:1704.02518 [pdf, other]

Deep Generative Adversarial Compression Artifact Removal

Authors: Leonardo Galteri, Lorenzo Seidenari, Marco Bertini, Alberto Del Bimbo

Abstract: Compression artifacts arise in images whenever a lossy compression algorithm is applied. These artifacts eliminate details present in the original image, or add noise and small structures; because of these effects they make images less pleasant for the human eye, and may also lead to decreased performance of computer vision algorithms such as object detectors. To eliminate such artifacts, when dec… ▽ More Compression artifacts arise in images whenever a lossy compression algorithm is applied. These artifacts eliminate details present in the original image, or add noise and small structures; because of these effects they make images less pleasant for the human eye, and may also lead to decreased performance of computer vision algorithms such as object detectors. To eliminate such artifacts, when decompressing an image, it is required to recover the original image from a disturbed version. To this end, we present a feed-forward fully convolutional residual network model trained using a generative adversarial framework. To provide a baseline, we show that our model can be also trained optimizing the Structural Similarity (SSIM), which is a better loss with respect to the simpler Mean Squared Error (MSE). Our GAN is able to produce images with more photorealistic details than MSE or SSIM based networks. Moreover we show that our approach can be used as a pre-processing step for object detection in case images are degraded by compression to a point that state-of-the art detectors fail. In this task, our GAN method obtains better performance than MSE or SSIM trained networks. △ Less

Submitted 6 December, 2017; v1 submitted 8 April, 2017; originally announced April 2017.

Comments: ICCV 2017 Camera Ready + Acknowledgements

arXiv:1612.05288 [pdf, ps, other]

Volume preserving non homogeneous mean curvature flow in hyperbolic space

Authors: Maria Chiara Bertini, Giuseppe Pipoli

Abstract: We study a volume/area preserving curvature flow of hypersurfaces that are convex by horospheres in the hyperbolic space, with velocity given by a generic positive, increasing function of the mean curvature, not necessarly homogeneous. For this class of speeds we prove the exponential convergence to a geodesic sphere. The proof is ispired by [10] and is based on the preserving of the convexity by… ▽ More We study a volume/area preserving curvature flow of hypersurfaces that are convex by horospheres in the hyperbolic space, with velocity given by a generic positive, increasing function of the mean curvature, not necessarly homogeneous. For this class of speeds we prove the exponential convergence to a geodesic sphere. The proof is ispired by [10] and is based on the preserving of the convexity by horospheres that allows to bound the inner and outer radii and to give uniform bounds on the curvature by maximum principle arguments. In order to deduce the exponential trend, we study the behaviour of a suitable ratio associated to the hypersurface that converges exponentially in time to the value associated to a geodesic sphere. △ Less

Submitted 23 January, 2017; v1 submitted 15 December, 2016; originally announced December 2016.

Comments: 15 pages

MSC Class: 53C44; 35B40

arXiv:1610.07436 [pdf, ps, other]

Volume preserving non homogeneous mean curvature flow of convex hypersurfaces

Authors: Maria Chiara Bertini, Carlo Sinestrari

Abstract: We consider a convex Euclidean hypersurface that evolves by a volume or area preserving flow with speed given by a general nonhomogeneous function of the mean curvature. For a broad class of possible speed functions, we show that any closed convex hypersurface converges to a round sphere. The proof is based on the monotonicity of the isoperimetric ratio, which allows to control the inner and outer… ▽ More We consider a convex Euclidean hypersurface that evolves by a volume or area preserving flow with speed given by a general nonhomogeneous function of the mean curvature. For a broad class of possible speed functions, we show that any closed convex hypersurface converges to a round sphere. The proof is based on the monotonicity of the isoperimetric ratio, which allows to control the inner and outer radius of the hypersurface and to deduce uniform bounds on the curvature by maximum principle arguments. △ Less

Submitted 24 October, 2016; originally announced October 2016.

arXiv:1605.02892 [pdf, other]

Compact Hash Codes for Efficient Visual Descriptors Retrieval in Large Scale Databases

Authors: Simone Ercoli, Marco Bertini, Alberto Del Bimbo

Abstract: In this paper we present an efficient method for visual descriptors retrieval based on compact hash codes computed using a multiple k-means assignment. The method has been applied to the problem of approximate nearest neighbor (ANN) search of local and global visual content descriptors, and it has been tested on different datasets: three large scale public datasets of up to one billion descriptors… ▽ More In this paper we present an efficient method for visual descriptors retrieval based on compact hash codes computed using a multiple k-means assignment. The method has been applied to the problem of approximate nearest neighbor (ANN) search of local and global visual content descriptors, and it has been tested on different datasets: three large scale public datasets of up to one billion descriptors (BIGANN) and, supported by recent progress in convolutional neural networks (CNNs), also on the CIFAR-10 and MNIST datasets. Experimental results show that, despite its simplicity, the proposed method obtains a very high performance that makes it superior to more complex state-of-the-art methods. △ Less

Submitted 10 May, 2016; originally announced May 2016.

arXiv:1605.00957 [pdf, other]

Bloom Filters and Compact Hash Codes for Efficient and Distributed Image Retrieval

Authors: Andrea Salvi, Simone Ercoli, Marco Bertini, Alberto Del Bimbo

Abstract: This paper presents a novel method for efficient image retrieval, based on a simple and effective hashing of CNN features and the use of an indexing structure based on Bloom filters. These filters are used as gatekeepers for the database of image features, allowing to avoid to perform a query if the query features are not stored in the database and speeding up the query process, without affecting… ▽ More This paper presents a novel method for efficient image retrieval, based on a simple and effective hashing of CNN features and the use of an indexing structure based on Bloom filters. These filters are used as gatekeepers for the database of image features, allowing to avoid to perform a query if the query features are not stored in the database and speeding up the query process, without affecting retrieval performance. Thanks to the limited memory requirements the system is suitable for mobile applications and distributed databases, associating each filter to a distributed portion of the database. Experimental validation has been performed on three standard image retrieval datasets, outperforming state-of-the-art hashing methods in terms of precision, while the proposed indexing method obtains a $2\times$ speedup. △ Less

Submitted 3 May, 2016; originally announced May 2016.

arXiv:1503.08248 [pdf, other]

doi 10.1145/2906152

Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval

Authors: Xirong Li, Tiberio Uricchio, Lamberto Ballan, Marco Bertini, Cees G. M. Snoek, Alberto Del Bimbo

Abstract: Where previous reviews on content-based image retrieval emphasize on what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image. A comprehensive treatise of three closely linked problems, i.e., image tag assignment, refinement, and tag-based image retrieval is presented. While existing works vary in terms of their targeted tasks and methodology, t… ▽ More Where previous reviews on content-based image retrieval emphasize on what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image. A comprehensive treatise of three closely linked problems, i.e., image tag assignment, refinement, and tag-based image retrieval is presented. While existing works vary in terms of their targeted tasks and methodology, they rely on the key functionality of tag relevance, i.e. estimating the relevance of a specific tag with respect to the visual content of a given image and its social context. By analyzing what information a specific method exploits to construct its tag relevance function and how such information is exploited, this paper introduces a taxonomy to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations. For a head-to-head comparison between the state-of-the-art, a new experimental protocol is presented, with training sets containing 10k, 100k and 1m images and an evaluation on three test sets, contributed by various research groups. Eleven representative works are implemented and evaluated. Putting all this together, the survey aims to provide an overview of the past and foster progress for the near future. △ Less

Submitted 23 March, 2016; v1 submitted 27 March, 2015; originally announced March 2015.

Comments: to appear in ACM Computing Surveys

ACM Class: H.3.1; H.3.3

Journal ref: ACM Computing Surveys, Volume 49 Issue 1, 14:1-14:39, June 2016

arXiv:1407.0623 [pdf, other]

doi 10.1016/j.cviu.2015.05.009

A Data-Driven Approach for Tag Refinement and Localization in Web Videos

Authors: Lamberto Ballan, Marco Bertini, Giuseppe Serra, Alberto Del Bimbo

Abstract: Tagging of visual content is becoming more and more widespread as web-based services and social networks have popularized tagging functionalities among their users. These user-generated tags are used to ease browsing and exploration of media collections, e.g. using tag clouds, or to retrieve multimedia content. However, not all media are equally tagged by users. Using the current systems is easy t… ▽ More Tagging of visual content is becoming more and more widespread as web-based services and social networks have popularized tagging functionalities among their users. These user-generated tags are used to ease browsing and exploration of media collections, e.g. using tag clouds, or to retrieve multimedia content. However, not all media are equally tagged by users. Using the current systems is easy to tag a single photo, and even tagging a part of a photo, like a face, has become common in sites like Flickr and Facebook. On the other hand, tagging a video sequence is more complicated and time consuming, so that users just tag the overall content of a video. In this paper we present a method for automatic video annotation that increases the number of tags originally provided by users, and localizes them temporally, associating tags to keyframes. Our approach exploits collective knowledge embedded in user-generated tags and web sources, and visual similarity of keyframes and images uploaded to social sites like YouTube and Flickr, as well as web sources like Google and Bing. Given a keyframe, our method is able to select on the fly from these visual sources the training exemplars that should be the most relevant for this test sample, and proceeds to transfer labels across similar images. Compared to existing video tagging approaches that require training classifiers for each tag, our system has few parameters, is easy to implement and can deal with an open vocabulary scenario. We demonstrate the approach on tag refinement and localization on DUT-WEBV, a large dataset of web videos, and show state-of-the-art results. △ Less

Submitted 28 May, 2015; v1 submitted 2 July, 2014; originally announced July 2014.

Comments: Preprint submitted to Computer Vision and Image Understanding (CVIU)

arXiv:math-ph/0609017 [pdf, ps, other]

doi 10.1088/0305-4470/39/49/007

Dynamics and Lax-Phillips scattering for generalized Lamb models

Authors: M. Bertini, D. Noja, A. Posilicano

Abstract: This paper treats the dynamics and scattering of a model of coupled oscillating systems, a finite dimensional one and a wave field on the half line. The coupling is realized producing the family of selfadjoint extensions of the suitably restricted self-adjoint operator describing the uncoupled dynamics. The spectral theory of the family is studied and the associated quadratic forms constructed.… ▽ More This paper treats the dynamics and scattering of a model of coupled oscillating systems, a finite dimensional one and a wave field on the half line. The coupling is realized producing the family of selfadjoint extensions of the suitably restricted self-adjoint operator describing the uncoupled dynamics. The spectral theory of the family is studied and the associated quadratic forms constructed. The dynamics turns out to be Hamiltonian and the Hamiltonian is described, including the case in which the finite dimensional systems comprises nonlinear oscillators; in this case the dynamics is shown to exist as well. In the linear case the system is equivalent, on a dense subspace, to a wave equation on the half line with higher order boundary conditions, described by a differential polynomial $p(\partial_x)$ explicitely related to the model parameters. In terms of such structure the Lax-Phillips scattering of the system is studied. In particular we determine the incoming and outgoing translation representations, the scattering operator, which turns out to be unitarily equivalent to the multiplication operator given by the rational function $-p(iκ)^*/p(iκ)$, and the Lax-Phillips semigroup, which describes the evolution of the states which are neither incoming in the past nor outgoing in the future. △ Less

Submitted 27 November, 2006; v1 submitted 7 September, 2006; originally announced September 2006.

Journal ref: J. Phys. A: Math. Gen. 39 (2006), 15173-15195

arXiv:math-ph/0407002 [pdf, ps, other]

doi 10.1063/1.2009607

Rigorous Dynamics and Radiation Theory for a Pauli-Fierz Model in the Ultraviolet Limit

Authors: Massimo Bertini, Diego Noja, Andrea Posilicano

Abstract: The present paper is devoted to the detailed study of quantization and evolution of the point limit of the Pauli-Fierz model for a charged oscillator interacting with the electromagnetic field in dipole approximation. In particular, a well defined dynamics is constructed for the classical model, which is subsequently quantized according to the Segal scheme. To this end, the classical model in th… ▽ More The present paper is devoted to the detailed study of quantization and evolution of the point limit of the Pauli-Fierz model for a charged oscillator interacting with the electromagnetic field in dipole approximation. In particular, a well defined dynamics is constructed for the classical model, which is subsequently quantized according to the Segal scheme. To this end, the classical model in the point limit is reformulated as a second order abstract wave equation, and a consistent quantum evolution is given. This allows a study of the behaviour of the survival and transition amplitudes for the process of decay of the excited states of the charged particle, and the emission of photons in the decay process. In particular, for the survival amplitude the exact time behaviour is found. This is completely determined by the resonances of the systems plus a tail term prevailing in the asymptotic, long time regime. Moreover, the survival amplitude exhibites in a fairly clear way the Lamb shift correction to the unperturbed frequencies of the oscillator. △ Less

Submitted 5 July, 2005; v1 submitted 2 July, 2004; originally announced July 2004.

Comments: Shortened version. To appear in J. Math. Phys

Journal ref: J.Math.Phys. 46 (2005) 102305

arXiv:math-ph/0010047 [pdf, ps, other]

doi 10.1063/1.1360194

Wave Equations with Point Interactions in Finite Energy Space

Authors: Massimo Bertini, Diego Noja, Andrea Posilicano

Abstract: Given the abstract wave equation $\ddotφ-Δ_αφ=0$, where $Δ_α$ is the Laplace operator with a point interaction of strength $α$, we define and study $\bar W_α$, the associated wave generator in the phase space of finite energy states. We prove the existence of the phase flow generated by $\bar W_α$, and describe its most relevant properties with particular emphasis on the associated symplectic st… ▽ More Given the abstract wave equation $\ddotφ-Δ_αφ=0$, where $Δ_α$ is the Laplace operator with a point interaction of strength $α$, we define and study $\bar W_α$, the associated wave generator in the phase space of finite energy states. We prove the existence of the phase flow generated by $\bar W_α$, and describe its most relevant properties with particular emphasis on the associated symplectic structure and scattering theory △ Less

Submitted 23 February, 2001; v1 submitted 27 October, 2000; originally announced October 2000.

Comments: misprints corrected. 26 pages. To appear in J. Math. Phys

arXiv:hep-ph/0006152 [pdf, ps, other]

doi 10.1016/S0010-4655(00)00206-X

Pythia version 7-0.0 - a proof-of-concept version

Authors: Marc Bertini, Leif Lonnblad, Torbjorn Sjostrand

Abstract: This document describes the first proof-of-concept version of the Pythia7 program. Pythia7 is a complete re-write of the Pythia program in C++. It is mainly intended to be a replacement for the `Lund' family of event generators, but is also a toolkit with a structure suitable for implementing any event generator model. In this document, the structure of the program is presented both from the… ▽ More This document describes the first proof-of-concept version of the Pythia7 program. Pythia7 is a complete re-write of the Pythia program in C++. It is mainly intended to be a replacement for the `Lund' family of event generators, but is also a toolkit with a structure suitable for implementing any event generator model. In this document, the structure of the program is presented both from the user and the developer point of view. It is not intended to be a complete manual, but together with the documentation provided in the distribution, it should be sufficient to start working with the program. △ Less

Submitted 14 June, 2000; originally announced June 2000.

Comments: 39 pages, 3 figures

Journal ref: Comput.Phys.Commun.134:365-391,2001

arXiv:hep-ph/9904205 [pdf, ps, other]

doi 10.1007/s100520050652

Off-diagonal helicity density matrix elements for vector mesons produced in polarized e+e- processes

Authors: M. Anselmino, M. Bertini, F. Caruso, F. Murgia, P. Quintairos

Abstract: Final state quark-antiquark interactions give origin to non zero values of the off-diagonal element rho_{1,-1} of the helicity density matrix of vector mesons produced in e+e- annihilations, as confirmed by recent OPAL data on Phi, D^* and K^*'s. New predictions are given for rho_{1,-1} of several mesons produced at large x_E and small p_T -- i.e. collinear with the parent jet -- in the annihila… ▽ More Final state quark-antiquark interactions give origin to non zero values of the off-diagonal element rho_{1,-1} of the helicity density matrix of vector mesons produced in e+e- annihilations, as confirmed by recent OPAL data on Phi, D^* and K^*'s. New predictions are given for rho_{1,-1} of several mesons produced at large x_E and small p_T -- i.e. collinear with the parent jet -- in the annihilation of polarized e+ and e-; the results depend strongly on the elementary dynamics and allow further non trivial tests of the Standard Model. △ Less

Submitted 1 April, 1999; originally announced April 1999.

Comments: LaTeX, 20 pages, 6 ps figures, uses epsfig.sty

Report number: DFTT 19/99, INFNCA-TH9905

Journal ref: Eur.Phys.J.C11:529-537,1999

arXiv:hep-ph/9810226 [pdf, ps, other]

Quark fragmentation and off-diagonal helicity density matrix elements for vector meson production

Authors: M. Anselmino, M. Bertini, F. Murgia, P. Quintairos

Abstract: As confirmed by some recent LEP data on phi, K^* and D^*0 production, final state interactions in quark fragmentation may give origin to non-zero values of the off-diagonal element rho_{1,-1} of the helicity density matrix of vector mesons produced in e^+ e^- annihilations: we give estimates for rho_{1,-1} of vector mesons with a large x_E and collinear with the parent jet, relating its size and… ▽ More As confirmed by some recent LEP data on phi, K^* and D^*0 production, final state interactions in quark fragmentation may give origin to non-zero values of the off-diagonal element rho_{1,-1} of the helicity density matrix of vector mesons produced in e^+ e^- annihilations: we give estimates for rho_{1,-1} of vector mesons with a large x_E and collinear with the parent jet, relating its size and sign to the associated hard constituent dynamics. We mention possible non-zero values of rho_{1,-1} in several other processes. △ Less

Submitted 2 October, 1998; originally announced October 1998.

Comments: 3 pages, LaTeX, no figures, uses sprocl.sty. Talk delivered by F. Murgia at the XIII International Symposium on High Energy Spin Physics (SPIN98), September 8-12, 1998, Protvino, Russia

Report number: INFNCA-TH9812, DFTT 59/98

arXiv:hep-ph/9806456 [pdf, ps, other]

doi 10.1016/S0370-2693(98)01237-4

Diffraction in Charged Current DIS

Authors: M. Bertini, M. Genovese, N. N. Nikolaev, B. G. Zakharov

Abstract: We present the QCD calculation of the diffractive structure function for charged current DIS. In particular we analyse the perturbatively tractable excitation of heavy quarks. We emphasize the peculiarities of the Regge factorization breaking in excitation of open charm. We present the QCD calculation of the diffractive structure function for charged current DIS. In particular we analyse the perturbatively tractable excitation of heavy quarks. We emphasize the peculiarities of the Regge factorization breaking in excitation of open charm. △ Less

Submitted 23 June, 1998; originally announced June 1998.

Comments: 16 pages LateX, 5 eps figures included

Report number: DFTT 33/97

Journal ref: Phys.Lett. B442 (1998) 398-406

arXiv:hep-ph/9805234 [pdf, ps, other]

doi 10.1016/S0370-2693(98)00978-2

Off-diagonal helicity density matrix elements for heavy vector mesons inclusively produced in N-N, gamma-N, l-N interactions

Authors: M. Anselmino, M. Bertini, F. Murgia, B. Pire

Abstract: Final state interactions in quark fragmentation may give origin to non zero values of the off-diagonal element rho_(1,-1) of the helicity density matrix of vector mesons V produced in current jets, with a large energy fraction x_E; the value of rho_(1,-1)(V) is related to the hard constituent dynamics and tests unusual properties of it. Some recent data on phi, K^* and D^* produced in e^+ e^- an… ▽ More Final state interactions in quark fragmentation may give origin to non zero values of the off-diagonal element rho_(1,-1) of the helicity density matrix of vector mesons V produced in current jets, with a large energy fraction x_E; the value of rho_(1,-1)(V) is related to the hard constituent dynamics and tests unusual properties of it. Some recent data on phi, K^* and D^* produced in e^+ e^- annihilations at LEP show such effects. Predictions are given here for rho_(1,-1) of heavy mesons produced in nucleon-nucleon, gamma-nucleon and lepton-nucleon interactions. △ Less

Submitted 3 August, 1998; v1 submitted 6 May, 1998; originally announced May 1998.

Comments: LaTeX, 10 pages, 1 postscript figure, uses epsfig.sty. Revised version, to be published on Phys. Lett. B. Some statements added to clarify text

Report number: DFTT 16/98, CPhT-S610-0498, INFNCA-TH9804

Journal ref: Phys.Lett. B438 (1998) 347-352

arXiv:hep-ph/9803423 [pdf, ps, other]

Charged Current Diffractive Structure Functions

Authors: M. Bertini, M. Genovese, N. N. Nikolaev, B. G. Zakharov

Abstract: We present our study of the diffraction in charged current DIS. We analyse the perturbatively tractable excitation of heavy quarks, emphasizing the peculiarities of the Regge factorization breaking in excitation of open charm. We present our study of the diffraction in charged current DIS. We analyse the perturbatively tractable excitation of heavy quarks, emphasizing the peculiarities of the Regge factorization breaking in excitation of open charm. △ Less

Submitted 23 March, 1998; originally announced March 1998.

Comments: Proceeding of LISHEP98 workshop on diffractive physics

arXiv:hep-ph/9710547 [pdf, ps, other]

doi 10.1016/S0370-2693(98)00036-7

Twist-4 effects and $Q^{2}$ dependence of diffractive DIS

Authors: M. Bertini, M. Genovese, N. N. Nikolaev, A. V. Pronyaev, B. G. Zakharov

Abstract: In this letter we report the direct perturbative QCD evaluation of twist-4 effects in diffractive DIS. They are large and have a strong impact on the $Q^2$ dependence of diffractive structure functions at large $β$. Based on the AGK rules, we comment on the possible contribution from diffractive higher twists to $\propto {1 \over Q^{2}}$ corrections to proton structure function at small $x$. The… ▽ More In this letter we report the direct perturbative QCD evaluation of twist-4 effects in diffractive DIS. They are large and have a strong impact on the $Q^2$ dependence of diffractive structure functions at large $β$. Based on the AGK rules, we comment on the possible contribution from diffractive higher twists to $\propto {1 \over Q^{2}}$ corrections to proton structure function at small $x$. These corrections to the longitudinal structure function $F_{L}$ may be particularly large. △ Less

Submitted 30 October, 1997; originally announced October 1997.

Comments: 13 pages LaTeX including 4 PS figures

Report number: ISN 97.99, DFTT 46/97

Journal ref: Phys.Lett. B422 (1998) 238-246

arXiv:hep-ph/9710505 [pdf, ps, other]

doi 10.1016/S0370-2693(98)00351-7

Quark fragmentation into vector and pseudoscalar mesons at LEP

Authors: M. Anselmino, M. Bertini, C. Burgard, F. Caruso, P. Quintairos

Abstract: Some data on the ratio of vector to vector + pseudoscalar mesons, V/(V+P), and the probability of helicity zero vector states, rho_00, are now available from LEP. A possible relation between these two quantities and their interpretation in terms of polarized fragmentation functions are discussed; numerical estimates are given for the relative occupancies of K and K*, D and D*, B and B* states. Some data on the ratio of vector to vector + pseudoscalar mesons, V/(V+P), and the probability of helicity zero vector states, rho_00, are now available from LEP. A possible relation between these two quantities and their interpretation in terms of polarized fragmentation functions are discussed; numerical estimates are given for the relative occupancies of K and K*, D and D*, B and B* states. △ Less

Submitted 28 October, 1997; originally announced October 1997.

Comments: 5 pages, no figures

Report number: CERN-PPE/97-136

Journal ref: Phys.Lett. B427 (1998) 356-360

arXiv:hep-ph/9704420 [pdf, ps, other]

doi 10.1007/s100520050159

Off-diagonal Helicity Density Matrix Elements for Vector Mesons Produced at LEP

Authors: Mauro Anselmino, Marc Bertini, Francesco Murgia, Paulo Quintairos

Abstract: Final state quark-antiquark interactions may give origin to non zero values of the off-diagonal element rho_{1,-1} of the helicity density matrix of vector mesons produced in e+e- annihilations, as confirmed by recent OPAL data on phi and D^*'s. Predictions are given for rho_{1,-1} of several mesons produced at large z and small p_T, i.e. collinear with the parent jet; the values obtained for ph… ▽ More Final state quark-antiquark interactions may give origin to non zero values of the off-diagonal element rho_{1,-1} of the helicity density matrix of vector mesons produced in e+e- annihilations, as confirmed by recent OPAL data on phi and D^*'s. Predictions are given for rho_{1,-1} of several mesons produced at large z and small p_T, i.e. collinear with the parent jet; the values obtained for phi and D^* are in agreement with data. △ Less

Submitted 28 April, 1997; originally announced April 1997.

Comments: LaTeX, 12 pages, no figures

Report number: INFNCA-TH9707, DFTT 25/97

Journal ref: Eur.Phys.J.C2:539-544,1998

arXiv:hep-ph/9511425 [pdf, ps, other]

doi 10.1007/BF02743817

The Pomeron in Elastic and Deep Inelastic Scattering

Authors: M. Bertini, M. Giffon, L. Jenkovszky, F. Paccanoni, E. Predazzi

Abstract: We discuss some properties of the Pomeron in high energy elastic hadron-hadron and deep inelastic lepton-hadron scattering. A number of issues concerning the nature and the origin of the Pomeron are briefly recalled here. The novelty in this paper resides essentially in its presentation; we strive at discussing all these various issues in the following unifying perspective : it is our contention… ▽ More We discuss some properties of the Pomeron in high energy elastic hadron-hadron and deep inelastic lepton-hadron scattering. A number of issues concerning the nature and the origin of the Pomeron are briefly recalled here. The novelty in this paper resides essentially in its presentation; we strive at discussing all these various issues in the following unifying perspective : it is our contention that the Pomeron is one and the same in all reactions. Various examples will be provided illustrating why we do not believe that one should invoke additional tools to describe the data. For pedagogical convenience, we list below the topics to be covered in the following. -- 1. Introduction. How many Pomerons? -- 2. The Pomeron in the $S$-matrix theory -- 3. The Pomeron in QCD -- 4. The Pomeron in deep inelastic scattering -- 5. The Pomeron structure -- 6. (Temporary?) Conclusions △ Less

Submitted 27 November, 1995; originally announced November 1995.

Comments: 32 pages in TeX; 27 figures (available on request from [email protected]

Report number: LYCEN/95-37

Journal ref: Riv.Nuovo Cim. 19N1 (1996) 1-37

arXiv:hep-ph/9501254 [pdf, ps, other]

doi 10.1016/0370-2693(95)00245-G

Do we need two Pomerons?

Authors: M. Bertini, M. Giffon, E. Predazzi

Abstract: We show that one single Pomeron compatible with the Froissart limit, can account for all the present HERA data. We show that one single Pomeron compatible with the Froissart limit, can account for all the present HERA data. △ Less

Submitted 1 February, 1995; v1 submitted 10 January, 1995; originally announced January 1995.

Comments: The title has been modified and some references added

Report number: LYCEN 9504

Journal ref: Phys.Lett. B349 (1995) 561-568

Showing 1–48 of 48 results for author: Bertini, M