Search | arXiv e-print repository

Learnable Data Augmentation for One-Shot Unsupervised Domain Adaptation

Authors: Julio Ivan Davila Carrazco, Pietro Morerio, Alessio Del Bue, Vittorio Murino

Abstract: This paper presents a classification framework based on learnable data augmentation to tackle the One-Shot Unsupervised Domain Adaptation (OS-UDA) problem. OS-UDA is the most challenging setting in Domain Adaptation, as only one single unlabeled target sample is assumed to be available for model adaptation. Driven by such single sample, our method LearnAug-UDA learns how to augment source data, ma… ▽ More This paper presents a classification framework based on learnable data augmentation to tackle the One-Shot Unsupervised Domain Adaptation (OS-UDA) problem. OS-UDA is the most challenging setting in Domain Adaptation, as only one single unlabeled target sample is assumed to be available for model adaptation. Driven by such single sample, our method LearnAug-UDA learns how to augment source data, making it perceptually similar to the target. As a result, a classifier trained on such augmented data will generalize well for the target domain. To achieve this, we designed an encoder-decoder architecture that exploits a perceptual loss and style transfer strategies to augment the source data. Our method achieves state-of-the-art performance on two well-known Domain Adaptation benchmarks, DomainNet and VisDA. The project code is available at https://github.com/IIT-PAVIS/LearnAug-UDA △ Less

Submitted 3 October, 2023; originally announced October 2023.

Comments: Accepted to The 34th British Machine Vision Conference (BMVC 2023)

arXiv:2308.08303 [pdf, other]

Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric Videos

Authors: Sanket Thakur, Cigdem Beyan, Pietro Morerio, Vittorio Murino, Alessio Del Bue

Abstract: Objects are crucial for understanding human-object interactions. By identifying the relevant objects, one can also predict potential future interactions or actions that may occur with these objects. In this paper, we study the problem of Short-Term Object interaction anticipation (STA) and propose NAOGAT (Next-Active-Object Guided Anticipation Transformer), a multi-modal end-to-end transformer net… ▽ More Objects are crucial for understanding human-object interactions. By identifying the relevant objects, one can also predict potential future interactions or actions that may occur with these objects. In this paper, we study the problem of Short-Term Object interaction anticipation (STA) and propose NAOGAT (Next-Active-Object Guided Anticipation Transformer), a multi-modal end-to-end transformer network, that attends to objects in observed frames in order to anticipate the next-active-object (NAO) and, eventually, to guide the model to predict context-aware future actions. The task is challenging since it requires anticipating future action along with the object with which the action occurs and the time after which the interaction will begin, a.k.a. the time to contact (TTC). Compared to existing video modeling architectures for action anticipation, NAOGAT captures the relationship between objects and the global scene context in order to predict detections for the next active object and anticipate relevant future actions given these detections, leveraging the objects' dynamics to improve accuracy. One of the key strengths of our approach, in fact, is its ability to exploit the motion dynamics of objects within a given clip, which is often ignored by other models, and separately decoding the object-centric and motion-centric information. Through our experiments, we show that our model outperforms existing methods on two separate datasets, Ego4D and EpicKitchens-100 ("Unseen Set"), as measured by several additional metrics, such as time to contact, and next-active-object localization. The code will be available upon acceptance. △ Less

Submitted 5 October, 2023; v1 submitted 16 August, 2023; originally announced August 2023.

Comments: Accepted in WACV'24

arXiv:2305.16066 [pdf, other]

Guided Attention for Next Active Object @ EGO4D STA Challenge

Authors: Sanket Thakur, Cigdem Beyan, Pietro Morerio, Vittorio Murino, Alessio Del Bue

Abstract: In this technical report, we describe the Guided-Attention mechanism based solution for the short-term anticipation (STA) challenge for the EGO4D challenge. It combines the object detections, and the spatiotemporal features extracted from video clips, enhancing the motion and contextual information, and further decoding the object-centric and motion-centric information to address the problem of ST… ▽ More In this technical report, we describe the Guided-Attention mechanism based solution for the short-term anticipation (STA) challenge for the EGO4D challenge. It combines the object detections, and the spatiotemporal features extracted from video clips, enhancing the motion and contextual information, and further decoding the object-centric and motion-centric information to address the problem of STA in egocentric videos. For the challenge, we build our model on top of StillFast with Guided Attention applied on fast network. Our model obtains better performance on the validation set and also achieves state-of-the-art (SOTA) results on the challenge test set for EGO4D Short-Term Object Interaction Anticipation Challenge. △ Less

Submitted 4 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: Winner of CVPR@2023 Ego4D STA challenge. arXiv admin note: substantial text overlap with arXiv:2305.12953

arXiv:2305.12953 [pdf, other]

Enhancing Next Active Object-based Egocentric Action Anticipation with Guided Attention

Authors: Sanket Thakur, Cigdem Beyan, Pietro Morerio, Vittorio Murino, Alessio Del Bue

Abstract: Short-term action anticipation (STA) in first-person videos is a challenging task that involves understanding the next active object interactions and predicting future actions. Existing action anticipation methods have primarily focused on utilizing features extracted from video clips, but often overlooked the importance of objects and their interactions. To this end, we propose a novel approach t… ▽ More Short-term action anticipation (STA) in first-person videos is a challenging task that involves understanding the next active object interactions and predicting future actions. Existing action anticipation methods have primarily focused on utilizing features extracted from video clips, but often overlooked the importance of objects and their interactions. To this end, we propose a novel approach that applies a guided attention mechanism between the objects, and the spatiotemporal features extracted from video clips, enhancing the motion and contextual information, and further decoding the object-centric and motion-centric information to address the problem of STA in egocentric videos. Our method, GANO (Guided Attention for Next active Objects) is a multi-modal, end-to-end, single transformer-based network. The experimental results performed on the largest egocentric dataset demonstrate that GANO outperforms the existing state-of-the-art methods for the prediction of the next active object label, its bounding box location, the corresponding future action, and the time to contact the object. The ablation study shows the positive contribution of the guided attention mechanism compared to other fusion methods. Moreover, it is possible to improve the next active object location and class label prediction results of GANO by just appending the learnable object tokens with the region of interest embeddings. △ Less

Submitted 23 June, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: Accepted to IEEE ICIP 2023, see project page here : https://sanketsans.github.io/guided-attention-egocentric.html

arXiv:2305.04628 [pdf, other]

Target-driven One-Shot Unsupervised Domain Adaptation

Authors: Julio Ivan Davila Carrazco, Suvarna Kishorkumar Kadam, Pietro Morerio, Alessio Del Bue, Vittorio Murino

Abstract: In this paper, we introduce a novel framework for the challenging problem of One-Shot Unsupervised Domain Adaptation (OSUDA), which aims to adapt to a target domain with only a single unlabeled target sample. Unlike existing approaches that rely on large labeled source and unlabeled target data, our Target-driven One-Shot UDA (TOS-UDA) approach employs a learnable augmentation strategy guided by t… ▽ More In this paper, we introduce a novel framework for the challenging problem of One-Shot Unsupervised Domain Adaptation (OSUDA), which aims to adapt to a target domain with only a single unlabeled target sample. Unlike existing approaches that rely on large labeled source and unlabeled target data, our Target-driven One-Shot UDA (TOS-UDA) approach employs a learnable augmentation strategy guided by the target sample's style to align the source distribution with the target distribution. Our method consists of three modules: an augmentation module, a style alignment module, and a classifier. Unlike existing methods, our augmentation module allows for strong transformations of the source samples, and the style of the single target sample available is exploited to guide the augmentation by ensuring perceptual similarity. Furthermore, our approach integrates augmentation with style alignment, eliminating the need for separate pre-training on additional datasets. Our method outperforms or performs comparably to existing OS-UDA methods on the Digits and DomainNet benchmarks. △ Less

Submitted 17 July, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

Comments: Accepted to 22nd International Conference on IMAGE ANALYSIS AND PROCESSING (ICIAP) 2023

Journal ref: 22nd International Conference on IMAGE ANALYSIS AND PROCESSING (ICIAP) 2023

arXiv:2304.07374 [pdf, other]

Continual Source-Free Unsupervised Domain Adaptation

Authors: Waqar Ahmed, Pietro Morerio, Vittorio Murino

Abstract: Existing Source-free Unsupervised Domain Adaptation (SUDA) approaches inherently exhibit catastrophic forgetting. Typically, models trained on a labeled source domain and adapted to unlabeled target data improve performance on the target while drop** performance on the source, which is not available during adaptation. In this study, our goal is to cope with the challenging problem of SUDA in a c… ▽ More Existing Source-free Unsupervised Domain Adaptation (SUDA) approaches inherently exhibit catastrophic forgetting. Typically, models trained on a labeled source domain and adapted to unlabeled target data improve performance on the target while drop** performance on the source, which is not available during adaptation. In this study, our goal is to cope with the challenging problem of SUDA in a continual learning setting, i.e., adapting to the target(s) with varying distributional shifts while maintaining performance on the source. The proposed framework consists of two main stages: i) a SUDA model yielding cleaner target labels -- favoring good performance on target, and ii) a novel method for synthesizing class-conditioned source-style images by leveraging only the source model and pseudo-labeled target data as a prior. An extensive pool of experiments on major benchmarks, e.g., PACS, Visda-C, and DomainNet demonstrates that the proposed Continual SUDA (C-SUDA) framework enables preserving satisfactory performance on the source domain without exploiting the source data at all. △ Less

Submitted 14 April, 2023; originally announced April 2023.

Comments: Accepted at International Conference on Image Analysis and Processing, 2023

arXiv:2302.06358 [pdf, other]

Anticipating Next Active Objects for Egocentric Videos

Authors: Sanket Thakur, Cigdem Beyan, Pietro Morerio, Vittorio Murino, Alessio Del Bue

Abstract: This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip where the contact might happen, before any action takes place. The problem is considerably hard, as we aim at estimating the position of such objects in a scenario where the observed clip and the action segment are separated by the so-called ``time to contact'' (TTC) se… ▽ More This paper addresses the problem of anticipating the next-active-object location in the future, for a given egocentric video clip where the contact might happen, before any action takes place. The problem is considerably hard, as we aim at estimating the position of such objects in a scenario where the observed clip and the action segment are separated by the so-called ``time to contact'' (TTC) segment. Many methods have been proposed to anticipate the action of a person based on previous hand movements and interactions with the surroundings. However, there have been no attempts to investigate the next possible interactable object, and its future location with respect to the first-person's motion and the field-of-view drift during the TTC window. We define this as the task of Anticipating the Next ACTive Object (ANACTO). To this end, we propose a transformer-based self-attention framework to identify and locate the next-active-object in an egocentric clip. We benchmark our method on three datasets: EpicKitchens-100, EGTEA+ and Ego4D. We also provide annotations for the first two datasets. Our approach performs best compared to relevant baseline methods. We also conduct ablation studies to understand the effectiveness of the proposed and baseline methods on varying conditions. Code and ANACTO task annotations will be made available upon paper acceptance. △ Less

Submitted 1 May, 2024; v1 submitted 13 February, 2023; originally announced February 2023.

Comments: Accepted by IEEE ACCESS, this paper carries the Manuscript DOI: 10.1109/ACCESS.2024.3395282. The complete peer-reviewed version is available via this DOI, while the arXiv version is a post-author manuscript without peer-review

arXiv:2207.12842 [pdf, other]

Unsupervised Domain Adaptation for Video Transformers in Action Recognition

Authors: Victor G. Turrisi da Costa, Giacomo Zara, Paolo Rota, Thiago Oliveira-Santos, Nicu Sebe, Vittorio Murino, Elisa Ricci

Abstract: Over the last few years, Unsupervised Domain Adaptation (UDA) techniques have acquired remarkable importance and popularity in computer vision. However, when compared to the extensive literature available for images, the field of videos is still relatively unexplored. On the other hand, the performance of a model in action recognition is heavily affected by domain shift. In this paper, we propose… ▽ More Over the last few years, Unsupervised Domain Adaptation (UDA) techniques have acquired remarkable importance and popularity in computer vision. However, when compared to the extensive literature available for images, the field of videos is still relatively unexplored. On the other hand, the performance of a model in action recognition is heavily affected by domain shift. In this paper, we propose a simple and novel UDA approach for video action recognition. Our approach leverages recent advances on spatio-temporal transformers to build a robust source model that better generalises to the target domain. Furthermore, our architecture learns domain invariant features thanks to the introduction of a novel alignment loss term derived from the Information Bottleneck principle. We report results on two video action recognition benchmarks for UDA, showing state-of-the-art performance on HMDB$\leftrightarrow$UCF, as well as on Kinetics$\rightarrow$NEC-Drone, which is more challenging. This demonstrates the effectiveness of our method in handling different levels of domain shift. The source code is available at https://github.com/vturrisi/UDAVT. △ Less

Submitted 26 July, 2022; originally announced July 2022.

Comments: Accepted at ICPR 2022

arXiv:2104.09191 [pdf, other]

Compact CNN Structure Learning by Knowledge Distillation

Authors: Waqar Ahmed, Andrea Zunino, Pietro Morerio, Vittorio Murino

Abstract: The concept of compressing deep Convolutional Neural Networks (CNNs) is essential to use limited computation, power, and memory resources on embedded devices. However, existing methods achieve this objective at the cost of a drop in inference accuracy in computer vision tasks. To address such a drawback, we propose a framework that leverages knowledge distillation along with customizable block-wis… ▽ More The concept of compressing deep Convolutional Neural Networks (CNNs) is essential to use limited computation, power, and memory resources on embedded devices. However, existing methods achieve this objective at the cost of a drop in inference accuracy in computer vision tasks. To address such a drawback, we propose a framework that leverages knowledge distillation along with customizable block-wise optimization to learn a lightweight CNN structure while preserving better control over the compression-performance tradeoff. Considering specific resource constraints, e.g., floating-point operations per inference (FLOPs) or model-parameters, our method results in a state of the art network compression while being capable of achieving better inference accuracy. In a comprehensive evaluation, we demonstrate that our method is effective, robust, and consistent with results over a variety of network architectures and datasets, at negligible training overhead. In particular, for the already compact network MobileNet_v2, our method offers up to 2x and 5.2x better model compression in terms of FLOPs and model-parameters, respectively, while getting 1.05% better model performance than the baseline network. △ Less

Submitted 19 April, 2021; originally announced April 2021.

Comments: This paper has been accepted to ICPR 2020

arXiv:2103.15973 [pdf, other]

Adaptive Pseudo-Label Refinement by Negative Ensemble Learning for Source-Free Unsupervised Domain Adaptation

Authors: Waqar Ahmed, Pietro Morerio, Vittorio Murino

Abstract: The majority of existing Unsupervised Domain Adaptation (UDA) methods presumes source and target domain data to be simultaneously available during training. Such an assumption may not hold in practice, as source data is often inaccessible (e.g., due to privacy reasons). On the contrary, a pre-trained source model is always considered to be available, even though performing poorly on target due to… ▽ More The majority of existing Unsupervised Domain Adaptation (UDA) methods presumes source and target domain data to be simultaneously available during training. Such an assumption may not hold in practice, as source data is often inaccessible (e.g., due to privacy reasons). On the contrary, a pre-trained source model is always considered to be available, even though performing poorly on target due to the well-known domain shift problem. This translates into a significant amount of misclassifications, which can be interpreted as structured noise affecting the inferred target pseudo-labels. In this work, we cast UDA as a pseudo-label refinery problem in the challenging source-free scenario. We propose a unified method to tackle adaptive noise filtering and pseudo-label refinement. A novel Negative Ensemble Learning technique is devised to specifically address noise in pseudo-labels, by enhancing diversity in ensemble members with different stochastic (i) input augmentation and (ii) feedback. In particular, the latter is achieved by leveraging the novel concept of Disjoint Residual Labels, which allow diverse information to be fed to the different members. A single target model is eventually trained with the refined pseudo-labels, which leads to a robust performance on the target domain. Extensive experiments show that the proposed method, named Adaptive Pseudo-Label Refinement, achieves state-of-the-art performance on major UDA benchmarks, such as Digit5, PACS, Visda-C, and DomainNet, without using source data at all. △ Less

Submitted 29 March, 2021; originally announced March 2021.

arXiv:2103.12437 [pdf, other]

Learning without Seeing nor Knowing: Towards Open Zero-Shot Learning

Authors: Federico Marmoreo, Julio Ivan Davila Carrazco, Vittorio Murino, Jacopo Cavazza

Abstract: In Generalized Zero-Shot Learning (GZSL), unseen categories (for which no visual data are available at training time) can be predicted by leveraging their class embeddings (e.g., a list of attributes describing them) together with a complementary pool of seen classes (paired with both visual data and class embeddings). Despite GZSL is arguably challenging, we posit that knowing in advance the clas… ▽ More In Generalized Zero-Shot Learning (GZSL), unseen categories (for which no visual data are available at training time) can be predicted by leveraging their class embeddings (e.g., a list of attributes describing them) together with a complementary pool of seen classes (paired with both visual data and class embeddings). Despite GZSL is arguably challenging, we posit that knowing in advance the class embeddings, especially for unseen categories, is an actual limit of the applicability of GZSL towards real-world scenarios. To relax this assumption, we propose Open Zero-Shot Learning (OZSL) to extend GZSL towards the open-world settings. We formalize OZSL as the problem of recognizing seen and unseen classes (as in GZSL) while also rejecting instances from unknown categories, for which neither visual data nor class embeddings are provided. We formalize the OZSL problem introducing evaluation protocols, error metrics and benchmark datasets. We also suggest to tackle the OZSL problem by proposing the idea of performing unknown feature generation (instead of only unseen features generation as done in GZSL). We achieve this by optimizing a generative process to sample unknown class embeddings as complementary to the seen and the unseen. We intend these results to be the ground to foster future research, extending the standard closed-world zero-shot learning (GZSL) with the novel open-world counterpart (OZSL). △ Less

Submitted 14 September, 2021; v1 submitted 23 March, 2021; originally announced March 2021.

arXiv:2102.03266 [pdf, other]

Transductive Zero-Shot Learning by Decoupled Feature Generation

Authors: Federico Marmoreo, Jacopo Cavazza, Vittorio Murino

Abstract: In this paper, we address zero-shot learning (ZSL), the problem of recognizing categories for which no labeled visual data are available during training. We focus on the transductive setting, in which unlabelled visual data from unseen classes is available. State-of-the-art paradigms in ZSL typically exploit generative adversarial networks to synthesize visual features from semantic attributes. We… ▽ More In this paper, we address zero-shot learning (ZSL), the problem of recognizing categories for which no labeled visual data are available during training. We focus on the transductive setting, in which unlabelled visual data from unseen classes is available. State-of-the-art paradigms in ZSL typically exploit generative adversarial networks to synthesize visual features from semantic attributes. We posit that the main limitation of these approaches is to adopt a single model to face two problems: 1) generating realistic visual features, and 2) translating semantic attributes into visual cues. Differently, we propose to decouple such tasks, solving them separately. In particular, we train an unconditional generator to solely capture the complexity of the distribution of visual data and we subsequently pair it with a conditional generator devoted to enrich the prior knowledge of the data distribution with the semantic content of the class embeddings. We present a detailed ablation study to dissect the effect of our proposed decoupling approach, while demonstrating its superiority over the related state-of-the-art. △ Less

Submitted 14 September, 2021; v1 submitted 5 February, 2021; originally announced February 2021.

Comments: Published at the IEEE/CVF Winter Conference on Computer Vision (WACV) 2021

arXiv:2010.09557 [pdf, other]

A Versatile Crack Inspection Portable System based on Classifier Ensemble and Controlled Illumination

Authors: Milind G. Padalkar, Carlos Beltrán-González, Matteo Bustreo, Alessio Del Bue, Vittorio Murino

Abstract: This paper presents a novel setup for automatic visual inspection of cracks in ceramic tile as well as studies the effect of various classifiers and height-varying illumination conditions for this task. The intuition behind this setup is that cracks can be better visualized under specific lighting conditions than others. Our setup, which is designed for field work with constraints in its maximum d… ▽ More This paper presents a novel setup for automatic visual inspection of cracks in ceramic tile as well as studies the effect of various classifiers and height-varying illumination conditions for this task. The intuition behind this setup is that cracks can be better visualized under specific lighting conditions than others. Our setup, which is designed for field work with constraints in its maximum dimensions, can acquire images for crack detection with multiple lighting conditions using the illumination sources placed at multiple heights. Crack detection is then performed by classifying patches extracted from the acquired images in a sliding window fashion. We study the effect of lights placed at various heights by training classifiers both on customized as well as state-of-the-art architectures and evaluate their performance both at patch-level and image-level, demonstrating the effectiveness of our setup. More importantly, ours is the first study that demonstrates how height-varying illumination conditions can affect crack detection with the use of existing state-of-the-art classifiers. We provide an insight about the illumination conditions that can help in improving crack detection in a challenging real-world industrial environment. △ Less

Submitted 19 October, 2020; originally announced October 2020.

Comments: Accepted in ICPR 2020

arXiv:2010.07906 [pdf, other]

DSLib: An open source library for the dominant set clustering method

Authors: Sebastiano Vascon, Samuel Rota Bulò, Vittorio Murino, Marcello Pelillo

Abstract: DSLib is an open-source implementation of the Dominant Set (DS) clustering algorithm written entirely in Matlab. The DS method is a graph-based clustering technique rooted in the evolutionary game theory that starts gaining lots of interest in the computer science community. Thanks to its duality with game theory and its strict relation to the notion of maximal clique, has been explored in several… ▽ More DSLib is an open-source implementation of the Dominant Set (DS) clustering algorithm written entirely in Matlab. The DS method is a graph-based clustering technique rooted in the evolutionary game theory that starts gaining lots of interest in the computer science community. Thanks to its duality with game theory and its strict relation to the notion of maximal clique, has been explored in several directions not only related to clustering problems. Applications in graph matching, segmentation, classification and medical imaging are common in literature. This package provides an implementation of the original DS clustering algorithm since no code has been officially released yet, together with a still growing collection of methods and variants related to it. Our library is integrable into a Matlab pipeline without dependencies, it is simple to use and easily extendable for upcoming works. The latest source code, the documentation and some examples can be downloaded from https://xwasco.github.io/DominantSetLibrary. △ Less

Submitted 15 October, 2020; originally announced October 2020.

arXiv:2005.04813 [pdf, other]

The Visual Social Distancing Problem

Authors: Marco Cristani, Alessio Del Bue, Vittorio Murino, Francesco Setti, Alessandro Vinciarelli

Abstract: One of the main and most effective measures to contain the recent viral outbreak is the maintenance of the so-called Social Distancing (SD). To comply with this constraint, workplaces, public institutions, transports and schools will likely adopt restrictions over the minimum inter-personal distance between people. Given this actual scenario, it is crucial to massively measure the compliance to su… ▽ More One of the main and most effective measures to contain the recent viral outbreak is the maintenance of the so-called Social Distancing (SD). To comply with this constraint, workplaces, public institutions, transports and schools will likely adopt restrictions over the minimum inter-personal distance between people. Given this actual scenario, it is crucial to massively measure the compliance to such physical constraint in our life, in order to figure out the reasons of the possible breaks of such distance limitations, and understand if this implies a possible threat given the scene context. All of this, complying with privacy policies and making the measurement acceptable. To this end, we introduce the Visual Social Distancing (VSD) problem, defined as the automatic estimation of the inter-personal distance from an image, and the characterization of the related people aggregations. VSD is pivotal for a non-invasive analysis to whether people comply with the SD restriction, and to provide statistics about the level of safety of specific areas whenever this constraint is violated. We then discuss how VSD relates with previous literature in Social Signal Processing and indicate which existing Computer Vision methods can be used to manage such problem. We conclude with future challenges related to the effectiveness of VSD systems, ethical implications and future application scenarios. △ Less

Submitted 10 May, 2020; originally announced May 2020.

Comments: 9 pages, 5 figures. All the authors equally contributed to this manuscript and they are listed by alphabetical order. Under submission

arXiv:2004.09374 [pdf, other]

Complex-Object Visual Inspection via Multiple Lighting Configurations

Authors: Maya Aghaei, Matteo Bustreo, Pietro Morerio, Nicolo Carissimi, Alessio Del Bue, Vittorio Murino

Abstract: The design of an automatic visual inspection system is usually performed in two stages. While the first stage consists in selecting the most suitable hardware setup for highlighting most effectively the defects on the surface to be inspected, the second stage concerns the development of algorithmic solutions to exploit the potentials offered by the collected data. In this paper, first, we presen… ▽ More The design of an automatic visual inspection system is usually performed in two stages. While the first stage consists in selecting the most suitable hardware setup for highlighting most effectively the defects on the surface to be inspected, the second stage concerns the development of algorithmic solutions to exploit the potentials offered by the collected data. In this paper, first, we present a novel illumination setup embedding four illumination configurations to resemble diffused, dark-field, and front lighting techniques. Second, we analyze the contributions brought by deploying the proposed setup in training phase only - mimicking the scenario in which an already developed visual inspection system cannot be modified on the customer site - and in evaluation phase. Along with an exhaustive set of experiments, in this paper, we demonstrate the suitability of the proposed setup for effective illumination of complex-objects, defined as manufactured items with variable surface characteristics that cannot be determined a priori. Moreover, we discuss the importance of multiple light configurations availability during training and their natural boosting effect which, without the need to modify the system design in evaluation phase, lead to improvements in the overall system performance. △ Less

Submitted 20 April, 2020; originally announced April 2020.

Comments: 8 pages, 7 figures, submitted to ICPR2020

arXiv:2004.08270 [pdf, other]

Weakly Supervised Geodesic Segmentation of Egyptian Mummy CT Scans

Authors: Avik Hati, Matteo Bustreo, Diego Sona, Vittorio Murino, Alessio Del Bue

Abstract: In this paper, we tackle the task of automatically analyzing 3D volumetric scans obtained from computed tomography (CT) devices. In particular, we address a particular task for which data is very limited: the segmentation of ancient Egyptian mummies CT scans. We aim at digitally unwrap** the mummy and identify different segments such as body, bandages and jewelry. The problem is complex because… ▽ More In this paper, we tackle the task of automatically analyzing 3D volumetric scans obtained from computed tomography (CT) devices. In particular, we address a particular task for which data is very limited: the segmentation of ancient Egyptian mummies CT scans. We aim at digitally unwrap** the mummy and identify different segments such as body, bandages and jewelry. The problem is complex because of the lack of annotated data for the different semantic regions to segment, thus discouraging the use of strongly supervised approaches. We, therefore, propose a weakly supervised and efficient interactive segmentation method to solve this challenging problem. After segmenting the wrapped mummy from its exterior region using histogram analysis and template matching, we first design a voxel distance measure to find an approximate solution for the body and bandage segments. Here, we use geodesic distances since voxel features as well as spatial relationship among voxels is incorporated in this measure. Next, we refine the solution using a GrabCut based segmentation together with a tracking method on the slices of the scan that assigns labels to different regions in the volume, using limited supervision in the form of scribbles drawn by the user. The efficiency of the proposed method is demonstrated using visualizations and validated through quantitative measures and qualitative unwrap** of the mummy. △ Less

Submitted 17 April, 2020; originally announced April 2020.

arXiv:2003.06498 [pdf, other]

Explainable Deep Classification Models for Domain Generalization

Authors: Andrea Zunino, Sarah Adel Bargal, Riccardo Volpi, Mehrnoosh Sameki, Jianming Zhang, Stan Sclaroff, Vittorio Murino, Kate Saenko

Abstract: Conventionally, AI models are thought to trade off explainability for lower accuracy. We develop a training strategy that not only leads to a more explainable AI system for object classification, but as a consequence, suffers no perceptible accuracy degradation. Explanations are defined as regions of visual evidence upon which a deep classification network makes a decision. This is represented in… ▽ More Conventionally, AI models are thought to trade off explainability for lower accuracy. We develop a training strategy that not only leads to a more explainable AI system for object classification, but as a consequence, suffers no perceptible accuracy degradation. Explanations are defined as regions of visual evidence upon which a deep classification network makes a decision. This is represented in the form of a saliency map conveying how much each pixel contributed to the network's decision. Our training strategy enforces a periodic saliency-based feedback to encourage the model to focus on the image regions that directly correspond to the ground-truth object. We quantify explainability using an automated metric, and using human judgement. We propose explainability as a means for bridging the visual-semantic gap between different domains where model explanations are used as a means of disentagling domain specific information from otherwise relevant features. We demonstrate that this leads to improved generalization to new domains without hindering performance on the original domain. △ Less

Submitted 13 March, 2020; originally announced March 2020.

arXiv:2003.06430 [pdf, other]

Learning Unbiased Representations via Mutual Information Backpropagation

Authors: Ruggero Ragonesi, Riccardo Volpi, Jacopo Cavazza, Vittorio Murino

Abstract: We are interested in learning data-driven representations that can generalize well, even when trained on inherently biased data. In particular, we face the case where some attributes (bias) of the data, if learned by the model, can severely compromise its generalization properties. We tackle this problem through the lens of information theory, leveraging recent findings for a differentiable estima… ▽ More We are interested in learning data-driven representations that can generalize well, even when trained on inherently biased data. In particular, we face the case where some attributes (bias) of the data, if learned by the model, can severely compromise its generalization properties. We tackle this problem through the lens of information theory, leveraging recent findings for a differentiable estimation of mutual information. We propose a novel end-to-end optimization strategy, which simultaneously estimates and minimizes the mutual information between the learned representation and the data attributes. When applied on standard benchmarks, our model shows comparable or superior classification performance with respect to state-of-the-art approaches. Moreover, our method is general enough to be applicable to the problem of ``algorithmic fairness'', with competitive results. △ Less

Submitted 13 March, 2020; originally announced March 2020.

Comments: Code publicly available at https://github.com/rugrag/learn-unbiased

arXiv:2002.05046 [pdf, other]

Intra-Camera Supervised Person Re-Identification

Authors: ** Zhu, Xiatian Zhu, Minxian Li, Pietro Morerio, Vittorio Murino, Shaogang Gong

Abstract: Existing person re-identification (re-id) methods mostly exploit a large set of cross-camera identity labelled training data. This requires a tedious data collection and annotation process, leading to poor scalability in practical re-id applications. On the other hand unsupervised re-id methods do not need identity label information, but they usually suffer from much inferior and insufficient mode… ▽ More Existing person re-identification (re-id) methods mostly exploit a large set of cross-camera identity labelled training data. This requires a tedious data collection and annotation process, leading to poor scalability in practical re-id applications. On the other hand unsupervised re-id methods do not need identity label information, but they usually suffer from much inferior and insufficient model performance. To overcome these fundamental limitations, we propose a novel person re-identification paradigm based on an idea of independent per-camera identity annotation. This eliminates the most time-consuming and tedious inter-camera identity labelling process, significantly reducing the amount of human annotation efforts. Consequently, it gives rise to a more scalable and more feasible setting, which we call Intra-Camera Supervised (ICS) person re-id, for which we formulate a Multi-tAsk mulTi-labEl (MATE) deep learning method. Specifically, MATE is designed for self-discovering the cross-camera identity correspondence in a per-camera multi-task inference framework. Extensive experiments demonstrate the cost-effectiveness superiority of our method over the alternative approaches on three large person re-id datasets. For example, MATE yields 88.7% rank-1 score on Market-1501 in the proposed ICS person re-id setting, significantly outperforming unsupervised learning models and closely approaching conventional fully supervised learning competitors. △ Less

Submitted 16 January, 2021; v1 submitted 12 February, 2020; originally announced February 2020.

Comments: Accepted to IJCV

arXiv:2001.02950 [pdf, other]

Generative Pseudo-label Refinement for Unsupervised Domain Adaptation

Authors: Pietro Morerio, Riccardo Volpi, Ruggero Ragonesi, Vittorio Murino

Abstract: We investigate and characterize the inherent resilience of conditional Generative Adversarial Networks (cGANs) against noise in their conditioning labels, and exploit this fact in the context of Unsupervised Domain Adaptation (UDA). In UDA, a classifier trained on the labelled source set can be used to infer pseudo-labels on the unlabelled target set. However, this will result in a significant amo… ▽ More We investigate and characterize the inherent resilience of conditional Generative Adversarial Networks (cGANs) against noise in their conditioning labels, and exploit this fact in the context of Unsupervised Domain Adaptation (UDA). In UDA, a classifier trained on the labelled source set can be used to infer pseudo-labels on the unlabelled target set. However, this will result in a significant amount of misclassified examples (due to the well-known domain shift issue), which can be interpreted as noise injection in the ground-truth labels for the target set. We show that cGANs are, to some extent, robust against such "shift noise". Indeed, cGANs trained with noisy pseudo-labels, are able to filter such noise and generate cleaner target samples. We exploit this finding in an iterative procedure where a generative model and a classifier are jointly trained: in turn, the generator allows to sample cleaner data from the target distribution, and the classifier allows to associate better labels to target samples, progressively refining target pseudo-labels. Results on common benchmarks show that our method performs better or comparably with the unsupervised domain adaptation state of the art. △ Less

Submitted 9 January, 2020; originally announced January 2020.

arXiv:1912.10982 [pdf, other]

DMCL: Distillation Multiple Choice Learning for Multimodal Action Recognition

Authors: Nuno C. Garcia, Sarah Adel Bargal, Vitaly Ablavsky, Pietro Morerio, Vittorio Murino, Stan Sclaroff

Abstract: In this work, we address the problem of learning an ensemble of specialist networks using multimodal data, while considering the realistic and challenging scenario of possible missing modalities at test time. Our goal is to leverage the complementary information of multiple modalities to the benefit of the ensemble and each individual network. We introduce a novel Distillation Multiple Choice Lear… ▽ More In this work, we address the problem of learning an ensemble of specialist networks using multimodal data, while considering the realistic and challenging scenario of possible missing modalities at test time. Our goal is to leverage the complementary information of multiple modalities to the benefit of the ensemble and each individual network. We introduce a novel Distillation Multiple Choice Learning framework for multimodal data, where different modality networks learn in a cooperative setting from scratch, strengthening one another. The modality networks learned using our method achieve significantly higher accuracy than if trained separately, due to the guidance of other modalities. We evaluate this approach on three video action recognition benchmark datasets. We obtain state-of-the-art results in comparison to other approaches that work with missing modalities at test time. △ Less

Submitted 23 December, 2019; originally announced December 2019.

arXiv:1910.10859 [pdf, other]

doi 10.1109/TIP.2019.2940477

Aggregation Signature for Small Object Tracking

Authors: Chunlei Liu, Wenrui Ding, **yu Yang, Vittorio Murino, Baochang Zhang, Jungong Han, Guodong Guo

Abstract: Small object tracking becomes an increasingly important task, which however has been largely unexplored in computer vision. The great challenges stem from the facts that: 1) small objects show extreme vague and variable appearances, and 2) they tend to be lost easier as compared to normal-sized ones due to the shaking of lens. In this paper, we propose a novel aggregation signature suitable for sm… ▽ More Small object tracking becomes an increasingly important task, which however has been largely unexplored in computer vision. The great challenges stem from the facts that: 1) small objects show extreme vague and variable appearances, and 2) they tend to be lost easier as compared to normal-sized ones due to the shaking of lens. In this paper, we propose a novel aggregation signature suitable for small object tracking, especially aiming for the challenge of sudden and large drift. We make three-fold contributions in this work. First, technically, we propose a new descriptor, named aggregation signature, based on saliency, able to represent highly distinctive features for small objects. Second, theoretically, we prove that the proposed signature matches the foreground object more accurately with a high probability. Third, experimentally, the aggregation signature achieves a high performance on multiple datasets, outperforming the state-of-the-art methods by large margins. Moreover, we contribute with two newly collected benchmark datasets, i.e., small90 and small112, for visually small object tracking. The datasets will be available in https://github.com/bczhangbczhang/. △ Less

Submitted 23 October, 2019; originally announced October 2019.

Comments: IEEE Transactions on Image Processing, 2019

arXiv:1910.10035 [pdf, other]

Scanner Invariant Multiple Sclerosis Lesion Segmentation from MRI

Authors: Shahab Aslani, Vittorio Murino, Michael Dayan, Roger Tam, Diego Sona, Ghassan Hamarneh

Abstract: This paper presents a simple and effective generalization method for magnetic resonance imaging (MRI) segmentation when data is collected from multiple MRI scanning sites and as a consequence is affected by (site-)domain shifts. We propose to integrate a traditional encoder-decoder network with a regularization network. This added network includes an auxiliary loss term which is responsible for th… ▽ More This paper presents a simple and effective generalization method for magnetic resonance imaging (MRI) segmentation when data is collected from multiple MRI scanning sites and as a consequence is affected by (site-)domain shifts. We propose to integrate a traditional encoder-decoder network with a regularization network. This added network includes an auxiliary loss term which is responsible for the reduction of the domain shift problem and for the resulting improved generalization. The proposed method was evaluated on multiple sclerosis lesion segmentation from MRI data. We tested the proposed model on an in-house clinical dataset including 117 patients from 56 different scanning sites. In the experiments, our method showed better generalization performance than other baseline networks. △ Less

Submitted 22 October, 2019; originally announced October 2019.

arXiv:1908.10359 [pdf, other]

Unsupervised Domain-Adaptive Person Re-identification Based on Attributes

Authors: ** Zhu, Pietro Morerio, Vittorio Murino

Abstract: Pedestrian attributes, e.g., hair length, clothes type and color, locally describe the semantic appearance of a person. Training person re-identification (ReID) algorithms under the supervision of such attributes have proven to be effective in extracting local features which are important for ReID. Unlike person identity, attributes are consistent across different domains (or datasets). However, m… ▽ More Pedestrian attributes, e.g., hair length, clothes type and color, locally describe the semantic appearance of a person. Training person re-identification (ReID) algorithms under the supervision of such attributes have proven to be effective in extracting local features which are important for ReID. Unlike person identity, attributes are consistent across different domains (or datasets). However, most of ReID datasets lack attribute annotations. On the other hand, there are several datasets labeled with sufficient attributes for the case of pedestrian attribute recognition. Exploiting such data for ReID purpose can be a way to alleviate the shortage of attribute annotations in ReID case. In this work, an unsupervised domain adaptive ReID feature learning framework is proposed to make full use of attribute annotations. We propose to transfer attribute-related features from their original domain to the ReID one: to this end, we introduce an adversarial discriminative domain adaptation method in order to learn domain invariant features for encoding semantic attributes. Experiments on three large-scale datasets validate the effectiveness of the proposed ReID framework. △ Less

Submitted 27 August, 2019; originally announced August 2019.

Comments: 5 pages, accepted by ICIP2019

arXiv:1908.10344 [pdf, other]

Intra-Camera Supervised Person Re-Identification: A New Benchmark

Authors: ** Zhu, Xiatian Zhu, Minxian Li, Vittorio Murino, Shaogang Gong

Abstract: Existing person re-identification (re-id) methods rely mostly on a large set of inter-camera identity labelled training data, requiring a tedious data collection and annotation process therefore leading to poor scalability in practical re-id applications. To overcome this fundamental limitation, we consider person re-identification without inter-camera identity association but only with identity l… ▽ More Existing person re-identification (re-id) methods rely mostly on a large set of inter-camera identity labelled training data, requiring a tedious data collection and annotation process therefore leading to poor scalability in practical re-id applications. To overcome this fundamental limitation, we consider person re-identification without inter-camera identity association but only with identity labels independently annotated within each individual camera-view. This eliminates the most time-consuming and tedious inter-camera identity labelling process in order to significantly reduce the amount of human efforts required during annotation. It hence gives rise to a more scalable and more feasible learning scenario, which we call Intra-Camera Supervised (ICS) person re-id. Under this ICS setting with weaker label supervision, we formulate a Multi-Task Multi-Label (MTML) deep learning method. Given no inter-camera association, MTML is specially designed for self-discovering the inter-camera identity correspondence. This is achieved by inter-camera multi-label learning under a joint multi-task inference framework. In addition, MTML can also efficiently learn the discriminative re-id feature representations by fully using the available identity labels within each camera-view. Extensive experiments demonstrate the performance superiority of our MTML model over the state-of-the-art alternative methods on three large-scale person re-id datasets in the proposed intra-camera supervised learning setting. △ Less

Submitted 27 August, 2019; originally announced August 2019.

Comments: 9 pages, 3 figures, accepted by ICCV Workshop on Real-World Recognition from Low-Quality Images and Videos, 2019

arXiv:1904.07933 [pdf, other]

Audio-Visual Model Distillation Using Acoustic Images

Authors: Andrés F. Pérez, Valentina Sanguineti, Pietro Morerio, Vittorio Murino

Abstract: In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and acoustic images, a novel audio data modality. Former models learn audio representations from raw signals or spectral data acquired by a single microphone, with remarkable results in classification and retrieval. However, such representations are not so robust towards var… ▽ More In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and acoustic images, a novel audio data modality. Former models learn audio representations from raw signals or spectral data acquired by a single microphone, with remarkable results in classification and retrieval. However, such representations are not so robust towards variable environmental sound conditions. We tackle this drawback by exploiting a new multimodal labeled action recognition dataset acquired by a hybrid audio-visual sensor that provides RGB video, raw audio signals, and spatialized acoustic data, also known as acoustic images, where the visual and acoustic images are aligned in space and synchronized in time. Using this richer information, we train audio deep learning models in a teacher-student fashion. In particular, we distill knowledge into audio networks from both visual and acoustic image teachers. Our experiments suggest that the learned representations are more powerful and have better generalization capabilities than the features learned from models trained using just single-microphone audio data. △ Less

Submitted 11 February, 2020; v1 submitted 16 April, 2019; originally announced April 2019.

Comments: Accepted at WACV 2020; supplementary material at page 11; code available at https://github.com/afperezm/acoustic-images-distillation

arXiv:1903.11900 [pdf, other]

Addressing Model Vulnerability to Distributional Shifts over Image Transformation Sets

Authors: Riccardo Volpi, Vittorio Murino

Abstract: We are concerned with the vulnerability of computer vision models to distributional shifts. We formulate a combinatorial optimization problem that allows evaluating the regions in the image space where a given model is more vulnerable, in terms of image transformations applied to the input, and face it with standard search algorithms. We further embed this idea in a training procedure, where we de… ▽ More We are concerned with the vulnerability of computer vision models to distributional shifts. We formulate a combinatorial optimization problem that allows evaluating the regions in the image space where a given model is more vulnerable, in terms of image transformations applied to the input, and face it with standard search algorithms. We further embed this idea in a training procedure, where we define new data augmentation rules according to the image transformations that the current model is most vulnerable to, over iterations. An empirical evaluation on classification and semantic segmentation problems suggests that the devised algorithm allows to train models that are more robust against content-preserving image manipulations and, in general, against distributional shifts. △ Less

Submitted 20 August, 2019; v1 submitted 28 March, 2019; originally announced March 2019.

Comments: ICCV 2019 (camera ready)

arXiv:1902.01395 [pdf]

Comparison of brain connectomes using geodesic distance on manifold:a twin study

Authors: A. Yamin, M. Dayan, L. Squarcina, P. Brambilla, V. Murino, V. Diwadkar, D. Sona

Abstract: fMRI is a unique non-invasive approach for understanding the functional organization of the human brain, and task-based fMRI promotes identification of functionally relevant brain regions associated with a given task. Here, we use fMRI (using the Poffenberger Paradigm) data collected in mono- and dizygotic twin pairs to propose a novel approach for assessing similarity in functional networks. In p… ▽ More fMRI is a unique non-invasive approach for understanding the functional organization of the human brain, and task-based fMRI promotes identification of functionally relevant brain regions associated with a given task. Here, we use fMRI (using the Poffenberger Paradigm) data collected in mono- and dizygotic twin pairs to propose a novel approach for assessing similarity in functional networks. In particular, we compared network similarity between pairs of twins in task-relevant and task-orthogonal networks. The proposed method measures the similarity between functional networks using a geodesic distance between graph Laplacians. With method we show that networks are more similar in monozygotic twins compared to dizygotic twins. Furthermore, the similarity in monozygotic twins is higher for task-relevant, than task-orthogonal networks. △ Less

Submitted 4 February, 2019; originally announced February 2019.

Comments: Paper is accepted for presentation in ISBI 2019. Camera ready has been submitted on 15 Jan 2019

arXiv:1812.02626 [pdf, other]

Guided Zoom: Questioning Network Evidence for Fine-grained Classification

Authors: Sarah Adel Bargal, Andrea Zunino, Vitali Petsiuk, Jianming Zhang, Kate Saenko, Vittorio Murino, Stan Sclaroff

Abstract: We propose Guided Zoom, an approach that utilizes spatial grounding of a model's decision to make more informed predictions. It does so by making sure the model has "the right reasons" for a prediction, defined as reasons that are coherent with those used to make similar correct decisions at training time. The reason/evidence upon which a deep convolutional neural network makes a prediction is def… ▽ More We propose Guided Zoom, an approach that utilizes spatial grounding of a model's decision to make more informed predictions. It does so by making sure the model has "the right reasons" for a prediction, defined as reasons that are coherent with those used to make similar correct decisions at training time. The reason/evidence upon which a deep convolutional neural network makes a prediction is defined to be the spatial grounding, in the pixel space, for a specific class conditional probability in the model output. Guided Zoom examines how reasonable such evidence is for each of the top-k predicted classes, rather than solely trusting the top-1 prediction. We show that Guided Zoom improves the classification accuracy of a deep convolutional neural network model and obtains state-of-the-art results on three fine-grained classification benchmark datasets. △ Less

Submitted 23 March, 2020; v1 submitted 6 December, 2018; originally announced December 2018.

Comments: BMVC 2019 Camera Ready Version

arXiv:1811.02942 [pdf, other]

doi 10.1016/j.neuroimage.2019.03.068

Multi-branch Convolutional Neural Network for Multiple Sclerosis Lesion Segmentation

Authors: Shahab Aslani, Michael Dayan, Loredana Storelli, Massimo Filippi, Vittorio Murino, Maria A Rocca, Diego Sona

Abstract: In this paper, we present an automated approach for segmenting multiple sclerosis (MS) lesions from multi-modal brain magnetic resonance images. Our method is based on a deep end-to-end 2D convolutional neural network (CNN) for slice-based segmentation of 3D volumetric data. The proposed CNN includes a multi-branch downsampling path, which enables the network to encode information from multiple mo… ▽ More In this paper, we present an automated approach for segmenting multiple sclerosis (MS) lesions from multi-modal brain magnetic resonance images. Our method is based on a deep end-to-end 2D convolutional neural network (CNN) for slice-based segmentation of 3D volumetric data. The proposed CNN includes a multi-branch downsampling path, which enables the network to encode information from multiple modalities separately. Multi-scale feature fusion blocks are proposed to combine feature maps from different modalities at different stages of the network. Then, multi-scale feature upsampling blocks are introduced to upsize combined feature maps to leverage information from lesion shape and location. We trained and tested the proposed model using orthogonal plane orientations of each 3D modality to exploit the contextual information in all directions. The proposed pipeline is evaluated on two different datasets: a private dataset including 37 MS patients and a publicly available dataset known as the ISBI 2015 longitudinal MS lesion segmentation challenge dataset, consisting of 14 MS patients. Considering the ISBI challenge, at the time of submission, our method was amongst the top performing solutions. On the private dataset, using the same array of performance metrics as in the ISBI challenge, the proposed approach shows high improvements in MS lesion segmentation compared with other publicly available tools. △ Less

Submitted 8 April, 2019; v1 submitted 7 November, 2018; originally announced November 2018.

Comments: This paper has been accepted for publication in NeuroImage

arXiv:1810.08437 [pdf, other]

doi 10.1109/TPAMI.2019.2929038

Learning with privileged information via adversarial discriminative modality distillation

Authors: Nuno C. Garcia, Pietro Morerio, Vittorio Murino

Abstract: Heterogeneous data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while training data can be accurately collected to include a variety of sensory modalities, it is often the case that not all of them are available in real life (testing) scenarios, where a model has to be deployed. This raises the challenge of… ▽ More Heterogeneous data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while training data can be accurately collected to include a variety of sensory modalities, it is often the case that not all of them are available in real life (testing) scenarios, where a model has to be deployed. This raises the challenge of how to extract information from multimodal data in the training stage, in a form that can be exploited at test time, considering limitations such as noisy or missing modalities. This paper presents a new approach in this direction for RGB-D vision tasks, developed within the adversarial learning and privileged information frameworks. We consider the practical case of learning representations from depth and RGB videos, while relying only on RGB data at test time. We propose a new approach to train a hallucination network that learns to distill depth information via adversarial learning, resulting in a clean approach without several losses to balance or hyperparameters. We report state-of-the-art results on object classification on the NYUD dataset and video action recognition on the largest multimodal dataset available for this task, the NTU RGB+D, as well as on the Northwestern-UCLA. △ Less

Submitted 26 July, 2019; v1 submitted 19 October, 2018; originally announced October 2018.

Comments: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence

arXiv:1806.07110 [pdf, other]

Modality Distillation with Multiple Stream Networks for Action Recognition

Authors: Nuno Garcia, Pietro Morerio, Vittorio Murino

Abstract: Diverse input data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while a (training) dataset could be accurately designed to include a variety of sensory inputs, it is often the case that not all modalities could be available in real life (testing) scenarios, where a model has to be deployed. This raises the c… ▽ More Diverse input data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while a (training) dataset could be accurately designed to include a variety of sensory inputs, it is often the case that not all modalities could be available in real life (testing) scenarios, where a model has to be deployed. This raises the challenge of how to learn robust representations leveraging multimodal data in the training stage, while considering limitations at test time, such as noisy or missing modalities. This paper presents a new approach for multimodal video action recognition, developed within the unified frameworks of distillation and privileged information, named generalized distillation. Particularly, we consider the case of learning representations from depth and RGB videos, while relying on RGB data only at test time. We propose a new approach to train an hallucination network that learns to distill depth features through multiplicative connections of spatiotemporal representations, leveraging soft labels and hard labels, as well as distance between feature maps. We report state-of-the-art results on video action classification on the largest multimodal dataset available for this task, the NTU RGB+D. Code available at https://github.com/ncgarcia/modality-distillation . △ Less

Submitted 29 October, 2018; v1 submitted 19 June, 2018; originally announced June 2018.

Comments: Accepted at ECCV 2018; Supp. material at p.16; code available

arXiv:1805.12018 [pdf, other]

Generalizing to Unseen Domains via Adversarial Data Augmentation

Authors: Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John Duchi, Vittorio Murino, Silvio Savarese

Abstract: We are concerned with learning models that generalize well to different \emph{unseen} domains. We consider a worst-case formulation over data distributions that are near the source domain in the feature space. Only using training data from a single source distribution, we propose an iterative procedure that augments the dataset with examples from a fictitious target domain that is "hard" under the… ▽ More We are concerned with learning models that generalize well to different \emph{unseen} domains. We consider a worst-case formulation over data distributions that are near the source domain in the feature space. Only using training data from a single source distribution, we propose an iterative procedure that augments the dataset with examples from a fictitious target domain that is "hard" under the current model. We show that our iterative scheme is an adaptive data augmentation method where we append adversarial examples at each iteration. For softmax losses, we show that our method is a data-dependent regularization scheme that behaves differently from classical regularizers that regularize towards zero (e.g., ridge or lasso). On digit recognition and semantic segmentation tasks, our method learns models improve performance across a range of a priori unknown target domains. △ Less

Submitted 6 November, 2018; v1 submitted 30 May, 2018; originally announced May 2018.

Comments: Accepted to NIPS 2018 (camera ready)

arXiv:1805.09092 [pdf, other]

Excitation Dropout: Encouraging Plasticity in Deep Neural Networks

Authors: Andrea Zunino, Sarah Adel Bargal, Pietro Morerio, Jianming Zhang, Stan Sclaroff, Vittorio Murino

Abstract: We propose a guided dropout regularizer for deep networks based on the evidence of a network prediction defined as the firing of neurons in specific paths. In this work, we utilize the evidence at each neuron to determine the probability of dropout, rather than drop** out neurons uniformly at random as in standard dropout. In essence, we dropout with higher probability those neurons which contri… ▽ More We propose a guided dropout regularizer for deep networks based on the evidence of a network prediction defined as the firing of neurons in specific paths. In this work, we utilize the evidence at each neuron to determine the probability of dropout, rather than drop** out neurons uniformly at random as in standard dropout. In essence, we dropout with higher probability those neurons which contribute more to decision making at training time. This approach penalizes high saliency neurons that are most relevant for model prediction, i.e. those having stronger evidence. By drop** such high-saliency neurons, the network is forced to learn alternative paths in order to maintain loss minimization, resulting in a plasticity-like behavior, a characteristic of human brains too. We demonstrate better generalization ability, an increased utilization of network neurons, and a higher resilience to network compression using several metrics over four image/video recognition benchmarks. △ Less

Submitted 21 January, 2021; v1 submitted 23 May, 2018; originally announced May 2018.

Comments: This work is published in the International Journal of Computer Vision (IJCV) in 2021

arXiv:1711.10290 [pdf, other]

Scalable and Compact 3D Action Recognition with Approximated RBF Kernel Machines

Authors: Jacopo Cavazza, Pietro Morerio, Vittorio Murino

Abstract: Despite the recent deep learning (DL) revolution, kernel machines still remain powerful methods for action recognition. DL has brought the use of large datasets and this is typically a problem for kernel approaches, which are not scaling up efficiently due to kernel Gram matrices. Nevertheless, kernel methods are still attractive and more generally applicable since they can equally manage differen… ▽ More Despite the recent deep learning (DL) revolution, kernel machines still remain powerful methods for action recognition. DL has brought the use of large datasets and this is typically a problem for kernel approaches, which are not scaling up efficiently due to kernel Gram matrices. Nevertheless, kernel methods are still attractive and more generally applicable since they can equally manage different sizes of the datasets, also in cases where DL techniques show some limitations. This work investigates these issues by proposing an explicit approximated representation that, together with a linear model, is an equivalent, yet scalable, implementation of a kernel machine. Our approximation is directly inspired by the exact feature map that is induced by an RBF Gaussian kernel but, unlike the latter, it is finite dimensional and very compact. We justify the soundness of our idea with a theoretical analysis which proves the unbiasedness of the approximation, and provides a vanishing bound for its variance, which is shown to decrease much rapidly than in alternative methods in the literature. In a broad experimental validation, we assess the superiority of our approximation in terms of 1) ease and speed of training, 2) compactness of the model, and 3) improvements with respect to the state-of-the-art performance. △ Less

Submitted 28 November, 2017; originally announced November 2017.

arXiv:1711.10288 [pdf, other]

Minimal-Entropy Correlation Alignment for Unsupervised Deep Domain Adaptation

Authors: Pietro Morerio, Jacopo Cavazza, Vittorio Murino

Abstract: In this work, we face the problem of unsupervised domain adaptation with a novel deep learning approach which leverages on our finding that entropy minimization is induced by the optimal alignment of second order statistics between source and target domains. We formally demonstrate this hypothesis and, aiming at achieving an optimal alignment in practical cases, we adopt a more principled strategy… ▽ More In this work, we face the problem of unsupervised domain adaptation with a novel deep learning approach which leverages on our finding that entropy minimization is induced by the optimal alignment of second order statistics between source and target domains. We formally demonstrate this hypothesis and, aiming at achieving an optimal alignment in practical cases, we adopt a more principled strategy which, differently from the current Euclidean approaches, deploys alignment along geodesics. Our pipeline can be implemented by adding to the standard classification loss (on the labeled source domain), a source-to-target regularizer that is weighted in an unsupervised and data-driven fashion. We provide extensive experiments to assess the superiority of our framework on standard domain and modality adaptation benchmarks. △ Less

Submitted 28 November, 2017; originally announced November 2017.

arXiv:1711.08561 [pdf, other]

Adversarial Feature Augmentation for Unsupervised Domain Adaptation

Authors: Riccardo Volpi, Pietro Morerio, Silvio Savarese, Vittorio Murino

Abstract: Recent works showed that Generative Adversarial Networks (GANs) can be successfully applied in unsupervised domain adaptation, where, given a labeled source dataset and an unlabeled target dataset, the goal is to train powerful classifiers for the target samples. In particular, it was shown that a GAN objective function can be used to learn target features indistinguishable from the source ones. I… ▽ More Recent works showed that Generative Adversarial Networks (GANs) can be successfully applied in unsupervised domain adaptation, where, given a labeled source dataset and an unlabeled target dataset, the goal is to train powerful classifiers for the target samples. In particular, it was shown that a GAN objective function can be used to learn target features indistinguishable from the source ones. In this work, we extend this framework by (i) forcing the learned feature extractor to be domain-invariant, and (ii) training it through data augmentation in the feature space, namely performing feature augmentation. While data augmentation in the image space is a well established technique in deep learning, feature augmentation has not yet received the same level of attention. We accomplish it by means of a feature generator trained by playing the GAN minimax game against source features. Results show that both enforcing domain-invariance and performing feature augmentation lead to superior or comparable performance to state-of-the-art results in several unsupervised domain adaptation benchmarks. △ Less

Submitted 4 May, 2018; v1 submitted 22 November, 2017; originally announced November 2017.

Comments: Accepted to CVPR 2018

arXiv:1711.06778 [pdf, other]

Excitation Backprop for RNNs

Authors: Sarah Adel Bargal, Andrea Zunino, Donghyun Kim, Jianming Zhang, Vittorio Murino, Stan Sclaroff

Abstract: Deep models are state-of-the-art for many vision tasks including video action recognition and video captioning. Models are trained to caption or classify activity in videos, but little is known about the evidence used to make such decisions. Grounding decisions made by deep networks has been studied in spatial visual content, giving more insight into model predictions for images. However, such stu… ▽ More Deep models are state-of-the-art for many vision tasks including video action recognition and video captioning. Models are trained to caption or classify activity in videos, but little is known about the evidence used to make such decisions. Grounding decisions made by deep networks has been studied in spatial visual content, giving more insight into model predictions for images. However, such studies are relatively lacking for models of spatiotemporal visual content - videos. In this work, we devise a formulation that simultaneously grounds evidence in space and time, in a single pass, using top-down saliency. We visualize the spatiotemporal cues that contribute to a deep model's classification/captioning output using the model's internal representation. Based on these spatiotemporal cues, we are able to localize segments within a video that correspond with a specific action, or phrase from a caption, without explicitly optimizing/training for these tasks. △ Less

Submitted 8 March, 2018; v1 submitted 17 November, 2017; originally announced November 2017.

Comments: CVPR 2018 Camera Ready Version

Journal ref: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

arXiv:1710.05092 [pdf, other]

Dropout as a Low-Rank Regularizer for Matrix Factorization

Authors: Jacopo Cavazza, Pietro Morerio, Benjamin Haeffele, Connor Lane, Vittorio Murino, Rene Vidal

Abstract: Regularization for matrix factorization (MF) and approximation problems has been carried out in many different ways. Due to its popularity in deep learning, dropout has been applied also for this class of problems. Despite its solid empirical performance, the theoretical properties of dropout as a regularizer remain quite elusive for this class of problems. In this paper, we present a theoretical… ▽ More Regularization for matrix factorization (MF) and approximation problems has been carried out in many different ways. Due to its popularity in deep learning, dropout has been applied also for this class of problems. Despite its solid empirical performance, the theoretical properties of dropout as a regularizer remain quite elusive for this class of problems. In this paper, we present a theoretical analysis of dropout for MF, where Bernoulli random variables are used to drop columns of the factors. We demonstrate the equivalence between dropout and a fully deterministic model for MF in which the factors are regularized by the sum of the product of squared Euclidean norms of the columns. Additionally, we inspect the case of a variable sized factorization and we prove that dropout achieves the global minimum of a convex approximation problem with (squared) nuclear norm regularization. As a result, we conclude that dropout can be used as a low-rank regularizer with data dependent singular-value thresholding. △ Less

Submitted 13 October, 2017; originally announced October 2017.

arXiv:1710.03487 [pdf, other]

An Analysis of Dropout for Matrix Factorization

Authors: Jacopo Cavazza, Connor Lane, Benjamin D. Haeffele, Vittorio Murino, René Vidal

Abstract: Dropout is a simple yet effective algorithm for regularizing neural networks by randomly drop** out units through Bernoulli multiplicative noise, and for some restricted problem classes, such as linear or logistic regression, several theoretical studies have demonstrated the equivalence between dropout and a fully deterministic optimization problem with data-dependent Tikhonov regularization. Th… ▽ More Dropout is a simple yet effective algorithm for regularizing neural networks by randomly drop** out units through Bernoulli multiplicative noise, and for some restricted problem classes, such as linear or logistic regression, several theoretical studies have demonstrated the equivalence between dropout and a fully deterministic optimization problem with data-dependent Tikhonov regularization. This work presents a theoretical analysis of dropout for matrix factorization, where Bernoulli random variables are used to drop a factor, thereby attempting to control the size of the factorization. While recent work has demonstrated the empirical effectiveness of dropout for matrix factorization, a theoretical understanding of the regularization properties of dropout in this context remains elusive. This work demonstrates the equivalence between dropout and a fully deterministic model for matrix factorization in which the factors are regularized by the sum of the product of the norms of the columns. While the resulting regularizer is closely related to a variational form of the nuclear norm, suggesting that dropout may limit the size of the factorization, we show that it is possible to trivially lower the objective value by doubling the size of the factorization. We show that this problem is caused by the use of a fixed dropout rate, which motivates the use of a rate that increases with the size of the factorization. Synthetic experiments validate our theoretical findings. △ Less

Submitted 10 October, 2017; originally announced October 2017.

arXiv:1709.01695 [pdf, other]

A Compact Kernel Approximation for 3D Action Recognition

Authors: Jacopo Cavazza, Pietro Morerio, Vittorio Murino

Abstract: 3D action recognition was shown to benefit from a covariance representation of the input data (joint 3D positions). A kernel machine feed with such feature is an effective paradigm for 3D action recognition, yielding state-of-the-art results. Yet, the whole framework is affected by the well-known scalability issue. In fact, in general, the kernel function has to be evaluated for all pairs of insta… ▽ More 3D action recognition was shown to benefit from a covariance representation of the input data (joint 3D positions). A kernel machine feed with such feature is an effective paradigm for 3D action recognition, yielding state-of-the-art results. Yet, the whole framework is affected by the well-known scalability issue. In fact, in general, the kernel function has to be evaluated for all pairs of instances inducing a Gram matrix whose complexity is quadratic in the number of samples. In this work we reduce such complexity to be linear by proposing a novel and explicit feature map to approximate the kernel function. This allows to train a linear classifier with an explicit feature encoding, which implicitly implements a Log-Euclidean machine in a scalable fashion. Not only we prove that the proposed approximation is unbiased, but also we work out an explicit strong bound for its variance, attesting a theoretical superiority of our approach with respect to existing ones. Experimentally, we verify that our representation provides a compact encoding and outperforms other approximation schemes on a number of publicly available benchmark datasets for 3D action recognition. △ Less

Submitted 4 October, 2017; v1 submitted 6 September, 2017; originally announced September 2017.

Comments: Best paper award special mention at the 19th edition of the GIRPR International Conference on Image Analysis and Processing (ICIAP) 2017

arXiv:1708.01846 [pdf, other]

Manifold Constrained Low-Rank Decomposition

Authors: Chen Chen, Baochang Zhang, Alessio Del Bue, Vittorio Murino

Abstract: Low-rank decomposition (LRD) is a state-of-the-art method for visual data reconstruction and modelling. However, it is a very challenging problem when the image data contains significant occlusion, noise, illumination variation, and misalignment from rotation or viewpoint changes. We leverage the specific structure of data in order to improve the performance of LRD when the data are not ideal. To… ▽ More Low-rank decomposition (LRD) is a state-of-the-art method for visual data reconstruction and modelling. However, it is a very challenging problem when the image data contains significant occlusion, noise, illumination variation, and misalignment from rotation or viewpoint changes. We leverage the specific structure of data in order to improve the performance of LRD when the data are not ideal. To this end, we propose a new framework that embeds manifold priors into LRD. To implement the framework, we design an alternating direction method of multipliers (ADMM) method which efficiently integrates the manifold constraints during the optimization process. The proposed approach is successfully used to calculate low-rank models from face images, hand-written digits and planar surface images. The results show a consistent increase of performance when compared to the state-of-the-art over a wide range of realistic image misalignments and corruptions. △ Less

Submitted 6 August, 2017; originally announced August 2017.

arXiv:1708.01034 [pdf, other]

doi 10.1109/CVPRW.2017.7

What Will I Do Next? The Intention from Motion Experiment

Authors: Andrea Zunino, Jacopo Cavazza, Atesh Koul, Andrea Cavallo, Cristina Becchio, Vittorio Murino

Abstract: In computer vision, video-based approaches have been widely explored for the early classification and the prediction of actions or activities. However, it remains unclear whether this modality (as compared to 3D kinematics) can still be reliable for the prediction of human intentions, defined as the overarching goal embedded in an action sequence. Since the same action can be performed with differ… ▽ More In computer vision, video-based approaches have been widely explored for the early classification and the prediction of actions or activities. However, it remains unclear whether this modality (as compared to 3D kinematics) can still be reliable for the prediction of human intentions, defined as the overarching goal embedded in an action sequence. Since the same action can be performed with different intentions, this problem is more challenging but yet affordable as proved by quantitative cognitive studies which exploit the 3D kinematics acquired through motion capture systems. In this paper, we bridge cognitive and computer vision studies, by demonstrating the effectiveness of video-based approaches for the prediction of human intentions. Precisely, we propose Intention from Motion, a new paradigm where, without using any contextual information, we consider instantaneous gras** motor acts involving a bottle in order to forecast why the bottle itself has been reached (to pass it or to place in a box, or to pour or to drink the liquid inside). We process only the gras** onsets casting intention prediction as a classification framework. Leveraging on our multimodal acquisition (3D motion capture data and 2D optical videos), we compare the most commonly used 3D descriptors from cognitive studies with state-of-the-art video-based techniques. Since the two analyses achieve an equivalent performance, we demonstrate that computer vision tools are effective in capturing the kinematics and facing the cognitive problem of human intention prediction. △ Less

Submitted 3 August, 2017; originally announced August 2017.

Comments: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops

arXiv:1708.01022 [pdf, other]

doi 10.1109/CVPRW.2017.165

When Kernel Methods meet Feature Learning: Log-Covariance Network for Action Recognition from Skeletal Data

Authors: Jacopo Cavazza, Pietro Morerio, Vittorio Murino

Abstract: Human action recognition from skeletal data is a hot research topic and important in many open domain applications of computer vision, thanks to recently introduced 3D sensors. In the literature, naive methods simply transfer off-the-shelf techniques from video to the skeletal representation. However, the current state-of-the-art is contended between to different paradigms: kernel-based methods an… ▽ More Human action recognition from skeletal data is a hot research topic and important in many open domain applications of computer vision, thanks to recently introduced 3D sensors. In the literature, naive methods simply transfer off-the-shelf techniques from video to the skeletal representation. However, the current state-of-the-art is contended between to different paradigms: kernel-based methods and feature learning with (recurrent) neural networks. Both approaches show strong performances, yet they exhibit heavy, but complementary, drawbacks. Motivated by this fact, our work aims at combining together the best of the two paradigms, by proposing an approach where a shallow network is fed with a covariance representation. Our intuition is that, as long as the dynamics is effectively modeled, there is no need for the classification network to be deep nor recurrent in order to score favorably. We validate this hypothesis in a broad experimental analysis over 6 publicly available datasets. △ Less

Submitted 3 August, 2017; originally announced August 2017.

Comments: 2017 IEEE Computer Vision and Pattern Recognition (CVPR) Workshops

arXiv:1706.03112 [pdf, other]

Unsupervised Adaptive Re-identification in Open World Dynamic Camera Networks

Authors: Rameswar Panda, Amran Bhuiyan, Vittorio Murino, Amit K. Roy-Chowdhury

Abstract: Person re-identification is an open and challenging problem in computer vision. Existing approaches have concentrated on either designing the best feature representation or learning optimal matching metrics in a static setting where the number of cameras are fixed in a network. Most approaches have neglected the dynamic and open world nature of the re-identification problem, where a new camera may… ▽ More Person re-identification is an open and challenging problem in computer vision. Existing approaches have concentrated on either designing the best feature representation or learning optimal matching metrics in a static setting where the number of cameras are fixed in a network. Most approaches have neglected the dynamic and open world nature of the re-identification problem, where a new camera may be temporarily inserted into an existing system to get additional information. To address such a novel and very practical problem, we propose an unsupervised adaptation scheme for re-identification models in a dynamic camera network. First, we formulate a domain perceptive re-identification method based on geodesic flow kernel that can effectively find the best source camera (already installed) to adapt with a newly introduced target camera, without requiring a very expensive training phase. Second, we introduce a transitive inference algorithm for re-identification that can exploit the information from best source camera to improve the accuracy across other camera pairs in a network of multiple cameras. Extensive experiments on four benchmark datasets demonstrate that the proposed approach significantly outperforms the state-of-the-art unsupervised learning based alternatives whilst being extremely efficient to compute. △ Less

Submitted 9 June, 2017; originally announced June 2017.

Comments: CVPR 2017 Spotlight

arXiv:1705.08180 [pdf, other]

Correlation Alignment by Riemannian Metric for Domain Adaptation

Authors: Pietro Morerio, Vittorio Murino

Abstract: Domain adaptation techniques address the problem of reducing the sensitivity of machine learning methods to the so-called domain shift, namely the difference between source (training) and target (test) data distributions. In particular, unsupervised domain adaptation assumes no labels are available in the target domain. To this end, aligning second order statistics (covariances) of target and sour… ▽ More Domain adaptation techniques address the problem of reducing the sensitivity of machine learning methods to the so-called domain shift, namely the difference between source (training) and target (test) data distributions. In particular, unsupervised domain adaptation assumes no labels are available in the target domain. To this end, aligning second order statistics (covariances) of target and source domains have proven to be an effective approach ti fill the gap between the domains. However, covariance matrices do not form a subspace of the Euclidean space, but live in a Riemannian manifold with non-positive curvature, making the usual Euclidean metric suboptimal to measure distances. In this paper, we extend the idea of training a neural network with a constraint on the covariances of the hidden layer features, by rigorously accounting for the curved structure of the manifold of symmetric positive definite matrices. The resulting loss function exploits a theoretically sound geodesic distance on such manifold. Results show indeed the suboptimal nature of the Euclidean distance. This makes us able to perform better than previous approaches on the standard Office dataset, a benchmark for domain adaptation techniques. △ Less

Submitted 23 May, 2017; originally announced May 2017.

arXiv:1703.06229 [pdf, other]

Curriculum Dropout

Authors: Pietro Morerio, Jacopo Cavazza, Riccardo Volpi, Rene Vidal, Vittorio Murino

Abstract: Dropout is a very effective way of regularizing neural networks. Stochastically "drop** out" units with a certain probability discourages over-specific co-adaptations of feature detectors, preventing overfitting and improving network generalization. Besides, Dropout can be interpreted as an approximate model aggregation technique, where an exponential number of smaller networks are averaged in o… ▽ More Dropout is a very effective way of regularizing neural networks. Stochastically "drop** out" units with a certain probability discourages over-specific co-adaptations of feature detectors, preventing overfitting and improving network generalization. Besides, Dropout can be interpreted as an approximate model aggregation technique, where an exponential number of smaller networks are averaged in order to get a more powerful ensemble. In this paper, we show that using a fixed dropout probability during training is a suboptimal choice. We thus propose a time scheduling for the probability of retaining neurons in the network. This induces an adaptive regularization scheme that smoothly increases the difficulty of the optimization problem. This idea of "starting easy" and adaptively increasing the difficulty of the learning problem has its roots in curriculum learning and allows one to train better models. Indeed, we prove that our optimization strategy implements a very general curriculum scheme, by gradually adding noise to both the input and intermediate feature representations within the network architecture. Experiments on seven image classification datasets and different network architectures show that our method, named Curriculum Dropout, frequently yields to better generalization and, at worst, performs just as well as the standard Dropout method. △ Less

Submitted 3 August, 2017; v1 submitted 17 March, 2017; originally announced March 2017.

Comments: Accepted at ICCV (International Conference on Computer Vision) 2017

arXiv:1701.02898 [pdf, other]

Modeling Retinal Ganglion Cell Population Activity with Restricted Boltzmann Machines

Authors: Matteo Zanotto, Riccardo Volpi, Alessandro Maccione, Luca Berdondini, Diego Sona, Vittorio Murino

Abstract: The retina is a complex nervous system which encodes visual stimuli before higher order processing occurs in the visual cortex. In this study we evaluated whether information about the stimuli received by the retina can be retrieved from the firing rate distribution of Retinal Ganglion Cells (RGCs), exploiting High-Density 64x64 MEA technology. To this end, we modeled the RGC population activity u… ▽ More The retina is a complex nervous system which encodes visual stimuli before higher order processing occurs in the visual cortex. In this study we evaluated whether information about the stimuli received by the retina can be retrieved from the firing rate distribution of Retinal Ganglion Cells (RGCs), exploiting High-Density 64x64 MEA technology. To this end, we modeled the RGC population activity using mean-covariance Restricted Boltzmann Machines, latent variable models capable of learning the joint distribution of a set of continuous observed random variables and a set of binary unobserved random units. The idea was to figure out if binary latent states encode the regularities associated to different visual stimuli, as modes in the joint distribution. We measured the goodness of mcRBM encoding by calculating the Mutual Information between the latent states and the stimuli shown to the retina. Results show that binary states can encode the regularities associated to different stimuli, using both gratings and natural scenes as stimuli. We also discovered that hidden variables encode interesting properties of retinal activity, interpreted as population receptive fields. We further investigated the ability of the model to learn different modes in population activity by comparing results associated to a retina in normal conditions and after pharmacologically blocking GABA receptors (GABAC at first, and then also GABAA and GABAB). As expected, Mutual Information tends to decrease if we pharmacologically block receptors. We finally stress that the computational method described in this work could potentially be applied to any kind of neural data obtained through MEA technology, though different techniques should be applied to interpret the results. △ Less

Submitted 17 January, 2017; v1 submitted 11 January, 2017; originally announced January 2017.

arXiv:1609.09251 [pdf, other]

Kernel Methods on Approximate Infinite-Dimensional Covariance Operators for Image Classification

Authors: Hà Quang Minh, Marco San Biagio, Loris Bazzani, Vittorio Murino

Abstract: This paper presents a novel framework for visual object recognition using infinite-dimensional covariance operators of input features in the paradigm of kernel methods on infinite-dimensional Riemannian manifolds. Our formulation provides in particular a rich representation of image features by exploiting their non-linear correlations. Theoretically, we provide a finite-dimensional approximation o… ▽ More This paper presents a novel framework for visual object recognition using infinite-dimensional covariance operators of input features in the paradigm of kernel methods on infinite-dimensional Riemannian manifolds. Our formulation provides in particular a rich representation of image features by exploiting their non-linear correlations. Theoretically, we provide a finite-dimensional approximation of the Log-Hilbert-Schmidt (Log-HS) distance between covariance operators that is scalable to large datasets, while maintaining an effective discriminating capability. This allows us to efficiently approximate any continuous shift-invariant kernel defined using the Log-HS distance. At the same time, we prove that the Log-HS inner product between covariance operators is only approximable by its finite-dimensional counterpart in a very limited scenario. Consequently, kernels defined using the Log-HS inner product, such as polynomial kernels, are not scalable in the same way as shift-invariant kernels. Computationally, we apply the approximate Log-HS distance formulation to covariance operators of both handcrafted and convolutional features, exploiting both the expressiveness of these features and the power of the covariance representation. Empirically, we tested our framework on the task of image classification on twelve challenging datasets. In almost all cases, the results obtained outperform other state of the art methods, demonstrating the competitiveness and potential of our framework. △ Less

Submitted 29 September, 2016; originally announced September 2016.

Comments: 18 double-column pages

Showing 1–50 of 61 results for author: Murino, V