-
Improving Object Detector Training on Synthetic Data by Starting With a Strong Baseline Methodology
Authors:
Frank A. Ruis,
Alma M. Liezenga,
Friso G. Heslinga,
Luca Ballan,
Thijs A. Eker,
Richard J. M. den Hollander,
Martin C. van Leeuwen,
Judith Dijk,
Wyke Huizinga
Abstract:
Collecting and annotating real-world data for the development of object detection models is a time-consuming and expensive process. In the military domain in particular, data collection can also be dangerous or infeasible. Training models on synthetic data may provide a solution for cases where access to real-world training data is restricted. However, bridging the reality gap between synthetic an…
▽ More
Collecting and annotating real-world data for the development of object detection models is a time-consuming and expensive process. In the military domain in particular, data collection can also be dangerous or infeasible. Training models on synthetic data may provide a solution for cases where access to real-world training data is restricted. However, bridging the reality gap between synthetic and real data remains a challenge. Existing methods usually build on top of baseline Convolutional Neural Network (CNN) models that have been shown to perform well when trained on real data, but have limited ability to perform well when trained on synthetic data. For example, some architectures allow for fine-tuning with the expectation of large quantities of training data and are prone to overfitting on synthetic data. Related work usually ignores various best practices from object detection on real data, e.g. by training on synthetic data from a single environment with relatively little variation. In this paper we propose a methodology for improving the performance of a pre-trained object detector when training on synthetic data. Our approach focuses on extracting the salient information from synthetic data without forgetting useful features learned from pre-training on real images. Based on the state of the art, we incorporate data augmentation methods and a Transformer backbone. Besides reaching relatively strong performance without any specialized synthetic data transfer methods, we show that our methods improve the state of the art on synthetic data trained object detection for the RarePlanes and DGTA-VisDrone datasets, and reach near-perfect performance on an in-house vehicle detection dataset.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Following the Human Thread in Social Navigation
Authors:
Luca Scofano,
Alessio Sampieri,
Tommaso Campari,
Valentino Sacco,
Indro Spinelli,
Lamberto Ballan,
Fabio Galasso
Abstract:
The success of collaboration between humans and robots in shared environments relies on the robot's real-time adaptation to human motion. Specifically, in Social Navigation, the agent should be close enough to assist but ready to back up to let the human move freely, avoiding collisions. Human trajectories emerge as crucial cues in Social Navigation, but they are partially observable from the robo…
▽ More
The success of collaboration between humans and robots in shared environments relies on the robot's real-time adaptation to human motion. Specifically, in Social Navigation, the agent should be close enough to assist but ready to back up to let the human move freely, avoiding collisions. Human trajectories emerge as crucial cues in Social Navigation, but they are partially observable from the robot's egocentric view and computationally complex to process.
We propose the first Social Dynamics Adaptation model (SDA) based on the robot's state-action history to infer the social dynamics. We propose a two-stage Reinforcement Learning framework: the first learns to encode the human trajectories into social dynamics and learns a motion policy conditioned on this encoded information, the current status, and the previous action. Here, the trajectories are fully visible, i.e., assumed as privileged information. In the second stage, the trained policy operates without direct access to trajectories. Instead, the model infers the social dynamics solely from the history of previous actions and statuses in real-time. Tested on the novel Habitat 3.0 platform, SDA sets a novel state of the art (SoA) performance in finding and following humans.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement
Authors:
Davide Rigoni,
Luca Parolari,
Luciano Serafini,
Alessandro Sperduti,
Lamberto Ballan
Abstract:
Using only image-sentence pairs, weakly-supervised visual-textual grounding aims to learn region-phrase correspondences of the respective entity mentions. Compared to the supervised approach, learning is more difficult since bounding boxes and textual phrases correspondences are unavailable. In light of this, we propose the Semantic Prior Refinement Model (SPRM), whose predictions are obtained by…
▽ More
Using only image-sentence pairs, weakly-supervised visual-textual grounding aims to learn region-phrase correspondences of the respective entity mentions. Compared to the supervised approach, learning is more difficult since bounding boxes and textual phrases correspondences are unavailable. In light of this, we propose the Semantic Prior Refinement Model (SPRM), whose predictions are obtained by combining the output of two main modules. The first untrained module aims to return a rough alignment between textual phrases and bounding boxes. The second trained module is composed of two sub-components that refine the rough alignment to improve the accuracy of the final phrase-bounding box alignments. The model is trained to maximize the multimodal similarity between an image and a sentence, while minimizing the multimodal similarity of the same sentence and a new unrelated image, carefully selected to help the most during training. Our approach shows state-of-the-art results on two popular datasets, Flickr30k Entities and ReferIt, shining especially on ReferIt with a 9.6% absolute improvement. Moreover, thanks to the untrained component, it reaches competitive performances just using a small fraction of training examples.
△ Less
Submitted 26 September, 2023; v1 submitted 18 May, 2023;
originally announced May 2023.
-
Distilling Knowledge for Short-to-Long Term Trajectory Prediction
Authors:
Sourav Das,
Guglielmo Camporese,
Shaokang Cheng,
Lamberto Ballan
Abstract:
Long-term trajectory forecasting is an important and challenging problem in the fields of computer vision, machine learning, and robotics. One fundamental difficulty stands in the evolution of the trajectory that becomes more and more uncertain and unpredictable as the time horizon grows, subsequently increasing the complexity of the problem. To overcome this issue, in this paper, we propose Di-Lo…
▽ More
Long-term trajectory forecasting is an important and challenging problem in the fields of computer vision, machine learning, and robotics. One fundamental difficulty stands in the evolution of the trajectory that becomes more and more uncertain and unpredictable as the time horizon grows, subsequently increasing the complexity of the problem. To overcome this issue, in this paper, we propose Di-Long, a new method that employs the distillation of a short-term trajectory model forecaster that guides a student network for long-term trajectory prediction during the training process. Given a total sequence length that comprehends the allowed observation for the student network and the complementary target sequence, we let the student and the teacher solve two different related tasks defined over the same full trajectory: the student observes a short sequence and predicts a long trajectory, whereas the teacher observes a longer sequence and predicts the remaining short target trajectory. The teacher's task is less uncertain, and we use its accurate predictions to guide the student through our knowledge distillation framework, reducing long-term future uncertainty. Our experiments show that our proposed Di-Long method is effective for long-term forecasting and achieves state-of-the-art performance on the Intersection Drone Dataset (inD) and the Stanford Drone Dataset (SDD).
△ Less
Submitted 15 March, 2024; v1 submitted 15 May, 2023;
originally announced May 2023.
-
Exploiting Proximity-Aware Tasks for Embodied Social Navigation
Authors:
Enrico Cancelli,
Tommaso Campari,
Luciano Serafini,
Angel X. Chang,
Lamberto Ballan
Abstract:
Learning how to navigate among humans in an occluded and spatially constrained indoor environment, is a key ability required to embodied agent to be integrated into our society. In this paper, we propose an end-to-end architecture that exploits Proximity-Aware Tasks (referred as to Risk and Proximity Compass) to inject into a reinforcement learning navigation policy the ability to infer common-sen…
▽ More
Learning how to navigate among humans in an occluded and spatially constrained indoor environment, is a key ability required to embodied agent to be integrated into our society. In this paper, we propose an end-to-end architecture that exploits Proximity-Aware Tasks (referred as to Risk and Proximity Compass) to inject into a reinforcement learning navigation policy the ability to infer common-sense social behaviors. To this end, our tasks exploit the notion of immediate and future dangers of collision. Furthermore, we propose an evaluation protocol specifically designed for the Social Navigation Task in simulated environments. This is done to capture fine-grained features and characteristics of the policy by analyzing the minimal unit of human-robot spatial interaction, called Encounter. We validate our approach on Gibson4+ and Habitat-Matterport3D datasets.
△ Less
Submitted 10 March, 2023; v1 submitted 1 December, 2022;
originally announced December 2022.
-
TAMFormer: Multi-Modal Transformer with Learned Attention Mask for Early Intent Prediction
Authors:
Nada Osman,
Guglielmo Camporese,
Lamberto Ballan
Abstract:
Human intention prediction is a growing area of research where an activity in a video has to be anticipated by a vision-based system. To this end, the model creates a representation of the past, and subsequently, it produces future hypotheses about upcoming scenarios. In this work, we focus on pedestrians' early intention prediction in which, from a current observation of an urban scene, the model…
▽ More
Human intention prediction is a growing area of research where an activity in a video has to be anticipated by a vision-based system. To this end, the model creates a representation of the past, and subsequently, it produces future hypotheses about upcoming scenarios. In this work, we focus on pedestrians' early intention prediction in which, from a current observation of an urban scene, the model predicts the future activity of pedestrians that approach the street. Our method is based on a multi-modal transformer that encodes past observations and produces multiple predictions at different anticipation times. Moreover, we propose to learn the attention masks of our transformer-based model (Temporal Adaptive Mask Transformer) in order to weigh differently present and past temporal dependencies. We investigate our method on several public benchmarks for early intention prediction, improving the prediction performances at different anticipation times compared to the previous works.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer
Authors:
Guglielmo Camporese,
Elena Izzo,
Lamberto Ballan
Abstract:
Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective Self-Supervised Learning (SSL) strategy to train ViTs, that without any external annotation or e…
▽ More
Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective Self-Supervised Learning (SSL) strategy to train ViTs, that without any external annotation or external data, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly the supervised task. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signals at each training step. We investigated our methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets. Code is available at: https://github.com/guglielmocamporese/relvit.
△ Less
Submitted 13 October, 2022; v1 submitted 1 June, 2022;
originally announced June 2022.
-
Goal-driven Self-Attentive Recurrent Networks for Trajectory Prediction
Authors:
Luigi Filippo Chiara,
Pasquale Coscia,
Sourav Das,
Simone Calderara,
Rita Cucchiara,
Lamberto Ballan
Abstract:
Human trajectory forecasting is a key component of autonomous vehicles, social-aware robots and advanced video-surveillance applications. This challenging task typically requires knowledge about past motion, the environment and likely destination areas. In this context, multi-modality is a fundamental aspect and its effective modeling can be beneficial to any architecture. Inferring accurate traje…
▽ More
Human trajectory forecasting is a key component of autonomous vehicles, social-aware robots and advanced video-surveillance applications. This challenging task typically requires knowledge about past motion, the environment and likely destination areas. In this context, multi-modality is a fundamental aspect and its effective modeling can be beneficial to any architecture. Inferring accurate trajectories is nevertheless challenging, due to the inherently uncertain nature of the future. To overcome these difficulties, recent models use different inputs and propose to model human intentions using complex fusion mechanisms. In this respect, we propose a lightweight attention-based recurrent backbone that acts solely on past observed positions. Although this backbone already provides promising results, we demonstrate that its prediction accuracy can be improved considerably when combined with a scene-aware goal-estimation module. To this end, we employ a common goal module, based on a U-Net architecture, which additionally extracts semantic information to predict scene-compliant destinations. We conduct extensive experiments on publicly-available datasets (i.e. SDD, inD, ETH/UCY) and show that our approach performs on par with state-of-the-art techniques while reducing model complexity.
△ Less
Submitted 25 April, 2022;
originally announced April 2022.
-
How many Observations are Enough? Knowledge Distillation for Trajectory Forecasting
Authors:
Alessio Monti,
Angelo Porrello,
Simone Calderara,
Pasquale Coscia,
Lamberto Ballan,
Rita Cucchiara
Abstract:
Accurate prediction of future human positions is an essential task for modern video-surveillance systems. Current state-of-the-art models usually rely on a "history" of past tracked locations (e.g., 3 to 5 seconds) to predict a plausible sequence of future locations (e.g., up to the next 5 seconds). We feel that this common schema neglects critical traits of realistic applications: as the collecti…
▽ More
Accurate prediction of future human positions is an essential task for modern video-surveillance systems. Current state-of-the-art models usually rely on a "history" of past tracked locations (e.g., 3 to 5 seconds) to predict a plausible sequence of future locations (e.g., up to the next 5 seconds). We feel that this common schema neglects critical traits of realistic applications: as the collection of input trajectories involves machine perception (i.e., detection and tracking), incorrect detection and fragmentation errors may accumulate in crowded scenes, leading to tracking drifts. On this account, the model would be fed with corrupted and noisy input data, thus fatally affecting its prediction performance.
In this regard, we focus on delivering accurate predictions when only few input observations are used, thus potentially lowering the risks associated with automatic perception. To this end, we conceive a novel distillation strategy that allows a knowledge transfer from a teacher network to a student one, the latter fed with fewer observations (just two ones). We show that a properly defined teacher supervision allows a student network to perform comparably to state-of-the-art approaches that demand more observations. Besides, extensive experiments on common trajectory forecasting datasets highlight that our student network better generalizes to unseen scenarios.
△ Less
Submitted 9 March, 2022;
originally announced March 2022.
-
Online Learning of Reusable Abstract Models for Object Goal Navigation
Authors:
Tommaso Campari,
Leonardo Lamanna,
Paolo Traverso,
Luciano Serafini,
Lamberto Ballan
Abstract:
In this paper, we present a novel approach to incrementally learn an Abstract Model of an unknown environment, and show how an agent can reuse the learned model for tackling the Object Goal Navigation task. The Abstract Model is a finite state machine in which each state is an abstraction of a state of the environment, as perceived by the agent in a certain position and orientation. The perception…
▽ More
In this paper, we present a novel approach to incrementally learn an Abstract Model of an unknown environment, and show how an agent can reuse the learned model for tackling the Object Goal Navigation task. The Abstract Model is a finite state machine in which each state is an abstraction of a state of the environment, as perceived by the agent in a certain position and orientation. The perceptions are high-dimensional sensory data (e.g., RGB-D images), and the abstraction is reached by exploiting image segmentation and the Taskonomy model bank. The learning of the Abstract Model is accomplished by executing actions, observing the reached state, and updating the Abstract Model with the acquired information. The learned models are memorized by the agent, and they are reused whenever it recognizes to be in an environment that corresponds to the stored model. We investigate the effectiveness of the proposed approach for the Object Goal Navigation task, relying on public benchmarks. Our results show that the reuse of learned Abstract Models can boost performance on Object Goal Navigation.
△ Less
Submitted 4 March, 2022;
originally announced March 2022.
-
SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric Videos
Authors:
Nada Osman,
Guglielmo Camporese,
Pasquale Coscia,
Lamberto Ballan
Abstract:
Action anticipation in egocentric videos is a difficult task due to the inherently multi-modal nature of human actions. Additionally, some actions happen faster or slower than others depending on the actor or surrounding context which could vary each time and lead to different predictions. Based on this idea, we build upon RULSTM architecture, which is specifically designed for anticipating human…
▽ More
Action anticipation in egocentric videos is a difficult task due to the inherently multi-modal nature of human actions. Additionally, some actions happen faster or slower than others depending on the actor or surrounding context which could vary each time and lead to different predictions. Based on this idea, we build upon RULSTM architecture, which is specifically designed for anticipating human actions, and propose a novel attention-based technique to evaluate, simultaneously, slow and fast features extracted from three different modalities, namely RGB, optical flow, and extracted objects. Two branches process information at different time scales, i.e., frame-rates, and several fusion schemes are considered to improve prediction accuracy. We perform extensive experiments on EpicKitchens-55 and EGTEA Gaze+ datasets, and demonstrate that our technique systematically improves the results of RULSTM architecture for Top-5 accuracy metric at different anticipation times.
△ Less
Submitted 2 September, 2021;
originally announced September 2021.
-
Conditional Variational Capsule Network for Open Set Recognition
Authors:
Yunrui Guo,
Guglielmo Camporese,
Wen**g Yang,
Alessandro Sperduti,
Lamberto Ballan
Abstract:
In open set recognition, a classifier has to detect unknown classes that are not known at training time. In order to recognize new categories, the classifier has to project the input samples of known classes in very compact and separated regions of the features space for discriminating samples of unknown classes. Recently proposed Capsule Networks have shown to outperform alternatives in many fiel…
▽ More
In open set recognition, a classifier has to detect unknown classes that are not known at training time. In order to recognize new categories, the classifier has to project the input samples of known classes in very compact and separated regions of the features space for discriminating samples of unknown classes. Recently proposed Capsule Networks have shown to outperform alternatives in many fields, particularly in image recognition, however they have not been fully applied yet to open-set recognition. In capsule networks, scalar neurons are replaced by capsule vectors or matrices, whose entries represent different properties of objects. In our proposal, during training, capsules features of the same known class are encouraged to match a pre-defined gaussian, one for each class. To this end, we use the variational autoencoder framework, with a set of gaussian priors as the approximation for the posterior distribution. In this way, we are able to control the compactness of the features of the same class around the center of the gaussians, thus controlling the ability of the classifier in detecting samples from unknown classes. We conducted several experiments and ablation of our model, obtaining state of the art results on different datasets in the open set recognition and unknown detection tasks.
△ Less
Submitted 17 August, 2021; v1 submitted 19 April, 2021;
originally announced April 2021.
-
Prediction of Tuberculosis using U-Net and segmentation techniques
Authors:
Dennis Núñez-Fernández,
Lamberto Ballan,
Gabriel Jiménez-Avalos,
Jorge Coronel,
Patricia Sheen,
Mirko Zimic
Abstract:
One of the most serious public health problems in Peru and worldwide is Tuberculosis (TB), which is produced by a bacterium known as Mycobacterium tuberculosis. The purpose of this work is to facilitate and automate the diagnosis of tuberculosis using the MODS method and using lens-free microscopy, as it is easier to calibrate and easier to use by untrained personnel compared to lens microscopy. T…
▽ More
One of the most serious public health problems in Peru and worldwide is Tuberculosis (TB), which is produced by a bacterium known as Mycobacterium tuberculosis. The purpose of this work is to facilitate and automate the diagnosis of tuberculosis using the MODS method and using lens-free microscopy, as it is easier to calibrate and easier to use by untrained personnel compared to lens microscopy. Therefore, we employed a U-Net network on our collected data set to perform automatic segmentation of cord shape bacterial accumulation and then predict tuberculosis. Our results show promising evidence for automatic segmentation of TB cords, and thus good accuracy for TB prediction.
△ Less
Submitted 2 April, 2021;
originally announced April 2021.
-
Exploiting Scene-specific Features for Object Goal Navigation
Authors:
Tommaso Campari,
Paolo Eccher,
Luciano Serafini,
Lamberto Ballan
Abstract:
Can the intrinsic relation between an object and the room in which it is usually located help agents in the Visual Navigation Task? We study this question in the context of Object Navigation, a problem in which an agent has to reach an object of a specific class while moving in a complex domestic environment. In this paper, we introduce a new reduced dataset that speeds up the training of navigati…
▽ More
Can the intrinsic relation between an object and the room in which it is usually located help agents in the Visual Navigation Task? We study this question in the context of Object Navigation, a problem in which an agent has to reach an object of a specific class while moving in a complex domestic environment. In this paper, we introduce a new reduced dataset that speeds up the training of navigation models, a notoriously complex task. Our proposed dataset permits the training of models that do not exploit online-built maps in reasonable times even without the use of huge computational resources. Therefore, this reduced dataset guarantees a significant benchmark and it can be used to identify promising models that could be then tried on bigger and more challenging datasets. Subsequently, we propose the SMTSC model, an attention-based model capable of exploiting the correlation between scenes and objects contained in them, highlighting quantitatively how the idea is correct.
△ Less
Submitted 21 August, 2020;
originally announced August 2020.
-
Automatic semantic segmentation for prediction of tuberculosis using lens-free microscopy images
Authors:
Dennis Núñez-Fernández,
Lamberto Ballan,
Gabriel Jiménez-Avalos,
Jorge Coronel,
Mirko Zimic
Abstract:
Tuberculosis (TB), caused by a germ called Mycobacterium tuberculosis, is one of the most serious public health problems in Peru and the world. The development of this project seeks to facilitate and automate the diagnosis of tuberculosis by the MODS method and using lens-free microscopy, due they are easier to calibrate and easier to use (by untrained personnel) in comparison with lens microscopy…
▽ More
Tuberculosis (TB), caused by a germ called Mycobacterium tuberculosis, is one of the most serious public health problems in Peru and the world. The development of this project seeks to facilitate and automate the diagnosis of tuberculosis by the MODS method and using lens-free microscopy, due they are easier to calibrate and easier to use (by untrained personnel) in comparison with lens microscopy. Thus, we employ a U-Net network in our collected dataset to perform the automatic segmentation of the TB cords in order to predict tuberculosis. Our initial results show promising evidence for automatic segmentation of TB cords.
△ Less
Submitted 5 July, 2020;
originally announced July 2020.
-
Using Capsule Neural Network to predict Tuberculosis in lens-free microscopic images
Authors:
Dennis Núñez-Fernández,
Lamberto Ballan,
Gabriel Jiménez-Avalos,
Jorge Coronel,
Mirko Zimic
Abstract:
Tuberculosis, caused by a bacteria called Mycobacterium tuberculosis, is one of the most serious public health problems worldwide. This work seeks to facilitate and automate the prediction of tuberculosis by the MODS method and using lens-free microscopy, which is easy to use by untrained personnel. We employ the CapsNet architecture in our collected dataset and show that it has a better accuracy…
▽ More
Tuberculosis, caused by a bacteria called Mycobacterium tuberculosis, is one of the most serious public health problems worldwide. This work seeks to facilitate and automate the prediction of tuberculosis by the MODS method and using lens-free microscopy, which is easy to use by untrained personnel. We employ the CapsNet architecture in our collected dataset and show that it has a better accuracy than traditional CNN architectures.
△ Less
Submitted 5 July, 2020;
originally announced July 2020.
-
AC-VRNN: Attentive Conditional-VRNN for Multi-Future Trajectory Prediction
Authors:
Alessia Bertugli,
Simone Calderara,
Pasquale Coscia,
Lamberto Ballan,
Rita Cucchiara
Abstract:
Anticipating human motion in crowded scenarios is essential for develo** intelligent transportation systems, social-aware robots and advanced video surveillance applications. A key component of this task is represented by the inherently multi-modal nature of human paths which makes socially acceptable multiple futures when human interactions are involved. To this end, we propose a generative arc…
▽ More
Anticipating human motion in crowded scenarios is essential for develo** intelligent transportation systems, social-aware robots and advanced video surveillance applications. A key component of this task is represented by the inherently multi-modal nature of human paths which makes socially acceptable multiple futures when human interactions are involved. To this end, we propose a generative architecture for multi-future trajectory predictions based on Conditional Variational Recurrent Neural Networks (C-VRNNs). Conditioning mainly relies on prior belief maps, representing most likely moving directions and forcing the model to consider past observed dynamics in generating future positions. Human interactions are modeled with a graph-based attention mechanism enabling an online attentive hidden state refinement of the recurrent estimation. To corroborate our model, we perform extensive experiments on publicly-available datasets (e.g., ETH/UCY, Stanford Drone Dataset, STATS SportVU NBA, Intersection Drone Dataset and TrajNet++) and demonstrate its effectiveness in crowded scenes compared to several state-of-the-art methods.
△ Less
Submitted 8 July, 2021; v1 submitted 17 May, 2020;
originally announced May 2020.
-
Knowledge Distillation for Action Anticipation via Label Smoothing
Authors:
Guglielmo Camporese,
Pasquale Coscia,
Antonino Furnari,
Giovanni Maria Farinella,
Lamberto Ballan
Abstract:
Human capability to anticipate near future from visual observations and non-verbal cues is essential for develo** intelligent systems that need to interact with people. Several research areas, such as human-robot interaction (HRI), assisted living or autonomous driving need to foresee future events to avoid crashes or help people. Egocentric scenarios are classic examples where action anticipati…
▽ More
Human capability to anticipate near future from visual observations and non-verbal cues is essential for develo** intelligent systems that need to interact with people. Several research areas, such as human-robot interaction (HRI), assisted living or autonomous driving need to foresee future events to avoid crashes or help people. Egocentric scenarios are classic examples where action anticipation is applied due to their numerous applications. Such challenging task demands to capture and model domain's hidden structure to reduce prediction uncertainty. Since multiple actions may equally occur in the future, we treat action anticipation as a multi-label problem with missing labels extending the concept of label smoothing. This idea resembles the knowledge distillation process since useful information is injected into the model during training. We implement a multi-modal framework based on long short-term memory (LSTM) networks to summarize past observations and make predictions at different time steps. We perform extensive experiments on EPIC-Kitchens and EGTEA Gaze+ datasets including more than 2500 and 100 action classes, respectively. The experiments show that label smoothing systematically improves performance of state-of-the-art models for action anticipation.
△ Less
Submitted 18 December, 2020; v1 submitted 16 April, 2020;
originally announced April 2020.
-
A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata
Authors:
Tobia Tesan,
Pasquale Coscia,
Lamberto Ballan
Abstract:
Images represent a commonly used form of visual communication among people. Nevertheless, image classification may be a challenging task when dealing with unclear or non-common images needing more context to be correctly annotated. Metadata accompanying images on social-media represent an ideal source of additional information for retrieving proper neighborhoods easing image annotation task. To th…
▽ More
Images represent a commonly used form of visual communication among people. Nevertheless, image classification may be a challenging task when dealing with unclear or non-common images needing more context to be correctly annotated. Metadata accompanying images on social-media represent an ideal source of additional information for retrieving proper neighborhoods easing image annotation task. To this end, we blend visual features extracted from neighbors and their metadata to jointly leverage context and visual cues. Our models use multiple semantic embeddings to achieve the dual objective of being robust to vocabulary changes between train and test sets and decoupling the architecture from the low-level metadata representation. Convolutional and recurrent neural networks (CNNs-RNNs) are jointly adopted to infer similarity among neighbors and query images. We perform comprehensive experiments on the NUS-WIDE dataset showing that our models outperform state-of-the-art architectures based on images and metadata, and decrease both sensory and semantic gaps to better annotate images.
△ Less
Submitted 30 March, 2020; v1 submitted 13 October, 2019;
originally announced October 2019.
-
Social and Scene-Aware Trajectory Prediction in Crowded Spaces
Authors:
Matteo Lisotto,
Pasquale Coscia,
Lamberto Ballan
Abstract:
Mimicking human ability to forecast future positions or interpret complex interactions in urban scenarios, such as streets, shop** malls or squares, is essential to develop socially compliant robots or self-driving cars. Autonomous systems may gain advantage on anticipating human motion to avoid collisions or to naturally behave alongside people. To foresee plausible trajectories, we construct a…
▽ More
Mimicking human ability to forecast future positions or interpret complex interactions in urban scenarios, such as streets, shop** malls or squares, is essential to develop socially compliant robots or self-driving cars. Autonomous systems may gain advantage on anticipating human motion to avoid collisions or to naturally behave alongside people. To foresee plausible trajectories, we construct an LSTM (long short-term memory)-based model considering three fundamental factors: people interactions, past observations in terms of previously crossed areas and semantics of surrounding space. Our model encompasses several pooling mechanisms to join the above elements defining multiple tensors, namely social, navigation and semantic tensors. The network is tested in unstructured environments where complex paths emerge according to both internal (intentions) and external (other people, not accessible areas) motivations. As demonstrated, modeling paths unaware of social interactions or context information, is insufficient to correctly predict future positions. Experimental results corroborate the effectiveness of the proposed framework in comparison to LSTM-based models for human path prediction.
△ Less
Submitted 19 September, 2019;
originally announced September 2019.
-
Learning without Prejudice: Avoiding Bias in Webly-Supervised Action Recognition
Authors:
Christian Rupprecht,
Ansh Kapil,
Nan Liu,
Lamberto Ballan,
Federico Tombari
Abstract:
Webly-supervised learning has recently emerged as an alternative paradigm to traditional supervised learning based on large-scale datasets with manual annotations. The key idea is that models such as CNNs can be learned from the noisy visual data available on the web. In this work we aim to exploit web data for video understanding tasks such as action recognition and detection. One of the main pro…
▽ More
Webly-supervised learning has recently emerged as an alternative paradigm to traditional supervised learning based on large-scale datasets with manual annotations. The key idea is that models such as CNNs can be learned from the noisy visual data available on the web. In this work we aim to exploit web data for video understanding tasks such as action recognition and detection. One of the main problems in webly-supervised learning is cleaning the noisy labeled data from the web. The state-of-the-art paradigm relies on training a first classifier on noisy data that is then used to clean the remaining dataset. Our key insight is that this procedure biases the second classifier towards samples that the first one understands. Here we train two independent CNNs, a RGB network on web images and video frames and a second network using temporal information from optical flow. We show that training the networks independently is vastly superior to selecting the frames for the flow classifier by using our RGB network. Moreover, we show benefits in enriching the training set with different data sources from heterogeneous public web databases. We demonstrate that our framework outperforms all other webly-supervised methods on two public benchmarks, UCF-101 and Thumos'14.
△ Less
Submitted 14 June, 2017;
originally announced June 2017.
-
Localization of JPEG double compression through multi-domain convolutional neural networks
Authors:
Irene Amerini,
Tiberio Uricchio,
Lamberto Ballan,
Roberto Caldelli
Abstract:
When an attacker wants to falsify an image, in most of cases she/he will perform a JPEG recompression. Different techniques have been developed based on diverse theoretical assumptions but very effective solutions have not been developed yet. Recently, machine learning based approaches have been started to appear in the field of image forensics to solve diverse tasks such as acquisition source ide…
▽ More
When an attacker wants to falsify an image, in most of cases she/he will perform a JPEG recompression. Different techniques have been developed based on diverse theoretical assumptions but very effective solutions have not been developed yet. Recently, machine learning based approaches have been started to appear in the field of image forensics to solve diverse tasks such as acquisition source identification and forgery detection. In this last case, the aim ahead would be to get a trained neural network able, given a to-be-checked image, to reliably localize the forged areas. With this in mind, our paper proposes a step forward in this direction by analyzing how a single or double JPEG compression can be revealed and localized using convolutional neural networks (CNNs). Different kinds of input to the CNN have been taken into consideration, and various experiments have been carried out trying also to evidence potential issues to be further investigated.
△ Less
Submitted 6 June, 2017;
originally announced June 2017.
-
Context-Aware Trajectory Prediction
Authors:
Federico Bartoli,
Giuseppe Lisanti,
Lamberto Ballan,
Alberto Del Bimbo
Abstract:
Human motion and behaviour in crowded spaces is influenced by several factors, such as the dynamics of other moving agents in the scene, as well as the static elements that might be perceived as points of attraction or obstacles. In this work, we present a new model for human trajectory prediction which is able to take advantage of both human-human and human-space interactions. The future trajecto…
▽ More
Human motion and behaviour in crowded spaces is influenced by several factors, such as the dynamics of other moving agents in the scene, as well as the static elements that might be perceived as points of attraction or obstacles. In this work, we present a new model for human trajectory prediction which is able to take advantage of both human-human and human-space interactions. The future trajectory of humans, are generated by observing their past positions and interactions with the surroundings. To this end, we propose a "context-aware" recurrent neural network LSTM model, which can learn and predict human motion in crowded spaces such as a sidewalk, a museum or a shop** mall. We evaluate our model on a public pedestrian datasets, and we contribute a new challenging dataset that collects videos of humans that navigate in a (real) crowded space such as a big museum. Results show that our approach can predict human trajectories better when compared to previous state-of-the-art forecasting models.
△ Less
Submitted 6 May, 2017;
originally announced May 2017.
-
Am I Done? Predicting Action Progress in Videos
Authors:
Federico Becattini,
Tiberio Uricchio,
Lorenzo Seidenari,
Lamberto Ballan,
Alberto Del Bimbo
Abstract:
In this paper we deal with the problem of predicting action progress in videos. We argue that this is an extremely important task since it can be valuable for a wide range of interaction applications. To this end we introduce a novel approach, named ProgressNet, capable of predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during…
▽ More
In this paper we deal with the problem of predicting action progress in videos. We argue that this is an extremely important task since it can be valuable for a wide range of interaction applications. To this end we introduce a novel approach, named ProgressNet, capable of predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during its execution. To provide a general definition of action progress, we ground our work in the linguistics literature, borrowing terms and concepts to understand which actions can be the subject of progress estimation. As a result, we define a categorization of actions and their phases. Motivated by the recent success obtained from the interaction of Convolutional and Recurrent Neural Networks, our model is based on a combination of the Faster R-CNN framework, to make frame-wise predictions, and LSTM networks, to estimate action progress through time. After introducing two evaluation protocols for the task at hand, we demonstrate the capability of our model to effectively predict action progress on the UCF-101 and J-HMDB datasets.
△ Less
Submitted 9 March, 2020; v1 submitted 4 May, 2017;
originally announced May 2017.
-
Automatic Image Annotation via Label Transfer in the Semantic Space
Authors:
Tiberio Uricchio,
Lamberto Ballan,
Lorenzo Seidenari,
Alberto Del Bimbo
Abstract:
Automatic image annotation is among the fundamental problems in computer vision and pattern recognition, and it is becoming increasingly important in order to develop algorithms that are able to search and browse large-scale image collections. In this paper, we propose a label propagation framework based on Kernel Canonical Correlation Analysis (KCCA), which builds a latent semantic space where co…
▽ More
Automatic image annotation is among the fundamental problems in computer vision and pattern recognition, and it is becoming increasingly important in order to develop algorithms that are able to search and browse large-scale image collections. In this paper, we propose a label propagation framework based on Kernel Canonical Correlation Analysis (KCCA), which builds a latent semantic space where correlation of visual and textual features are well preserved into a semantic embedding. The proposed approach is robust and can work either when the training set is well annotated by experts, as well as when it is noisy such as in the case of user-generated tags in social media. We report extensive results on four popular datasets. Our results show that our KCCA-based framework can be applied to several state-of-the-art label transfer methods to obtain significant improvements. Our approach works even with the noisy tags of social users, provided that appropriate denoising is performed. Experiments on a large scale setting show that our method can provide some benefits even when the semantic space is estimated on a subset of training images.
△ Less
Submitted 1 June, 2017; v1 submitted 16 May, 2016;
originally announced May 2016.
-
Knowledge Transfer for Scene-specific Motion Prediction
Authors:
Lamberto Ballan,
Francesco Castaldo,
Alexandre Alahi,
Francesco Palmieri,
Silvio Savarese
Abstract:
When given a single frame of the video, humans can not only interpret the content of the scene, but also they are able to forecast the near future. This ability is mostly driven by their rich prior knowledge about the visual world, both in terms of (i) the dynamics of moving agents, as well as (ii) the semantic of the scene. In this work we exploit the interplay between these two key elements to p…
▽ More
When given a single frame of the video, humans can not only interpret the content of the scene, but also they are able to forecast the near future. This ability is mostly driven by their rich prior knowledge about the visual world, both in terms of (i) the dynamics of moving agents, as well as (ii) the semantic of the scene. In this work we exploit the interplay between these two key elements to predict scene-specific motion patterns. First, we extract patch descriptors encoding the probability of moving to the adjacent patches, and the probability of being in that particular patch or changing behavior. Then, we introduce a Dynamic Bayesian Network which exploits this scene specific knowledge for trajectory prediction. Experimental results demonstrate that our method is able to accurately predict trajectories and transfer predictions to a novel scene characterized by similar elements.
△ Less
Submitted 25 July, 2016; v1 submitted 22 March, 2016;
originally announced March 2016.
-
Love Thy Neighbors: Image Annotation by Exploiting Image Metadata
Authors:
Justin Johnson,
Lamberto Ballan,
Fei-Fei Li
Abstract:
Some images that are difficult to recognize on their own may become more clear in the context of a neighborhood of related images with similar social-network metadata. We build on this intuition to improve multilabel image annotation. Our model uses image metadata nonparametrically to generate neighborhoods of related images using Jaccard similarities, then uses a deep neural network to blend visu…
▽ More
Some images that are difficult to recognize on their own may become more clear in the context of a neighborhood of related images with similar social-network metadata. We build on this intuition to improve multilabel image annotation. Our model uses image metadata nonparametrically to generate neighborhoods of related images using Jaccard similarities, then uses a deep neural network to blend visual information from the image and its neighbors. Prior work typically models image metadata parametrically, in contrast, our nonparametric treatment allows our model to perform well even when the vocabulary of metadata changes between training and testing. We perform comprehensive experiments on the NUS-WIDE dataset, where we show that our model outperforms state-of-the-art methods for multilabel image annotation even when our model is forced to generalize to new types of metadata.
△ Less
Submitted 21 September, 2015; v1 submitted 30 August, 2015;
originally announced August 2015.
-
Capturing Hands in Action using Discriminative Salient Points and Physics Simulation
Authors:
Dimitrios Tzionas,
Luca Ballan,
Abhilash Srikantha,
Pablo Aponte,
Marc Pollefeys,
Juergen Gall
Abstract:
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated object…
▽ More
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated objects. Our framework combines a generative model with discriminatively trained salient points to achieve a low tracking error and with collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data. Since all components are unified in a single objective function which is almost everywhere differentiable, it can be optimized with standard optimization techniques. Our approach works for monocular RGB-D sequences as well as setups with multiple synchronized RGB cameras. For a qualitative and quantitative evaluation, we captured 29 sequences with a large variety of interactions and up to 150 degrees of freedom.
△ Less
Submitted 7 March, 2016; v1 submitted 6 June, 2015;
originally announced June 2015.
-
Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval
Authors:
Xirong Li,
Tiberio Uricchio,
Lamberto Ballan,
Marco Bertini,
Cees G. M. Snoek,
Alberto Del Bimbo
Abstract:
Where previous reviews on content-based image retrieval emphasize on what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image. A comprehensive treatise of three closely linked problems, i.e., image tag assignment, refinement, and tag-based image retrieval is presented. While existing works vary in terms of their targeted tasks and methodology, t…
▽ More
Where previous reviews on content-based image retrieval emphasize on what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image. A comprehensive treatise of three closely linked problems, i.e., image tag assignment, refinement, and tag-based image retrieval is presented. While existing works vary in terms of their targeted tasks and methodology, they rely on the key functionality of tag relevance, i.e. estimating the relevance of a specific tag with respect to the visual content of a given image and its social context. By analyzing what information a specific method exploits to construct its tag relevance function and how such information is exploited, this paper introduces a taxonomy to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations. For a head-to-head comparison between the state-of-the-art, a new experimental protocol is presented, with training sets containing 10k, 100k and 1m images and an evaluation on three test sets, contributed by various research groups. Eleven representative works are implemented and evaluated. Putting all this together, the survey aims to provide an overview of the past and foster progress for the near future.
△ Less
Submitted 23 March, 2016; v1 submitted 27 March, 2015;
originally announced March 2015.
-
A Data-Driven Approach for Tag Refinement and Localization in Web Videos
Authors:
Lamberto Ballan,
Marco Bertini,
Giuseppe Serra,
Alberto Del Bimbo
Abstract:
Tagging of visual content is becoming more and more widespread as web-based services and social networks have popularized tagging functionalities among their users. These user-generated tags are used to ease browsing and exploration of media collections, e.g. using tag clouds, or to retrieve multimedia content. However, not all media are equally tagged by users. Using the current systems is easy t…
▽ More
Tagging of visual content is becoming more and more widespread as web-based services and social networks have popularized tagging functionalities among their users. These user-generated tags are used to ease browsing and exploration of media collections, e.g. using tag clouds, or to retrieve multimedia content. However, not all media are equally tagged by users. Using the current systems is easy to tag a single photo, and even tagging a part of a photo, like a face, has become common in sites like Flickr and Facebook. On the other hand, tagging a video sequence is more complicated and time consuming, so that users just tag the overall content of a video. In this paper we present a method for automatic video annotation that increases the number of tags originally provided by users, and localizes them temporally, associating tags to keyframes. Our approach exploits collective knowledge embedded in user-generated tags and web sources, and visual similarity of keyframes and images uploaded to social sites like YouTube and Flickr, as well as web sources like Google and Bing. Given a keyframe, our method is able to select on the fly from these visual sources the training exemplars that should be the most relevant for this test sample, and proceeds to transfer labels across similar images. Compared to existing video tagging approaches that require training classifiers for each tag, our system has few parameters, is easy to implement and can deal with an open vocabulary scenario. We demonstrate the approach on tag refinement and localization on DUT-WEBV, a large dataset of web videos, and show state-of-the-art results.
△ Less
Submitted 28 May, 2015; v1 submitted 2 July, 2014;
originally announced July 2014.