Search | arXiv e-print repository

CARLOR @ Ego4D Step Grounding Challenge: Bayesian temporal-order priors for test time refinement

Authors: Carlos Plou, Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Ana C. Murillo

Abstract: The goal of the Step Grounding task is to locate temporal boundaries of activities based on natural language descriptions. This technical report introduces a Bayesian-VSLNet to address the challenge of identifying such temporal segments in lengthy, untrimmed egocentric videos. Our model significantly improves upon traditional models by incorporating a novel Bayesian temporal-order prior during inf… ▽ More The goal of the Step Grounding task is to locate temporal boundaries of activities based on natural language descriptions. This technical report introduces a Bayesian-VSLNet to address the challenge of identifying such temporal segments in lengthy, untrimmed egocentric videos. Our model significantly improves upon traditional models by incorporating a novel Bayesian temporal-order prior during inference, enhancing the accuracy of moment predictions. This prior adjusts for cyclic and repetitive actions within videos. Our evaluations demonstrate superior performance over existing methods, achieving state-of-the-art results on the Ego4D Goal-Step dataset with a 35.18 Recall Top-1 at 0.3 IoU and 20.48 Recall Top-1 at 0.5 IoU on the test set. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2404.01867 [pdf, other]

Active Exploration in Bayesian Model-based Reinforcement Learning for Robot Manipulation

Authors: Carlos Plou, Ana C. Murillo, Ruben Martinez-Cantin

Abstract: Efficiently tackling multiple tasks within complex environment, such as those found in robot manipulation, remains an ongoing challenge in robotics and an opportunity for data-driven solutions, such as reinforcement learning (RL). Model-based RL, by building a dynamic model of the robot, enables data reuse and transfer learning between tasks with the same robot and similar environment. Furthermore… ▽ More Efficiently tackling multiple tasks within complex environment, such as those found in robot manipulation, remains an ongoing challenge in robotics and an opportunity for data-driven solutions, such as reinforcement learning (RL). Model-based RL, by building a dynamic model of the robot, enables data reuse and transfer learning between tasks with the same robot and similar environment. Furthermore, data gathering in robotics is expensive and we must rely on data efficient approaches such as model-based RL, where policy learning is mostly conducted on cheaper simulations based on the learned model. Therefore, the quality of the model is fundamental for the performance of the posterior tasks. In this work, we focus on improving the quality of the model and maintaining the data efficiency by performing active learning of the dynamic model during a preliminary exploration phase based on maximize information gathering. We employ Bayesian neural network models to represent, in a probabilistic way, both the belief and information encoded in the dynamic model during exploration. With our presented strategies we manage to actively estimate the novelty of each transition, using this as the exploration reward. In this work, we compare several Bayesian inference methods for neural networks, some of which have never been used in a robotics context, and evaluate them in a realistic robot manipulation setup. Our experiments show the advantages of our Bayesian model-based RL approach, with similar quality in the results than relevant alternatives with much lower requirements regarding robot execution steps. Unlike related previous studies that focused the validation solely on toy problems, our research takes a step towards more realistic setups, tackling robotic arm end-tasks. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2404.01801 [pdf, other]

EventSleep: Sleep Activity Recognition with Event Cameras

Authors: Carlos Plou, Nerea Gallego, Alberto Sabater, Eduardo Montijano, Pablo Urcola, Luis Montesano, Ruben Martinez-Cantin, Ana C. Murillo

Abstract: Event cameras are a promising technology for activity recognition in dark environments due to their unique properties. However, real event camera datasets under low-lighting conditions are still scarce, which also limits the number of approaches to solve these kind of problems, hindering the potential of this technology in many applications. We present EventSleep, a new dataset and methodology to… ▽ More Event cameras are a promising technology for activity recognition in dark environments due to their unique properties. However, real event camera datasets under low-lighting conditions are still scarce, which also limits the number of approaches to solve these kind of problems, hindering the potential of this technology in many applications. We present EventSleep, a new dataset and methodology to address this gap and study the suitability of event cameras for a very relevant medical application: sleep monitoring for sleep disorders analysis. The dataset contains synchronized event and infrared recordings emulating common movements that happen during the sleep, resulting in a new challenging and unique dataset for activity recognition in dark environments. Our novel pipeline is able to achieve high accuracy under these challenging conditions and incorporates a Bayesian approach (Laplace ensembles) to increase the robustness in the predictions, which is fundamental for medical applications. Our work is the first application of Bayesian neural networks for event cameras, the first use of Laplace ensembles in a realistic problem, and also demonstrates for the first time the potential of event cameras in a new application domain: to enhance current sleep evaluation procedures. Our activity recognition results highlight the potential of event cameras under dark conditions, and its capacity and robustness for sleep activity recognition, and open problems as the adaptation of event data pre-processing techniques to dark environments. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2403.18033 [pdf, other]

SpectralWaste Dataset: Multimodal Data for Waste Sorting Automation

Authors: Sara Casao, Fernando Peña, Alberto Sabater, Rosa Castillón, Darío Suárez, Eduardo Montijano, Ana C. Murillo

Abstract: The increase in non-biodegradable waste is a worldwide concern. Recycling facilities play a crucial role, but their automation is hindered by the complex characteristics of waste recycling lines like clutter or object deformation. In addition, the lack of publicly available labeled data for these environments makes develo** robust perception systems challenging. Our work explores the benefits of… ▽ More The increase in non-biodegradable waste is a worldwide concern. Recycling facilities play a crucial role, but their automation is hindered by the complex characteristics of waste recycling lines like clutter or object deformation. In addition, the lack of publicly available labeled data for these environments makes develo** robust perception systems challenging. Our work explores the benefits of multimodal perception for object segmentation in real waste management scenarios. First, we present SpectralWaste, the first dataset collected from an operational plastic waste sorting facility that provides synchronized hyperspectral and conventional RGB images. This dataset contains labels for several categories of objects that commonly appear in sorting plants and need to be detected and separated from the main trash flow for several reasons, such as security in the management line or reuse. Additionally, we propose a pipeline employing different object segmentation architectures and evaluate the alternatives on our dataset, conducting an extensive analysis for both multimodal and unimodal alternatives. Our evaluation pays special attention to efficiency and suitability for real-time processing and demonstrates how HSI can bring a boost to RGB-only perception in these realistic industrial settings without much computational overhead. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.13467 [pdf, other]

CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models

Authors: Pablo Pueyo, Eduardo Montijano, Ana C. Murillo, Mac Schwager

Abstract: This paper introduces CLIPSwarm, a new algorithm designed to automate the modeling of swarm drone formations based on natural language. The algorithm begins by enriching a provided word, to compose a text prompt that serves as input to an iterative approach to find the formation that best matches the provided word. The algorithm iteratively refines formations of robots to align with the textual de… ▽ More This paper introduces CLIPSwarm, a new algorithm designed to automate the modeling of swarm drone formations based on natural language. The algorithm begins by enriching a provided word, to compose a text prompt that serves as input to an iterative approach to find the formation that best matches the provided word. The algorithm iteratively refines formations of robots to align with the textual description, employing different steps for "exploration" and "exploitation". Our framework is currently evaluated on simple formation targets, limited to contour shapes. A formation is visually represented through alpha-shape contours and the most representative color is automatically found for the input word. To measure the similarity between the description and the visual representation of the formation, we use CLIP [1], encoding text and images into vectors and assessing their similarity. Subsequently, the algorithm rearranges the formation to visually represent the word more effectively, within the given constraints of available drones. Control actions are then assigned to the drones, ensuring robotic behavior and collision-free movement. Experimental results demonstrate the system's efficacy in accurately modeling robot formations from natural language descriptions. The algorithm's versatility is showcased through the execution of drone shows in photorealistic simulation with varying shapes. We refer the reader to the supplementary video for a visual reference of the results. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2401.05272 [pdf, other]

CineMPC: A Fully Autonomous Drone Cinematography System Incorporating Zoom, Focus, Pose, and Scene Composition

Authors: Pablo Pueyo, Juan Dendarieta, Eduardo Montijano, Ana C. Murillo, Mac Schwager

Abstract: We present CineMPC, a complete cinematographic system that autonomously controls a drone to film multiple targets recording user-specified aesthetic objectives. Existing solutions in autonomous cinematography control only the camera extrinsics, namely its position, and orientation. In contrast, CineMPC is the first solution that includes the camera intrinsic parameters in the control loop, which a… ▽ More We present CineMPC, a complete cinematographic system that autonomously controls a drone to film multiple targets recording user-specified aesthetic objectives. Existing solutions in autonomous cinematography control only the camera extrinsics, namely its position, and orientation. In contrast, CineMPC is the first solution that includes the camera intrinsic parameters in the control loop, which are essential tools for controlling cinematographic effects like focus, depth-of-field, and zoom. The system estimates the relative poses between the targets and the camera from an RGB-D image and optimizes a trajectory for the extrinsic and intrinsic camera parameters to film the artistic and technical requirements specified by the user. The drone and the camera are controlled in a nonlinear Model Predicted Control (MPC) loop by re-optimizing the trajectory at each time step in response to current conditions in the scene. The perception system of CineMPC can track the targets' position and orientation despite the camera effects. Experiments in a photorealistic simulation and with a real platform demonstrate the capabilities of the system to achieve a full array of cinematographic effects that are not possible without the control of the intrinsics of the camera. Code for CineMPC is implemented following a modular architecture in ROS and released to the community. △ Less

Submitted 10 January, 2024; originally announced January 2024.

arXiv:2311.11047 [pdf, other]

CLIPSwarm: Converting text into formations of robots

Authors: Pablo Pueyo, Eduardo Montijano, Ana C. Murillo, Mac Schwager

Abstract: We present CLIPSwarm, an algorithm to generate robot swarm formations from natural language descriptions. CLIPSwarm receives an input text and finds the position of the robots to form a shape that corresponds to the given text. To do so, we implement a variation of the Montecarlo particle filter to obtain a matching formation iteratively. In every iteration, we generate a set of new formations and… ▽ More We present CLIPSwarm, an algorithm to generate robot swarm formations from natural language descriptions. CLIPSwarm receives an input text and finds the position of the robots to form a shape that corresponds to the given text. To do so, we implement a variation of the Montecarlo particle filter to obtain a matching formation iteratively. In every iteration, we generate a set of new formations and evaluate their Clip Similarity with the given text, selecting the best formations according to this metric. This metric is obtained using Clip, [1], an existing foundation model trained to encode images and texts into vectors within a common latent space. The comparison between these vectors determines how likely the given text describes the shapes. Our initial proof of concept shows the potential of this solution to generate robot swarm formations just from natural language descriptions and demonstrates a novel application of foundation models, such as CLIP, in the field of multi-robot systems. In this first approach, we create formations using a Convex-Hull approach. Next steps include more robust and generic representation and optimization steps in the process of obtaining a suitable swarm formation. △ Less

Submitted 18 November, 2023; originally announced November 2023.

Comments: Please cite this article as "P. Pueyo, E. Montijano, A. C. Murillo, and M. Schwager, CLIPSwarm: Converting text into formations of robots. ICRA 2023 Workshop on Multi-Robot Learning"

Journal ref: ICRA 2023, Workshop on Multi-Robot Learning

arXiv:2310.03953 [pdf, other]

CineTransfer: Controlling a Robot to Imitate Cinematographic Style from a Single Example

Authors: Pablo Pueyo, Eduardo Montijano, Ana C. Murillo, Mac Schwager

Abstract: This work presents CineTransfer, an algorithmic framework that drives a robot to record a video sequence that mimics the cinematographic style of an input video. We propose features that abstract the aesthetic style of the input video, so the robot can transfer this style to a scene with visual details that are significantly different from the input video. The framework builds upon CineMPC, a tool… ▽ More This work presents CineTransfer, an algorithmic framework that drives a robot to record a video sequence that mimics the cinematographic style of an input video. We propose features that abstract the aesthetic style of the input video, so the robot can transfer this style to a scene with visual details that are significantly different from the input video. The framework builds upon CineMPC, a tool that allows users to control cinematographic features, like subjects' position on the image and the depth of field, by manipulating the intrinsics and extrinsics of a cinematographic camera. However, CineMPC requires a human expert to specify the desired style of the shot (composition, camera motion, zoom, focus, etc). CineTransfer bridges this gap, aiming a fully autonomous cinematographic platform. The user chooses a single input video as a style guide. CineTransfer extracts and optimizes two important style features, the composition of the subject in the image and the scene depth of field, and provides instructions for CineMPC to control the robot to record an output sequence that matches these features as closely as possible. In contrast with other style transfer methods, our approach is a lightweight and portable framework which does not require deep network training or extensive datasets. Experiments with real and simulated videos demonstrate the system's ability to analyze and transfer style between recordings, and are available in the supplementary video. △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2304.07059 [pdf, other]

A Framework for Fast Prototy** of Photo-realistic Environments with Multiple Pedestrians

Authors: Sara Casao, Andrés Otero, Álvaro Serra-Gómez, Ana C. Murillo, Javier Alonso-Mora, Eduardo Montijano

Abstract: Robotic applications involving people often require advanced perception systems to better understand complex real-world scenarios. To address this challenge, photo-realistic and physics simulators are gaining popularity as a means of generating accurate data labeling and designing scenarios for evaluating generalization capabilities, e.g., lighting changes, camera movements or different weather co… ▽ More Robotic applications involving people often require advanced perception systems to better understand complex real-world scenarios. To address this challenge, photo-realistic and physics simulators are gaining popularity as a means of generating accurate data labeling and designing scenarios for evaluating generalization capabilities, e.g., lighting changes, camera movements or different weather conditions. We develop a photo-realistic framework built on Unreal Engine and AirSim to generate easily scenarios with pedestrians and mobile robots. The framework is capable to generate random and customized trajectories for each person and provides up to 50 ready-to-use people models along with an API for their metadata retrieval. We demonstrate the usefulness of the proposed framework with a use case of multi-target tracking, a popular problem in real pedestrian scenarios. The notable feature variability in the obtained perception data is presented and evaluated. △ Less

Submitted 14 April, 2023; originally announced April 2023.

arXiv:2211.12222 [pdf, other]

doi 10.1109/TPAMI.2023.3311336

Event Transformer+. A multi-purpose solution for efficient event data processing

Authors: Alberto Sabater, Luis Montesano, Ana C. Murillo

Abstract: Event cameras record sparse illumination changes with high temporal resolution and high dynamic range. Thanks to their sparse recording and low consumption, they are increasingly used in applications such as AR/VR and autonomous driving. Current topperforming methods often ignore specific event-data properties, leading to the development of generic but computationally expensive algorithms, while e… ▽ More Event cameras record sparse illumination changes with high temporal resolution and high dynamic range. Thanks to their sparse recording and low consumption, they are increasingly used in applications such as AR/VR and autonomous driving. Current topperforming methods often ignore specific event-data properties, leading to the development of generic but computationally expensive algorithms, while event-aware methods do not perform as well. We propose Event Transformer+, that improves our seminal work EvT with a refined patch-based event representation and a more robust backbone to achieve more accurate results, while still benefiting from event-data sparsity to increase its efficiency. Additionally, we show how our system can work with different data modalities and propose specific output heads, for event-stream classification (i.e. action recognition) and per-pixel predictions (dense depth estimation). Evaluation results show better performance to the state-of-the-art while requiring minimal computation resources, both on GPU and CPU. △ Less

Submitted 3 September, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

Comments: arXiv admin note: text overlap with arXiv:2204.03355

arXiv:2204.14240 [pdf, other]

doi 10.1038/s41597-023-02564-7

EndoMapper dataset of complete calibrated endoscopy procedures

Authors: Pablo Azagra, Carlos Sostres, Ángel Ferrandez, Luis Riazuelo, Clara Tomasini, Oscar León Barbed, Javier Morlana, David Recasens, Victor M. Batlle, Juan J. Gómez-Rodríguez, Richard Elvira, Julia López, Cristina Oriol, Javier Civera, Juan D. Tardós, Ana Cristina Murillo, Angel Lanas, José M. M. Montiel

Abstract: Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introdu… ▽ More Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introduces the Endomapper dataset, the first collection of complete endoscopy sequences acquired during regular medical practice, making secondary use of medical data. Its main purpose is to facilitate the development and evaluation of Visual Simultaneous Localization and Map** (VSLAM) methods in real endoscopy data. The dataset contains more than 24 hours of video. It is the first endoscopic dataset that includes endoscope calibration as well as the original calibration videos. Meta-data and annotations associated with the dataset vary from the anatomical landmarks, procedure labeling, segmentations, reconstructions, simulated sequences with ground truth and same patient procedures. The software used in this paper is publicly available. △ Less

Submitted 10 October, 2023; v1 submitted 29 April, 2022; originally announced April 2022.

Comments: 17 pages, 14 figures, 8 tables

Journal ref: Sci Data 10, 671 (2023)

arXiv:2204.03355 [pdf, other]

Event Transformer. A sparse-aware solution for efficient event data processing

Authors: Alberto Sabater, Luis Montesano, Ana C. Murillo

Abstract: Event cameras are sensors of great interest for many applications that run in low-resource and challenging environments. They log sparse illumination changes with high temporal resolution and high dynamic range, while they present minimal power consumption. However, top-performing methods often ignore specific event-data properties, leading to the development of generic but computationally expensi… ▽ More Event cameras are sensors of great interest for many applications that run in low-resource and challenging environments. They log sparse illumination changes with high temporal resolution and high dynamic range, while they present minimal power consumption. However, top-performing methods often ignore specific event-data properties, leading to the development of generic but computationally expensive algorithms. Efforts toward efficient solutions usually do not achieve top-accuracy results for complex tasks. This work proposes a novel framework, Event Transformer (EvT), that effectively takes advantage of event-data properties to be highly efficient and accurate. We introduce a new patch-based event representation and a compact transformer-like architecture to process it. EvT is evaluated on different event-based benchmarks for action and gesture recognition. Evaluation results show better or comparable accuracy to the state-of-the-art while requiring significantly less computation resources, which makes EvT able to work with minimal latency both on GPU and CPU. △ Less

Submitted 18 April, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

arXiv:2203.04302 [pdf, other]

doi 10.1007/978-3-031-21083-9_5

SuperPoint features in endoscopy

Authors: O. L. Barbed, F. Chadebecq, J. Morlana, J. M. Martínez-Montiel, A. C. Murillo

Abstract: There is often a significant gap between research results and applicability in routine medical practice. This work studies the performance of well-known local features on a medical dataset captured during routine colonoscopy procedures. Local feature extraction and matching is a key step for many computer vision applications, specially regarding 3D modelling. In the medical domain, handcrafted loc… ▽ More There is often a significant gap between research results and applicability in routine medical practice. This work studies the performance of well-known local features on a medical dataset captured during routine colonoscopy procedures. Local feature extraction and matching is a key step for many computer vision applications, specially regarding 3D modelling. In the medical domain, handcrafted local features such as SIFT, with public pipelines such as COLMAP, are still a predominant tool for this kind of tasks. We explore the potential of the well known self-supervised approach SuperPoint, present an adapted variation for the endoscopic domain and propose a challenging evaluation framework. SuperPoint based models achieve significantly higher matching quality than commonly used local features in this domain. Our adapted model avoids features within specularity regions, a frequent and problematic artifact in endoscopic images, with consequent benefits for matching and reconstruction results. △ Less

Submitted 9 January, 2023; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: 9 pages, 5 figures

arXiv:2104.13415 [pdf, other]

Semi-Supervised Semantic Segmentation with Pixel-Level Contrastive Learning from a Class-wise Memory Bank

Authors: Inigo Alonso, Alberto Sabater, David Ferstl, Luis Montesano, Ana C. Murillo

Abstract: This work presents a novel approach for semi-supervised semantic segmentation. The key element of this approach is our contrastive learning module that enforces the segmentation network to yield similar pixel-level feature representations for same-class samples across the whole dataset. To achieve this, we maintain a memory bank continuously updated with relevant and high-quality feature vectors f… ▽ More This work presents a novel approach for semi-supervised semantic segmentation. The key element of this approach is our contrastive learning module that enforces the segmentation network to yield similar pixel-level feature representations for same-class samples across the whole dataset. To achieve this, we maintain a memory bank continuously updated with relevant and high-quality feature vectors from labeled data. In an end-to-end training, the features from both labeled and unlabeled data are optimized to be similar to same-class samples from the memory bank. Our approach outperforms the current state-of-the-art for semi-supervised semantic segmentation and semi-supervised domain adaptation on well-known public benchmarks, with larger improvements on the most challenging scenarios, i.e., less available labeled data. https://github.com/Shathe/SemiSeg-Contrastive △ Less

Submitted 6 August, 2021; v1 submitted 27 April, 2021; originally announced April 2021.

Journal ref: IEEE International Conference on Computer Vision 2021

arXiv:2104.03634 [pdf, other]

CineMPC: Controlling Camera Intrinsics and Extrinsics for Autonomous Cinematography

Authors: Pablo Pueyo, Eduardo Montijano, Ana C. Murillo, Mac Schwager

Abstract: We present CineMPC, an algorithm to autonomously control a UAV-borne video camera in a nonlinear Model Predicted Control (MPC) loop. CineMPC controls both the position and orientation of the camera -- the camera extrinsics -- as well as the lens focal length, focal distance, and aperture -- the camera intrinsics. While some existing solutions autonomously control the position and orientation of th… ▽ More We present CineMPC, an algorithm to autonomously control a UAV-borne video camera in a nonlinear Model Predicted Control (MPC) loop. CineMPC controls both the position and orientation of the camera -- the camera extrinsics -- as well as the lens focal length, focal distance, and aperture -- the camera intrinsics. While some existing solutions autonomously control the position and orientation of the camera, no existing solutions also control the intrinsic parameters, which are essential tools for rich cinematographic expression. The intrinsic parameters control the parts of the scene that are focused or blurred, the viewers' perception of depth in the scene and the position of the targets in the image. CineMPC closes the loop from camera images to UAV trajectory and lens parameters in order to follow the desired relative trajectory and image composition as the targets move through the scene. Experiments using a photo-realistic environment demonstrate the capabilities of the proposed control framework to successfully achieve a full array of cinematographic effects not possible without full camera control. △ Less

Submitted 22 February, 2022; v1 submitted 8 April, 2021; originally announced April 2021.

arXiv:2103.02303 [pdf, other]

Domain and View-point Agnostic Hand Action Recognition

Authors: Alberto Sabater, Iñigo Alonso, Luis Montesano, Ana C. Murillo

Abstract: Hand action recognition is a special case of action recognition with applications in human-robot interaction, virtual reality or life-logging systems. Building action classifiers able to work for such heterogeneous action domains is very challenging. There are very subtle changes across different actions from a given application but also large variations across domains (e.g. virtual reality vs lif… ▽ More Hand action recognition is a special case of action recognition with applications in human-robot interaction, virtual reality or life-logging systems. Building action classifiers able to work for such heterogeneous action domains is very challenging. There are very subtle changes across different actions from a given application but also large variations across domains (e.g. virtual reality vs life-logging). This work introduces a novel skeleton-based hand motion representation model that tackles this problem. The framework we propose is agnostic to the application domain or camera recording view-point. When working on a single domain (intra-domain action classification) our approach performs better or similar to current state-of-the-art methods on well-known hand action recognition benchmarks. And, more importantly, when performing hand action recognition for action domains and camera perspectives which our approach has not been trained for (cross-domain action classification), our proposed framework achieves comparable performance to intra-domain state-of-the-art methods. These experiments show the robustness and generalization capabilities of our framework. △ Less

Submitted 7 October, 2021; v1 submitted 3 March, 2021; originally announced March 2021.

arXiv:2102.08997 [pdf, other]

One-shot action recognition in challenging therapy scenarios

Authors: Alberto Sabater, Laura Santos, Jose Santos-Victor, Alexandre Bernardino, Luis Montesano, Ana C. Murillo

Abstract: One-shot action recognition aims to recognize new action categories from a single reference example, typically referred to as the anchor example. This work presents a novel approach for one-shot action recognition in the wild that computes motion representations robust to variable kinematic conditions. One-shot action recognition is then performed by evaluating anchor and target motion representat… ▽ More One-shot action recognition aims to recognize new action categories from a single reference example, typically referred to as the anchor example. This work presents a novel approach for one-shot action recognition in the wild that computes motion representations robust to variable kinematic conditions. One-shot action recognition is then performed by evaluating anchor and target motion representations. We also develop a set of complementary steps that boost the action recognition performance in the most challenging scenarios. Our approach is evaluated on the public NTU-120 one-shot action recognition benchmark, outperforming previous action recognition models. Besides, we evaluate our framework on a real use-case of therapy with autistic people. These recordings are particularly challenging due to high-level artifacts from the patient motion. Our results provide not only quantitative but also online qualitative measures, essential for the patient evaluation and monitoring during the actual therapy. △ Less

Submitted 29 July, 2021; v1 submitted 17 February, 2021; originally announced February 2021.

arXiv:2010.13701 [pdf, other]

Distributed Multi-Target Tracking in Camera Networks

Authors: Sara Casao, Abel Naya, Ana C. Murillo, Eduardo Montijano

Abstract: Most recent works on multi-target tracking with multiple cameras focus on centralized systems. In contrast, this paper presents a multi-target tracking approach implemented in a distributed camera network. The advantages of distributed systems lie in lighter communication management, greater robustness to failures and local decision making. On the other hand, data association and information fusio… ▽ More Most recent works on multi-target tracking with multiple cameras focus on centralized systems. In contrast, this paper presents a multi-target tracking approach implemented in a distributed camera network. The advantages of distributed systems lie in lighter communication management, greater robustness to failures and local decision making. On the other hand, data association and information fusion are more challenging than in a centralized setup, mostly due to the lack of global and complete information. The proposed algorithm boosts the benefits of the Distributed-Consensus Kalman Filter with the support of a re-identification network and a distributed tracker manager module to facilitate consistent information. These techniques complement each other and facilitate the cross-camera data association in a simple and effective manner. We evaluate the whole system with known public data sets under different conditions demonstrating the advantages of combining all the modules. In addition, we compare our algorithm to some existing centralized tracking methods, outperforming their behavior in terms of accuracy and bandwidth usage. △ Less

Submitted 16 April, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

arXiv:2010.12239 [pdf, other]

Domain Adaptation in LiDAR Semantic Segmentation by Aligning Class Distributions

Authors: Inigo Alonso, Luis Riazuelo, Luis Montesano, Ana C. Murillo

Abstract: LiDAR semantic segmentation provides 3D semantic information about the environment, an essential cue for intelligent systems during their decision making processes. Deep neural networks are achieving state-of-the-art results on large public benchmarks on this task. Unfortunately, finding models that generalize well or adapt to additional domains, where data distribution is different, remains a maj… ▽ More LiDAR semantic segmentation provides 3D semantic information about the environment, an essential cue for intelligent systems during their decision making processes. Deep neural networks are achieving state-of-the-art results on large public benchmarks on this task. Unfortunately, finding models that generalize well or adapt to additional domains, where data distribution is different, remains a major challenge. This work addresses the problem of unsupervised domain adaptation for LiDAR semantic segmentation models. Our approach combines novel ideas on top of the current state-of-the-art approaches and yields new state-of-the-art results. We propose simple but effective strategies to reduce the domain shift by aligning the data distribution on the input space. Besides, we propose a learning-based approach that aligns the distribution of the semantic classes of the target domain to the source domain. The presented ablation study shows how each part contributes to the final performance. Our strategy is shown to outperform previous approaches for domain adaptation with comparisons run on three different domains. △ Less

Submitted 3 December, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

Comments: 7 pages, 3 figures

arXiv:2009.11050 [pdf, other]

Robust and efficient post-processing for video object detection

Authors: Alberto Sabater, Luis Montesano, Ana C. Murillo

Abstract: Object recognition in video is an important task for plenty of applications, including autonomous driving perception, surveillance tasks, wearable devices or IoT networks. Object recognition using video data is more challenging than using still images due to blur, occlusions or rare object poses. Specific video detectors with high computational cost or standard image detectors together with a fast… ▽ More Object recognition in video is an important task for plenty of applications, including autonomous driving perception, surveillance tasks, wearable devices or IoT networks. Object recognition using video data is more challenging than using still images due to blur, occlusions or rare object poses. Specific video detectors with high computational cost or standard image detectors together with a fast post-processing algorithm achieve the current state-of-the-art. This work introduces a novel post-processing pipeline that overcomes some of the limitations of previous post-processing methods by introducing a learning-based similarity evaluation between detections across frames. Our method improves the results of state-of-the-art specific video detectors, specially regarding fast moving objects, and presents low resource requirements. And applied to efficient still image detectors, such as YOLO, provides comparable results to much more computationally intensive detectors. △ Less

Submitted 23 September, 2020; originally announced September 2020.

Comments: Submitted to the International Conference on Intelligent Robots and Systems, IROS 2020

arXiv:2009.04932 [pdf, other]

doi 10.1109/ETFA.2019.8869019

Performance of object recognition in wearable videos

Authors: Alberto Sabater, Luis Montesano, Ana C. Murillo

Abstract: Wearable technologies are enabling plenty of new applications of computer vision, from life logging to health assistance. Many of them are required to recognize the elements of interest in the scene captured by the camera. This work studies the problem of object detection and localization on videos captured by this type of camera. Wearable videos are a much more challenging scenario for object det… ▽ More Wearable technologies are enabling plenty of new applications of computer vision, from life logging to health assistance. Many of them are required to recognize the elements of interest in the scene captured by the camera. This work studies the problem of object detection and localization on videos captured by this type of camera. Wearable videos are a much more challenging scenario for object detection than standard images or even another type of videos, due to lower quality images (e.g. poor focus) or high clutter and occlusion common in wearable recordings. Existing work typically focuses on detecting the objects of focus or those being manipulated by the user wearing the camera. We perform a more general evaluation of the task of object detection in this type of video, because numerous applications, such as marketing studies, also need detecting objects which are not in focus by the user. This work presents a thorough study of the well known YOLO architecture, that offers an excellent trade-off between accuracy and speed, for the particular case of object detection in wearable video. We focus our study on the public ADL Dataset, but we also use additional public data for complementary evaluations. We run an exhaustive set of experiments with different variations of the original architecture and its training strategy. Our experiments drive to several conclusions about the most promising directions for our goal and point us to further research steps to improve detection in wearable videos. △ Less

Submitted 10 September, 2020; originally announced September 2020.

Comments: Emerging Technologies and Factory Automation, ETFA, 2019

arXiv:2002.10893 [pdf, other]

3D-MiniNet: Learning a 2D Representation from Point Clouds for Fast and Efficient 3D LIDAR Semantic Segmentation

Authors: Iñigo Alonso, Luis Riazuelo, Luis Montesano, Ana C. Murillo

Abstract: LIDAR semantic segmentation, which assigns a semantic label to each 3D point measured by the LIDAR, is becoming an essential task for many robotic applications such as autonomous driving. Fast and efficient semantic segmentation methods are needed to match the strong computational and temporal restrictions of many of these real-world applications. This work presents 3D-MiniNet, a novel approach… ▽ More LIDAR semantic segmentation, which assigns a semantic label to each 3D point measured by the LIDAR, is becoming an essential task for many robotic applications such as autonomous driving. Fast and efficient semantic segmentation methods are needed to match the strong computational and temporal restrictions of many of these real-world applications. This work presents 3D-MiniNet, a novel approach for LIDAR semantic segmentation that combines 3D and 2D learning layers. It first learns a 2D representation from the raw points through a novel projection which extracts local and global information from the 3D data. This representation is fed to an efficient 2D Fully Convolutional Neural Network (FCNN) that produces a 2D semantic segmentation. These 2D semantic labels are re-projected back to the 3D space and enhanced through a post-processing module. The main novelty in our strategy relies on the projection learning module. Our detailed ablation study shows how each component contributes to the final performance of 3D-MiniNet. We validate our approach on well known public benchmarks (SemanticKITTI and KITTI), where 3D-MiniNet gets state-of-the-art results while being faster and more parameter-efficient than previous methods. △ Less

Submitted 27 April, 2021; v1 submitted 25 February, 2020; originally announced February 2020.

Comments: 8 pages, 4 figures

arXiv:1912.09614 [pdf]

Features or Shape? Tackling the False Dichotomy of Time Series Classification

Authors: Sara Alaee, Alireza Abdoli, Christian Shelton, Amy C. Murillo, Alec C. Gerry, Eamonn Keogh

Abstract: Time series classification is an important task in its own right, and it is often a precursor to further downstream analytics. To date, virtually all works in the literature have used either shape-based classification using a distance measure or feature-based classification after finding some suitable features for the domain. It seems to be underappreciated that in many datasets it is the case tha… ▽ More Time series classification is an important task in its own right, and it is often a precursor to further downstream analytics. To date, virtually all works in the literature have used either shape-based classification using a distance measure or feature-based classification after finding some suitable features for the domain. It seems to be underappreciated that in many datasets it is the case that some classes are best discriminated with features, while others are best discriminated with shape. Thus, making the shape vs. feature choice will condemn us to poor results, at least for some classes. In this work, we propose a new model for classifying time series that allows the use of both shape and feature-based measures, when warranted. Our algorithm automatically decides which approach is best for which class, and at query time chooses which classifier to trust the most. We evaluate our idea on real world datasets and demonstrate that our ideas produce statistically significant improvement in classification accuracy. △ Less

Submitted 19 December, 2019; originally announced December 2019.

arXiv:1912.05913 [pdf]

doi 10.1109/BigData47090.2019.9005596

Time Series Classification: Lessons Learned in the (Literal) Field while Studying Chicken Behavior

Authors: Alireza Abdoli, Amy C. Murillo, Alec C. Gerry, Eamonn J. Keogh

Abstract: Poultry farms are a major contributor to the human food chain. However, around the world, there have been growing concerns about the quality of life for the livestock in poultry farms; and increasingly vocal demands for improved standards of animal welfare. Recent advances in sensing technologies and machine learning allow the possibility of monitoring birds, and employing the lessons learned to i… ▽ More Poultry farms are a major contributor to the human food chain. However, around the world, there have been growing concerns about the quality of life for the livestock in poultry farms; and increasingly vocal demands for improved standards of animal welfare. Recent advances in sensing technologies and machine learning allow the possibility of monitoring birds, and employing the lessons learned to improve the welfare for all birds. This task superficially appears to be easy, yet, studying behavioral patterns involves collecting enormous amounts of data, justifying the term Big Data. Before the big data can be used for analytical purposes to tease out meaningful, well-conserved behavioral patterns, the collected data needs to be pre-processed. The pre-processing refers to processes for cleansing and preparing data so that it is in the format ready to be analyzed by downstream algorithms, such as classification and clustering algorithms. However, as we shall demonstrate, efficient pre-processing of chicken big data is both non-trivial and crucial towards success of further analytics. △ Less

Submitted 20 December, 2019; v1 submitted 21 November, 2019; originally announced December 2019.

Comments: arXiv admin note: text overlap with arXiv:1811.03149

arXiv:1811.12039 [pdf, other]

EV-SegNet: Semantic Segmentation for Event-based Cameras

Authors: Iñigo Alonso, Ana C. Murillo

Abstract: Event cameras, or Dynamic Vision Sensor (DVS), are very promising sensors which have shown several advantages over frame based cameras. However, most recent work on real applications of these cameras is focused on 3D reconstruction and 6-DOF camera tracking. Deep learning based approaches, which are leading the state-of-the-art in visual recognition tasks, could potentially take advantage of the b… ▽ More Event cameras, or Dynamic Vision Sensor (DVS), are very promising sensors which have shown several advantages over frame based cameras. However, most recent work on real applications of these cameras is focused on 3D reconstruction and 6-DOF camera tracking. Deep learning based approaches, which are leading the state-of-the-art in visual recognition tasks, could potentially take advantage of the benefits of DVS, but some adaptations are needed still needed in order to effectively work on these cameras. This work introduces a first baseline for semantic segmentation with this kind of data. We build a semantic segmentation CNN based on state-of-the-art techniques which takes event information as the only input. Besides, we propose a novel representation for DVS data that outperforms previously used event representations for related tasks. Since there is no existing labeled dataset for this task, we propose how to automatically generate approximated semantic segmentation labels for some sequences of the DDD17 dataset, which we publish together with the model, and demonstrate they are valid to train a model for DVS data only. We compare our results on semantic segmentation from DVS data with results using corresponding grayscale images, demonstrating how they are complementary and worth combining. △ Less

Submitted 29 November, 2018; originally announced November 2018.

Journal ref: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019

arXiv:1811.03149 [pdf]

Time Series Classification to Improve Poultry Welfare

Authors: Alireza Abdoli, Amy C. Murillo, Chin-Chia M. Yeh, Alec C. Gerry, Eamonn J. Keogh

Abstract: Poultry farms are an important contributor to the human food chain. Worldwide, humankind keeps an enormous number of domesticated birds (e.g. chickens) for their eggs and their meat, providing rich sources of low-fat protein. However, around the world, there have been growing concerns about the quality of life for the livestock in poultry farms; and increasingly vocal demands for improved standard… ▽ More Poultry farms are an important contributor to the human food chain. Worldwide, humankind keeps an enormous number of domesticated birds (e.g. chickens) for their eggs and their meat, providing rich sources of low-fat protein. However, around the world, there have been growing concerns about the quality of life for the livestock in poultry farms; and increasingly vocal demands for improved standards of animal welfare. Recent advances in sensing technologies and machine learning allow the possibility of automatically assessing the health of some individual birds, and employing the lessons learned to improve the welfare for all birds. This task superficially appears to be easy, given the dramatic progress in recent years in classifying human behaviors, and given that human behaviors are presumably more complex. However, as we shall demonstrate, classifying chicken behaviors poses several unique challenges, chief among which is creating a generalizable dictionary of behaviors from sparse and noisy data. In this work we introduce a novel time series dictionary learning algorithm that can robustly learn from weakly labeled data sources. △ Less

Submitted 7 November, 2018; originally announced November 2018.

Showing 1–26 of 26 results for author: Murillo, A C