-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1092 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 14 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
3D Vertebrae Measurements: Assessing Vertebral Dimensions in Human Spine Mesh Models Using Local Anatomical Vertebral Axes
Authors:
Ivanna Kramer,
Vinzent Rittel,
Lara Blomenkamp,
Sabine Bauer,
Dietrich Paulus
Abstract:
Vertebral morphological measurements are important across various disciplines, including spinal biomechanics and clinical applications, pre- and post-operatively. These measurements also play a crucial role in anthropological longitudinal studies, where spinal metrics are repeatedly documented over extended periods. Traditionally, such measurements have been manually conducted, a process that is t…
▽ More
Vertebral morphological measurements are important across various disciplines, including spinal biomechanics and clinical applications, pre- and post-operatively. These measurements also play a crucial role in anthropological longitudinal studies, where spinal metrics are repeatedly documented over extended periods. Traditionally, such measurements have been manually conducted, a process that is time-consuming. In this study, we introduce a novel, fully automated method for measuring vertebral morphology using 3D meshes of lumbar and thoracic spine models.Our experimental results demonstrate the method's capability to accurately measure low-resolution patient-specific vertebral meshes with mean absolute error (MAE) of 1.09 mm and those derived from artificially created lumbar spines, where the average MAE value was 0.7 mm. Our qualitative analysis indicates that measurements obtained using our method on 3D spine models can be accurately reprojected back onto the original medical images if these images are available.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Generation of Synthetic Images for Pedestrian Detection Using a Sequence of GANs
Authors:
Viktor Seib,
Malte Roosen,
Ida Germann,
Stefan Wirtz,
Dietrich Paulus
Abstract:
Creating annotated datasets demands a substantial amount of manual effort. In this proof-of-concept work, we address this issue by proposing a novel image generation pipeline. The pipeline consists of three distinct generative adversarial networks (previously published), combined in a novel way to augment a dataset for pedestrian detection. Despite the fact that the generated images are not always…
▽ More
Creating annotated datasets demands a substantial amount of manual effort. In this proof-of-concept work, we address this issue by proposing a novel image generation pipeline. The pipeline consists of three distinct generative adversarial networks (previously published), combined in a novel way to augment a dataset for pedestrian detection. Despite the fact that the generated images are not always visually pleasant to the human eye, our detection benchmark reveals that the results substantially surpass the baseline. The presented proof-of-concept work was done in 2020 and is now published as a technical report after a three years retention period.
△ Less
Submitted 14 January, 2024;
originally announced January 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
On the interplay of data and cognitive bias in crisis information management -- An exploratory study on epidemic response
Authors:
David Paulus,
Ramian Fathi,
Frank Fiedrich,
Bartel Van de Walle,
Tina Comes
Abstract:
Humanitarian crises, such as the 2014 West Africa Ebola epidemic, challenge information management and thereby threaten the digital resilience of the responding organizations. Crisis information management (CIM) is characterised by the urgency to respond despite the uncertainty of the situation. Coupled with high stakes, limited resources and a high cognitive load, crises are prone to induce biase…
▽ More
Humanitarian crises, such as the 2014 West Africa Ebola epidemic, challenge information management and thereby threaten the digital resilience of the responding organizations. Crisis information management (CIM) is characterised by the urgency to respond despite the uncertainty of the situation. Coupled with high stakes, limited resources and a high cognitive load, crises are prone to induce biases in the data and the cognitive processes of analysts and decision-makers. When biases remain undetected and untreated in CIM, they may lead to decisions based on biased information, increasing the risk of an inefficient response. Literature suggests that crisis response needs to address the initial uncertainty and possible biases by adapting to new and better information as it becomes available. However, we know little about whether adaptive approaches mitigate the interplay of data and cognitive biases.
We investigated this question in an exploratory, three-stage experiment on epidemic response. Our participants were experienced practitioners in the fields of crisis decision-making and information analysis. We found that analysts fail to successfully debias data, even when biases are detected, and that this failure can be attributed to undervaluing debiasing efforts in favor of rapid results. This failure leads to the development of biased information products that are conveyed to decision-makers, who consequently make decisions based on biased information. Confirmation bias reinforces the reliance on conclusions reached with biased data, leading to a vicious cycle, in which biased assumptions remain uncorrected. We suggest mindful debiasing as a possible counter-strategy against these bias effects in CIM.
△ Less
Submitted 10 January, 2022;
originally announced January 2022.
-
Next-Best-View Estimation based on Deep Reinforcement Learning for Active Object Classification
Authors:
Christian Korbach,
Markus D. Solbach,
Raphael Memmesheimer,
Dietrich Paulus,
John K. Tsotsos
Abstract:
The presentation and analysis of image data from a single viewpoint are often not sufficient to solve a task. Several viewpoints are necessary to obtain more information. The next-best-view problem attempts to find the optimal viewpoint with the greatest information gain for the underlying task. In this work, a robot arm holds an object in its end-effector and searches for a sequence of next-best-…
▽ More
The presentation and analysis of image data from a single viewpoint are often not sufficient to solve a task. Several viewpoints are necessary to obtain more information. The next-best-view problem attempts to find the optimal viewpoint with the greatest information gain for the underlying task. In this work, a robot arm holds an object in its end-effector and searches for a sequence of next-best-view to explicitly identify the object. We use Soft Actor-Critic (SAC), a method of deep reinforcement learning, to learn these next-best-views for a specific set of objects. The evaluation shows that an agent can learn to determine an object pose to which the robot arm should move an object. This leads to a viewpoint that provides a more accurate prediction to distinguish such an object from other objects better. We make the code publicly available for the scientific community and for reproducibility.
△ Less
Submitted 14 October, 2021; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Fusion-GCN: Multimodal Action Recognition using Graph Convolutional Networks
Authors:
Michael Duhme,
Raphael Memmesheimer,
Dietrich Paulus
Abstract:
In this paper, we present Fusion-GCN, an approach for multimodal action recognition using Graph Convolutional Networks (GCNs). Action recognition methods based around GCNs recently yielded state-of-the-art performance for skeleton-based action recognition. With Fusion-GCN, we propose to integrate various sensor data modalities into a graph that is trained using a GCN model for multi-modal action r…
▽ More
In this paper, we present Fusion-GCN, an approach for multimodal action recognition using Graph Convolutional Networks (GCNs). Action recognition methods based around GCNs recently yielded state-of-the-art performance for skeleton-based action recognition. With Fusion-GCN, we propose to integrate various sensor data modalities into a graph that is trained using a GCN model for multi-modal action recognition. Additional sensor measurements are incorporated into the graph representation, either on a channel dimension (introducing additional node attributes) or spatial dimension (introducing new nodes). Fusion-GCN was evaluated on two public available datasets, the UTD-MHAD- and MMACT datasets, and demonstrates flexible fusion of RGB sequences, inertial measurements and skeleton sequences. Our approach gets comparable results on the UTD-MHAD dataset and improves the baseline on the large-scale MMACT dataset by a significant margin of up to 12.37% (F1-Measure) with the fusion of skeleton estimates and accelerometer measurements.
△ Less
Submitted 27 September, 2021;
originally announced September 2021.
-
Skeleton-DML: Deep Metric Learning for Skeleton-Based One-Shot Action Recognition
Authors:
Raphael Memmesheimer,
Simon Häring,
Nick Theisen,
Dietrich Paulus
Abstract:
One-shot action recognition allows the recognition of human-performed actions with only a single training example. This can influence human-robot-interaction positively by enabling the robot to react to previously unseen behaviour. We formulate the one-shot action recognition problem as a deep metric learning problem and propose a novel image-based skeleton representation that performs well in a m…
▽ More
One-shot action recognition allows the recognition of human-performed actions with only a single training example. This can influence human-robot-interaction positively by enabling the robot to react to previously unseen behaviour. We formulate the one-shot action recognition problem as a deep metric learning problem and propose a novel image-based skeleton representation that performs well in a metric learning setting. Therefore, we train a model that projects the image representations into an embedding space. In embedding space the similar actions have a low euclidean distance while dissimilar actions have a higher distance. The one-shot action recognition problem becomes a nearest-neighbor search in a set of activity reference samples. We evaluate the performance of our proposed representation against a variety of other skeleton-based image representations. In addition, we present an ablation study that shows the influence of different embedding vector sizes, losses and augmentation. Our approach lifts the state-of-the-art by 3.3% for the one-shot action recognition protocol on the NTU RGB+D 120 dataset under a comparable training setup. With additional augmentation our result improved over 7.7%.
△ Less
Submitted 8 March, 2021; v1 submitted 26 December, 2020;
originally announced December 2020.
-
SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition
Authors:
Raphael Memmesheimer,
Nick Theisen,
Dietrich Paulus
Abstract:
Recognizing an activity with a single reference sample using metric learning approaches is a promising research field. The majority of few-shot methods focus on object recognition or face-identification. We propose a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space. We encode signals into images and extract features using a deep resi…
▽ More
Recognizing an activity with a single reference sample using metric learning approaches is a promising research field. The majority of few-shot methods focus on object recognition or face-identification. We propose a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space. We encode signals into images and extract features using a deep residual CNN. Using triplet loss, we learn a feature embedding. The resulting encoder transforms features into an embedding space in which closer distances encode similar actions while higher distances encode different actions. Our approach is based on a signal level formulation and remains flexible across a variety of modalities. It further outperforms the baseline on the large scale NTU RGB+D 120 dataset for the One-Shot action recognition protocol by 5.6%. With just 60% of the training data, our approach still outperforms the baseline approach by 3.7%. With 40% of the training data, our approach performs comparably well to the second follow up. Further, we show that our approach generalizes well in experiments on the UTD-MHAD dataset for inertial, skeleton and fused data and the Simitate dataset for motion capturing data. Furthermore, our inter-joint and inter-sensor experiments suggest good capabilities on previously unseen setups.
△ Less
Submitted 19 October, 2020; v1 submitted 23 April, 2020;
originally announced April 2020.
-
Gimme Signals: Discriminative signal encoding for multimodal activity recognition
Authors:
Raphael Memmesheimer,
Nick Theisen,
Dietrich Paulus
Abstract:
We present a simple, yet effective and flexible method for action recognition supporting multiple sensor modalities. Multivariate signal sequences are encoded in an image and are then classified using a recently proposed EfficientNet CNN architecture. Our focus was to find an approach that generalizes well across different sensor modalities without specific adaptions while still achieving good res…
▽ More
We present a simple, yet effective and flexible method for action recognition supporting multiple sensor modalities. Multivariate signal sequences are encoded in an image and are then classified using a recently proposed EfficientNet CNN architecture. Our focus was to find an approach that generalizes well across different sensor modalities without specific adaptions while still achieving good results. We apply our method to 4 action recognition datasets containing skeleton sequences, inertial and motion capturing measurements as well as \wifi fingerprints that range up to 120 action classes. Our method defines the current best CNN-based approach on the NTU RGB+D 120 dataset, lifts the state of the art on the ARIL Wi-Fi dataset by +6.78%, improves the UTD-MHAD inertial baseline by +14.4%, the UTD-MHAD skeleton baseline by 1.13% and achieves 96.11% on the Simitate motion capturing data (80/20 split). We further demonstrate experiments on both, modality fusion on a signal level and signal reduction to prevent the representation from overloading.
△ Less
Submitted 9 April, 2020; v1 submitted 13 March, 2020;
originally announced March 2020.
-
Effects of data ambiguity and cognitive biases on the interpretability of machine learning models in humanitarian decision making
Authors:
David Paulus,
Gerdien de Vries,
Bartel Van de Walle
Abstract:
The effectiveness of machine learning algorithms depends on the quality and amount of data and the operationalization and interpretation by the human analyst. In humanitarian response, data is often lacking or overburdening, thus ambiguous, and the time-scarce, volatile, insecure environments of humanitarian activities are likely to inflict cognitive biases. This paper proposes to research the eff…
▽ More
The effectiveness of machine learning algorithms depends on the quality and amount of data and the operationalization and interpretation by the human analyst. In humanitarian response, data is often lacking or overburdening, thus ambiguous, and the time-scarce, volatile, insecure environments of humanitarian activities are likely to inflict cognitive biases. This paper proposes to research the effects of data ambiguity and cognitive biases on the interpretability of machine learning algorithms in humanitarian decision making.
△ Less
Submitted 12 November, 2019;
originally announced November 2019.
-
Gesture Recognition in RGB Videos UsingHuman Body Keypoints and Dynamic Time War**
Authors:
Pascal Schneider,
Raphael Memmesheimer,
Ivanna Kramer,
Dietrich Paulus
Abstract:
Gesture recognition opens up new ways for humans to intuitively interact with machines. Especially for service robots, gestures can be a valuable addition to the means of communication to, for example, draw the robot's attention to someone or something. Extracting a gesture from video data and classifying it is a challenging task and a variety of approaches have been proposed throughout the years.…
▽ More
Gesture recognition opens up new ways for humans to intuitively interact with machines. Especially for service robots, gestures can be a valuable addition to the means of communication to, for example, draw the robot's attention to someone or something. Extracting a gesture from video data and classifying it is a challenging task and a variety of approaches have been proposed throughout the years. This paper presents a method for gesture recognition in RGB videos using OpenPose to extract the pose of a person and Dynamic Time War** (DTW) in conjunction with One-Nearest-Neighbor (1NN) for time-series classification. The main features of this approach are the independence of any specific hardware and high flexibility, because new gestures can be added to the classifier by adding only a few examples of it. We utilize the robustness of the Deep Learning-based OpenPose framework while avoiding the data-intensive task of training a neural network ourselves. We demonstrate the classification performance of our method using a public dataset.
△ Less
Submitted 25 June, 2019;
originally announced June 2019.
-
Simitate: A Hybrid Imitation Learning Benchmark
Authors:
Raphael Memmesheimer,
Ivanna Mykhalchyshyna,
Viktor Seib,
Dietrich Paulus
Abstract:
We present Simitate --- a hybrid benchmarking suite targeting the evaluation of approaches for imitation learning. A dataset containing 1938 sequences where humans perform daily activities in a realistic environment is presented. The dataset is strongly coupled with an integration into a simulator. RGB and depth streams with a resolution of 960$\mathbb{\times}$540 at 30Hz and accurate ground truth…
▽ More
We present Simitate --- a hybrid benchmarking suite targeting the evaluation of approaches for imitation learning. A dataset containing 1938 sequences where humans perform daily activities in a realistic environment is presented. The dataset is strongly coupled with an integration into a simulator. RGB and depth streams with a resolution of 960$\mathbb{\times}$540 at 30Hz and accurate ground truth poses for the demonstrator's hand, as well as the object in 6 DOF at 120Hz are provided. Along with our dataset we provide the 3D model of the used environment, labeled object images and pre-trained models. A benchmarking suite that aims at fostering comparability and reproducibility supports the development of imitation learning approaches. Further, we propose and integrate evaluation metrics on assessing the quality of effect and trajectory of the imitation performed in simulation. Simitate is available on our project website: \url{https://agas.uni-koblenz.de/data/simitate/}.
△ Less
Submitted 15 May, 2019;
originally announced May 2019.
-
Scratchy: A Lightweight Modular Autonomous Robot for Robotic Competitions
Authors:
Raphael Memmesheimer,
Isabelle Kuhlmann,
Mark Mints,
Patrik Schmidt,
Christian Korbach,
Ida Germann,
Dietrich Paulus
Abstract:
We present Scratchy---a modular, lightweight robot built for low budget competition attendances. Its base is mainly built with standard 4040 aluminium profiles and the robot is driven by four mecanum wheels on brushless DC motors. In combination with a laser range finder we use estimated odometry -- which is calculated by encoders -- for creating maps using a particle filter. A RGB-D camera is uti…
▽ More
We present Scratchy---a modular, lightweight robot built for low budget competition attendances. Its base is mainly built with standard 4040 aluminium profiles and the robot is driven by four mecanum wheels on brushless DC motors. In combination with a laser range finder we use estimated odometry -- which is calculated by encoders -- for creating maps using a particle filter. A RGB-D camera is utilized for object detection and pose estimation. Additionally, there is the option to use a 6-DOF arm to grip objects from an estimated pose or generally for manipulation tasks. The robot can be assembled in less than one hour and fits into two pieces of hand luggage or one bigger suitcase. Therefore, it provides a huge advantage for student teams that participate in robot competitions like the European Robotics League or RoboCup. Thus, this keeps the funding required for participation, which is often a big hurdle for student teams to overcome, low. The software and additional hardware descriptions are available under: https://github.com/homer-robotics/scratchy.
△ Less
Submitted 14 May, 2019;
originally announced May 2019.
-
Trends, Challenges and Adopted Strategies in RoboCup@Home (2019 version)
Authors:
Mauricio Matamoros,
Viktor Seib,
Dietrich Paulus
Abstract:
Scientific competitions are crucial in the field of service robotics. They foster knowledge exchange and benchmarking, allowing teams to test their research in unstandardized scenarios. In this paper, we summarize the trending solutions and approaches used in RoboCup@Home. Further on, we discuss the attained achievements and challenges to overcome in relation with the progress required to fulfill…
▽ More
Scientific competitions are crucial in the field of service robotics. They foster knowledge exchange and benchmarking, allowing teams to test their research in unstandardized scenarios. In this paper, we summarize the trending solutions and approaches used in RoboCup@Home. Further on, we discuss the attained achievements and challenges to overcome in relation with the progress required to fulfill the long-term goal of the league. Consequently, we propose a set of milestones for upcoming competitions by considering the current capabilities of the robots and their limitations.
With this work we aim at laying the foundations towards the creation of roadmaps that can help to direct efforts in testing and benchmarking in robotics competitions.
△ Less
Submitted 25 March, 2019;
originally announced March 2019.
-
Trends, Challenges and Adopted Strategies in RoboCup@Home
Authors:
Mauricio Matamoros,
Viktor Seib,
Dietrich Paulus
Abstract:
Scientific competitions are crucial in the field of service robotics. They foster knowledge exchange and allow teams to test their research in unstandardized scenarios and compare result. Such is the case of RoboCup@Home. However, kee** track of all the technologies and solution approaches used by teams to solve the tests can be a challenge in itself. Moreover, after eleven years of competitions…
▽ More
Scientific competitions are crucial in the field of service robotics. They foster knowledge exchange and allow teams to test their research in unstandardized scenarios and compare result. Such is the case of RoboCup@Home. However, kee** track of all the technologies and solution approaches used by teams to solve the tests can be a challenge in itself. Moreover, after eleven years of competitions, it's easy to delve too much into the field, losing perspective and forgetting about the user's needs and long term goals.
In this paper, we aim to tackle this problems by presenting a summary of the trending solutions and approaches used in RoboCup@Home, and discussing the attained achievements and challenges to overcome in relation with the progress required to fulfill the long-term goal of the league. Hence, considering the current capabilities of the robots and their limitations, we propose a set of milestones to address in upcoming competitions.
With this work we lay the foundations towards the creation of roadmaps that can help to direct efforts in testing and benchmarking in robotics competitions.
△ Less
Submitted 6 March, 2019;
originally announced March 2019.
-
RoboCup@Home: Summarizing achievements in over eleven years of competition
Authors:
Mauricio Matamoros,
Viktor Seib,
Raphael Memmesheimer,
Dietrich Paulus
Abstract:
Scientific competitions are important in robotics because they foster knowledge exchange and allow teams to test their research in unstandardized scenarios and compare result. In the field of service robotics its role becomes crucial. Competitions like RoboCup@Home bring robots to people, a fundamental step to integrate them into society.
In this paper we summarize and discuss the differences be…
▽ More
Scientific competitions are important in robotics because they foster knowledge exchange and allow teams to test their research in unstandardized scenarios and compare result. In the field of service robotics its role becomes crucial. Competitions like RoboCup@Home bring robots to people, a fundamental step to integrate them into society.
In this paper we summarize and discuss the differences between the achievements claimed by teams in their team description papers, and the results observed during the competition^1 from a qualitative perspective.
We conclude with a set of important challenges to be conquered first in order to take robots to people's homes. We believe that competitions are also an excellent opportunity to collect data of direct and unbiased interactions for further research.
^1 The authors belong to several teams who have participated in RoboCup@Home as early as 2007
△ Less
Submitted 2 February, 2019;
originally announced February 2019.
-
From Commands to Goal-based Dialogs: A Roadmap to Achieve Natural Language Interaction in RoboCup@Home
Authors:
Mauricio Matamoros,
Karin Harbusch,
Dietrich Paulus
Abstract:
On the one hand, speech is a key aspect to people's communication. On the other, it is widely acknowledged that language proficiency is related to intelligence. Therefore, intelligent robots should be able to understand, at least, people's orders within their application domain. These insights are not new in RoboCup@Home, but we lack of a long-term plan to evaluate this approach. In this paper we…
▽ More
On the one hand, speech is a key aspect to people's communication. On the other, it is widely acknowledged that language proficiency is related to intelligence. Therefore, intelligent robots should be able to understand, at least, people's orders within their application domain. These insights are not new in RoboCup@Home, but we lack of a long-term plan to evaluate this approach. In this paper we conduct a brief review of the achievements on automated speech recognition and natural language understanding in RoboCup@Home. Furthermore, we discuss main challenges to tackle in spoken human-robot interaction within the scope of this competition. Finally, we contribute by presenting a pipelined road map to engender research in the area of natural language understanding applied to domestic service robotics.
△ Less
Submitted 2 February, 2019;
originally announced February 2019.
-
Markerless Visual Robot Programming by Demonstration
Authors:
Raphael Memmesheimer,
Ivanna Mykhalchyshyna,
Viktor Seib,
Nick Theisen,
Dietrich Paulus
Abstract:
In this paper we present an approach for learning to imitate human behavior on a semantic level by markerless visual observation. We analyze a set of spatial constraints on human pose data extracted using convolutional pose machines and object informations extracted from 2D image sequences. A scene analysis, based on an ontology of objects and affordances, is combined with continuous human pose es…
▽ More
In this paper we present an approach for learning to imitate human behavior on a semantic level by markerless visual observation. We analyze a set of spatial constraints on human pose data extracted using convolutional pose machines and object informations extracted from 2D image sequences. A scene analysis, based on an ontology of objects and affordances, is combined with continuous human pose estimation and spatial object relations. Using a set of constraints we associate the observed human actions with a set of executable robot commands. We demonstrate our approach in a kitchen task, where the robot learns to prepare a meal.
△ Less
Submitted 30 July, 2018;
originally announced July 2018.
-
Simple Online and Realtime Tracking with a Deep Association Metric
Authors:
Nicolai Wojke,
Alex Bewley,
Dietrich Paulus
Abstract:
Simple Online and Realtime Tracking (SORT) is a pragmatic approach to multiple object tracking with a focus on simple, effective algorithms. In this paper, we integrate appearance information to improve the performance of SORT. Due to this extension we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches. In spirit of the original fra…
▽ More
Simple Online and Realtime Tracking (SORT) is a pragmatic approach to multiple object tracking with a focus on simple, effective algorithms. In this paper, we integrate appearance information to improve the performance of SORT. Due to this extension we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches. In spirit of the original framework we place much of the computational complexity into an offline pre-training stage where we learn a deep association metric on a large-scale person re-identification dataset. During online application, we establish measurement-to-track associations using nearest neighbor queries in visual appearance space. Experimental evaluation shows that our extensions reduce the number of identity switches by 45%, achieving overall competitive performance at high frame rates.
△ Less
Submitted 21 March, 2017;
originally announced March 2017.