-
On the Foundations of Shortcut Learning
Authors:
Katherine L. Hermann,
Hossein Mobahi,
Thomas Fel,
Michael C. Mozer
Abstract:
Deep-learning models can extract a rich assortment of features from data. Which features a model uses depends not only on predictivity-how reliably a feature indicates train-set labels-but also on availability-how easily the feature can be extracted, or leveraged, from inputs. The literature on shortcut learning has noted examples in which models privilege one feature over another, for example tex…
▽ More
Deep-learning models can extract a rich assortment of features from data. Which features a model uses depends not only on predictivity-how reliably a feature indicates train-set labels-but also on availability-how easily the feature can be extracted, or leveraged, from inputs. The literature on shortcut learning has noted examples in which models privilege one feature over another, for example texture over shape and image backgrounds over foreground objects. Here, we test hypotheses about which input properties are more available to a model, and systematically study how predictivity and availability interact to shape models' feature use. We construct a minimal, explicit generative framework for synthesizing classification datasets with two latent features that vary in predictivity and in factors we hypothesize to relate to availability, and quantify a model's shortcut bias-its over-reliance on the shortcut (more available, less predictive) feature at the expense of the core (less available, more predictive) feature. We find that linear models are relatively unbiased, but introducing a single hidden layer with ReLU or Tanh units yields a bias. Our empirical findings are consistent with a theoretical account based on Neural Tangent Kernels. Finally, we study how models used in practice trade off predictivity and availability in naturalistic datasets, discovering availability manipulations which increase models' degree of shortcut bias. Taken together, these findings suggest that the propensity to learn shortcut features is a fundamental characteristic of deep nonlinear architectures warranting systematic study given its role in sha** how models solve tasks.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Getting aligned on representational alignment
Authors:
Ilia Sucholutsky,
Lukas Muttenthaler,
Adrian Weller,
Andi Peng,
Andreea Bobu,
Been Kim,
Bradley C. Love,
Erin Grant,
Iris Groen,
Jascha Achterberg,
Joshua B. Tenenbaum,
Katherine M. Collins,
Katherine L. Hermann,
Kerem Oktar,
Klaus Greff,
Martin N. Hebart,
Nori Jacoby,
Qiuyi Zhang,
Raja Marjieh,
Robert Geirhos,
Sherol Chen,
Simon Kornblith,
Sunayana Rane,
Talia Konkle,
Thomas P. O'Connell
, et al. (5 additional authors not shown)
Abstract:
Biological and artificial information processing systems form representations that they can use to categorize, reason, plan, navigate, and make decisions. How can we measure the extent to which the representations formed by these diverse systems agree? Do similarities in representations then translate into similar behavior? How can a system's representations be modified to better match those of an…
▽ More
Biological and artificial information processing systems form representations that they can use to categorize, reason, plan, navigate, and make decisions. How can we measure the extent to which the representations formed by these diverse systems agree? Do similarities in representations then translate into similar behavior? How can a system's representations be modified to better match those of another system? These questions pertaining to the study of representational alignment are at the heart of some of the most active research areas in cognitive science, neuroscience, and machine learning. For example, cognitive scientists measure the representational alignment of multiple individuals to identify shared cognitive priors, neuroscientists align fMRI responses from multiple individuals into a shared representational space for group-level analyses, and ML researchers distill knowledge from teacher models into student models by increasing their alignment. Unfortunately, there is limited knowledge transfer between research communities interested in representational alignment, so progress in one field often ends up being rediscovered independently in another. Thus, greater cross-field communication would be advantageous. To improve communication between these fields, we propose a unifying framework that can serve as a common language between researchers studying representational alignment. We survey the literature from all three fields and demonstrate how prior work fits into this framework. Finally, we lay out open problems in representational alignment where progress can benefit all three of these fields. We hope that our work can catalyze cross-disciplinary collaboration and accelerate progress for all communities studying and develo** information processing systems. We note that this is a working paper and encourage readers to reach out with their suggestions for future revisions.
△ Less
Submitted 2 November, 2023; v1 submitted 18 October, 2023;
originally announced October 2023.
-
RobotIO: A Python Library for Robot Manipulation Experiments
Authors:
Lukas Hermann,
Max Argus,
Adrian Roefer,
Abhinav Valada,
Thomas Brox
Abstract:
Setting up robot environments to quickly test newly developed algorithms is still a difficult and time consuming process. This presents a significant hurdle to researchers interested in performing real-world robotic experiments. RobotIO is a python library designed to solve this problem. It focuses on providing common, simple, and well structured python interfaces for robots, grippers, and cameras…
▽ More
Setting up robot environments to quickly test newly developed algorithms is still a difficult and time consuming process. This presents a significant hurdle to researchers interested in performing real-world robotic experiments. RobotIO is a python library designed to solve this problem. It focuses on providing common, simple, and well structured python interfaces for robots, grippers, and cameras, etc. These are provided with implementations of these interfaces for common hardware. This enables code using RobotIO to be portable across different robot setups. In terms of architecture, RobotIO is designed to be compatible with OpenAI gym environments, as well as ROS; examples of both of these are provided. The library comes together with a number of helpful tools, such as camera calibration scripts and episode recording functionality that further support algorithm development.
△ Less
Submitted 16 August, 2022; v1 submitted 27 July, 2022;
originally announced July 2022.
-
What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data
Authors:
Oier Mees,
Lukas Hermann,
Wolfram Burgard
Abstract:
A long-standing goal in robotics is to build robots that can perform a wide range of daily tasks from perceptions obtained with their onboard sensors and specified only via natural language. While recently substantial advances have been achieved in language-driven robotics by leveraging end-to-end learning from pixels, there is no clear and well-understood process for making various design choices…
▽ More
A long-standing goal in robotics is to build robots that can perform a wide range of daily tasks from perceptions obtained with their onboard sensors and specified only via natural language. While recently substantial advances have been achieved in language-driven robotics by leveraging end-to-end learning from pixels, there is no clear and well-understood process for making various design choices due to the underlying variation in setups. In this paper, we conduct an extensive study of the most critical challenges in learning language conditioned policies from offline free-form imitation datasets. We further identify architectural and algorithmic techniques that improve performance, such as a hierarchical decomposition of the robot control learning, a multimodal transformer encoder, discrete latent plans and a self-supervised contrastive loss that aligns video and language representations. By combining the results of our investigation with our improved model components, we are able to present a novel approach that significantly outperforms the state of the art on the challenging language conditioned long-horizon robot manipulation CALVIN benchmark. We have open-sourced our implementation to facilitate future research in learning to perform many complex manipulation skills in a row specified with natural language. Codebase and trained models available at http://hulc.cs.uni-freiburg.de
△ Less
Submitted 30 August, 2022; v1 submitted 13 April, 2022;
originally announced April 2022.
-
Affordance Learning from Play for Sample-Efficient Policy Learning
Authors:
Jessica Borja-Diaz,
Oier Mees,
Gabriel Kalweit,
Lukas Hermann,
Joschka Boedecker,
Wolfram Burgard
Abstract:
Robots operating in human-centered environments should have the ability to understand how objects function: what can be done with each object, where this interaction may occur, and how the object is used to achieve a goal. To this end, we propose a novel approach that extracts a self-supervised visual affordance model from human teleoperated play data and leverages it to enable efficient policy le…
▽ More
Robots operating in human-centered environments should have the ability to understand how objects function: what can be done with each object, where this interaction may occur, and how the object is used to achieve a goal. To this end, we propose a novel approach that extracts a self-supervised visual affordance model from human teleoperated play data and leverages it to enable efficient policy learning and motion planning. We combine model-based planning with model-free deep reinforcement learning (RL) to learn policies that favor the same object regions favored by people, while requiring minimal robot interactions with the environment. We evaluate our algorithm, Visual Affordance-guided Policy Optimization (VAPO), with both diverse simulation manipulation tasks and real world robot tidy-up experiments to demonstrate the effectiveness of our affordance-guided policies. We find that our policies train 4x faster than the baselines and generalize better to novel objects because our visual affordance model can anticipate their affordance regions.
△ Less
Submitted 1 March, 2022;
originally announced March 2022.
-
CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Authors:
Oier Mees,
Lukas Hermann,
Erick Rosete-Beas,
Wolfram Burgard
Abstract:
General-purpose robots coexisting with humans in their environment must learn to relate human language to their perceptions and actions to be useful in a range of daily tasks. Moreover, they need to acquire a diverse repertoire of general-purpose skills that allow composing long-horizon tasks by following unconstrained language instructions. In this paper, we present CALVIN (Composing Actions from…
▽ More
General-purpose robots coexisting with humans in their environment must learn to relate human language to their perceptions and actions to be useful in a range of daily tasks. Moreover, they need to acquire a diverse repertoire of general-purpose skills that allow composing long-horizon tasks by following unconstrained language instructions. In this paper, we present CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn long-horizon language-conditioned tasks. Our aim is to make it possible to develop agents that can solve many robotic manipulation tasks over a long horizon, from onboard sensors, and specified only via human language. CALVIN tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets and supports flexible specification of sensor suites. We evaluate the agents in zero-shot to novel language instructions and to novel environments and objects. We show that a baseline model based on multi-context imitation learning performs poorly on CALVIN, suggesting that there is significant room for develo** innovative agents that learn to relate human language to their world models with this benchmark.
△ Less
Submitted 13 July, 2022; v1 submitted 6 December, 2021;
originally announced December 2021.
-
Pre-training of Deep RL Agents for Improved Learning under Domain Randomization
Authors:
Artemij Amiranashvili,
Max Argus,
Lukas Hermann,
Wolfram Burgard,
Thomas Brox
Abstract:
Visual domain randomization in simulated environments is a widely used method to transfer policies trained in simulation to real robots. However, domain randomization and augmentation hamper the training of a policy. As reinforcement learning struggles with a noisy training signal, this additional nuisance can drastically impede training. For difficult tasks it can even result in complete failure…
▽ More
Visual domain randomization in simulated environments is a widely used method to transfer policies trained in simulation to real robots. However, domain randomization and augmentation hamper the training of a policy. As reinforcement learning struggles with a noisy training signal, this additional nuisance can drastically impede training. For difficult tasks it can even result in complete failure to learn. To overcome this problem we propose to pre-train a perception encoder that already provides an embedding invariant to the randomization. We demonstrate that this yields consistently improved results on a randomized version of DeepMind control suite tasks and a stacking environment on arbitrary backgrounds with zero-shot transfer to a physical robot.
△ Less
Submitted 29 April, 2021;
originally announced April 2021.
-
PIXLISE-C: Exploring The Data Analysis Needs of NASA Scientists for Mineral Identification
Authors:
Connie Ye,
Lukas Hermann,
Nur Yildirim,
Shravya Bhat,
Dominik Moritz,
Scott Davidoff
Abstract:
NASA JPL scientists working on the micro x-ray fluorescence (microXRF) spectroscopy data collected from Mars surface perform data analysis to look for signs of past microbial life on Mars. Their data analysis workflow mainly involves identifying mineral compounds through the element abundance in spatially distributed data points. Working with the NASA JPL team, we identified pain points and needs…
▽ More
NASA JPL scientists working on the micro x-ray fluorescence (microXRF) spectroscopy data collected from Mars surface perform data analysis to look for signs of past microbial life on Mars. Their data analysis workflow mainly involves identifying mineral compounds through the element abundance in spatially distributed data points. Working with the NASA JPL team, we identified pain points and needs to further develop their existing data visualization and analysis tool. Specifically, the team desired improvements for the process of creating and interpreting mineral composition groups. To address this problem, we developed an interactive tool that enables scientists to (1) cluster the data using either manual lasso-tool selection or through various machine learning clustering algorithms, and (2) compare the clusters and individual data points to make informed decisions about mineral compositions. Our preliminary tool supports a hybrid data analysis workflow where the user can manually refine the machine-generated clusters.
△ Less
Submitted 30 March, 2021;
originally announced March 2021.
-
Hindsight for Foresight: Unsupervised Structured Dynamics Models from Physical Interaction
Authors:
Iman Nematollahi,
Oier Mees,
Lukas Hermann,
Wolfram Burgard
Abstract:
A key challenge for an agent learning to interact with the world is to reason about physical properties of objects and to foresee their dynamics under the effect of applied forces. In order to scale learning through interaction to many objects and scenes, robots should be able to improve their own performance from real-world experience without requiring human supervision. To this end, we propose a…
▽ More
A key challenge for an agent learning to interact with the world is to reason about physical properties of objects and to foresee their dynamics under the effect of applied forces. In order to scale learning through interaction to many objects and scenes, robots should be able to improve their own performance from real-world experience without requiring human supervision. To this end, we propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images. Unlike previous approaches, our method does not require ground-truth data associations provided by a tracker or any pre-trained perception network. To learn from unlabeled real-world interaction data, we enforce consistency of estimated 3D clouds, actions and 2D images with observed ones. Our joint forward and inverse network learns to segment a scene into salient object parts and predicts their 3D motion under the effect of applied actions. Moreover, our object-centric model outputs action-conditioned 3D scene flow, object masks and 2D optical flow as emergent properties. Our extensive evaluation both in simulation and with real-world data demonstrates that our formulation leads to effective, interpretable models that can be used for visuomotor control and planning. Videos, code and dataset are available at http://hind4sight.cs.uni-freiburg.de
△ Less
Submitted 2 August, 2020;
originally announced August 2020.
-
FlowControl: Optical Flow Based Visual Servoing
Authors:
Max Argus,
Lukas Hermann,
Jon Long,
Thomas Brox
Abstract:
One-shot imitation is the vision of robot programming from a single demonstration, rather than by tedious construction of computer code. We present a practical method for realizing one-shot imitation for manipulation tasks, exploiting modern learning-based optical flow to perform real-time visual servoing. Our approach, which we call FlowControl, continuously tracks a demonstration video, using a…
▽ More
One-shot imitation is the vision of robot programming from a single demonstration, rather than by tedious construction of computer code. We present a practical method for realizing one-shot imitation for manipulation tasks, exploiting modern learning-based optical flow to perform real-time visual servoing. Our approach, which we call FlowControl, continuously tracks a demonstration video, using a specified foreground mask to attend to an object of interest. Using RGB-D observations, FlowControl requires no 3D object models, and is easy to set up. FlowControl inherits great robustness to visual appearance from decades of work in optical flow. We exhibit FlowControl on a range of problems, including ones requiring very precise motions, and ones requiring the ability to generalize.
△ Less
Submitted 1 July, 2020;
originally announced July 2020.
-
What shapes feature representations? Exploring datasets, architectures, and training
Authors:
Katherine L. Hermann,
Andrew K. Lampinen
Abstract:
In naturalistic learning problems, a model's input contains a wide range of features, some useful for the task at hand, and others not. Of the useful features, which ones does the model use? Of the task-irrelevant features, which ones does the model represent? Answers to these questions are important for understanding the basis of models' decisions, as well as for building models that learn versat…
▽ More
In naturalistic learning problems, a model's input contains a wide range of features, some useful for the task at hand, and others not. Of the useful features, which ones does the model use? Of the task-irrelevant features, which ones does the model represent? Answers to these questions are important for understanding the basis of models' decisions, as well as for building models that learn versatile, adaptable representations useful beyond the original training task. We study these questions using synthetic datasets in which the task-relevance of input features can be controlled directly. We find that when two features redundantly predict the labels, the model preferentially represents one, and its preference reflects what was most linearly decodable from the untrained model. Over training, task-relevant features are enhanced, and task-irrelevant features are partially suppressed. Interestingly, in some cases, an easier, weakly predictive feature can suppress a more strongly predictive, but more difficult one. Additionally, models trained to recognize both easy and hard features learn representations most similar to models that use only the easy feature. Further, easy features lead to more consistent representations across model runs than do hard features. Finally, models have greater representational similarity to an untrained model than to models trained on a different task. Our results highlight the complex processes that determine which features a model represents.
△ Less
Submitted 22 October, 2020; v1 submitted 22 June, 2020;
originally announced June 2020.
-
The Origins and Prevalence of Texture Bias in Convolutional Neural Networks
Authors:
Katherine L. Hermann,
Ting Chen,
Simon Kornblith
Abstract:
Recent work has indicated that, unlike humans, ImageNet-trained CNNs tend to classify images by texture rather than by shape. How pervasive is this bias, and where does it come from? We find that, when trained on datasets of images with conflicting shape and texture, CNNs learn to classify by shape at least as easily as by texture. What factors, then, produce the texture bias in CNNs trained on Im…
▽ More
Recent work has indicated that, unlike humans, ImageNet-trained CNNs tend to classify images by texture rather than by shape. How pervasive is this bias, and where does it come from? We find that, when trained on datasets of images with conflicting shape and texture, CNNs learn to classify by shape at least as easily as by texture. What factors, then, produce the texture bias in CNNs trained on ImageNet? Different unsupervised training objectives and different architectures have small but significant and largely independent effects on the level of texture bias. However, all objectives and architectures still lead to models that make texture-based classification decisions a majority of the time, even if shape information is decodable from their hidden representations. The effect of data augmentation is much larger. By taking less aggressive random crops at training time and applying simple, naturalistic augmentation (color distortion, noise, and blur), we train models that classify ambiguous images by shape a majority of the time, and outperform baselines on out-of-distribution test sets. Our results indicate that apparent differences in the way humans and ImageNet-trained CNNs process images may arise not primarily from differences in their internal workings, but from differences in the data that they see.
△ Less
Submitted 3 November, 2020; v1 submitted 20 November, 2019;
originally announced November 2019.
-
Adaptive Curriculum Generation from Demonstrations for Sim-to-Real Visuomotor Control
Authors:
Lukas Hermann,
Max Argus,
Andreas Eitel,
Artemij Amiranashvili,
Wolfram Burgard,
Thomas Brox
Abstract:
We propose Adaptive Curriculum Generation from Demonstrations (ACGD) for reinforcement learning in the presence of sparse rewards. Rather than designing shaped reward functions, ACGD adaptively sets the appropriate task difficulty for the learner by controlling where to sample from the demonstration trajectories and which set of simulation parameters to use. We show that training vision-based cont…
▽ More
We propose Adaptive Curriculum Generation from Demonstrations (ACGD) for reinforcement learning in the presence of sparse rewards. Rather than designing shaped reward functions, ACGD adaptively sets the appropriate task difficulty for the learner by controlling where to sample from the demonstration trajectories and which set of simulation parameters to use. We show that training vision-based control policies in simulation while gradually increasing the difficulty of the task via ACGD improves the policy transfer to the real world. The degree of domain randomization is also gradually increased through the task difficulty. We demonstrate zero-shot transfer for two real-world manipulation tasks: pick-and-stow and block stacking. A video showing the results can be found at https://lmb.informatik.uni-freiburg.de/projects/curriculum/
△ Less
Submitted 8 July, 2020; v1 submitted 17 October, 2019;
originally announced October 2019.