Skip to main content

Showing 1–50 of 50 results for author: Toshev, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.11794  [pdf, other

    cs.LG cs.CL

    DataComp-LM: In search of the next generation of training sets for language models

    Authors: Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner , et al. (34 additional authors not shown)

    Abstract: We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with dat… ▽ More

    Submitted 20 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Project page: https://www.datacomp.ai/dclm/

  2. arXiv:2406.07904  [pdf, other

    cs.LG

    Grounding Multimodal Large Language Models in Actions

    Authors: Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira, Alexander Toshev

    Abstract: Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  3. arXiv:2403.09611  [pdf, other

    cs.CV cs.CL cs.LG

    MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

    Authors: Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman , et al. (7 additional authors not shown)

    Abstract: In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for la… ▽ More

    Submitted 18 April, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  4. arXiv:2403.04750  [pdf, other

    physics.flu-dyn cs.LG

    JAX-SPH: A Differentiable Smoothed Particle Hydrodynamics Framework

    Authors: Artur P. Toshev, Harish Ramachandran, Jonas A. Erbesdobler, Gianluca Galletti, Johannes Brandstetter, Nikolaus A. Adams

    Abstract: Particle-based fluid simulations have emerged as a powerful tool for solving the Navier-Stokes equations, especially in cases that include intricate physics and free surfaces. The recent addition of machine learning methods to the toolbox for solving such problems is pushing the boundary of the quality vs. speed tradeoff of such numerical simulations. In this work, we lead the way to Lagrangian fl… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

    Comments: Accepted at the ICLR 2024 Workshop on AI4Differential Equations In Science

  5. arXiv:2402.06275  [pdf, other

    physics.flu-dyn cs.LG

    Neural SPH: Improved Neural Modeling of Lagrangian Fluid Dynamics

    Authors: Artur P. Toshev, Jonas A. Erbesdobler, Nikolaus A. Adams, Johannes Brandstetter

    Abstract: Smoothed particle hydrodynamics (SPH) is omnipresent in modern engineering and scientific disciplines. SPH is a class of Lagrangian schemes that discretize fluid dynamics via finite material points that are tracked through the evolving velocity field. Due to the particle-like nature of the simulation, graph neural networks (GNNs) have emerged as appealing and successful surrogates. However, the pr… ▽ More

    Submitted 9 February, 2024; originally announced February 2024.

  6. arXiv:2401.08541  [pdf, other

    cs.CV

    Scalable Pre-training of Large Autoregressive Image Models

    Authors: Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, Armand Joulin

    Abstract: This paper introduces AIM, a collection of vision models pre-trained with an autoregressive objective. These models are inspired by their textual counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling properties. Specifically, we highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, (2) the value o… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: https://github.com/apple/ml-aim

  7. arXiv:2311.16201  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

    Authors: Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander Toshev

    Abstract: Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  8. arXiv:2310.17722  [pdf, other

    cs.LG cs.AI cs.CL

    Large Language Models as Generalizable Policies for Embodied Tasks

    Authors: Andrew Szot, Max Schwarzer, Harsh Agrawal, Bogdan Mazoure, Walter Talbott, Katherine Metcalf, Natalie Mackraz, Devon Hjelm, Alexander Toshev

    Abstract: We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and… ▽ More

    Submitted 16 April, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

  9. arXiv:2309.17425  [pdf, other

    cs.AI cs.LG

    Data Filtering Networks

    Authors: Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, Vaishaal Shankar

    Abstract: Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we s… ▽ More

    Submitted 5 November, 2023; v1 submitted 29 September, 2023; originally announced September 2023.

  10. arXiv:2309.16342  [pdf, other

    cs.LG physics.flu-dyn

    LagrangeBench: A Lagrangian Fluid Mechanics Benchmarking Suite

    Authors: Artur P. Toshev, Gianluca Galletti, Fabian Fritz, Stefan Adami, Nikolaus A. Adams

    Abstract: Machine learning has been successfully applied to grid-based PDE modeling in various scientific applications. However, learned PDE solvers based on Lagrangian particle discretizations, which are the preferred approach to problems with free surfaces or complex physics, remain largely unexplored. We present LagrangeBench, the first benchmarking suite for Lagrangian particle problems, focusing on tem… ▽ More

    Submitted 28 October, 2023; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: Accepted at 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

  11. arXiv:2309.04354  [pdf, other

    cs.CV cs.LG stat.ML

    Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

    Authors: Erik Daxberger, Floris Weers, Bowen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du

    Abstract: Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such, sparse MoEs have enabled unprecedented scalability, resulting in tremendous successes across domains such as natural language processing and computer vision. In thi… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

  12. arXiv:2306.16740  [pdf, other

    cs.RO cs.AI cs.HC cs.LG

    Principles and Guidelines for Evaluating Social Robot Navigation Algorithms

    Authors: Anthony Francis, Claudia Pérez-D'Arpino, Chengshu Li, Fei Xia, Alexandre Alahi, Rachid Alami, Aniket Bera, Abhijat Biswas, Joydeep Biswas, Rohan Chandra, Hao-Tien Lewis Chiang, Michael Everett, Sehoon Ha, Justin Hart, Jonathan P. How, Haresh Karnan, Tsang-Wei Edward Lee, Luis J. Manso, Reuth Mirksy, Sören Pirk, Phani Teja Singamaneni, Peter Stone, Ada V. Taylor, Peter Trautman, Nathan Tsoi , et al. (6 additional authors not shown)

    Abstract: A major challenge to deploying robots widely is navigation in human-populated environments, commonly referred to as social robot navigation. While the field of social navigation has advanced tremendously in recent years, the fair evaluation of algorithms that tackle social navigation remains hard because it involves not just robotic agents moving in static environments but also dynamic human agent… ▽ More

    Submitted 19 September, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

    Comments: 42 pages, 11 figures, 6 tables

    ACM Class: I.2.9

  13. arXiv:2306.14818  [pdf, other

    cs.LG physics.chem-ph

    Accelerating Molecular Graph Neural Networks via Knowledge Distillation

    Authors: Filip Ekström Kelvinius, Dimitar Georgiev, Artur Petrov Toshev, Johannes Gasteiger

    Abstract: Recent advances in graph neural networks (GNNs) have enabled more comprehensive modeling of molecules and molecular systems, thereby enhancing the precision of molecular property prediction and molecular simulations. Nonetheless, as the field has been progressing to bigger and more complex architectures, state-of-the-art GNNs have become largely prohibitive for many large-scale applications. In th… ▽ More

    Submitted 28 October, 2023; v1 submitted 26 June, 2023; originally announced June 2023.

    Comments: Accepted as a conference paper at NeurIPS 2023

  14. arXiv:2306.07290  [pdf, other

    cs.LG cs.AI

    Value function estimation using conditional diffusion models for control

    Authors: Bogdan Mazoure, Walter Talbott, Miguel Angel Bautista, Devon Hjelm, Alexander Toshev, Josh Susskind

    Abstract: A fairly reliable trend in deep reinforcement learning is that the performance scales with the number of parameters, provided a complimentary scaling in amount of training data. As the appetite for large models increases, it is imperative to address, sooner than later, the potential problem of running out of high-quality demonstrations. In this case, instead of collecting only new data via costly… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

  15. arXiv:2305.15603  [pdf, other

    cs.LG physics.flu-dyn

    Learning Lagrangian Fluid Mechanics with E($3$)-Equivariant Graph Neural Networks

    Authors: Artur P. Toshev, Gianluca Galletti, Johannes Brandstetter, Stefan Adami, Nikolaus A. Adams

    Abstract: We contribute to the vastly growing field of machine learning for engineering systems by demonstrating that equivariant graph neural networks have the potential to learn more accurate dynamic-interaction models than their non-equivariant counterparts. We benchmark two well-studied fluid-flow systems, namely 3D decaying Taylor-Green vortex and 3D reverse Poiseuille flow, and evaluate the models bas… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: GSI'23 6th International Conference on Geometric Science of Information; 10 pages; oral. arXiv admin note: substantial text overlap with arXiv:2304.00150

  16. arXiv:2304.04385  [pdf, other

    cs.LG

    On Robustness in Multimodal Learning

    Authors: Brandon McKinzie, Joseph Cheng, Vaishaal Shankar, Yinfei Yang, Jonathon Shlens, Alexander Toshev

    Abstract: Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework… ▽ More

    Submitted 10 April, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

  17. arXiv:2304.00150  [pdf, other

    cs.LG physics.flu-dyn

    E($3$) Equivariant Graph Neural Networks for Particle-Based Fluid Mechanics

    Authors: Artur P. Toshev, Gianluca Galletti, Johannes Brandstetter, Stefan Adami, Nikolaus A. Adams

    Abstract: We contribute to the vastly growing field of machine learning for engineering systems by demonstrating that equivariant graph neural networks have the potential to learn more accurate dynamic-interaction models than their non-equivariant counterparts. We benchmark two well-studied fluid flow systems, namely the 3D decaying Taylor-Green vortex and the 3D reverse Poiseuille flow, and compare equivar… ▽ More

    Submitted 31 March, 2023; originally announced April 2023.

    Comments: ICLR 2023 Workshop on Physics for Machine Learning

  18. arXiv:2304.00146  [pdf, other

    cs.LG physics.flu-dyn

    On the Relationships between Graph Neural Networks for the Simulation of Physical Systems and Classical Numerical Methods

    Authors: Artur P. Toshev, Ludger Paehler, Andrea Panizza, Nikolaus A. Adams

    Abstract: Recent developments in Machine Learning approaches for modelling physical systems have begun to mirror the past development of numerical methods in the computational sciences. In this survey, we begin by providing an example of this with the parallels between the development trajectories of graph neural network acceleration for physical simulations and particle-based approaches. We then give an ov… ▽ More

    Submitted 31 March, 2023; originally announced April 2023.

    Comments: 2nd AI4Science Workshop at the 39th International Conference on Machine Learning (ICML), 2022

  19. arXiv:2301.13081  [pdf, other

    cs.CV

    STAIR: Learning Sparse Text and Image Representation in Grounded Tokens

    Authors: Chen Chen, Bowen Zhang, Liangliang Cao, Jiguang Shen, Tom Gunter, Albin Madappally Jose, Alexander Toshev, Jonathon Shlens, Ruoming Pang, Yinfei Yang

    Abstract: Image and text retrieval is one of the foundational tasks in the vision and language domain with multiple real-world applications. State-of-the-art approaches, e.g. CLIP, ALIGN, represent images and texts as dense embeddings and calculate the similarity in the dense embedding space as the matching score. On the other hand, sparse semantic features like bag-of-words models are more interpretable, b… ▽ More

    Submitted 7 February, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

  20. arXiv:2210.09996  [pdf, other

    cs.CV cs.LG

    Perceptual Grou** in Contrastive Vision-Language Models

    Authors: Kanchana Ranasinghe, Brandon McKinzie, Sachin Ravi, Yinfei Yang, Alexander Toshev, Jonathon Shlens

    Abstract: Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how… ▽ More

    Submitted 21 August, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: Accepted and presented at ICCV 2023

  21. arXiv:2210.06849  [pdf, other

    cs.CV

    Retrospectives on the Embodied AI Workshop

    Authors: Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi , et al. (14 additional authors not shown)

    Abstract: We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of… ▽ More

    Submitted 4 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

  22. arXiv:2209.09375  [pdf, other

    cs.RO cs.CV

    Gesture2Path: Imitation Learning for Gesture-aware Navigation

    Authors: Catie Cuan, Edward Lee, Emre Fisher, Anthony Francis, Leila Takayama, Tingnan Zhang, Alexander Toshev, Sören Pirk

    Abstract: As robots increasingly enter human-centered environments, they must not only be able to navigate safely around humans, but also adhere to complex social norms. Humans often rely on non-verbal communication through gestures and facial expressions when navigating around other people, especially in densely occupied spaces. Consequently, robots also need to be able to interpret gestures as part of sol… ▽ More

    Submitted 19 September, 2022; originally announced September 2022.

    Comments: 8 pages, 12 figures

  23. arXiv:2207.13751  [pdf, other

    cs.CV cs.GR cs.LG

    GAUDI: A Neural Architect for Immersive 3D Scene Generation

    Authors: Miguel Angel Bautista, Pengsheng Guo, Samira Abnar, Walter Talbott, Alexander Toshev, Zhuoyuan Chen, Laurent Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, Afshin Dehghan, Josh Susskind

    Abstract: We introduce GAUDI, a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera. We tackle this challenging problem with a scalable yet powerful approach, where we first optimize a latent representation that disentangles radiance fields and camera poses. This latent representation is then used to learn a generati… ▽ More

    Submitted 27 July, 2022; originally announced July 2022.

    Comments: Project webpage: https://github.com/apple/ml-gaudi

  24. arXiv:2204.05443  [pdf, other

    cs.RO cs.HC

    A Protocol for Validating Social Navigation Policies

    Authors: Sören Pirk, Edward Lee, Xuesu Xiao, Leila Takayama, Anthony Francis, Alexander Toshev

    Abstract: Enabling socially acceptable behavior for situated agents is a major goal of recent robotics research. Robots should not only operate safely around humans, but also abide by complex social norms. A key challenge for develo** socially-compliant policies is measuring the quality of their behavior. Social behavior is enormously complex, making it difficult to create reliable metrics to gauge the pe… ▽ More

    Submitted 29 April, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: IEEE International Conference on Robotics and Automation; Workshop: Social Robot Navigation: Advances and Evaluation

  25. arXiv:2204.01691  [pdf, other

    cs.RO cs.CL cs.LG

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Authors: Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee , et al. (20 additional authors not shown)

    Abstract: Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embo… ▽ More

    Submitted 16 August, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

    Comments: See website at https://say-can.github.io/ V1. Initial Upload. V2. Added PaLM results. Added study about new capabilities (drawer manipulation, chain of thought prompting, multilingual instructions). Added an ablation study of language model size. Added an open-source version of \algname on a simulated tabletop environment. Improved readability

  26. arXiv:2203.15041  [pdf, other

    cs.RO cs.CV cs.LG eess.SY

    Socially Compliant Navigation Dataset (SCAND): A Large-Scale Dataset of Demonstrations for Social Navigation

    Authors: Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett Warnell, Soeren Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, Peter Stone

    Abstract: Social navigation is the capability of an autonomous agent, such as a robot, to navigate in a 'socially compliant' manner in the presence of other intelligent agents such as humans. With the emergence of autonomously navigating mobile robots in human populated environments (e.g., domestic service robots in homes and restaurants and food delivery robots on public sidewalks), incorporating socially… ▽ More

    Submitted 8 June, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

    Journal ref: Robotics and Automation Letters (RA-L) 2022

  27. arXiv:2111.03189  [pdf, other

    cs.LG cs.AI cs.RO

    Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning

    Authors: Dhruv Shah, Peng Xu, Yao Lu, Ted Xiao, Alexander Toshev, Sergey Levine, Brian Ichter

    Abstract: Reinforcement learning can train policies that effectively perform complex tasks. However for long-horizon tasks, the performance of these methods degrades with horizon, often necessitating reasoning over and chaining lower-level skills. Hierarchical reinforcement learning aims to enable this by providing a bank of low-level skills as action abstractions. Hierarchies can further improve on this by… ▽ More

    Submitted 29 March, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

    Comments: Accepted to ICLR 2022

  28. arXiv:2008.07792  [pdf, other

    cs.AI cs.CV cs.LG cs.RO

    ReLMoGen: Leveraging Motion Generation in Reinforcement Learning for Mobile Manipulation

    Authors: Fei Xia, Chengshu Li, Roberto Martín-Martín, Or Litany, Alexander Toshev, Silvio Savarese

    Abstract: Many Reinforcement Learning (RL) approaches use joint control signals (positions, velocities, torques) as action space for continuous control tasks. We propose to lift the action space to a higher level in the form of subgoals for a motion generator (a combination of motion planner and trajectory executor). We argue that, by lifting the action space and by leveraging sampling-based motion planners… ▽ More

    Submitted 26 March, 2021; v1 submitted 18 August, 2020; originally announced August 2020.

    Comments: First two authors contributed equally. Access project website at http://svl.stanford.edu/projects/relmogen

  29. arXiv:2008.04888  [pdf, other

    cs.CV

    Adversarial Generative Grammars for Human Activity Prediction

    Authors: AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo

    Abstract: In this paper we propose an adversarial generative grammar model for future prediction. The objective is to learn a model that explicitly captures temporal dependencies, providing a capability to forecast multiple, distinct future activities. Our adversarial grammar is designed so that it can learn stochastic production rules from the data distribution, jointly with its latent non-terminal represe… ▽ More

    Submitted 14 August, 2020; v1 submitted 11 August, 2020; originally announced August 2020.

    Comments: ECCV 2020 (Oral)

  30. arXiv:2007.14545  [pdf, other

    cs.RO

    Learning Object-conditioned Exploration using Distributed Soft Actor Critic

    Authors: Ayzaan Wahid, Austin Stone, Kevin Chen, Brian Ichter, Alexander Toshev

    Abstract: Object navigation is defined as navigating to an object of a given label in a complex, unexplored environment. In its general form, this problem poses several challenges for Robotics: semantic exploration of unknown environments in search of an object and low-level control. In this work we study object-guided exploration and low-level control, and present an end-to-end trained navigation policy ac… ▽ More

    Submitted 30 July, 2020; v1 submitted 28 July, 2020; originally announced July 2020.

  31. arXiv:2006.13171  [pdf, other

    cs.CV cs.RO

    ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects

    Authors: Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Oleksandr Maksymets, Roozbeh Mottaghi, Manolis Savva, Alexander Toshev, Erik Wijmans

    Abstract: We revisit the problem of Object-Goal Navigation (ObjectNav). In its simplest form, ObjectNav is defined as the task of navigating to an object, specified by its label, in an unexplored environment. In particular, the agent is initialized at a random location and pose in an environment and asked to find an instance of an object category, e.g., find a chair, by navigating to it. As the community… ▽ More

    Submitted 30 August, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

  32. arXiv:2006.04843  [pdf, other

    cs.RO cs.LG

    Modeling Long-horizon Tasks as Sequential Interaction Landscapes

    Authors: Sören Pirk, Karol Hausman, Alexander Toshev, Mohi Khansari

    Abstract: Complex object manipulation tasks often span over long sequences of operations. Task planning over long-time horizons is a challenging and open problem in robotics, and its complexity grows exponentially with an increasing number of subtasks. In this paper we present a deep learning network that learns dependencies and transitions across subtasks solely from a set of demonstration videos. We repre… ▽ More

    Submitted 23 October, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: Published at 4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA More details available at: http://www.pirk.io

  33. arXiv:1910.14442  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Interactive Gibson Benchmark (iGibson 0.5): A Benchmark for Interactive Navigation in Cluttered Environments

    Authors: Fei Xia, William B. Shen, Chengshu Li, Priya Kasimbeg, Micael Tchapmi, Alexander Toshev, Li Fei-Fei, Roberto Martín-Martín, Silvio Savarese

    Abstract: We present Interactive Gibson Benchmark, the first comprehensive benchmark for training and evaluating Interactive Navigation: robot navigation strategies where physical interaction with objects is allowed and even encouraged to accomplish a task. For example, the robot can move objects if needed in order to clear a path leading to the goal location. Our benchmark comprises two novel elements: 1)… ▽ More

    Submitted 9 August, 2021; v1 submitted 29 October, 2019; originally announced October 2019.

    Comments: 9 pages, 8 figures. Consider citing a newer version (https://arxiv.longhoe.net/abs/2012.02924) if you are using iGibson

    Journal ref: IEEE Robotics and Automation Letters, Vol. 5, No. 2, April 2020

  34. arXiv:1903.09870  [pdf, other

    cs.RO cs.CV

    Long Range Neural Navigation Policies for the Real World

    Authors: Ayzaan Wahid, Alexander Toshev, Marek Fiser, Tsang-Wei Edward Lee

    Abstract: Learned Neural Network based policies have shown promising results for robot navigation. However, most of these approaches fall short of being used on a real robot due to the extensive simulated training they require. These simulations lack the visuals and dynamics of the real world, which makes it infeasible to deploy on a real robot. We present a novel Neural Net based policy, NavNet, which allo… ▽ More

    Submitted 28 August, 2019; v1 submitted 23 March, 2019; originally announced March 2019.

  35. arXiv:1903.03878  [pdf, other

    cs.LG cs.CV cs.RO stat.ML

    Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks

    Authors: Kuan Fang, Alexander Toshev, Li Fei-Fei, Silvio Savarese

    Abstract: Many robotic applications require the agent to perform long-horizon tasks in partially observable environments. In such applications, decision making at any step can depend on observations received far in the past. Hence, being able to properly memorize and utilize the long-term history is crucial. In this work, we propose a novel memory-based policy, named Scene Memory Transformer (SMT). The prop… ▽ More

    Submitted 9 March, 2019; originally announced March 2019.

    Comments: CVPR 2019 paper with supplementary material

  36. arXiv:1811.10636  [pdf, other

    cs.CV cs.LG cs.NE

    Evolving Space-Time Neural Architectures for Videos

    Authors: AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo

    Abstract: We present a new method for finding video CNN architectures that capture rich spatio-temporal information in videos. Previous work, taking advantage of 3D convolutions, obtained promising results by manually designing video CNN architectures. We here develop a novel evolutionary search algorithm that automatically explores models with different types and combinations of layers to jointly learn int… ▽ More

    Submitted 20 August, 2019; v1 submitted 26 November, 2018; originally announced November 2018.

    Journal ref: ICCV 2019

  37. arXiv:1806.03370  [pdf, other

    cs.CV

    Self-supervisory Signals for Object Discovery and Detection

    Authors: Etienne Pot, Alexander Toshev, Jana Kosecka

    Abstract: In robotic applications, we often face the challenge of discovering new objects while having very little or no labelled training data. In this paper we explore the use of self-supervision provided by a robot traversing an environment to learn representations of encountered objects. Knowledge of ego-motion and depth perception enables the agent to effectively associate multiple object proposals, wh… ▽ More

    Submitted 8 June, 2018; originally announced June 2018.

  38. arXiv:1805.06066  [pdf, other

    cs.CV

    Visual Representations for Semantic Target Driven Navigation

    Authors: Arsalan Mousavian, Alexander Toshev, Marek Fiser, Jana Kosecka, Ayzaan Wahid, James Davidson

    Abstract: What is a good visual representation for autonomous agents? We address this question in the context of semantic visual navigation, which is the problem of a robot finding its way through a complex environment to a target object, e.g. go to the refrigerator. Instead of acquiring a metric semantic map of an environment and using planning for navigation, our approach learns navigation policies on top… ▽ More

    Submitted 2 July, 2019; v1 submitted 15 May, 2018; originally announced May 2018.

    Comments: Accepted to ICRA 2019 and ECCV 2018 Workshop on Visual Learning and Embodied Agents in Simulation Environments

  39. arXiv:1712.07642  [pdf, other

    cs.CV cs.LG cs.RO

    Sim2Real View Invariant Visual Servoing by Recurrent Control

    Authors: Fereshteh Sadeghi, Alexander Toshev, Eric Jang, Sergey Levine

    Abstract: Humans are remarkably proficient at controlling their limbs and tools from a wide range of viewpoints and angles, even in the presence of optical distortions. In robotics, this ability is referred to as visual servoing: moving a tool or end-point to a desired location using primarily visual feedback. In this paper, we study how viewpoint-invariant visual servoing skills can be learned automaticall… ▽ More

    Submitted 20 December, 2017; originally announced December 2017.

    Comments: Supplementary video: https://fsadeghi.github.io/Sim2RealViewInvariantServo

  40. arXiv:1703.07464  [pdf, other

    cs.CV

    No Fuss Distance Metric Learning using Proxies

    Authors: Yair Movshovitz-Attias, Alexander Toshev, Thomas K. Leung, Sergey Ioffe, Saurabh Singh

    Abstract: We address the problem of distance metric learning (DML), defined as learning a distance consistent with a notion of semantic similarity. Traditionally, for this problem supervision is expressed in the form of sets of points that follow an ordinal relationship -- an anchor point $x$ is similar to a set of positive points $Y$, and dissimilar to a set of negative points $Z$, and a loss defined over… ▽ More

    Submitted 1 August, 2017; v1 submitted 21 March, 2017; originally announced March 2017.

    Comments: To be presented in ICCV 2017

  41. arXiv:1701.01779  [pdf, other

    cs.CV

    Towards Accurate Multi-person Pose Estimation in the Wild

    Authors: George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, Kevin Murphy

    Abstract: We propose a method for multi-person detection and 2-D pose estimation that achieves state-of-art results on the challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages. In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector. In the second stage, we estimate the… ▽ More

    Submitted 14 April, 2017; v1 submitted 6 January, 2017; originally announced January 2017.

    Comments: Paper describing an improved version of the G-RMI entry to the 2016 COCO keypoints challenge (http://image-net.org/challenges/ilsvrc+coco2016). Camera ready version to appear in the Proceedings of CVPR 2017

  42. Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge

    Authors: Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan

    Abstract: Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The mod… ▽ More

    Submitted 21 September, 2016; originally announced September 2016.

    Comments: arXiv admin note: substantial text overlap with arXiv:1411.4555

    Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: PP, Issue: 99 , July 2016 )

  43. arXiv:1605.02346  [pdf, other

    cs.CV

    Chained Predictions Using Convolutional Neural Networks

    Authors: Georgia Gkioxari, Alexander Toshev, Navdeep Jaitly

    Abstract: In this paper, we present an adaptation of the sequence-to-sequence model for structured output prediction in vision tasks. In this model the output variables for a given input are predicted sequentially using neural networks. The prediction for each output variable depends not only on the input but also on the previously predicted output variables. The model is applied to spatial localization tas… ▽ More

    Submitted 23 October, 2016; v1 submitted 8 May, 2016; originally announced May 2016.

    Comments: in submission to EECV 2016

  44. arXiv:1511.06789  [pdf, other

    cs.CV

    The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition

    Authors: Jonathan Krause, Benjamin Sapp, Andrew Howard, Howard Zhou, Alexander Toshev, Tom Duerig, James Philbin, Li Fei-Fei

    Abstract: Current approaches for fine-grained recognition do the following: First, recruit experts to annotate a dataset of images, optionally also collecting more structured data in the form of part annotations and bounding boxes. Second, train a model utilizing this data. Toward the goal of solving fine-grained recognition, we introduce an alternative approach, leveraging free, noisy data from the web and… ▽ More

    Submitted 18 October, 2016; v1 submitted 20 November, 2015; originally announced November 2015.

    Comments: ECCV 2016, data is released

  45. arXiv:1511.02283  [pdf, other

    cs.CV cs.CL cs.LG cs.RO

    Generation and Comprehension of Unambiguous Object Descriptions

    Authors: Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, Kevin Murphy

    Abstract: We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous object… ▽ More

    Submitted 10 April, 2016; v1 submitted 6 November, 2015; originally announced November 2015.

    Comments: We have released the Google Refexp dataset together with a toolbox for visualization and evaluation, see https://github.com/mjhucla/Google_Refexp_toolbox. Camera ready version for CVPR 2016

    ACM Class: I.2.6; I.2.7; I.2.10

  46. arXiv:1507.00302  [pdf, other

    cs.CV

    Pose Embeddings: A Deep Architecture for Learning to Match Human Poses

    Authors: Greg Mori, Caroline Pantofaru, Nisarg Kothari, Thomas Leung, George Toderici, Alexander Toshev, Weilong Yang

    Abstract: We present a method for learning an embedding that places images of humans in similar poses nearby. This embedding can be used as a direct method of comparing images based on human pose, avoiding potential challenges of estimating body joint positions. Pose embedding learning is formulated under a triplet-based distance criterion. A deep architecture is used to allow learning of a representation c… ▽ More

    Submitted 1 July, 2015; originally announced July 2015.

  47. arXiv:1411.4555  [pdf, other

    cs.CV

    Show and Tell: A Neural Image Caption Generator

    Authors: Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan

    Abstract: Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The mod… ▽ More

    Submitted 20 April, 2015; v1 submitted 17 November, 2014; originally announced November 2014.

  48. arXiv:1312.4894  [pdf, other

    cs.CV

    Deep Convolutional Ranking for Multilabel Image Annotation

    Authors: Yunchao Gong, Yangqing Jia, Thomas Leung, Alexander Toshev, Sergey Ioffe

    Abstract: Multilabel image annotation is one of the most important challenges in computer vision with many real-world applications. While existing work usually use conventional visual features for multilabel annotation, features based on Deep Neural Networks have shown potential to significantly boost performance. In this work, we propose to leverage the advantage of such features and analyze key components… ▽ More

    Submitted 14 April, 2014; v1 submitted 17 December, 2013; originally announced December 2013.

  49. DeepPose: Human Pose Estimation via Deep Neural Networks

    Authors: Alexander Toshev, Christian Szegedy

    Abstract: We propose a method for human pose estimation based on Deep Neural Networks (DNNs). The pose estimation is formulated as a DNN-based regression problem towards body joints. We present a cascade of such DNN regressors which results in high precision pose estimates. The approach has the advantage of reasoning about pose in a holistic fashion and has a simple but yet powerful formulation which capita… ▽ More

    Submitted 20 August, 2014; v1 submitted 17 December, 2013; originally announced December 2013.

    Comments: IEEE Conference on Computer Vision and Pattern Recognition, 2014

  50. arXiv:1312.2249  [pdf, other

    cs.CV stat.ML

    Scalable Object Detection using Deep Neural Networks

    Authors: Dumitru Erhan, Christian Szegedy, Alexander Toshev, Dragomir Anguelov

    Abstract: Deep convolutional neural networks have recently achieved state-of-the-art performance on a number of image recognition benchmarks, including the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC-2012). The winning model on the localization sub-task was a network that predicts a single bounding box and a confidence score for each object category in the image. Such a model captures the whol… ▽ More

    Submitted 8 December, 2013; originally announced December 2013.