Skip to main content

Showing 1–50 of 81 results for author: Ryoo, M

.
  1. arXiv:2406.20095  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

    Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, **ghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo

    Abstract: Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with au… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  2. arXiv:2406.09396  [pdf, other

    cs.CV

    Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

    Authors: Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryoo, Donghyun Kim, Michael S. Ryoo

    Abstract: Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely-related. Therefore, when performing long-form video question answering (LVQA),all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature explore the use of large lan… ▽ More

    Submitted 17 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  3. arXiv:2404.08515  [pdf

    cs.CV eess.IV

    ChatGPT and general-purpose AI count fruits in pictures surprisingly well

    Authors: Konlavach Mengsuwan, Juan Camilo Rivera Palacio, Masahiro Ryo

    Abstract: Object counting is a popular task in deep learning applications in various domains, including agriculture. A conventional deep learning approach requires a large amount of training data, often a logistic problem in a real-world application. To address this issue, we examined how well ChatGPT (GPT4V) and a general-purpose AI (foundation model for object counting, T-Rex) can count the number of frui… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: 12 pages, 3 figures

  4. arXiv:2404.07449  [pdf, other

    cs.CV

    Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

    Authors: Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin

    Abstract: Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  5. arXiv:2403.16998  [pdf, other

    cs.CV

    Understanding Long Videos in One Multimodal Language Model Pass

    Authors: Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo

    Abstract: Large Language Models (LLMs), known to contain a strong awareness of world knowledge, have allowed recent approaches to achieve excellent performance on Long-Video Understanding benchmarks, but at high inference costs. In this work, we first propose Likelihood Selection, a simple technique that unlocks faster inference in autoregressive LLMs for multiple-choice tasks common in long-video benchmark… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: 24 pages

  6. arXiv:2403.14622  [pdf, other

    cs.CV

    Language Repository for Long Video Understanding

    Authors: Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, Michael S. Ryoo

    Abstract: Language has become a prominent modality in computer vision with the rise of multi-modal LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term information gradually declines with input length. This becomes critical, especially in applications such as long-form video understanding. In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintai… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  7. arXiv:2312.03817  [pdf, other

    cs.CV

    Diffusion Illusions: Hiding Images in Plain Sight

    Authors: Ryan Burgert, Xiang Li, Abe Leite, Kanchana Ranasinghe, Michael S. Ryoo

    Abstract: We explore the problem of computationally generating special `prime' images that produce optical illusions when physically arranged and viewed in a certain way. First, we propose a formal definition for this problem. Next, we introduce Diffusion Illusions, the first comprehensive pipeline designed to automatically generate a wide range of these illusions. Specifically, we both adapt the existing `… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

  8. arXiv:2312.01990  [pdf, other

    cs.RO cs.AI

    SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention

    Authors: Isabel Leal, Krzysztof Choromanski, Deepali Jain, Avinava Dubey, Jake Varley, Michael Ryoo, Yao Lu, Frederick Liu, Vikas Sindhwani, Quan Vuong, Tamas Sarlos, Ken Oslund, Karol Hausman, Kanishka Rao

    Abstract: We present Self-Adaptive Robust Attention for Robotics Transformers (SARA-RT): a new paradigm for addressing the emerging challenge of scaling up Robotics Transformers (RT) for on-robot deployment. SARA-RT relies on the new method of fine-tuning proposed by us, called up-training. It converts pre-trained or already fine-tuned Transformer-based robotic policies of quadratic time complexity (includi… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

  9. arXiv:2311.05698  [pdf, other

    cs.CV

    Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

    Authors: AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova

    Abstract: One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volu… ▽ More

    Submitted 3 April, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

    Comments: CVPR 2024

  10. arXiv:2310.20704  [pdf, other

    cs.CV cs.AI

    Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders

    Authors: Srijan Das, Tanmay Jain, Dominick Reilly, Pranav Balaji, Soumyajit Karmakar, Shyam Marjit, Xiang Li, Abhijit Das, Michael S. Ryoo

    Abstract: Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite their success, ViTs lack inductive biases, which can make it difficult to train them with limited data. To address this challenge, prior studies suggest training ViTs with self-supervised learning (SSL) and fine-tuning sequentially. However, we observe that jointly optimizing ViTs for the primary task and a Self-Supervis… ▽ More

    Submitted 27 December, 2023; v1 submitted 31 October, 2023; originally announced October 2023.

    Comments: Accepted to WACV 2024

  11. arXiv:2309.00696  [pdf, other

    cs.CV

    AAN: Attributes-Aware Network for Temporal Action Detection

    Authors: Rui Dai, Srijan Das, Michael S. Ryoo, Francois Bremond

    Abstract: The challenge of long-term video understanding remains constrained by the efficient extraction of object semantics and the modelling of their relationships for downstream tasks. Although the CLIP visual features exhibit discriminative properties for various vision tasks, particularly in object encoding, they are suboptimal for long-term video understanding. To address this issue, we present the At… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

  12. arXiv:2307.15818  [pdf, other

    cs.RO cs.CL cs.CV cs.LG

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Authors: Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal , et al. (29 additional authors not shown)

    Abstract: We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web.… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

    Comments: Website: https://robotics-transformer.github.io/

  13. arXiv:2307.10922  [pdf, other

    cs.CV cs.LG

    Language-based Action Concept Spaces Improve Video Self-Supervised Learning

    Authors: Kanchana Ranasinghe, Michael Ryoo

    Abstract: Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to video domains with minimal supervision remains an open problem. We explore a simple step in that direction, using language tied self-supervised learning to adapt an image CLIP model to the video domain. A backbone modified for temporal modeling… ▽ More

    Submitted 26 October, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: Presented at NeurIPS 2023

  14. arXiv:2307.01849  [pdf, other

    cs.RO cs.CV cs.LG

    Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

    Authors: Xiang Li, Varun Belagali, **ghuan Shang, Michael S. Ryoo

    Abstract: Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states.… ▽ More

    Submitted 11 January, 2024; v1 submitted 4 July, 2023; originally announced July 2023.

    Comments: 15 pages, 13 figures. Code, pretrained checkpoints, and datasets are available at https://github.com/LostXine/crossway_diffusion Video demo is at https://youtu.be/9deKHueZBuk

  15. arXiv:2306.04021  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Energy-Based Models for Cross-Modal Localization using Convolutional Transformers

    Authors: Alan Wu, Michael S. Ryoo

    Abstract: We present a novel framework using Energy-Based Models (EBMs) for localizing a ground vehicle mounted with a range sensor against satellite imagery in the absence of GPS. Lidar sensors have become ubiquitous on autonomous vehicles for describing its surrounding environment. Map priors are typically built using the same sensor modality for localization purposes. However, these map building endeavor… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: ICRA 2023

  16. arXiv:2306.00975  [pdf, other

    cs.LG cs.CV cs.RO

    Active Vision Reinforcement Learning under Limited Visual Observability

    Authors: **ghuan Shang, Michael S. Ryoo

    Abstract: In this work, we investigate Active Vision Reinforcement Learning (ActiveVision-RL), where an embodied agent simultaneously learns action policy for the task while also controlling its visual observations in partially observable environments. We denote the former as motor policy and the latter as sensory policy. For example, humans solve real world tasks by hand manipulation (motor policy) togethe… ▽ More

    Submitted 5 November, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023. Project page at https://elicassion.github.io/sugarl/sugarl.html Code at https://github.com/elicassion/sugarl Environment library at https://github.com/elicassion/active-gym

  17. arXiv:2304.02560  [pdf, other

    cs.CV

    VicTR: Video-conditioned Text Representations for Activity Recognition

    Authors: Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo

    Abstract: Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely… ▽ More

    Submitted 29 March, 2024; v1 submitted 5 April, 2023; originally announced April 2023.

    Comments: To appear at CVPR 2024

  18. arXiv:2212.06817  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    RT-1: Robotics Transformer for Real-World Control at Scale

    Authors: Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deeksha Manjunath , et al. (26 additional authors not shown)

    Abstract: By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, wher… ▽ More

    Submitted 11 August, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

    Comments: See website at robotics-transformer1.github.io

  19. arXiv:2211.13224  [pdf, other

    cs.CV cs.CL cs.LG

    Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

    Authors: Ryan Burgert, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo

    Abstract: Recently, text-to-image diffusion models have shown remarkable capabilities in creating realistic images from natural language prompts. However, few works have explored using these models for semantic localization or grounding. In this work, we explore how an off-the-shelf text-to-image diffusion model, trained without exposure to localization information, can ground various semantic phrases witho… ▽ More

    Submitted 21 June, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: 19 pages; contains appendix

  20. arXiv:2211.09119  [pdf, other

    cs.LG cs.CV cs.RO

    Token Turing Machines

    Authors: Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab

    Abstract: We propose Token Turing Machines (TTM), a sequential, autoregressive Transformer model with memory for real-world sequential visual understanding. Our model is inspired by the seminal Neural Turing Machine, and has an external memory consisting of a set of tokens which summarise the previous history (i.e., frames). This memory is efficiently addressed, read and written using a Transformer as the p… ▽ More

    Submitted 13 April, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

    Comments: CVPR 2023 camera-ready copy

    Journal ref: CVPR 2023

  21. arXiv:2210.15943  [pdf, other

    cs.CV

    Grafting Vision Transformers

    Authors: Jongwoo Park, Kumara Kahatapitiya, Donghyun Kim, Shivchander Sudalairaj, Quanfu Fan, Michael S. Ryoo

    Abstract: Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better perfor… ▽ More

    Submitted 3 April, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

  22. arXiv:2209.09874  [pdf, other

    cs.RO cs.AI cs.CV

    Open-vocabulary Queryable Scene Representations for Real World Planning

    Authors: Boyuan Chen, Fei Xia, Brian Ichter, Kanishka Rao, Keerthana Gopalakrishnan, Michael S. Ryoo, Austin Stone, Daniel Kappler

    Abstract: Large language models (LLMs) have unlocked new capabilities of task planning from human instructions. However, prior attempts to apply LLMs to real-world robotic tasks are limited by the lack of grounding in the surrounding scene. In this paper, we develop NLMap, an open-vocabulary and queryable scene representation to address this problem. NLMap serves as a framework to gather and integrate conte… ▽ More

    Submitted 15 October, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

    Comments: v2, added references to concurrent work and acknowledgments

  23. arXiv:2208.00934  [pdf, other

    cs.CV

    Video Question Answering with Iterative Video-Text Co-Tokenization

    Authors: AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

    Abstract: Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization… ▽ More

    Submitted 1 August, 2022; originally announced August 2022.

    Comments: ECCV 2022

  24. arXiv:2207.00579  [pdf, other

    cs.CV cs.LG

    Video + CLIP Baseline for Ego4D Long-term Action Anticipation

    Authors: Srijan Das, Michael S. Ryoo

    Abstract: In this report, we introduce our adaptation of image-text models for long-term action anticipation. Our Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network. The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information… ▽ More

    Submitted 1 July, 2022; originally announced July 2022.

    Comments: Secured second position in the Ego4D Challenge for Long-Term Action Anticipation track at CVPR 2022

  25. arXiv:2206.13500  [pdf, other

    cs.CV cs.GR cs.LG cs.RO

    Neural Neural Textures Make Sim2Real Consistent

    Authors: Ryan Burgert, **ghuan Shang, Xiang Li, Michael Ryoo

    Abstract: Unpaired image translation algorithms can be used for sim2real tasks, but many fail to generate temporally consistent results. We present a new approach that combines differentiable rendering with image translation to achieve temporal consistency over indefinite timescales, using surface consistency losses and \emph{neural neural textures}. We call this algorithm TRITON (Texture Recovering Image T… ▽ More

    Submitted 15 December, 2022; v1 submitted 27 June, 2022; originally announced June 2022.

    Comments: 9 pages, 10 figures (without references or appendix); 16 pages, 16 figures (with appendix)

  26. arXiv:2206.11895  [pdf, other

    cs.CV cs.LG cs.RO

    Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

    Authors: **ghuan Shang, Srijan Das, Michael S. Ryoo

    Abstract: Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers,… ▽ More

    Submitted 12 January, 2023; v1 submitted 23 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022. Our code is at https://github.com/elicassion/3DTRL Our project page is at https://www3.cs.stonybrook.edu/~jishang/3dtrl/3dtrl.html v3, v4 for minor updates on figures and visualizations

  27. arXiv:2206.05266  [pdf, other

    cs.LG cs.CV cs.RO

    Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?

    Authors: Xiang Li, **ghuan Shang, Srijan Das, Michael S. Ryoo

    Abstract: We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels. We extend the contrastive reinforcement learning framework (e.g., CURL) that jointly optimizes SSL and RL losses and conduct an extensive amount of experiments with various self-supervised losses. Our observations suggest that the existing SSL framework for RL fails to bring meaningful… ▽ More

    Submitted 13 January, 2023; v1 submitted 10 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022. Code for ELo-SACv3 is at https://github.com/LostXine/elo-sac and code for ELo-Rainbow is at https://github.com/LostXine/elo-rainbow

  28. arXiv:2204.00598  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

    Authors: Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence

    Abstract: Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT quest… ▽ More

    Submitted 27 May, 2022; v1 submitted 1 April, 2022; originally announced April 2022.

    Comments: https://socraticmodels.github.io/

  29. arXiv:2112.03906  [pdf, other

    cs.CV

    Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning

    Authors: Srijan Das, Michael S. Ryoo

    Abstract: Contrastive representation learning of videos highly relies on the availability of millions of unlabelled videos. This is practical for videos available on web but acquiring such large scale of videos for real-world applications is very expensive and laborious. Therefore, in this paper we focus on designing video augmentation for self-supervised learning, we first analyze the best strategy to mi… ▽ More

    Submitted 27 July, 2023; v1 submitted 7 December, 2021; originally announced December 2021.

    Comments: Accepted at MVA 2023

  30. arXiv:2112.03905  [pdf, other

    cs.CV

    ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints

    Authors: Srijan Das, Michael S. Ryoo

    Abstract: Learning self-supervised video representation predominantly focuses on discriminating instances generated from simple data augmentation schemes. However, the learned representation often fails to generalize over unseen camera viewpoints. To this end, we propose ViewCLR, that learns self-supervised video representation invariant to camera viewpoint changes. We introduce a view-generator that can be… ▽ More

    Submitted 7 December, 2021; originally announced December 2021.

    Comments: 13 pages, Codes and models will updated soon

  31. arXiv:2112.03902  [pdf, other

    cs.CV

    MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

    Authors: Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S. Ryoo, Francois Bremond

    Abstract: Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos. The temporal relation is complex in those datasets, including challenges like composite action, and co-occurring action. For detecting actions in those complex videos, efficiently capturing both short-term and long-term temporal information in the video is critical. To this end, we… ▽ More

    Submitted 29 March, 2022; v1 submitted 7 December, 2021; originally announced December 2021.

    Comments: Accepted in CVPR 2022

  32. arXiv:2112.01514  [pdf, other

    cs.CV

    Self-supervised Video Transformer

    Authors: Kanchana Ranasinghe, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Michael Ryoo

    Abstract: In this paper, we propose self-supervised training for video transformers using unlabeled video data. From a given video, we create local and global spatiotemporal views with varying spatial sizes and frame rates. Our self-supervised objective seeks to match the features of these different views representing the same video, to be invariant to spatiotemporal variations in actions. To the best of ou… ▽ More

    Submitted 19 March, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

    Comments: Accepted to CVPR '22

  33. arXiv:2111.13677  [pdf, other

    cs.CV

    SWAT: Spatial Structure Within and Among Tokens

    Authors: Kumara Kahatapitiya, Michael S. Ryoo

    Abstract: Modeling visual data as tokens (i.e., image patches) using attention mechanisms, feed-forward networks or convolutions has been highly effective in recent years. Such methods usually have a common pipeline: a tokenization method, followed by a set of layers/blocks for information mixing, both within and among tokens. When image patches are converted into tokens, they are often flattened, discardin… ▽ More

    Submitted 20 November, 2023; v1 submitted 26 November, 2021; originally announced November 2021.

    Comments: Accepted to be published at IJCAI23

  34. arXiv:2111.13675  [pdf, other

    cs.CV

    Weakly-guided Self-supervised Pretraining for Temporal Activity Detection

    Authors: Kumara Kahatapitiya, Zhou Ren, Haoxiang Li, Zhenyu Wu, Michael S. Ryoo, Gang Hua

    Abstract: Temporal Activity Detection aims to predict activity classes per frame, in contrast to video-level predictions in Activity Classification (i.e., Activity Recognition). Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited. Thus, commonly, previous work on temporal activity detection resorts to fine-tuning a classification model pretrained o… ▽ More

    Submitted 4 February, 2023; v1 submitted 26 November, 2021; originally announced November 2021.

    Comments: Published as a conference paper at AAAI 2023

  35. StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning

    Authors: **ghuan Shang, Kumara Kahatapitiya, Xiang Li, Michael S. Ryoo

    Abstract: Reinforcement Learning (RL) can be considered as a sequence modeling task: given a sequence of past state-action-reward experiences, an agent predicts a sequence of next actions. In this work, we propose State-Action-Reward Transformer (StARformer) for visual RL, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like… ▽ More

    Submitted 3 January, 2023; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: Accepted to ECCV 2022. Our code is available at https://github.com/elicassion/StARformer

  36. arXiv:2110.04367  [pdf, other

    cs.LG stat.ML

    Hybrid Random Features

    Authors: Krzysztof Choromanski, Haoxian Chen, Han Lin, Yuanzhe Ma, Arijit Sehanobish, Deepali Jain, Michael S Ryoo, Jake Varley, Andy Zeng, Valerii Likhosherstov, Dmitry Kalashnikov, Vikas Sindhwani, Adrian Weller

    Abstract: We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi and Recht, 2007) or (recently introduced in the… ▽ More

    Submitted 30 January, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

    Comments: Published as a conference paper at ICLR 2022

  37. arXiv:2109.01066  [pdf, other

    cs.CV

    4D-Net for Learned Multi-Modal Alignment

    Authors: AJ Piergiovanni, Vincent Casser, Michael S. Ryoo, Anelia Angelova

    Abstract: We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time. We are able to incorporate the 4D information by performing a novel dynamic connection learning across various feature representations and levels of abstraction, as well as by observing geometric constraints. Our approach outperforms the state-of-the-art and strong baselines… ▽ More

    Submitted 2 September, 2021; originally announced September 2021.

    Comments: ICCV 2021

  38. arXiv:2108.01069  [pdf, other

    cs.RO cs.CV cs.LG

    Self-Supervised Disentangled Representation Learning for Third-Person Imitation Learning

    Authors: **ghuan Shang, Michael S. Ryoo

    Abstract: Humans learn to imitate by observing others. However, robot imitation learning generally requires expert demonstrations in the first-person view (FPV). Collecting such FPV videos for every robot could be very expensive. Third-person imitation learning (TPIL) is the concept of learning action policies by observing other agents in a third-person view (TPV), similar to what humans do. This ultimately… ▽ More

    Submitted 2 August, 2021; originally announced August 2021.

    Comments: Preprint. 8 pages. Accepted at IROS 2021

  39. arXiv:2106.14733  [pdf, other

    cs.CV

    Unsupervised Discovery of Actions in Instructional Videos

    Authors: AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

    Abstract: In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos. Instructional videos contain complex activities and are a rich source of information for intelligent agents, such as, autonomous robots or virtual assistants, which can, for example, automatically `read' the steps from an instructional video and execute them. However,… ▽ More

    Submitted 28 June, 2021; originally announced June 2021.

    Comments: Full paper

  40. arXiv:2106.11297  [pdf, other

    cs.CV cs.LG

    TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

    Authors: Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

    Abstract: In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual… ▽ More

    Submitted 3 April, 2022; v1 submitted 21 June, 2021; originally announced June 2021.

    Comments: This is the full version of the paper, extending its conference paper at NeurIPS 2021. Version 1.1 of the code is released

    Journal ref: NeurIPS 2021

  41. arXiv:2106.03738  [pdf, other

    cs.CV

    Unsupervised Action Segmentation for Instructional Videos

    Authors: AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

    Abstract: In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos based on a sequential stochastic autoregressive model for temporal segmentation of videos. This… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: 4 page abstract for LUV workshop

  42. arXiv:2104.07135  [pdf, other

    cs.CV

    Adaptive Intermediate Representations for Video Understanding

    Authors: Juhana Kangaspunta, AJ Piergiovanni, Rico Jonschkowski, Michael Ryoo, Anelia Angelova

    Abstract: A common strategy to video understanding is to incorporate spatial and motion information by fusing features derived from RGB frames and optical flow. In this work, we introduce a new way to leverage semantic segmentation as an intermediate representation for video understanding and use it in a way that requires no additional labeling. Second, we propose a general framework which learns the inte… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

  43. arXiv:2103.16516  [pdf, other

    cs.CV

    Recognizing Actions in Videos from Unseen Viewpoints

    Authors: AJ Piergiovanni, Michael S. Ryoo

    Abstract: Standard methods for video recognition use large CNNs designed to capture spatio-temporal data. However, training these models requires a large amount of labeled training data, containing a wide variety of actions, scenes, settings and camera viewpoints. In this paper, we show that current convolutional neural network models are unable to recognize actions from camera viewpoints not present in the… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

    Journal ref: CVPR 2021

  44. arXiv:2103.14633  [pdf, other

    cs.RO cs.CV cs.LG cs.NE

    Visionary: Vision architecture discovery for robot learning

    Authors: Iretiayo Akinola, Anelia Angelova, Yao Lu, Yevgen Chebotar, Dmitry Kalashnikov, Jacob Varley, Julian Ibarz, Michael S. Ryoo

    Abstract: We propose a vision-based architecture search algorithm for robot manipulation learning, which discovers interactions between low dimension action inputs and high dimensional visual inputs. Our approach automatically designs architectures while training on the task - discovering novel ways of combining and attending image feature representations with actions as well as features from previous layer… ▽ More

    Submitted 26 March, 2021; originally announced March 2021.

    Journal ref: ICRA 2021

  45. arXiv:2103.01302  [pdf, other

    cs.CV

    Coarse-Fine Networks for Temporal Activity Detection in Videos

    Authors: Kumara Kahatapitiya, Michael S. Ryoo

    Abstract: In this paper, we introduce Coarse-Fine Networks, a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion. Traditional Video models process inputs at one (or few) fixed temporal resolution without any dynamic frame selection. However, we argue that, processing multiple temporal resolutions of the input a… ▽ More

    Submitted 1 April, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

    Comments: To appear at CVPR 2021

  46. arXiv:2011.07092  [pdf, other

    cs.CV

    Reducing Inference Latency with Concurrent Architectures for Image Recognition

    Authors: Ramyad Hadidi, Jiashen Cao, Michael S. Ryoo, Hyesoon Kim

    Abstract: Satisfying the high computation demand of modern deep learning architectures is challenging for achieving low inference latency. The current approaches in decreasing latency only increase parallelism within a layer. This is because architectures typically capture a single-chain dependency pattern that prevents efficient distribution with a higher concurrency (i.e., simultaneous execution of one in… ▽ More

    Submitted 13 November, 2020; originally announced November 2020.

  47. arXiv:2008.08072  [pdf, other

    cs.CV cs.LG cs.NE

    AssembleNet++: Assembling Modality Representations via Attention Connections

    Authors: Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova

    Abstract: We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network. A new network component named peer-attention is introduced, which dynamically learns the attention weights using ano… ▽ More

    Submitted 18 August, 2020; originally announced August 2020.

    Comments: ECCV 2020 camera-ready version

    Journal ref: ECCV 2020

  48. arXiv:2008.04888  [pdf, other

    cs.CV

    Adversarial Generative Grammars for Human Activity Prediction

    Authors: AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo

    Abstract: In this paper we propose an adversarial generative grammar model for future prediction. The objective is to learn a model that explicitly captures temporal dependencies, providing a capability to forecast multiple, distinct future activities. Our adversarial grammar is designed so that it can learn stochastic production rules from the data distribution, jointly with its latent non-terminal represe… ▽ More

    Submitted 14 August, 2020; v1 submitted 11 August, 2020; originally announced August 2020.

    Comments: ECCV 2020 (Oral)

  49. arXiv:2007.12034  [pdf, other

    cs.CV cs.LG eess.IV

    AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

    Authors: Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua

    Abstract: Convolutional operations have two limitations: (1) do not explicitly model where to focus as the same filter is applied to all the positions, and (2) are unsuitable for modeling long-range dependencies as they only operate on a small neighborhood. While both limitations can be alleviated by attention operations, many design choices remain to be determined to use attention, especially when applying… ▽ More

    Submitted 31 July, 2020; v1 submitted 23 July, 2020; originally announced July 2020.

    Comments: ECCV 2020

  50. arXiv:2007.05515  [pdf, other

    cs.CV

    AViD Dataset: Anonymized Videos from Diverse Countries

    Authors: AJ Piergiovanni, Michael S. Ryoo

    Abstract: We introduce a new public video dataset for action recognition: Anonymized Videos from Diverse countries (AViD). Unlike existing public video datasets, AViD is a collection of action videos from many different countries. The motivation is to create a public dataset that would benefit training and pretraining of action recognition models for everybody, rather than making it useful for limited count… ▽ More

    Submitted 3 November, 2020; v1 submitted 10 July, 2020; originally announced July 2020.

    Comments: https://github.com/piergiaj/AViD

    Journal ref: NeurIPS 2020