Skip to main content

Showing 1–50 of 124 results for author: Grauman, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.09272  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

    Authors: Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman

    Abstract: Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinat… ▽ More

    Submitted 20 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Project page: https://vision.cs.utexas.edu/projects/action2sound

  2. arXiv:2406.07754  [pdf, other

    cs.CV

    HOI-Swap: Swap** Objects in Videos with Hand-Object Interaction Awareness

    Authors: Zihui Xue, Mi Luo, Changan Chen, Kristen Grauman

    Abstract: We study the problem of precisely swap** objects in videos, with a focus on those interacted with by hands, given one user-provided reference object image. Despite the great advancements that diffusion models have made in video editing recently, these models often fall short in handling the intricacies of hand-object interactions (HOI), failing to produce realistic edits -- especially when objec… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Project website: https://vision.cs.utexas.edu/projects/HOI-Swap/

  3. arXiv:2405.02821  [pdf, other

    cs.SD cs.AI cs.LG cs.RO eess.AS

    Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction

    Authors: Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman

    Abstract: Sim2real transfer has received increasing attention lately due to the success of learning robotic tasks in simulation end-to-end. While there has been a lot of progress in transferring vision-based navigation policies, the existing sim2real strategy for audio-visual navigation performs data augmentation empirically without measuring the acoustic gap. The sound differs from light in that it spans a… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

  4. arXiv:2404.16216  [pdf, other

    cs.CV cs.RO cs.SD eess.AS

    ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

    Authors: Arjun Somayazulu, Sagnik Majumder, Changan Chen, Kristen Grauman

    Abstract: An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Traditional methods for constructing acoustic models involve expensive and time-consuming collection of large quantities of acoustic data at dense spatial locations in the space, or rely on privileged knowledge of scene geometry to inte… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: Project page: https://vision.cs.utexas.edu/projects/active_rir/

  5. arXiv:2404.05206  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

    Authors: Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

    Abstract: We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024. Project page: https://vision.cs.utexas.edu/projects/soundingactions

  6. arXiv:2403.06351  [pdf, other

    cs.CV

    Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos

    Authors: Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman

    Abstract: We investigate exocentric-to-egocentric cross-view translation, which aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective. To this end, we propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation, which explicitly… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

    Comments: 22 pages

  7. arXiv:2401.01823  [pdf, other

    cs.CV

    Detours for Navigating Instructional Videos

    Authors: Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, Kristen Grauman

    Abstract: We introduce the video detours problem for navigating instructional videos. Given a source video and a natural language query asking to alter the how-to video's current path of execution in a certain way, the goal is to find a related ''detour video'' that satisfies the requested alteration. To address this challenge, we propose VidDetours, a novel video-language approach that learns to retrieve t… ▽ More

    Submitted 4 May, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

    Comments: CVPR 2024

  8. arXiv:2312.11782  [pdf, other

    cs.CV

    Learning Object State Changes in Videos: An Open-World Perspective

    Authors: Zihui Xue, Kumar Ashutosh, Kristen Grauman

    Abstract: Object State Changes (OSCs) are pivotal for video understanding. While humans can effortlessly generalize OSC understanding from familiar to unknown objects, current approaches are confined to a closed vocabulary. Addressing this gap, we introduce a novel open-world formulation for the video OSC problem. The goal is to temporally localize the three stages of an OSC -- the object's initial state, i… ▽ More

    Submitted 3 April, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

    Comments: Accepted by CVPR 2024, Project website: https://vision.cs.utexas.edu/projects/VidOSC/

  9. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, **g Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, **g Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  10. arXiv:2307.15064  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Self-Supervised Visual Acoustic Matching

    Authors: Arjun Somayazulu, Changan Chen, Kristen Grauman

    Abstract: Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised ap… ▽ More

    Submitted 23 November, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

    Comments: Project page: https://vision.cs.utexas.edu/projects/ss_vam/ . Accepted at NeurIPS 2023

  11. arXiv:2307.08763  [pdf, other

    cs.CV

    Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

    Authors: Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras, Kristen Grauman

    Abstract: Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state -- such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequ… ▽ More

    Submitted 29 October, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

    Comments: NeurIPS 2023

  12. arXiv:2307.04760  [pdf, other

    cs.CV cs.SD eess.AS

    Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

    Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman

    Abstract: We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downst… ▽ More

    Submitted 5 May, 2024; v1 submitted 10 July, 2023; originally announced July 2023.

    Comments: Accepted to CVPR 2024

  13. arXiv:2306.15850  [pdf, other

    cs.CV

    SpotEM: Efficient Video Search for Episodic Memory

    Authors: Santhosh Kumar Ramakrishnan, Ziad Al-Halah, Kristen Grauman

    Abstract: The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., "where did I leave my purse?"). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve effici… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: Published in ICML 2023

  14. arXiv:2306.09324  [pdf, other

    cs.CV

    Single-Stage Visual Query Localization in Egocentric Videos

    Authors: Hanwen Jiang, Santhosh Kumar Ramakrishnan, Kristen Grauman

    Abstract: Visual Query Localization on long-form egocentric videos requires spatio-temporal search and localization of visually specified objects and is vital to build episodic memory systems. Prior work develops complex multi-stage pipelines that leverage well-established object detection and tracking methods to perform VQL. However, each stage is independently trained and the complexity of the pipeline re… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Winner of Ego4D VQ2D challenge 2023

  15. arXiv:2306.05526  [pdf, other

    cs.CV

    Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment

    Authors: Zihui Xue, Kristen Grauman

    Abstract: The egocentric and exocentric viewpoints of a human activity look dramatically different, yet invariant representations to link them are essential for many potential applications in robotics and augmented reality. Prior work is limited to learning view-invariant features from paired synchronized viewpoints. We relax that strong data assumption and propose to learn fine-grained action features that… ▽ More

    Submitted 25 November, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: Accepted by NeurIPS 2023, Project website: https://vision.cs.utexas.edu/projects/AlignEgoExo/

  16. arXiv:2302.01891  [pdf, other

    cs.CV

    Egocentric Video Task Translation @ Ego4D Challenge 2022

    Authors: Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani

    Abstract: This technical report describes the EgoTask Translation approach that explores relations among a set of egocentric video tasks in the Ego4D challenge. To improve the primary task of interest, we propose to leverage existing models developed for other related tasks and design a task translator that learns to ''translate'' auxiliary task features to the primary task. With no modification to the base… ▽ More

    Submitted 3 February, 2023; originally announced February 2023.

    Comments: The technical report of ECCV@2022 Ego4D challenge

  17. arXiv:2301.08730  [pdf, other

    cs.CV cs.SD eess.AS

    Novel-View Acoustic Synthesis

    Authors: Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi

    Abstract: We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benc… ▽ More

    Submitted 24 October, 2023; v1 submitted 20 January, 2023; originally announced January 2023.

    Comments: Accepted at CVPR 2023. Project page: https://vision.cs.utexas.edu/projects/nvas

  18. A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems

    Authors: Megan M. Baker, Alexander New, Mario Aguilar-Simon, Ziad Al-Halah, Sébastien M. R. Arnold, Ese Ben-Iwhiwhu, Andrew P. Brna, Ethan Brooks, Ryan C. Brown, Zachary Daniels, Anurag Daram, Fabien Delattre, Ryan Dellana, Eric Eaton, Haotian Fu, Kristen Grauman, Jesse Hostetler, Shariq Iqbal, Cassandra Kent, Nicholas Ketz, Soheil Kolouri, George Konidaris, Dhireesha Kudithipudi, Erik Learned-Miller, Seungwon Lee , et al. (22 additional authors not shown)

    Abstract: Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through th… ▽ More

    Submitted 18 January, 2023; originally announced January 2023.

    Comments: To appear in Neural Networks

  19. arXiv:2301.02311  [pdf, other

    cs.CV

    HierVL: Learning Hierarchical Video-Language Embeddings

    Authors: Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

    Abstract: Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos acc… ▽ More

    Submitted 8 June, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: CVPR 2023

  20. arXiv:2301.02307  [pdf, other

    cs.CV

    What You Say Is What You Show: Visual Narration Detection in Instructional Videos

    Authors: Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, Kristen Grauman

    Abstract: Narrated ''how-to'' videos have emerged as a promising data source for a wide range of learning problems, from learning visual representations to training robot policies. However, this data is extremely noisy, as the narrations do not always describe the actions demonstrated in the video. To address this problem we introduce the novel task of visual narration detection, which entails determining w… ▽ More

    Submitted 18 July, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: Technical Report

  21. arXiv:2301.02217  [pdf, other

    cs.CV

    EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

    Authors: Shuhan Tan, Tushar Nagarajan, Kristen Grauman

    Abstract: Recent advances in egocentric video understanding models are promising, but their heavy computational expense is a barrier for many real-world applications. To address this challenge, we propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features by combining the semantics from a sparse set of video frames with the head motion from lightweight… ▽ More

    Submitted 5 January, 2023; originally announced January 2023.

    Comments: Tech report. Project page: https://vision.cs.utexas.edu/projects/egodistill

  22. arXiv:2301.02184  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Chat2Map: Efficient Scene Map** from Multi-Ego Conversations

    Authors: Sagnik Majumder, Hao Jiang, Pierre Moulon, Ethan Henderson, Paul Calamia, Kristen Grauman, Vamsi Krishna Ithapu

    Abstract: Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multi… ▽ More

    Submitted 20 April, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

    Comments: Accepted to CVPR 2023

  23. arXiv:2301.00746  [pdf, other

    cs.CV

    NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

    Authors: Santhosh Kumar Ramakrishnan, Ziad Al-Halah, Kristen Grauman

    Abstract: Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window output… ▽ More

    Submitted 25 March, 2023; v1 submitted 2 January, 2023; originally announced January 2023.

    Comments: 13 pages, 7 figures, appearing in CVPR 2023

  24. arXiv:2212.06301  [pdf, other

    cs.CV

    Egocentric Video Task Translation

    Authors: Zihui Xue, Yale Song, Kristen Grauman, Lorenzo Torresani

    Abstract: Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e.g., classifying sports in one dataset, tracking animals in another). However, in wearable cameras, the immersive egocentric perspective of a person engaging with the world around them presents an interconnected web of video understanding tasks -- hand-object manipulations, naviga… ▽ More

    Submitted 6 April, 2023; v1 submitted 12 December, 2022; originally announced December 2022.

    Comments: Accepted by CVPR 2023 (Highlight), Project website: https://vision.cs.utexas.edu/projects/egot2/

  25. arXiv:2212.04492  [pdf, other

    cs.CV

    Few-View Object Reconstruction with Unknown Categories and Camera Poses

    Authors: Hanwen Jiang, Zhenyu Jiang, Kristen Grauman, Yuke Zhu

    Abstract: While object reconstruction has made great strides in recent years, current methods typically require densely captured images and/or known camera poses, and generalize poorly to novel object categories. To step toward object reconstruction in the wild, this work explores reconstructing general real-world objects from a few images without known camera poses or object categories. The crux of our wor… ▽ More

    Submitted 25 January, 2024; v1 submitted 8 December, 2022; originally announced December 2022.

  26. arXiv:2210.06849  [pdf, other

    cs.CV

    Retrospectives on the Embodied AI Workshop

    Authors: Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi , et al. (14 additional authors not shown)

    Abstract: We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of… ▽ More

    Submitted 4 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

  27. arXiv:2207.11365  [pdf, other

    cs.CV

    EgoEnv: Human-centric environment representations from egocentric video

    Authors: Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman

    Abstract: First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate human-centric environment understanding, we present an approach that links egocen… ▽ More

    Submitted 9 November, 2023; v1 submitted 22 July, 2022; originally announced July 2022.

    Comments: Published in NeurIPS 2023 (Oral)

  28. arXiv:2206.08312  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

    Authors: Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, Philip W Robinson, Kristen Grauman

    Abstract: We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. Given a 3D mesh of a real-world environment, SoundSpaces can generate highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. Together with existing 3D visual assets, it supports an array of audio-visual research tasks, such as audio-visual navigation, m… ▽ More

    Submitted 23 January, 2023; v1 submitted 16 June, 2022; originally announced June 2022.

    Comments: Camera-ready version. Website: https://soundspaces.org. Project page: https://vision.cs.utexas.edu/projects/soundspaces2

  29. arXiv:2206.04006  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Few-Shot Audio-Visual Learning of Environment Acoustics

    Authors: Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman

    Abstract: Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed… ▽ More

    Submitted 24 November, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

    Comments: Accepted to NeurIPS 2022

  30. arXiv:2202.06875  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Visual Acoustic Matching

    Authors: Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman

    Abstract: We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal tr… ▽ More

    Submitted 13 June, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

    Comments: Project page: https://vision.cs.utexas.edu/projects/visual-acoustic-matching. Accepted at CVPR 2022

  31. arXiv:2202.02440  [pdf, other

    cs.CV cs.AI cs.LG

    Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation

    Authors: Ziad Al-Halah, Santhosh K. Ramakrishnan, Kristen Grauman

    Abstract: In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments. However, this process is expensive; massive amounts of interactions are needed for the model to generalize well. Moreover, this process is repeated whenever there is a change in the task type or the goal modality.… ▽ More

    Submitted 28 April, 2022; v1 submitted 4 February, 2022; originally announced February 2022.

    Comments: CVPR 2022. Project page: https://vision.cs.utexas.edu/projects/zsel/

  32. arXiv:2202.00850  [pdf, other

    cs.CV cs.LG cs.SD eess.AS eess.IV

    Active Audio-Visual Separation of Dynamic Sound Sources

    Authors: Sagnik Majumder, Kristen Grauman

    Abstract: We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs… ▽ More

    Submitted 25 July, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

    Comments: Accepted to ECCV 2022

  33. arXiv:2202.00164  [pdf, other

    cs.RO cs.CV

    DexVIP: Learning Dexterous Gras** with Human Hand Pose Priors from Video

    Authors: Priyanka Mandikal, Kristen Grauman

    Abstract: Dexterous multi-fingered robotic hands have a formidable action space, yet their morphological similarity to the human hand holds immense potential to accelerate robot learning. We propose DexVIP, an approach to learn dexterous robotic gras** from human-object interactions present in in-the-wild YouTube videos. We do this by curating grasp images from human-object interaction videos and imposing… ▽ More

    Submitted 31 January, 2022; originally announced February 2022.

  34. arXiv:2201.10029  [pdf, other

    cs.CV cs.AI

    PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

    Authors: Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, Kristen Grauman

    Abstract: State-of-the-art approaches to ObjectGoal navigation rely on reinforcement learning and typically require significant computational resources and time for learning. We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI), a modular approach that disentangles the skills of `where to look?' for an object and `how to navigate to (x, y)?'. Our key insight is that… ▽ More

    Submitted 17 June, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

    Comments: 8 pages + supplementary. Accepted in CVPR 2022

  35. arXiv:2111.10882  [pdf, other

    cs.CV cs.SD eess.AS

    Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video

    Authors: Rishabh Garg, Ruohan Gao, Kristen Grauman

    Abstract: Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach ex… ▽ More

    Submitted 21 November, 2021; originally announced November 2021.

    Comments: Published in BMVC 2021, project page: http://vision.cs.utexas.edu/projects/geometry-aware-binaural/

  36. arXiv:2110.07692  [pdf, other

    cs.CV cs.RO

    Sha** embodied agent behavior with activity-context priors from egocentric video

    Authors: Tushar Nagarajan, Kristen Grauman

    Abstract: Complex physical tasks entail a sequence of object interactions, each with its own preconditions -- which can be difficult for robotic agents to learn efficiently solely through their own experience. We introduce an approach to discover activity-context priors from in-the-wild egocentric video captured with human worn cameras. For a given object, an activity-context prior represents the set of oth… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

  37. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  38. arXiv:2107.02739  [pdf, other

    econ.EM cs.CV

    Shapes as Product Differentiation: Neural Network Embedding in the Analysis of Markets for Fonts

    Authors: Suk** Han, Eric H. Schulman, Kristen Grauman, Santhosh Ramakrishnan

    Abstract: Many differentiated products have key attributes that are unstructured and thus high-dimensional (e.g., design, text). Instead of treating unstructured attributes as unobservables in economic models, quantifying them can be important to answer interesting economic questions. To propose an analytical framework for these types of products, this paper considers one of the simplest design products-fon… ▽ More

    Submitted 7 March, 2024; v1 submitted 6 July, 2021; originally announced July 2021.

  39. arXiv:2106.07732  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Learning Audio-Visual Dereverberation

    Authors: Changan Chen, Wei Sun, David Harwath, Kristen Grauman

    Abstract: Reverberation not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry… ▽ More

    Submitted 13 March, 2023; v1 submitted 14 June, 2021; originally announced June 2021.

    Comments: Accepted at ICASSP 2023. This is the longer version of the five-page camera-ready paper. Project page: https://vision.cs.utexas.edu/projects/learning-audio-visual-dereverberation

  40. arXiv:2106.02036  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    Anticipative Video Transformer

    Authors: Rohit Girdhar, Kristen Grauman

    Abstract: We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features. Compared to existing temporal a… ▽ More

    Submitted 22 September, 2021; v1 submitted 3 June, 2021; originally announced June 2021.

    Comments: ICCV 2021. Ranked #1 in CVPR'21 EPIC-Kitchens-100 Action Anticipation challenge. Webpage/code/models: http://facebookresearch.github.io/AVT

  41. arXiv:2105.09544  [pdf, other

    cs.CV

    Egocentric Activity Recognition and Localization on a 3D Map

    Authors: Miao Liu, Lingni Ma, Kiran Somasundaram, Yin Li, Kristen Grauman, James M. Rehg, Chao Li

    Abstract: Given a video captured from a first person perspective and the environment context of where the video is recorded, can we recognize what the person is doing and identify where the action occurs in the 3D space? We address this challenging problem of jointly recognizing and localizing actions of a mobile user on a known 3D map from egocentric videos. To this end, we propose a novel deep probabilist… ▽ More

    Submitted 12 August, 2022; v1 submitted 20 May, 2021; originally announced May 2021.

    Comments: European Conference on Computer Vision (ECCV) 2022

  42. arXiv:2105.07142  [pdf, other

    cs.CV cs.LG cs.RO cs.SD eess.AS

    Move2Hear: Active Audio-Visual Source Separation

    Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman

    Abstract: We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest in its environment. The agent hears multiple audio sources simultaneously (e.g., a person speaking down the hall in a noisy household) and it must use its eyes and ears to automatically separate out the sounds originating fro… ▽ More

    Submitted 25 August, 2021; v1 submitted 15 May, 2021; originally announced May 2021.

    Comments: Accepted to ICCV 2021

  43. arXiv:2104.07905  [pdf, other

    cs.CV

    Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

    Authors: Yanghao Li, Tushar Nagarajan, Bo Xiong, Kristen Grauman

    Abstract: We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Learning from purely egocentric data is limited by low dataset scale and diversity, while using purely exocentric (third-person) data introduces a large domain mismatch. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific propertie… ▽ More

    Submitted 16 April, 2021; originally announced April 2021.

    Comments: Accepted by CVPR-2021

  44. arXiv:2104.00682  [pdf, other

    cs.CV cs.AI cs.LG

    Multiview Pseudo-Labeling for Semi-supervised Learning from Video

    Authors: Bo Xiong, Haoqi Fan, Kristen Grauman, Christoph Feichtenhofer

    Abstract: We present a multiview pseudo-labeling approach to video learning, a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video. The complementary views help obtain more reliable pseudo-labels on unlabeled video, to learn stronger video representations than from purely supervised data. Though our method capitalizes on multip… ▽ More

    Submitted 1 April, 2021; originally announced April 2021.

    Comments: Technical report

  45. arXiv:2102.02337  [pdf, other

    cs.CV

    Environment Predictive Coding for Embodied Agents

    Authors: Santhosh K. Ramakrishnan, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman

    Abstract: We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents. In contrast to prior work on self-supervised learning for images, we aim to jointly encode a series of images gathered by an agent as it moves about in 3D environments. We learn these representations via a zone prediction task, where we intelligently mask out porti… ▽ More

    Submitted 3 February, 2021; originally announced February 2021.

    Comments: 9 pages, 6 figures, appendix

  46. arXiv:2102.01690  [pdf, other

    cs.CV

    From Culture to Clothing: Discovering the World Events Behind A Century of Fashion Images

    Authors: Wei-Lin Hsiao, Kristen Grauman

    Abstract: Fashion is intertwined with external cultural factors, but identifying these links remains a manual process limited to only the most salient phenomena. We propose a data-driven approach to identify specific cultural factors affecting the clothes people wear. Using large-scale datasets of news articles and vintage photos spanning a century, we present a multi-modal statistical model to detect influ… ▽ More

    Submitted 20 September, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

    Comments: Accepted to ICCV 2021

  47. arXiv:2101.03149  [pdf, other

    cs.CV cs.SD eess.IV

    VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

    Authors: Ruohan Gao, Kristen Grauman

    Abstract: We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus on learning the alignment between the speaker's lip movements and the sounds they generate, we propose to leverage the speaker's face appearance as an additional… ▽ More

    Submitted 6 April, 2021; v1 submitted 8 January, 2021; originally announced January 2021.

    Comments: In CVPR 2021. Project page: http://vision.cs.utexas.edu/projects/VisualVoice/

  48. arXiv:2012.15470  [pdf, other

    cs.CV

    Audio-Visual Floorplan Reconstruction

    Authors: Senthil Purushwalkam, Sebastian Vicenc Amengual Gari, Vamsi Krishna Ithapu, Carl Schissler, Philip Robinson, Abhinav Gupta, Kristen Grauman

    Abstract: Given only a few glimpses of an environment, how much can we infer about its entire floorplan? Existing methods can map only what is visible or immediately apparent from context, and thus require substantial movements through a space to fully map it. We explore how both audio and visual sensing together can provide rapid floorplan reconstruction from limited viewpoints. Audio not only helps sense… ▽ More

    Submitted 31 December, 2020; originally announced December 2020.

  49. arXiv:2012.11583  [pdf, other

    cs.CV cs.LG cs.RO cs.SD eess.AS

    Semantic Audio-Visual Navigation

    Authors: Changan Chen, Ziad Al-Halah, Kristen Grauman

    Abstract: Recent work on audio-visual navigation assumes a constantly-sounding target and restricts the role of audio to signaling the target's position. We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning (e.g., toilet flushing, door creaking) and acoustic events are sporadic or short in duration. We propose a transformer-based… ▽ More

    Submitted 6 April, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

    Comments: Project page: http://vision.cs.utexas.edu/projects/semantic-audio-visual-navigation

  50. arXiv:2012.02897  [pdf, other

    cs.CV

    Discovering Underground Maps from Fashion

    Authors: Utkarsh Mall, Kavita Bala, Tamara Berg, Kristen Grauman

    Abstract: The fashion sense -- meaning the clothing styles people wear -- in a geographical region can reveal information about that region. For example, it can reflect the kind of activities people do there, or the type of crowds that frequently visit the region (e.g., tourist hot spot, student neighborhood, business center). We propose a method to automatically create underground neighborhood maps of citi… ▽ More

    Submitted 4 December, 2020; originally announced December 2020.