Skip to main content

Showing 1–20 of 20 results for author: Nagarajan, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.16222  [pdf, other

    cs.CV

    Step Differences in Instructional Video

    Authors: Tushar Nagarajan, Lorenzo Torresani

    Abstract: Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However, current approaches for language-based assistance can only answer questions about a single video. We propose an approach that first automatically generates large amounts of visual instruction tuning data involving pairs of videos f… ▽ More

    Submitted 27 June, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  2. arXiv:2402.13250  [pdf, other

    cs.CV

    Video ReCap: Recursive Captioning of Hour-Long Videos

    Authors: Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius

    Abstract: Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process v… ▽ More

    Submitted 16 May, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: Accepted by CVPR 2024

  3. arXiv:2402.12239  [pdf, other

    eess.SP cs.SD eess.AS

    Significance of Chirp MFCC as a Feature in Speech and Audio Applications

    Authors: S. Johanan Joysingh, P. Vijayalakshmi, T. Nagarajan

    Abstract: A novel feature, based on the chirp z-transform, that offers an improved representation of the underlying true spectrum is proposed. This feature, the chirp MFCC, is derived by computing the Mel frequency cepstral coefficients from the chirp magnitude spectrum, instead of the Fourier transform magnitude spectrum. The theoretical foundations for the proposal, and the experimental validation using p… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

  4. arXiv:2401.01823  [pdf, other

    cs.CV

    Detours for Navigating Instructional Videos

    Authors: Kumar Ashutosh, Zihui Xue, Tushar Nagarajan, Kristen Grauman

    Abstract: We introduce the video detours problem for navigating instructional videos. Given a source video and a natural language query asking to alter the how-to video's current path of execution in a certain way, the goal is to find a related ''detour video'' that satisfies the requested alteration. To address this challenge, we propose VidDetours, a novel video-language approach that learns to retrieve t… ▽ More

    Submitted 4 May, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

    Comments: CVPR 2024

  5. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, **g Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, **g Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  6. arXiv:2309.16058  [pdf, other

    cs.LG cs.CL cs.CV

    AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

    Authors: Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu, Kavya Srinet, Babak Damavandi, Anuj Kumar

    Abstract: We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  7. arXiv:2301.02217  [pdf, other

    cs.CV

    EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

    Authors: Shuhan Tan, Tushar Nagarajan, Kristen Grauman

    Abstract: Recent advances in egocentric video understanding models are promising, but their heavy computational expense is a barrier for many real-world applications. To address this challenge, we propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features by combining the semantics from a sparse set of video frames with the head motion from lightweight… ▽ More

    Submitted 5 January, 2023; originally announced January 2023.

    Comments: Tech report. Project page: https://vision.cs.utexas.edu/projects/egodistill

  8. arXiv:2207.11365  [pdf, other

    cs.CV

    EgoEnv: Human-centric environment representations from egocentric video

    Authors: Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman

    Abstract: First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate human-centric environment understanding, we present an approach that links egocen… ▽ More

    Submitted 9 November, 2023; v1 submitted 22 July, 2022; originally announced July 2022.

    Comments: Published in NeurIPS 2023 (Oral)

  9. arXiv:2110.07692  [pdf, other

    cs.CV cs.RO

    Sha** embodied agent behavior with activity-context priors from egocentric video

    Authors: Tushar Nagarajan, Kristen Grauman

    Abstract: Complex physical tasks entail a sequence of object interactions, each with its own preconditions -- which can be difficult for robotic agents to learn efficiently solely through their own experience. We introduce an approach to discover activity-context priors from in-the-wild egocentric video captured with human worn cameras. For a given object, an activity-context prior represents the set of oth… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

  10. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  11. arXiv:2104.07905  [pdf, other

    cs.CV

    Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

    Authors: Yanghao Li, Tushar Nagarajan, Bo Xiong, Kristen Grauman

    Abstract: We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Learning from purely egocentric data is limited by low dataset scale and diversity, while using purely exocentric (third-person) data introduces a large domain mismatch. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific propertie… ▽ More

    Submitted 16 April, 2021; originally announced April 2021.

    Comments: Accepted by CVPR-2021

  12. arXiv:2102.02337  [pdf, other

    cs.CV

    Environment Predictive Coding for Embodied Agents

    Authors: Santhosh K. Ramakrishnan, Tushar Nagarajan, Ziad Al-Halah, Kristen Grauman

    Abstract: We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents. In contrast to prior work on self-supervised learning for images, we aim to jointly encode a series of images gathered by an agent as it moves about in 3D environments. We learn these representations via a zone prediction task, where we intelligently mask out porti… ▽ More

    Submitted 3 February, 2021; originally announced February 2021.

    Comments: 9 pages, 6 figures, appendix

  13. arXiv:2010.06978  [pdf, other

    cs.LG stat.ML

    Differentiable Causal Discovery Under Unmeasured Confounding

    Authors: Rohit Bhattacharya, Tushar Nagarajan, Daniel Malinsky, Ilya Shpitser

    Abstract: The data drawn from biological, economic, and social systems are often confounded due to the presence of unmeasured variables. Prior work in causal discovery has focused on discrete search procedures for selecting acyclic directed mixed graphs (ADMGs), specifically ancestral ADMGs, that encode ordinary conditional independence constraints among the observed variables of the system. However, confou… ▽ More

    Submitted 24 February, 2021; v1 submitted 14 October, 2020; originally announced October 2020.

    Comments: Main draft: 9 pages. Appendix: 9 pages

    ACM Class: G.3; J.3; F.2.2

  14. arXiv:2008.09241  [pdf, other

    cs.CV

    Learning Affordance Landscapes for Interaction Exploration in 3D Environments

    Authors: Tushar Nagarajan, Kristen Grauman

    Abstract: Embodied agents operating in human spaces must be able to master how their environment works: what objects can the agent use, and how can it use them? We introduce a reinforcement learning approach for exploration for interaction, whereby an embodied agent autonomously discovers the affordance landscape of a new unmapped 3D environment (such as an unfamiliar kitchen). Given an egocentric RGB-D cam… ▽ More

    Submitted 18 October, 2020; v1 submitted 20 August, 2020; originally announced August 2020.

    Comments: To be published in NeurIPS 2020

  15. arXiv:2001.04583  [pdf, other

    cs.CV

    EGO-TOPO: Environment Affordances from Egocentric Video

    Authors: Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman

    Abstract: First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned directly from egocentric video. The main idea is to gain a hu… ▽ More

    Submitted 27 March, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

    Comments: Published in CVPR 2020, project page: http://vision.cs.utexas.edu/projects/ego-topo/

  16. arXiv:1906.01963  [pdf, other

    cs.CV

    Grounded Human-Object Interaction Hotspots from Video (Extended Abstract)

    Authors: Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

    Abstract: Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements. We propose an approach to learn human-object interaction "hotspots" directly from video. Rather than treat affordances as a manually supervised semantic segmentation task, our approach learns about interactions by watching v… ▽ More

    Submitted 3 June, 2019; originally announced June 2019.

    Comments: arXiv admin note: substantial text overlap with arXiv:1812.04558

  17. arXiv:1812.04558  [pdf, other

    cs.CV

    Grounded Human-Object Interaction Hotspots from Video

    Authors: Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

    Abstract: Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements. We propose an approach to learn human-object interaction "hotspots" directly from video. Rather than treat affordances as a manually supervised semantic segmentation task, our approach learns about interactions by watching v… ▽ More

    Submitted 2 April, 2019; v1 submitted 11 December, 2018; originally announced December 2018.

  18. arXiv:1803.09851  [pdf, other

    cs.CV

    Attributes as Operators: Factorizing Unseen Attribute-Object Compositions

    Authors: Tushar Nagarajan, Kristen Grauman

    Abstract: We present a new approach to modeling visual attributes. Prior work casts attributes in a similar role as objects, learning a latent representation where properties (e.g., sliced) are recognized by classifiers much in the way objects (e.g., apple) are. However, this common approach fails to separate the attributes observed during training from the objects with which they are composed, making it in… ▽ More

    Submitted 27 August, 2018; v1 submitted 26 March, 2018; originally announced March 2018.

    Comments: European Conference on Computer Vision (ECCV) 2018

  19. arXiv:1711.08393  [pdf, other

    cs.CV cs.LG

    BlockDrop: Dynamic Inference Paths in Residual Networks

    Authors: Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, Rogerio Feris

    Abstract: Very deep convolutional neural networks offer excellent recognition results, yet their computational expense limits their impact for many real-world applications. We introduce BlockDrop, an approach that learns to dynamically choose which layers of a deep network to execute during inference so as to best reduce total computation without degrading prediction accuracy. Exploiting the robustness of R… ▽ More

    Submitted 28 January, 2019; v1 submitted 22 November, 2017; originally announced November 2017.

    Comments: CVPR 2018

  20. arXiv:1710.09942  [pdf, other

    cs.CL

    CANDiS: Coupled & Attention-Driven Neural Distant Supervision

    Authors: Tushar Nagarajan, Sharmistha, Partha Talukdar

    Abstract: Distant Supervision for Relation Extraction uses heuristically aligned text data with an existing knowledge base as training data. The unsupervised nature of this technique allows it to scale to web-scale relation extraction tasks, at the expense of noise in the training data. Previous work has explored relationships among instances of the same entity-pair to reduce this noise, but relationships a… ▽ More

    Submitted 26 October, 2017; originally announced October 2017.

    Comments: WiNLP 2017