Skip to main content

Showing 1–32 of 32 results for author: Koppula, S

.
  1. arXiv:2407.05921  [pdf, other

    cs.CV cs.AI cs.LG

    TAPVid-3D: A Benchmark for Tracking Any Point in 3D

    Authors: Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, João Carreira, Andrew Zisserman, Gabriel Brostow, Carl Doersch

    Abstract: We introduce a new benchmark, TAPVid-3D, for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D). While point tracking in two dimensions (TAP) has many benchmarks measuring performance on real-world videos, such as TAPVid-DAVIS, three-dimensional point tracking has none. To this end, leveraging existing footage, we build a new benchmark for 3D point tracking featuring 4,000+ real-w… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  2. arXiv:2402.05861  [pdf, other

    cs.CV

    Memory Consolidation Enables Long-Context Video Understanding

    Authors: Ivana Balažević, Yuge Shi, Pinelopi Papalampidi, Rahma Chaabouni, Skanda Koppula, Olivier J. Hénaff

    Abstract: Most transformer-based video encoders are limited to short temporal contexts due to their quadratic complexity. While various attempts have been made to extend this context, this has often come at the cost of both conceptual and computational complexity. We propose to instead re-purpose existing pre-trained video transformers by simply fine-tuning them to attend to memories derived non-parametrica… ▽ More

    Submitted 31 May, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

  3. arXiv:2402.00847  [pdf, other

    cs.CV stat.ML

    BootsTAP: Bootstrapped Training for Tracking-Any-Point

    Authors: Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew Zisserman

    Abstract: To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulat… ▽ More

    Submitted 23 May, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  4. arXiv:2312.07395  [pdf, other

    cs.CV cs.CL

    A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames

    Authors: Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Joe Heyward, Viorica Patraucean, Jiajun Shen, Antoine Miech, Andrew Zisserman, Aida Nematzdeh

    Abstract: Understanding long, real-world videos requires modeling of long-range visual dependencies. To this end, we explore video-first architectures, building on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. However, we expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in stan… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

  5. arXiv:2309.10707  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models

    Authors: Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Raviteja Vemulapalli, Jen-Hao Rick Chang, Karren Yang, Gautam Varma Mantena, Oncel Tuzel

    Abstract: While Automatic Speech Recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. However, target-domain data usually are not readily available in many scenarios. In this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

  6. arXiv:2305.13786  [pdf, other

    cs.CV cs.AI cs.LG

    Perception Test: A Diagnostic Benchmark for Multimodal Video Models

    Authors: Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira

    Abstract: We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning… ▽ More

    Submitted 30 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

  7. arXiv:2304.06600  [pdf, other

    cs.LG cs.CV cs.RO

    Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation

    Authors: Mohit Sharma, Claudio Fantacci, Yuxiang Zhou, Skanda Koppula, Nicolas Heess, Jon Scholz, Yusuf Aytar

    Abstract: Recent works have shown that large models pretrained on common visual learning tasks can provide useful representations for a wide range of specialized perception problems, as well as a variety of robotic manipulation tasks. While prior work on robotic manipulation has predominantly used frozen pretrained features, we demonstrate that in robotics this approach can fail to reach optimal performance… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.

    Comments: ICLR'23, Project page see https://sites.google.com/view/robo-adapters/

  8. arXiv:2303.14885  [pdf, other

    eess.AS cs.LG cs.SD

    Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis

    Authors: Karren Yang, Ting-Yao Hu, Jen-Hao Rick Chang, Hema Swetha Koppula, Oncel Tuzel

    Abstract: Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To… ▽ More

    Submitted 26 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  9. arXiv:2209.15589  [pdf, other

    cs.CV cs.LG

    Where Should I Spend My FLOPS? Efficiency Evaluations of Visual Pre-training Methods

    Authors: Skanda Koppula, Yazhe Li, Evan Shelhamer, Andrew Jaegle, Nikhil Parthasarathy, Relja Arandjelovic, João Carreira, Olivier Hénaff

    Abstract: Self-supervised methods have achieved remarkable success in transfer learning, often achieving the same or better accuracy than supervised pre-training. Most prior work has done so by increasing pre-training computation by adding complex data augmentation, multiple views, or lengthy training schedules. In this work, we investigate a related, but orthogonal question: given a fixed FLOP budget, what… ▽ More

    Submitted 18 October, 2022; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: 11 pages. 36th Conference on Neural Information Processing Systems, Workshop on Self-Supervised Learning (2022)

  10. arXiv:2203.08777  [pdf, other

    cs.CV cs.AI cs.LG

    Object discovery and representation networks

    Authors: Olivier J. Hénaff, Skanda Koppula, Evan Shelhamer, Daniel Zoran, Andrew Jaegle, Andrew Zisserman, João Carreira, Relja Arandjelović

    Abstract: The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks. While there has been excellent progress with simple, image-level learning, recent methods have shown the advantage of including knowledge of image structure. However, by introducing hand-crafted image segmentations to define regions of interest, or specialized augmentation strategie… ▽ More

    Submitted 27 July, 2022; v1 submitted 16 March, 2022; originally announced March 2022.

    Comments: European Conference on Computer Vision (ECCV) 2022

  11. arXiv:2203.08220  [pdf, other

    cs.CR cs.AR

    Power-Based Side-Channel Attack for AES Key Extraction on the ATMega328 Microcontroller

    Authors: Utsav Banerjee, Lisa Ho, Skanda Koppula

    Abstract: We demonstrate the extraction of an AES secret key from flash memory on the ATMega328 microcontroller (the microcontroller used on the popular Arduino Genuino/Uno board). We loaded a standard AVR-architecture AES-128 implementation onto the chip and encrypted randomly chosen plaintexts with several different keys. We measured the chip's power consumption during encryption, correlated observed powe… ▽ More

    Submitted 13 March, 2022; originally announced March 2022.

    Comments: MIT 6.858 Class Project

  12. arXiv:2202.10890  [pdf, other

    cs.CV

    HiP: Hierarchical Perceiver

    Authors: Joao Carreira, Skanda Koppula, Daniel Zoran, Adria Recasens, Catalin Ionescu, Olivier Henaff, Evan Shelhamer, Relja Arandjelovic, Matt Botvinick, Oriol Vinyals, Karen Simonyan, Andrew Zisserman, Andrew Jaegle

    Abstract: General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by using exclusively global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of l… ▽ More

    Submitted 3 November, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

  13. arXiv:2202.02310  [pdf, other

    cs.LG cs.AR

    EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

    Authors: Lois Orosa, Skanda Koppula, Yaman Umuroglu, Konstantinos Kanellopoulos, Juan Gomez-Luna, Michaela Blott, Kees Vissers, Onur Mutlu

    Abstract: Dilated and transposed convolutions are widely used in modern convolutional neural networks (CNNs). These kernels are used extensively during CNN training and inference of applications such as image segmentation and high-resolution image generation. Although these kernels have grown in popularity, they stress current compute systems due to their high memory intensity, exascale compute demands, and… ▽ More

    Submitted 4 February, 2022; originally announced February 2022.

  14. arXiv:2110.02891  [pdf, other

    cs.LG cs.SD eess.AS

    Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models

    Authors: Jen-Hao Rick Chang, Ashish Shrivastava, Hema Swetha Koppula, Xiaoshuai Zhang, Oncel Tuzel

    Abstract: Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, under an unsupervised-style setting, typical training algorithms f… ▽ More

    Submitted 30 June, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: ICML 2022

  15. arXiv:2107.14795  [pdf, other

    cs.LG cs.CL cs.CV cs.SD eess.AS

    Perceiver IO: A General Architecture for Structured Inputs & Outputs

    Authors: Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, Joāo Carreira

    Abstract: A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data f… ▽ More

    Submitted 15 March, 2022; v1 submitted 30 July, 2021; originally announced July 2021.

    Comments: ICLR 2022 camera ready. Code: https://dpmd.ai/perceiver-code

  16. arXiv:2103.10957  [pdf, other

    cs.CV

    Efficient Visual Pretraining with Contrastive Detection

    Authors: Olivier J. Hénaff, Skanda Koppula, Jean-Baptiste Alayrac, Aaron van den Oord, Oriol Vinyals, João Carreira

    Abstract: Self-supervised pretraining has been shown to yield powerful representations for transfer learning. These performance gains come at a large computational cost however, with state-of-the-art methods requiring an order of magnitude more computation than supervised pretraining. We tackle this computational bottleneck by introducing a new self-supervised objective, contrastive detection, which tasks r… ▽ More

    Submitted 5 August, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

    Comments: Technical report

  17. arXiv:2102.05182  [pdf, other

    astro-ph.GA cs.LG

    A Deep Learning Approach for Characterizing Major Galaxy Mergers

    Authors: Skanda Koppula, Victor Bapst, Marc Huertas-Company, Sam Blackwell, Agnieszka Grabska-Barwinska, Sander Dieleman, Andrea Huber, Natasha Antropova, Mikolaj Binkowski, Hannah Openshaw, Adria Recasens, Fernando Caro, Avishai Deke, Yohan Dubois, Jesus Vega Ferrero, David C. Koo, Joel R. Primack, Trevor Back

    Abstract: Fine-grained estimation of galaxy merger stages from observations is a key problem useful for validation of our current theoretical understanding of galaxy formation. To this end, we demonstrate a CNN-based regression model that is able to predict, for the first time, using a single image, the merger stage relative to the first perigee passage with a median error of 38.3 million years (Myrs) over… ▽ More

    Submitted 9 February, 2021; originally announced February 2021.

    Comments: Third Workshop on Machine Learning and the Physical Sciences (NeurIPS 2020), Vancouver, Canada

  18. arXiv:2101.09904  [pdf, other

    cs.SD cs.NI eess.AS

    Using Angle of Arrival for Improving Indoor Localization

    Authors: Sai Koppula, Shivang Singh

    Abstract: In this paper, we primarily explore the improvement of single stream audio systems using Angle of Arrival calculations in both simulation and real life gathered data. We wanted to learn how to discern the direction of an audio source from gathered signal data to ultimately incorporate into a multi modal security system. We focused on the MUSIC algorithm for the estimation of the angle of arrival b… ▽ More

    Submitted 25 January, 2021; originally announced January 2021.

  19. arXiv:2007.13971  [pdf, other

    cs.CV cs.RO

    Accurate, Low-Latency Visual Perception for Autonomous Racing:Challenges, Mechanisms, and Practical Solutions

    Authors: Kieran Strobel, Sibo Zhu, Raphael Chang, Skanda Koppula

    Abstract: Autonomous racing provides the opportunity to test safety-critical perception pipelines at their limit. This paper describes the practical challenges and solutions to applying state-of-the-art computer vision algorithms to build a low-latency, high-accuracy perception system for DUT18 Driverless (DUT18D), a 4WD electric race car with podium finishes at all Formula Driverless competitions for which… ▽ More

    Submitted 27 July, 2020; originally announced July 2020.

  20. SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations

    Authors: Konstantinos Kanellopoulos, Nandita Vijaykumar, Christina Giannoula, Roknoddin Azizi, Skanda Koppula, Nika Mansouri Ghiasi, Taha Shahroodi, Juan Gomez Luna, Onur Mutlu

    Abstract: Important workloads, such as machine learning and graph analytics applications, heavily involve sparse linear algebra operations. These operations use sparse matrix compression as an effective means to avoid storing zeros and performing unnecessary computation on zero elements. However, compression techniques like Compressed Sparse Row (CSR) that are widely used today introduce significant instruc… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.

  21. arXiv:1910.05340  [pdf, other

    cs.DC cs.LG

    EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference Using Approximate DRAM

    Authors: Skanda Koppula, Lois Orosa, Abdullah Giray Yağlıkçı, Roknoddin Azizi, Taha Shahroodi, Konstantinos Kanellopoulos, Onur Mutlu

    Abstract: The effectiveness of deep neural networks (DNN) in vision, speech, and language processing has prompted a tremendous demand for energy-efficient high-performance DNN inference systems. Due to the increasing memory intensity of most DNN workloads, main memory can dominate the system's energy consumption and stall time. One effective way to reduce the energy consumption and increase the performance… ▽ More

    Submitted 11 October, 2019; originally announced October 2019.

    Comments: This work is to appear at MICRO 2019

  22. arXiv:1802.03816  [pdf, other

    cs.CL

    Understanding Recurrent Neural State Using Memory Signatures

    Authors: Skanda Koppula, Khe Chai Sim, Kean Chin

    Abstract: We demonstrate a network visualization technique to analyze the recurrent state inside the LSTMs/GRUs used commonly in language and acoustic models. Interpreting intermediate state and network activations inside end-to-end models remains an open challenge. Our method allows users to understand exactly how much and what history is encoded inside recurrent state in grapheme sequence models. Our proc… ▽ More

    Submitted 11 February, 2018; originally announced February 2018.

    Comments: Accepted to 2018 IEEE International Conference on Acoustics, Speech and Signal Processing

  23. arXiv:1712.04614  [pdf, other

    cs.AR

    Applying the Residue Number System to Network Inference

    Authors: Mohamed Abdelhamid, Skanda Koppula

    Abstract: This work explores the lesser studied objective of optimizing the multiply-and-accumulates executed during evaluation of the network. In particular, we propose using the Residue Number System (RNS) as the internal number representation across all layer evaluations, allowing us to explore usage of the more power-efficient RNS multipliers and adders. Using results from simulation of our RNS arithmet… ▽ More

    Submitted 13 December, 2017; originally announced December 2017.

  24. arXiv:1708.02215  [pdf, other

    cs.CV

    Learning a CNN-based End-to-End Controller for a Formula SAE Racecar

    Authors: Skanda Koppula

    Abstract: We present a set of CNN-based end-to-end models for controls of a Formula SAE racecar, along with various benchmarking and visualization tools to understand model performance. We tackled three main problems in the context of cone-delineated racetrack driving: (1) discretized steering, which translates a first-person frame along to the track to a predicted steering direction. (2) real-value steerin… ▽ More

    Submitted 12 July, 2017; originally announced August 2017.

  25. arXiv:1601.00740  [pdf, other

    cs.RO cs.CV cs.LG

    Brain4Cars: Car That Knows Before You Do via Sensory-Fusion Deep Learning Architecture

    Authors: Ashesh Jain, Hema S Koppula, Shane Soh, Bharad Raghavan, Avi Singh, Ashutosh Saxena

    Abstract: Advanced Driver Assistance Systems (ADAS) have made driving safer over the last decade. They prepare vehicles for unsafe road conditions and alert drivers if they perform a dangerous maneuver. However, many accidents are unavoidable because by the time drivers are alerted, it is already too late. Anticipating maneuvers beforehand can alert drivers before they perform the maneuver and also give ADA… ▽ More

    Submitted 5 January, 2016; originally announced January 2016.

    Comments: Journal Version (ICCV and ICRA combination with more system details) http://brain4cars.com

  26. arXiv:1509.05016  [pdf, other

    cs.CV cs.AI cs.RO

    Recurrent Neural Networks for Driver Activity Anticipation via Sensory-Fusion Architecture

    Authors: Ashesh Jain, Avi Singh, Hema S Koppula, Shane Soh, Ashutosh Saxena

    Abstract: Anticipating the future actions of a human is a widely studied problem in robotics that requires spatio-temporal reasoning. In this work we propose a deep learning approach for anticipation in sensory-rich robotics applications. We introduce a sensory-fusion architecture which jointly learns to anticipate and fuse information from multiple sensory streams. Our architecture consists of Recurrent Ne… ▽ More

    Submitted 16 September, 2015; originally announced September 2015.

    Comments: Follow-up of ICCV 2015 Brain4Cars http://www.brain4cars.com

  27. arXiv:1504.02789  [pdf, other

    cs.CV

    Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models

    Authors: Ashesh Jain, Hema S. Koppula, Bharad Raghavan, Shane Soh, Ashutosh Saxena

    Abstract: Advanced Driver Assistance Systems (ADAS) have made driving safer over the last decade. They prepare vehicles for unsafe road conditions and alert drivers if they perform a dangerous maneuver. However, many accidents are unavoidable because by the time drivers are alerted, it is already too late. Anticipating maneuvers beforehand can alert drivers before they perform the maneuver and also give ADA… ▽ More

    Submitted 19 September, 2015; v1 submitted 10 April, 2015; originally announced April 2015.

    Comments: ICCV 2015, http://brain4cars.com

  28. arXiv:1412.0691  [pdf, other

    cs.AI cs.RO

    RoboBrain: Large-Scale Knowledge Engine for Robots

    Authors: Ashutosh Saxena, Ashesh Jain, Ozan Sener, Aditya Jami, Dipendra K. Misra, Hema S. Koppula

    Abstract: In this paper we introduce a knowledge engine, which learns and shares knowledge representations, for robots to carry out a variety of tasks. Building such an engine brings with it the challenge of dealing with multiple data modalities including symbols, natural language, haptic senses, robot trajectories, visual features and many others. The \textit{knowledge} stored in the engine comes from mult… ▽ More

    Submitted 12 April, 2015; v1 submitted 1 December, 2014; originally announced December 2014.

    Comments: 10 pages, 9 figures

  29. arXiv:1210.1207  [pdf, other

    cs.RO cs.AI cs.CV

    Learning Human Activities and Object Affordances from RGB-D Videos

    Authors: Hema Swetha Koppula, Rudhir Gupta, Ashutosh Saxena

    Abstract: Understanding human activities and object affordances are two very important skills, especially for personal robots which operate in human environments. In this work, we consider the problem of extracting a descriptive labeling of the sequence of sub-activities being performed by a human, and more importantly, of their interactions with the objects in the form of associated affordances. Given a RG… ▽ More

    Submitted 5 May, 2013; v1 submitted 4 October, 2012; originally announced October 2012.

    Comments: arXiv admin note: substantial text overlap with arXiv:1208.0967

  30. arXiv:1208.0967  [pdf, ps, other

    cs.CV

    Human Activity Learning using Object Affordances from RGB-D Videos

    Authors: Hema Swetha Koppula, Rudhir Gupta, Ashutosh Saxena

    Abstract: Human activities comprise several sub-activities performed in a sequence and involve interactions with various objects. This makes reasoning about the object affordances a central task for activity recognition. In this work, we consider the problem of jointly labeling the object affordances and human activities from RGB-D videos. We frame the problem as a Markov Random Field where the nodes repres… ▽ More

    Submitted 4 August, 2012; originally announced August 2012.

  31. arXiv:1111.5358  [pdf, other

    cs.RO cs.AI cs.CV

    Contextually Guided Semantic Labeling and Search for 3D Point Clouds

    Authors: Abhishek Anand, Hema Swetha Koppula, Thorsten Joachims, Ashutosh Saxena

    Abstract: RGB-D cameras, which give an RGB image to- gether with depths, are becoming increasingly popular for robotic perception. In this paper, we address the task of detecting commonly found objects in the 3D point cloud of indoor scenes obtained from such cameras. Our method uses a graphical model that captures various features and contextual relations, including the local visual appearance and shape cu… ▽ More

    Submitted 5 September, 2012; v1 submitted 22 November, 2011; originally announced November 2011.

    Comments: arXiv admin note: substantial text overlap with arXiv:1106.5551

  32. arXiv:1106.5551  [pdf, other

    cs.RO

    Labeling 3D scenes for Personal Assistant Robots

    Authors: Hema Swetha Koppula, Abhishek Anand, Thorsten Joachims, Ashutosh Saxena

    Abstract: Inexpensive RGB-D cameras that give an RGB image together with depth data have become widely available. We use this data to build 3D point clouds of a full scene. In this paper, we address the task of labeling objects in this 3D point cloud of a complete indoor scene such as an office. We propose a graphical model that captures various features and contextual relations, including the local visual… ▽ More

    Submitted 27 June, 2011; originally announced June 2011.