Skip to main content

Showing 1–50 of 148 results for author: Parikh, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.12413  [pdf, other

    cs.CL

    Targeted Multilingual Adaptation for Low-resource Language Families

    Authors: C. M. Downey, Terra Blevins, Dhwani Serai, Dwija Parikh, Shane Steinert-Threlkeld

    Abstract: The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

  2. arXiv:2405.10391  [pdf, other

    cs.RO cs.AI eess.IV

    Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance

    Authors: Anish Bhattacharya, Nishanth Rao, Dhruv Parikh, Pratik Kunapuli, Nikolai Matni, Vijay Kumar

    Abstract: We demonstrate the capabilities of an attention-based end-to-end approach for high-speed quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional vision-based navigation via independent map**, p… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: 8 pages, 10 figures, 3 tables

  3. arXiv:2404.04527  [pdf, other

    cs.CV cs.AI cs.AR cs.DC

    VTR: An Optimized Vision Transformer for SAR ATR Acceleration on FPGA

    Authors: Sachini Wickramasinghe, Dhruv Parikh, Bingyi Zhang, Rajgopal Kannan, Viktor Prasanna, Carl Busart

    Abstract: Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is a key technique used in military applications like remote-sensing image recognition. Vision Transformers (ViTs) are the current state-of-the-art in various computer vision applications, outperforming their CNN counterparts. However, using ViTs for SAR ATR applications is challenging due to (1) standard ViTs require extensive trai… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: SPIE DCS 2024

  4. arXiv:2403.14047  [pdf, other

    cs.DC cs.AR cs.CV

    Accelerating ViT Inference on FPGA through Static and Dynamic Pruning

    Authors: Dhruv Parikh, Shouyi Li, Bingyi Zhang, Rajgopal Kannan, Carl Busart, Viktor Prasanna

    Abstract: Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks. However, their high computational complexity prevents them from being applied to many real-world applications. Weight and token pruning are two well-known methods for reducing complexity: weight pruning reduces the model size and associated computational demands, while token pruning further dynamic… ▽ More

    Submitted 12 April, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: FCCM 2024

  5. arXiv:2403.09334  [pdf, other

    cs.CV

    Video Editing via Factorized Diffusion Distillation

    Authors: Uriel Singer, Amit Zohar, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, Yaniv Taigman

    Abstract: We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure,… ▽ More

    Submitted 24 March, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  6. arXiv:2311.10709  [pdf, other

    cs.CV cs.AI cs.GR cs.LG cs.MM

    Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

    Authors: Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

    Abstract: We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training--that enable us to directly generate high quality and high resolut… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

    Comments: Project page: https://emu-video.metademolab.com

  7. arXiv:2311.10089  [pdf, other

    cs.CV cs.AI cs.LG

    Emu Edit: Precise Image Editing via Recognition and Generation Tasks

    Authors: Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, Yaniv Taigman

    Abstract: Instruction-based image editing holds immense potential for a variety of applications, as it enables users to perform any editing operation using a natural language instruction. However, current models in this domain often struggle with accurately executing user instructions. We present Emu Edit, a multi-task image editing model which sets state-of-the-art results in instruction-based image editin… ▽ More

    Submitted 16 November, 2023; originally announced November 2023.

  8. arXiv:2309.15807  [pdf, other

    cs.CV

    Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

    Authors: Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda , et al. (1 additional authors not shown)

    Abstract: Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusivel… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  9. arXiv:2309.09142  [pdf, other

    cs.DC

    Performance of Graph Neural Networks for Point Cloud Applications

    Authors: Dhruv Parikh, Bingyi Zhang, Rajgopal Kannan, Viktor Prasanna, Carl Busart

    Abstract: Graph Neural Networks (GNNs) have gained significant momentum recently due to their capability to learn on unstructured graph data. Dynamic GNNs (DGNNs) are the current state-of-the-art for point cloud applications; such applications (viz. autonomous driving) require real-time processing at the edge with tight latency and memory constraints. Conducting performance analysis on such DGNNs, thus, bec… ▽ More

    Submitted 16 September, 2023; originally announced September 2023.

    Comments: 27th Annual IEEE High Performance Extreme Computing Conference

  10. arXiv:2308.15502  [pdf, ps, other

    cs.LG cs.CR cs.MM

    On the Steganographic Capacity of Selected Learning Models

    Authors: Rishit Agrawal, Kelvin Jou, Tanush Obili, Daksh Parikh, Samarth Prajapati, Yash Seth, Charan Sridhar, Nathan Zhang, Mark Stamp

    Abstract: Machine learning and deep learning models are potential vectors for various attack scenarios. For example, previous research has shown that malware can be hidden in deep learning models. Hiding information in a learning model can be viewed as a form of steganography. In this research, we consider the general question of the steganographic capacity of learning models. Specifically, for a wide range… ▽ More

    Submitted 29 August, 2023; originally announced August 2023.

    Comments: arXiv admin note: text overlap with arXiv:2306.17189

  11. arXiv:2308.01125  [pdf, other

    cs.CV cs.RO

    Stereo Visual Odometry with Deep Learning-Based Point and Line Feature Matching using an Attention Graph Neural Network

    Authors: Shenbagaraj Kannapiran, Nalin Bendapudi, Ming-Yuan Yu, Devarth Parikh, Spring Berman, Ankit Vora, Gaurav Pandey

    Abstract: Robust feature matching forms the backbone for most Visual Simultaneous Localization and Map** (vSLAM), visual odometry, 3D reconstruction, and Structure from Motion (SfM) algorithms. However, recovering feature matches from texture-poor scenes is a major challenge and still remains an open area of research. In this paper, we present a Stereo Visual Odometry (StereoVO) technique based on point a… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

  12. arXiv:2305.09662  [pdf, other

    cs.CV cs.AI

    Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

    Authors: Samaneh Azadi, Akbar Shah, Thomas Hayes, Devi Parikh, Sonal Gupta

    Abstract: Text-guided human motion generation has drawn significant interest because of its impactful applications spanning animation and robotics. Recently, application of diffusion models for motion generation has enabled improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

    Comments: arXiv admin note: text overlap with arXiv:2304.07410

  13. arXiv:2304.07410  [pdf, other

    cs.CV cs.AI

    Text-Conditional Contextualized Avatars For Zero-Shot Personalization

    Authors: Samaneh Azadi, Thomas Hayes, Akbar Shah, Guan Pang, Devi Parikh, Sonal Gupta

    Abstract: Recent large-scale text-to-image generation models have made significant improvements in the quality, realism, and diversity of the synthesized images and enable users to control the created content through language. However, the personalization aspect of these generative models is still challenging and under-explored. In this work, we propose a pipeline that enables personalization of image gener… ▽ More

    Submitted 14 April, 2023; originally announced April 2023.

  14. arXiv:2303.04353  [pdf, other

    cs.MS

    Cascading GEMM: High Precision from Low Precision

    Authors: Devangi N. Parikh, Robert A. van de Geijn, Greg M. Henry

    Abstract: This paper lays out insights and opportunities for implementing higher-precision matrix-matrix multiplication (GEMM) from (in terms of) lower-precision high-performance GEMM. The driving case study approximates double-double precision (FP64x2) GEMM in terms of double precision (FP64) GEMM, leveraging how the BLAS-like Library Instantiation Software (BLIS) framework refactors the Goto Algorithm. Wi… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

    Comments: 26 pages, 9 figures

    ACM Class: G.4

  15. arXiv:2301.11280  [pdf, other

    cs.CV cs.AI cs.LG

    Text-To-4D Dynamic Scene Generation

    Authors: Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, Yaniv Taigman

    Abstract: We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera locat… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

  16. arXiv:2301.03709  [pdf, other

    cs.SE cs.LG

    Transfer learning for conflict and duplicate detection in software requirement pairs

    Authors: Garima Malik, Savas Yildirim, Mucahit Cevik, Ayse Bener, Devang Parikh

    Abstract: Consistent and holistic expression of software requirements is important for the success of software projects. In this study, we aim to enhance the efficiency of the software development processes by automatically identifying conflicting and duplicate software requirement specifications. We formulate the conflict and duplicate detection problem as a requirement pair classification task. We design… ▽ More

    Submitted 9 January, 2023; originally announced January 2023.

  17. arXiv:2301.00495  [pdf, other

    cs.SE

    Adaptive Fine-tuning for Multiclass Classification over Software Requirement Data

    Authors: Savas Yildirim, Mucahit Cevik, Devang Parikh, Ayse Basar

    Abstract: The analysis of software requirement specifications (SRS) using Natural Language Processing (NLP) methods has been an important study area in the software engineering field in recent years. Especially thanks to the advances brought by deep learning and transfer learning approaches in NLP, SRS data can be utilized for various learning tasks more easily. In this study, we employ a three-stage domain… ▽ More

    Submitted 1 January, 2023; originally announced January 2023.

  18. GraffMatch: Global Matching of 3D Lines and Planes for Wide Baseline LiDAR Registration

    Authors: Parker C. Lusk, Devarth Parikh, Jonathan P. How

    Abstract: Using geometric landmarks like lines and planes can increase navigation accuracy and decrease map storage requirements compared to commonly-used LiDAR point cloud maps. However, landmark-based registration for applications like loop closure detection is challenging because a reliable initial guess is not available. Global landmark matching has been investigated in the literature, but these methods… ▽ More

    Submitted 24 December, 2022; originally announced December 2022.

    Comments: accepted to RA-L; 8 pages. arXiv admin note: text overlap with arXiv:2205.08556

    Journal ref: IEEE Robotics and Automation Letters, vol. 8, no. 2, pp. 632-639, Feb. 2023

  19. SpaText: Spatio-Textual Representation for Controllable Image Generation

    Authors: Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, Xi Yin

    Abstract: Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality. However, it is nearly impossible to control the shapes of different regions/objects or their layout in a fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a fixed set of labels. To this end, we present SpaText - a new method for text-to-image gen… ▽ More

    Submitted 19 March, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

    Comments: CVPR 2023. Project page available at: https://omriavrahami.com/spatext

  20. arXiv:2209.15352  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    AudioGen: Textually Guided Audio Generation

    Authors: Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi

    Abstract: We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differe… ▽ More

    Submitted 5 March, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: Accepted to ICLR 2023

  21. arXiv:2209.14792  [pdf, other

    cs.CV cs.AI cs.LG

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Authors: Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman

    Abstract: We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

  22. arXiv:2206.13690  [pdf, other

    cs.SE

    Supervised Semantic Similarity-based Conflict Detection Algorithm: S3CDA

    Authors: Garima Malik, Mucahit Cevik, Devang Parikh, Ayse Basar

    Abstract: In the realm of software development, the clarity, completeness, and comprehensiveness of requirements significantly impact the success of software systems. The Software Requirement Specification (SRS) document, a cornerstone of the software development life cycle, delineates both functional and nonfunctional requirements, playing a pivotal role in ensuring the quality and timely delivery of softw… ▽ More

    Submitted 27 March, 2024; v1 submitted 27 June, 2022; originally announced June 2022.

  23. arXiv:2205.01652  [pdf, other

    cs.CV cs.AI

    Episodic Memory Question Answering

    Authors: Samyak Datta, Sameer Dharur, Vincent Cartillier, Ruta Desai, Mukul Khanna, Dhruv Batra, Devi Parikh

    Abstract: Egocentric augmented reality devices such as wearable glasses passively capture visual data as a human wearer tours a home environment. We envision a scenario wherein the human communicates with an AI agent powering such a device by asking questions (e.g., where did you last see my keys?). In order to succeed at this task, the egocentric AI assistant must (1) construct semantically rich and effici… ▽ More

    Submitted 3 May, 2022; originally announced May 2022.

    Comments: Published at CVPR 2022 (Oral presentation)

  24. arXiv:2204.08058  [pdf, other

    cs.CV cs.AI

    MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

    Authors: Thomas Hayes, Songyang Zhang, Xi Yin, Guan Pang, Sasha Sheng, Harry Yang, Songwei Ge, Qiyuan Hu, Devi Parikh

    Abstract: Multimodal video-audio-text understanding and generation can benefit from datasets that are narrow but rich. The narrowness allows bite-sized challenges that the research community can make progress on. The richness ensures we are making progress along the core challenges. To this end, we present a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun [… ▽ More

    Submitted 28 April, 2022; v1 submitted 17 April, 2022; originally announced April 2022.

  25. arXiv:2204.03638  [pdf, other

    cs.CV cs.AI

    Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

    Authors: Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, Devi Parikh

    Abstract: Videos are created to express emotion, exchange information, and share experiences. Video synthesis has intrigued researchers for a long time. Despite the rapid progress driven by advances in visual synthesis, most existing studies focus on improving the frames' quality and the transitions between them, while little progress has been made in generating longer videos. In this paper, we present a me… ▽ More

    Submitted 24 September, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: In ECCV 2022

  26. arXiv:2203.13131  [pdf, other

    cs.CV cs.AI cs.CL cs.GR cs.LG

    Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

    Authors: Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, Yaniv Taigman

    Abstract: Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity and text relevancy, several pivotal gaps remain unanswered, limiting applicability and quality. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mech… ▽ More

    Submitted 24 March, 2022; originally announced March 2022.

  27. arXiv:2110.14810  [pdf, other

    cs.HC cs.AI

    Telling Creative Stories Using Generative Visual Aids

    Authors: Safinah Ali, Devi Parikh

    Abstract: Can visual artworks created using generative visual algorithms inspire human creativity in storytelling? We asked writers to write creative stories from a starting prompt, and provided them with visuals created by generative AI models from the same prompt. Compared to a control group, writers who used the visuals as story writing aid wrote significantly more creative, original, complete and visual… ▽ More

    Submitted 27 October, 2021; originally announced October 2021.

    Comments: Accepted in the Machine Learning for Creativity and Design Workshop at NeurIPS 2021

  28. arXiv:2110.04070  [pdf, other

    cs.CV cs.LG

    Dataset Structural Index: Leveraging a machine's perspective towards visual data

    Authors: Dishant Parikh

    Abstract: With advances in vision and perception architectures, we have realized that working with data is equally crucial, if not more, than the algorithms. Till today, we have trained machines based on our knowledge and perspective of the world. The entire concept of Dataset Structural Index(DSI) revolves around understanding a machine`s perspective of the dataset. With DSI, I show two meta values with wh… ▽ More

    Submitted 23 January, 2023; v1 submitted 5 October, 2021; originally announced October 2021.

  29. arXiv:2110.02670  [pdf, other

    cs.CV cs.LG

    S-Extension Patch: A simple and efficient way to extend an object detection model

    Authors: Dishant Parikh

    Abstract: While building convolutional network-based systems, the toll it takes to train the network is something that cannot be ignored. In cases where we need to append additional capabilities to the existing model, the attention immediately goes towards retraining techniques. In this paper, I show how to leverage knowledge about the dataset to append the class faster while maintaining the speed of infere… ▽ More

    Submitted 20 January, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: Accepted and presented at ICDSMLA 2021. Proceedings to be published with Springer

  30. arXiv:2109.14065  [pdf, other

    cs.RO

    Localization of a Smart Infrastructure Fisheye Camera in a Prior Map for Autonomous Vehicles

    Authors: Subodh Mishra, Armin Parchami, Enrique Corona, Punarjay Chakravarty, Ankit Vora, Devarth Parikh, Gaurav Pandey

    Abstract: This work presents a technique for localization of a smart infrastructure node, consisting of a fisheye camera, in a prior map. These cameras can detect objects that are outside the line of sight of the autonomous vehicles (AV) and send that information to AVs using V2X technology. However, in order for this information to be of any use to the AV, the detected objects should be provided in the ref… ▽ More

    Submitted 28 September, 2021; originally announced September 2021.

    Comments: Submitted to ICRA 2022

  31. arXiv:2107.06252  [pdf, other

    cs.SD cs.MM eess.AS

    Dance2Music: Automatic Dance-driven Music Generation

    Authors: Gunjan Aggarwal, Devi Parikh

    Abstract: Dance and music typically go hand in hand. The complexities in dance, music, and their synchronisation make them fascinating to study from a computational creativity perspective. While several works have looked at generating dance for a given music, automatically generating music for a given dance remains under-explored. This capability could have several creative expression and entertainment appl… ▽ More

    Submitted 20 July, 2021; v1 submitted 13 July, 2021; originally announced July 2021.

  32. arXiv:2106.14127  [pdf, other

    cs.CL cs.AI cs.CV

    Visual Conceptual Blending with Large-scale Language and Vision Models

    Authors: Songwei Ge, Devi Parikh

    Abstract: We ask the question: to what extent can recent large-scale language and image generation models blend visual concepts? Given an arbitrary object, we identify a relevant object and generate a single-sentence description of the blend of the two using a language model. We then generate a visual depiction of the blend using a text-based image generation model. Quantitative and qualitative evaluations… ▽ More

    Submitted 26 June, 2021; originally announced June 2021.

  33. arXiv:2106.13901  [pdf, other

    cs.CY cs.AI

    Building Bridges: Generative Artworks to Explore AI Ethics

    Authors: Ramya Srinivasan, Devi Parikh

    Abstract: In recent years, there has been an increased emphasis on understanding and mitigating adverse impacts of artificial intelligence (AI) technologies on society. Across academia, industry, and government bodies, a variety of endeavours are being pursued towards enhancing AI ethics. A significant challenge in the design of ethical AI systems is that there are multiple stakeholders in the AI pipeline,… ▽ More

    Submitted 25 June, 2021; originally announced June 2021.

    Journal ref: CVPR Workshop on Ethical Considerations in Creative Applications of Computer Vision, 2022

  34. arXiv:2106.02280  [pdf, other

    cs.CV cs.CL

    Human-Adversarial Visual Question Answering

    Authors: Sasha Sheng, Amanpreet Singh, Vedanuj Goswami, Jose Alberto Lopez Magana, Wojciech Galuba, Devi Parikh, Douwe Kiela

    Abstract: Performance on the most commonly used Visual Question Answering dataset (VQA v2) is starting to approach human accuracy. However, in interacting with state-of-the-art VQA models, it is clear that the problem is far from being solved. In order to stress test VQA models, we benchmark them against human-adversarial examples. Human subjects interact with a state-of-the-art VQA model, and for each imag… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

    Comments: 22 pages, 13 figures. First two authors contributed equally

  35. arXiv:2105.11589  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.RO

    VISITRON: Visual Semantics-Aligned Interactively Trained Object-Navigator

    Authors: Ayush Shrivastava, Karthik Gopalakrishnan, Yang Liu, Robinson Piramuthu, Gokhan Tür, Devi Parikh, Dilek Hakkani-Tür

    Abstract: Interactive robots navigating photo-realistic environments need to be trained to effectively leverage and handle the dynamic nature of dialogue in addition to the challenges underlying vision-and-language navigation (VLN). In this paper, we present VISITRON, a multi-modal Transformer-based navigator better suited to the interactive regime inherent to Cooperative Vision-and-Dialog Navigation (CVDN)… ▽ More

    Submitted 15 March, 2022; v1 submitted 24 May, 2021; originally announced May 2021.

    Comments: Accepted at Findings of the Annual Meeting of the Association for Computational Linguistics (ACL) 2022, previous version accepted at Visually Grounded Interaction and Language (ViGIL) Workshop at NAACL 2021

    ACM Class: I.2.9

  36. arXiv:2103.01436  [pdf, other

    cs.LG

    ForceNet: A Graph Neural Network for Large-Scale Quantum Calculations

    Authors: Weihua Hu, Muhammed Shuaibi, Abhishek Das, Siddharth Goyal, Anuroop Sriram, Jure Leskovec, Devi Parikh, C. Lawrence Zitnick

    Abstract: With massive amounts of atomic simulation data available, there is a huge opportunity to develop fast and accurate machine learning models to approximate expensive physics-based calculations. The key quantity to estimate is atomic forces, where the state-of-the-art Graph Neural Networks (GNNs) explicitly enforce basic physical constraints such as rotation-covariance. However, to strictly satisfy t… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

  37. arXiv:2101.12059  [pdf, other

    cs.CV cs.CL

    VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

    Authors: Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani

    Abstract: We present \textsc{Vx2Text}, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. In order to leverage transformer networks, which have been shown to be effective at modeling language, each modality is first converted into a set of language embeddings by a learnable tokenizer. This allows our approach to perform multimodal fusion in the language s… ▽ More

    Submitted 29 January, 2021; v1 submitted 28 January, 2021; originally announced January 2021.

    Comments: Work in progress

  38. arXiv:2012.11587  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Object-Centric Diagnosis of Visual Reasoning

    Authors: Jianwei Yang, Jiayuan Mao, Jiajun Wu, Devi Parikh, David D. Cox, Joshua B. Tenenbaum, Chuang Gan

    Abstract: When answering questions about an image, it not only needs knowing what -- understanding the fine-grained contents (e.g., objects, relationships) in the image, but also telling why -- reasoning over grounding visual cues to derive the answer for a question. Over the last few years, we have seen significant progress on visual question answering. Though impressive as the accuracy grows, it still lag… ▽ More

    Submitted 21 December, 2020; originally announced December 2020.

  39. arXiv:2012.11014  [pdf, other

    cs.CV cs.CL

    KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA

    Authors: Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, Marcus Rohrbach

    Abstract: One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image. In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time. We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can… ▽ More

    Submitted 20 December, 2020; originally announced December 2020.

  40. arXiv:2011.10039  [pdf, other

    cs.CV cs.AI

    Creative Sketch Generation

    Authors: Songwei Ge, Vedanuj Goswami, C. Lawrence Zitnick, Devi Parikh

    Abstract: Sketching or doodling is a popular creative activity that people engage in. However, most existing work in automatic sketch understanding or generation has focused on sketches that are quite mundane. In this work, we introduce two datasets of creative sketches -- Creative Birds and Creative Creatures -- containing 10k sketches each along with part annotations. We propose DoodlerGAN -- a part-based… ▽ More

    Submitted 3 March, 2021; v1 submitted 19 November, 2020; originally announced November 2020.

    Comments: Published as a conference paper at ICLR 2021

  41. arXiv:2011.08277  [pdf, other

    cs.CV cs.CL

    Where Are You? Localization from Embodied Dialog

    Authors: Meera Hahn, Jacob Krantz, Dhruv Batra, Devi Parikh, James M. Rehg, Stefan Lee, Peter Anderson

    Abstract: We present Where Are You? (WAY), a dataset of ~6k dialogs in which two humans -- an Observer and a Locator -- complete a cooperative localization task. The Observer is spawned at random in a 3D environment and can navigate from first-person views while answering questions from the Locator. The Locator must localize the Observer in a detailed top-down map by asking questions and giving instructions… ▽ More

    Submitted 3 September, 2021; v1 submitted 16 November, 2020; originally announced November 2020.

    Journal ref: EMNLP 2020

  42. arXiv:2011.04728  [pdf, other

    cs.CV cs.GR cs.LG eess.IV

    Similarity-Based Clustering for Enhancing Image Classification Architectures

    Authors: Dishant Parikh

    Abstract: Convolutional networks are at the center of best-in-class computer vision applications for a wide assortment of undertakings. Since 2014, a profound amount of work began to make better convolutional architectures, yielding generous additions in different benchmarks. Albeit expanded model size and computational cost will, in general, mean prompt quality increases for most undertakings but, the arch… ▽ More

    Submitted 6 October, 2021; v1 submitted 3 November, 2020; originally announced November 2020.

  43. arXiv:2011.03807  [pdf, other

    cs.CV cs.CL cs.RO

    Sim-to-Real Transfer for Vision-and-Language Navigation

    Authors: Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, Stefan Lee

    Abstract: We study the challenging problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions. Recent work on the task of Vision-and-Language Navigation (VLN) has achieved significant progress in simulation. To assess the implications of this work for robotics, we transfer a VLN agent trained in simulation to a physical robot.… ▽ More

    Submitted 7 November, 2020; originally announced November 2020.

    Comments: CoRL 2020

  44. arXiv:2010.10038  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency

    Authors: Sameer Dharur, Purva Tendulkar, Dhruv Batra, Devi Parikh, Ramprasaath R. Selvaraju

    Abstract: Recent research in Visual Question Answering (VQA) has revealed state-of-the-art models to be inconsistent in their understanding of the world -- they answer seemingly difficult questions requiring reasoning correctly but get simpler associated sub-questions wrong. These sub-questions pertain to lower level visual concepts in the image that models ideally should understand to be able to answer the… ▽ More

    Submitted 30 November, 2020; v1 submitted 20 October, 2020; originally announced October 2020.

    Comments: Accepted to the NeurIPS 2020 workshop on Interpretable Inductive Biases and Physically Structured Learning

  45. arXiv:2010.09990  [pdf, other

    cond-mat.mtrl-sci cs.LG

    The Open Catalyst 2020 (OC20) Dataset and Community Challenges

    Authors: Lowik Chanussot, Abhishek Das, Siddharth Goyal, Thibaut Lavril, Muhammed Shuaibi, Morgane Riviere, Kevin Tran, Javier Heras-Domingo, Caleb Ho, Weihua Hu, Aini Palizhati, Anuroop Sriram, Brandon Wood, Junwoong Yoon, Devi Parikh, C. Lawrence Zitnick, Zachary Ulissi

    Abstract: Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuels synthesis, long-term energy storage, and renewable fertilizer production. Despite considerable effort by the catalysis community to apply machine learning models to the computational catalyst discovery process, it remains an open challenge to build models that can generalize across both… ▽ More

    Submitted 24 September, 2021; v1 submitted 19 October, 2020; originally announced October 2020.

    Comments: 37 pages, 11 figures, submitted to ACS Catalysis

  46. arXiv:2010.09435  [pdf, other

    cond-mat.mtrl-sci cs.CE cs.LG

    An Introduction to Electrocatalyst Design using Machine Learning for Renewable Energy Storage

    Authors: C. Lawrence Zitnick, Lowik Chanussot, Abhishek Das, Siddharth Goyal, Javier Heras-Domingo, Caleb Ho, Weihua Hu, Thibaut Lavril, Aini Palizhati, Morgane Riviere, Muhammed Shuaibi, Anuroop Sriram, Kevin Tran, Brandon Wood, Junwoong Yoon, Devi Parikh, Zachary Ulissi

    Abstract: Scalable and cost-effective solutions to renewable energy storage are essential to addressing the world's rising energy needs while reducing climate change. As we increase our reliance on renewable energy sources such as wind and solar, which produce intermittent power, storage is needed to transfer power from times of peak generation to peak demand. This may require the storage of power for hours… ▽ More

    Submitted 14 October, 2020; originally announced October 2020.

    Comments: 27 pages

    ACM Class: I.2.6; J.2

  47. arXiv:2010.06087  [pdf, other

    cs.CV

    Contrast and Classify: Training Robust VQA Models

    Authors: Yash Kant, Abhinav Moudgil, Dhruv Batra, Devi Parikh, Harsh Agrawal

    Abstract: Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions. Existing approaches address this by augmenting the dataset with question paraphrases from visual question generation models or adversarial perturbations. These approaches use the combined data to learn an answer classifier by m… ▽ More

    Submitted 18 April, 2021; v1 submitted 12 October, 2020; originally announced October 2020.

  48. Improving the efficiency of spectral features extraction by structuring the audio files

    Authors: Dishant Parikh, Saurabh Sachdev

    Abstract: The extraction of spectral features from a music clip is a computationally expensive task. As in order to extract accurate features, we need to process the clip for its whole length. This preprocessing task creates a large overhead and also makes the extraction process slower. We show how formatting a dataset in a certain way, can help make the process more efficient by eliminating the need for pr… ▽ More

    Submitted 6 October, 2020; originally announced October 2020.

  49. arXiv:2009.03231  [pdf, other

    cs.CV cs.AI cs.LG

    Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents

    Authors: Samyak Datta, Oleksandr Maksymets, Judy Hoffman, Stefan Lee, Dhruv Batra, Devi Parikh

    Abstract: Recent work has presented embodied agents that can navigate to point-goal targets in novel indoor environments with near-perfect accuracy. However, these agents are equipped with idealized sensors for localization and take deterministic actions. This setting is practically sterile by comparison to the dirty reality of noisy sensors and actuations in the real world -- wheels can slip, motion sensor… ▽ More

    Submitted 7 September, 2020; originally announced September 2020.

  50. arXiv:2007.12750  [pdf, other

    cs.CV cs.AI cs.CL

    Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

    Authors: Michael Cogswell, Jiasen Lu, Rishabh Jain, Stefan Lee, Devi Parikh, Dhruv Batra

    Abstract: Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to new tasks, minimizing expensive data collection and annotation. In this work, we study a setting we call "Dialog without Dialog", which requires agents to develop visually grounded dialog model… ▽ More

    Submitted 24 July, 2020; originally announced July 2020.

    Comments: 19 pages, 8 figures