Skip to main content

Showing 1–50 of 52 results for author: Chang, A X

.
  1. arXiv:2406.12723  [pdf, other

    cs.LG

    BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity

    Authors: Zahra Gharaee, Scott C. Lowe, ZeMing Gong, Pablo Millan Arias, Nicholas Pellegrino, Austin T. Wang, Joakim Bruslund Haurum, Iuliia Zarubiieva, Lila Kari, Dirk Steinke, Graham W. Taylor, Paul Fieguth, Angel X. Chang

    Abstract: As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by includin… ▽ More

    Submitted 24 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

  2. arXiv:2406.11579  [pdf, other

    cs.CV

    Duoduo CLIP: Efficient 3D Understanding with Multi-View Images

    Authors: Han-Hung Lee, Yiming Zhang, Angel X. Chang

    Abstract: We introduce Duoduo CLIP, a model for 3D representation learning that learns shape encodings from multi-view images instead of point-clouds. The choice of multi-view images allows us to leverage 2D priors from off-the-shelf CLIP models to facilitate fine-tuning with 3D data. Our approach not only shows better generalization compared to existing point cloud methods, but also reduces GPU requirement… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  3. arXiv:2405.17537  [pdf, other

    cs.AI cs.CL cs.CV

    BIOSCAN-CLIP: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

    Authors: ZeMing Gong, Austin T. Wang, Joakim Bruslund Haurum, Scott C. Lowe, Graham W. Taylor, Angel X. Chang

    Abstract: Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for the taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, DNA barcodes, and textual data in a unified embedding space. This allows for… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: 16 pages with 9 figures

  4. arXiv:2405.10255  [pdf, other

    cs.CV cs.RO

    When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

    Authors: Xianzheng Ma, Yash Bhalgat, Brandon Smart, Shuai Chen, Xinghui Li, Jian Ding, **dong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Pollefeys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, Victor Adrian Prisacariu

    Abstract: As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context lear… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  5. arXiv:2405.05010  [pdf, other

    cs.CV

    ${M^2D}$NeRF: Multi-Modal Decomposition NeRF with 3D Feature Fields

    Authors: Ning Wang, Lefei Zhang, Angel X Chang

    Abstract: Neural fields (NeRF) have emerged as a promising approach for representing continuous 3D scenes. Nevertheless, the lack of semantic encoding in NeRFs poses a significant challenge for scene decomposition. To address this challenge, we present a single model, Multi-Modal Decomposition NeRF (${M^2D}$NeRF), that is capable of both text-based and visual patch-based edits. Specifically, we use multi-mo… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

  6. arXiv:2403.13289  [pdf, other

    cs.CV

    Text-to-3D Shape Generation

    Authors: Han-Hung Lee, Manolis Savva, Angel X. Chang

    Abstract: Recent years have seen an explosion of work and interest in text-to-3D shape generation. Much of the progress is driven by advances in 3D representations, large-scale pretraining and representation learning for text and image data enabling generative AI models, and differentiable rendering. Computational systems that can perform text-to-3D shape generation have captivated the popular imagination a… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  7. arXiv:2403.12301  [pdf, other

    cs.CV

    R3DS: Reality-linked 3D Scenes for Panoramic Scene Understanding

    Authors: Qirui Wu, Sonia Raychaudhuri, Daniel Ritchie, Manolis Savva, Angel X Chang

    Abstract: We introduce the Reality-linked 3D Scenes (R3DS) dataset of synthetic 3D scenes mirroring the real-world scene arrangements from Matterport3D panoramas. Compared to prior work, R3DS has more complete and densely populated scenes with objects linked to real-world observations in panoramas. R3DS also provides an object support hierarchy, and matching object sets (e.g., same chairs around a dining ta… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  8. arXiv:2401.00405  [pdf, other

    cs.CV

    Generalizing Single-View 3D Shape Retrieval to Occlusions and Unseen Objects

    Authors: Qirui Wu, Daniel Ritchie, Manolis Savva, Angel X. Chang

    Abstract: Single-view 3D shape retrieval is a challenging task that is increasingly important with the growth of available 3D data. Prior work that has studied this task has not focused on evaluating how realistic occlusions impact performance, and how shape retrieval methods generalize to scenarios where either the target 3D shape database contains unseen shapes, or the input image contains unseen objects.… ▽ More

    Submitted 31 December, 2023; originally announced January 2024.

  9. arXiv:2311.02401  [pdf, other

    cs.LG

    BarcodeBERT: Transformers for Biodiversity Analysis

    Authors: Pablo Millan Arias, Niousha Sadjadi, Monireh Safari, ZeMing Gong, Austin T. Wang, Scott C. Lowe, Joakim Bruslund Haurum, Iuliia Zarubiieva, Dirk Steinke, Lila Kari, Angel X. Chang, Graham W. Taylor

    Abstract: Understanding biodiversity is a global challenge, in which DNA barcodes - short snippets of DNA that cluster by species - play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across… ▽ More

    Submitted 4 November, 2023; originally announced November 2023.

    Comments: Main text: 5 pages, Total: 9 pages, 2 figures, accepted at the 4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023)

  10. arXiv:2309.05251  [pdf, other

    cs.CV

    Multi3DRefer: Grounding Text Description to Multiple 3D Objects

    Authors: Yiming Zhang, ZeMing Gong, Angel X. Chang

    Abstract: We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a strict setting is unnatural as localizing potentially multiple objects is a common need in real-world scenarios and robotic tasks (e.g., visual navigation and ob… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  11. arXiv:2307.10455  [pdf, other

    cs.CV cs.AI cs.LG

    A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect Dataset

    Authors: Zahra Gharaee, ZeMing Gong, Nicholas Pellegrino, Iuliia Zarubiieva, Joakim Bruslund Haurum, Scott C. Lowe, Jaclyn T. A. McKeown, Chris C. Y. Ho, Joschka McLeod, Yi-Yun C Wei, Jireh Agda, Sujeevan Ratnasingham, Dirk Steinke, Angel X. Chang, Graham W. Taylor, Paul Fieguth

    Abstract: In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is taxonomically classified by an expert, and also has associated genetic information including raw nucleotide barcode sequences and assigned barcode index numbers, which are genetically-based proxies for species classification. This paper presents a c… ▽ More

    Submitted 13 November, 2023; v1 submitted 19 July, 2023; originally announced July 2023.

  12. arXiv:2306.11290  [pdf, other

    cs.CV

    Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation

    Authors: Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, Manolis Savva

    Abstract: We contribute the Habitat Synthetic Scene Dataset, a dataset of 211 high-quality 3D scenes, and use it to test navigation agent generalization to realistic 3D environments. Our dataset represents real interiors and contains a diverse set of 18,656 models of real-world objects. We investigate the impact of synthetic 3D scene dataset scale and realism on the task of training embodied agents to find… ▽ More

    Submitted 7 December, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

  13. arXiv:2305.18557  [pdf, other

    cs.CV

    Evaluating 3D Shape Analysis Methods for Robustness to Rotation Invariance

    Authors: Supriya Gadi Patil, Angel X. Chang, Manolis Savva

    Abstract: This paper analyzes the robustness of recent 3D shape descriptors to SO(3) rotations, something that is fundamental to shape modeling. Specifically, we formulate the task of rotated 3D object instance detection. To do so, we consider a database of 3D indoor scenes, where objects occur in different orientations. We benchmark different methods for feature extraction and classification in the context… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: 20th Conference on Robots and Vision (CRV) 2023

  14. arXiv:2304.03696  [pdf, other

    cs.RO cs.CV

    MOPA: Modular Object Navigation with PointGoal Agents

    Authors: Sonia Raychaudhuri, Tommaso Campari, Unnat Jain, Manolis Savva, Angel X. Chang

    Abstract: We propose a simple but effective modular approach MOPA (Modular ObjectNav with PointGoal agents) to systematically investigate the inherent modularity of the object navigation task in Embodied AI. MOPA consists of four modules: (a) an object detection module trained to identify objects from RGB images, (b) a map building module to build a semantic map of the observed objects, (c) an exploration m… ▽ More

    Submitted 27 January, 2024; v1 submitted 7 April, 2023; originally announced April 2023.

  15. arXiv:2303.14087  [pdf, other

    cs.CV

    OPDMulti: Openable Part Detection for Multiple Objects

    Authors: Xiaohao Sun, Hanxiao Jiang, Manolis Savva, Angel Xuan Chang

    Abstract: Openable part detection is the task of detecting the openable parts of an object in a single-view image, and predicting corresponding motion parameters. Prior work investigated the unrealistic setting where all input images only contain a single openable object. We generalize this task to scenes with multiple objects each potentially possessing openable parts, and create a corresponding dataset ba… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

  16. arXiv:2212.00836  [pdf, other

    cs.CV

    UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

    Authors: Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, Angel X. Chang

    Abstract: Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships. However, despite some previous attempts on connecting these two related tasks with highly task-specific neural modules, it remains understudied how to explicitly depict their shared nature to learn them simultaneously. In this work, we propose UniT3D, a simple… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

  17. arXiv:2212.00767  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Exploiting Proximity-Aware Tasks for Embodied Social Navigation

    Authors: Enrico Cancelli, Tommaso Campari, Luciano Serafini, Angel X. Chang, Lamberto Ballan

    Abstract: Learning how to navigate among humans in an occluded and spatially constrained indoor environment, is a key ability required to embodied agent to be integrated into our society. In this paper, we propose an end-to-end architecture that exploits Proximity-Aware Tasks (referred as to Risk and Proximity Compass) to inject into a reinforcement learning navigation policy the ability to infer common-sen… ▽ More

    Submitted 10 March, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

  18. arXiv:2210.06849  [pdf, other

    cs.CV

    Retrospectives on the Embodied AI Workshop

    Authors: Matt Deitke, Dhruv Batra, Yonatan Bisk, Tommaso Campari, Angel X. Chang, Devendra Singh Chaplot, Changan Chen, Claudia Pérez D'Arpino, Kiana Ehsani, Ali Farhadi, Li Fei-Fei, Anthony Francis, Chuang Gan, Kristen Grauman, David Hall, Winson Han, Unnat Jain, Aniruddha Kembhavi, Jacob Krantz, Stefan Lee, Chengshu Li, Sagnik Majumder, Oleksandr Maksymets, Roberto Martín-Martín, Roozbeh Mottaghi , et al. (14 additional authors not shown)

    Abstract: We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of… ▽ More

    Submitted 4 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

  19. arXiv:2210.05633  [pdf, other

    cs.CV

    Habitat-Matterport 3D Semantics Dataset

    Authors: Karmesh Yadav, Ram Ramrakhya, Santhosh Kumar Ramakrishnan, Theo Gervet, John Turner, Aaron Gokaslan, Noah Maestre, Angel Xuan Chang, Dhruv Batra, Manolis Savva, Alexander William Clegg, Devendra Singh Chaplot

    Abstract: We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of 3D real-world spaces with densely annotated semantics that is currently available to the academic community. It consists of 142,646 object instance annotations across 216 3D spaces and 3,100 rooms within those spaces. The scale, quality, and diversity of object annotations far exceed those of prior… ▽ More

    Submitted 12 October, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: 15 Pages, 11 Figures, 6 Tables

  20. arXiv:2209.15172  [pdf, other

    cs.CV cs.GR cs.LG

    Understanding Pure CLIP Guidance for Voxel Grid NeRF Models

    Authors: Han-Hung Lee, Angel X. Chang

    Abstract: We explore the task of text to 3D object generation using CLIP. Specifically, we use CLIP for guidance without access to any datasets, a setting we refer to as pure CLIP guidance. While prior work has adopted this setting, there is no systematic study of mechanics for preventing adversarial generations within CLIP. We illustrate how different image-based augmentations prevent the adversarial gener… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

  21. arXiv:2209.05612  [pdf, other

    cs.CV

    Articulated 3D Human-Object Interactions from RGB Videos: An Empirical Analysis of Approaches and Challenges

    Authors: Sanjay Haresh, Xiaohao Sun, Hanxiao Jiang, Angel X. Chang, Manolis Savva

    Abstract: Human-object interactions with articulated objects are common in everyday life. Despite much progress in single-view 3D reconstruction, it is still challenging to infer an articulated 3D object model from an RGB video showing a person manipulating the object. We canonicalize the task of articulated 3D human-object interaction reconstruction from RGB video, and carry out a systematic benchmark of f… ▽ More

    Submitted 12 September, 2022; originally announced September 2022.

    Comments: 3DV 2022

  22. arXiv:2205.13675  [pdf, other

    cs.AR cs.AI cs.LG

    Reinforcement Learning Approach for Map** Applications to Dataflow-Based Coarse-Grained Reconfigurable Array

    Authors: Andre Xian Ming Chang, Parth Khopkar, Bashar Romanous, Abhishek Chaurasia, Patrick Estep, Skyler Windh, Doug Vanesko, Sheik Dawood Beer Mohideen, Eugenio Culurciello

    Abstract: The Streaming Engine (SE) is a Coarse-Grained Reconfigurable Array which provides programming flexibility and high-performance with energy efficiency. An application program to be executed on the SE is represented as a combination of Synchronous Data Flow (SDF) graphs, where every instruction is represented as a node. Each node needs to be mapped to the right slot and array in the SE to ensure the… ▽ More

    Submitted 26 May, 2022; originally announced May 2022.

    Comments: 10 pages, 12 figures

  23. arXiv:2203.16421  [pdf, other

    cs.CV

    OPD: Single-view 3D Openable Part Detection

    Authors: Hanxiao Jiang, Yongsen Mao, Manolis Savva, Angel X. Chang

    Abstract: We address the task of predicting what parts of an object can open and how they move when they do so. The input is a single image of an object, and as output we detect what parts of the object can open, and the motion parameters describing the articulation of each openable part. To tackle this task, we create two datasets of 3D objects: OPDSynth based on existing synthetic objects, and OPDReal bas… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

  24. arXiv:2201.07366  [pdf, other

    cs.CV

    TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval

    Authors: Yue Ruan, Han-Hung Lee, Yiming Zhang, Ke Zhang, Angel X. Chang

    Abstract: Text-to-shape retrieval is an increasingly relevant problem with the growth of 3D shape data. Recent work on contrastive losses for learning joint embeddings over multimodal data has been successful at tasks such as retrieval and classification. Thus far, work on joint representation learning for 3D shapes and text has focused on improving embeddings through modeling of complex attention between r… ▽ More

    Submitted 27 December, 2023; v1 submitted 18 January, 2022; originally announced January 2022.

    Comments: Accepted by WACV 2024

  25. Roominoes: Generating Novel 3D Floor Plans From Existing 3D Rooms

    Authors: Kai Wang, Xianghao Xu, Leon Lei, Selena Ling, Natalie Lindsay, Angel X. Chang, Manolis Savva, Daniel Ritchie

    Abstract: Realistic 3D indoor scene datasets have enabled significant recent progress in computer vision, scene understanding, autonomous navigation, and 3D reconstruction. But the scale, diversity, and customizability of existing datasets is limited, and it is time-consuming and expensive to scan and annotate more. Fortunately, combinatorics is on our side: there are enough individual rooms in existing 3D… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Comments: Symposium on Geometry Processing (SGP) 2021

    Journal ref: Computer Graphics Forum, 40: 57-69 (2021)

  26. arXiv:2112.01551  [pdf, other

    cs.CV

    D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding

    Authors: Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, Angel X. Chang

    Abstract: Recent studies on dense captioning and visual grounding in 3D have achieved impressive results. Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. Also, how to discriminatively describe objects in complex 3D environments is not fully studied yet. To address these challenges,… ▽ More

    Submitted 22 July, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

    Comments: Project website: https://daveredrum.github.io/D3Net/

  27. arXiv:2110.05769  [pdf, other

    cs.CV cs.AI cs.LG cs.MA

    Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents

    Authors: Shivansh Patel, Saim Wani, Unnat Jain, Alexander Schwing, Svetlana Lazebnik, Manolis Savva, Angel X. Chang

    Abstract: Communication between embodied AI agents has received increasing attention in recent years. Despite its use, it is still unclear whether the learned communication is interpretable and grounded in perception. To study the grounding of emergent forms of communication, we first introduce the collaborative multi-object navigation task CoMON. In this task, an oracle agent has detailed environment infor… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: Project page: https://shivanshpatel35.github.io/comon/ ; the first three authors contributed equally

  28. arXiv:2109.15207  [pdf, other

    cs.CV cs.CL cs.RO

    Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments

    Authors: Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, Angel X. Chang

    Abstract: In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle 'off the path' scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent's location to the goal, but such goal-oriented supervision is ofte… ▽ More

    Submitted 30 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021

  29. arXiv:2109.08238  [pdf, other

    cs.CV cs.AI

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Authors: Santhosh K. Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, Dhruv Batra

    Abstract: We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of 1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each scene in the dataset consists of a textured 3D mesh reconstruction of interiors such as multi-floor residences, stores, and other private indoor spaces. HM3D surpasses existing datasets available for academic research in te… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

    Comments: 21 pages, 14 figures

  30. arXiv:2106.06629  [pdf, other

    cs.CV

    Mirror3D: Depth Refinement for Mirror Surfaces

    Authors: Jiaqi Tan, Weijie Lin, Angel X. Chang, Manolis Savva

    Abstract: Despite recent progress in depth sensing and 3D reconstruction, mirror surfaces are a significant source of errors. To address this problem, we create the Mirror3D dataset: a 3D mirror plane dataset based on three RGBD datasets (Matterport3D, NYUv2 and ScanNet) containing 7,011 mirror instance masks and 3D planes. We then develop Mirror3DNet: a module that refines raw sensor depth or estimated dep… ▽ More

    Submitted 11 June, 2021; originally announced June 2021.

    Comments: Paper presented at CVPR 2021. For code, data and pretrained models, see https://3dlg-hcvc.github.io/mirror3d/

  31. arXiv:2106.05375  [pdf, other

    cs.CV cs.GR

    Plan2Scene: Converting Floorplans to 3D Scenes

    Authors: Madhawa Vidanapathirana, Qirui Wu, Yasutaka Furukawa, Angel X. Chang, Manolis Savva

    Abstract: We address the task of converting a floorplan and a set of associated photos of a residence into a textured 3D mesh model, a task which we call Plan2Scene. Our system 1) lifts a floorplan image to a 3D mesh model; 2) synthesizes surface textures based on the input photos; and 3) infers textures for unobserved surfaces using a graph neural network architecture. To train and evaluate our system we c… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: This paper is accepted to CVPR 2021. For code, data and pretrained models, see https://3dlg-hcvc.github.io/plan2scene/

  32. arXiv:2012.03912  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation

    Authors: Saim Wani, Shivansh Patel, Unnat Jain, Angel X. Chang, Manolis Savva

    Abstract: Navigation tasks in photorealistic 3D environments are challenging because they require perception and effective planning under partial observability. Recent work shows that map-like memory is useful for long-horizon navigation tasks. However, a focused investigation of the impact of maps on navigation tasks of varying complexity has not yet been performed. We propose the multiON task, which requi… ▽ More

    Submitted 7 December, 2020; originally announced December 2020.

    Comments: Project page: https://shivanshpatel35.github.io/multi-ON/ ; the first three authors contributed equally

  33. arXiv:2012.02206  [pdf, other

    cs.CV cs.LG eess.IV

    Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

    Authors: Dave Zhenyu Chen, Ali Gholami, Matthias Nießner, Angel X. Chang

    Abstract: We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in… ▽ More

    Submitted 3 December, 2020; originally announced December 2020.

    Comments: Video: https://youtu.be/AgmIpDbwTCY

  34. arXiv:2011.01975  [pdf, other

    cs.AI cs.CV cs.LG cs.RO

    Rearrangement: A Challenge for Embodied AI

    Authors: Dhruv Batra, Angel X. Chang, Sonia Chernova, Andrew J. Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, Manolis Savva, Hao Su

    Abstract: We describe a framework for research and evaluation in Embodied AI. Our proposal is based on a canonical task: Rearrangement. A standard task can focus the development of new techniques and serve as a source of trained models that can be transferred to other settings. In the rearrangement task, the goal is to bring a given physical environment into a specified state. The goal state can be specifie… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Authors are listed in alphabetical order

  35. arXiv:2003.08515  [pdf, other

    cs.CV cs.RO

    SAPIEN: A SimulAted Part-based Interactive ENvironment

    Authors: Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, Hao Su

    Abstract: Building home assistant robots has long been a pursuit for vision and robotics researchers. To achieve this task, a simulated environment with physically realistic simulation, sufficient articulated objects, and transferability to the real robot is indispensable. Existing environments achieve these requirements for robotics simulation with different levels of simplification and focus. We take one… ▽ More

    Submitted 18 March, 2020; originally announced March 2020.

  36. arXiv:1912.08830  [pdf, other

    cs.CV cs.CL cs.LG eess.IV

    ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language

    Authors: Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner

    Abstract: We introduce the task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, learning a fused descriptor from 3D object proposals and encoded sentence embeddings. This fused descriptor correlates language express… ▽ More

    Submitted 11 November, 2020; v1 submitted 18 December, 2019; originally announced December 2019.

    Comments: Project page: https://daveredrum.github.io/ScanRefer/

  37. arXiv:1903.03757  [pdf, other

    cs.CV

    Hierarchy Denoising Recursive Autoencoders for 3D Scene Layout Prediction

    Authors: Yifei Shi, Angel Xuan Chang, Zhelun Wu, Manolis Savva, Kai Xu

    Abstract: Indoor scenes exhibit rich hierarchical structure in 3D object layouts. Many tasks in 3D scene understanding can benefit from reasoning jointly about the hierarchical context of a scene, and the identities of objects. We present a variational denoising recursive autoencoder (VDRAE) that generates and iteratively refines a hierarchical representation of 3D object layouts, interleaving bottom-up enc… ▽ More

    Submitted 10 April, 2019; v1 submitted 9 March, 2019; originally announced March 2019.

    Comments: CVPR 2019

  38. arXiv:1812.02713  [pdf, other

    cs.CV

    PartNet: A Large-scale Benchmark for Fine-grained and Hierarchical Part-level 3D Object Understanding

    Authors: Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J. Guibas, Hao Su

    Abstract: We present PartNet: a consistent, large-scale dataset of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information. Our dataset consists of 573,585 part instances over 26,671 3D models covering 24 object categories. This dataset enables and serves as a catalyst for many tasks such as shape analysis, dynamic 3D scene modeling and simulation, affordance analysis, a… ▽ More

    Submitted 6 December, 2018; originally announced December 2018.

  39. arXiv:1811.11187  [pdf, other

    cs.CV

    Scan2CAD: Learning CAD Model Alignment in RGB-D Scans

    Authors: Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X. Chang, Matthias Nießner

    Abstract: We present Scan2CAD, a novel data-driven method that learns to align clean 3D CAD models from a shape database to the noisy and incomplete geometry of a commodity RGB-D scan. For a 3D reconstruction of an indoor scene, our method takes as input a set of CAD models, and predicts a 9DoF pose that aligns each model to the underlying scan geometry. To tackle this problem, we create a new scan-to-CAD a… ▽ More

    Submitted 27 November, 2018; originally announced November 2018.

    Comments: Video: https://youtu.be/PiHSYpgLTfA

  40. arXiv:1803.08495  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings

    Authors: Kevin Chen, Christopher B. Choy, Manolis Savva, Angel X. Chang, Thomas Funkhouser, Silvio Savarese

    Abstract: We present a method for generating colored 3D shapes from natural language. To this end, we first learn joint embeddings of freeform text descriptions and colored 3D shapes. Our model combines and extends learning by association and metric learning approaches to learn implicit cross-modal connections, and produces a joint representation that captures the many-to-many relations between language and… ▽ More

    Submitted 22 March, 2018; originally announced March 2018.

  41. arXiv:1712.04569  [pdf, other

    cs.CV

    Im2Pano3D: Extrapolating 360 Structure and Semantics Beyond the Field of View

    Authors: Shuran Song, Andy Zeng, Angel X. Chang, Manolis Savva, Silvio Savarese, Thomas Funkhouser

    Abstract: We present Im2Pano3D, a convolutional neural network that generates a dense prediction of 3D structure and a probability distribution of semantic labels for a full 360 panoramic view of an indoor scene when given only a partial observation (<= 50%) in the form of an RGB-D image. To make this possible, Im2Pano3D leverages strong contextual priors learned from large-scale synthetic and real-world in… ▽ More

    Submitted 12 December, 2017; originally announced December 2017.

    Comments: Video summary: https://youtu.be/Au3GmktK-So

  42. arXiv:1712.03931  [pdf, other

    cs.LG cs.AI cs.CV cs.GR cs.RO

    MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments

    Authors: Manolis Savva, Angel X. Chang, Alexey Dosovitskiy, Thomas Funkhouser, Vladlen Koltun

    Abstract: We present MINOS, a simulator designed to support the development of multisensory models for goal-directed navigation in complex indoor environments. The simulator leverages large datasets of complex 3D environments and supports flexible configuration of multimodal sensor suites. We use MINOS to benchmark deep-learning-based navigation methods, to analyze the influence of environmental complexity… ▽ More

    Submitted 11 December, 2017; originally announced December 2017.

    Comments: MINOS is a simulator designed to support research on end-to-end navigation

  43. arXiv:1708.02579  [pdf, other

    cs.AR

    Snowflake: A Model Agnostic Accelerator for Deep Convolutional Neural Networks

    Authors: Vinayak Gokhale, Aliasger Zaidy, Andre Xian Ming Chang, Eugenio Culurciello

    Abstract: Deep convolutional neural networks (CNNs) are the deep learning model of choice for performing object detection, classification, semantic segmentation and natural language processing tasks. CNNs require billions of operations to process a frame. This computational complexity, combined with the inherent parallelism of the convolution operation make CNNs an excellent target for custom accelerators.… ▽ More

    Submitted 8 August, 2017; originally announced August 2017.

  44. arXiv:1708.00117  [pdf, other

    cs.DC cs.LG

    Compiling Deep Learning Models for Custom Hardware Accelerators

    Authors: Andre Xian Ming Chang, Aliasger Zaidy, Vinayak Gokhale, Eugenio Culurciello

    Abstract: Convolutional neural networks (CNNs) are the core of most state-of-the-art deep learning algorithms specialized for object detection and classification. CNNs are both computationally complex and embarrassingly parallel. Two properties that leave room for potential software and hardware optimizations for embedded systems. Given a programmable hardware accelerator with a CNN oriented custom instruct… ▽ More

    Submitted 10 December, 2017; v1 submitted 31 July, 2017; originally announced August 2017.

    Comments: 8 pages

  45. arXiv:1704.02393  [pdf, other

    cs.CV

    Learning Where to Look: Data-Driven Viewpoint Set Selection for 3D Scenes

    Authors: Kyle Genova, Manolis Savva, Angel X. Chang, Thomas Funkhouser

    Abstract: The use of rendered images, whether from completely synthetic datasets or from 3D reconstructions, is increasingly prevalent in vision tasks. However, little attention has been given to how the selection of viewpoints affects the performance of rendered training sets. In this paper, we propose a data-driven approach to view set selection. Given a set of example images, we extract statistics descri… ▽ More

    Submitted 7 April, 2017; originally announced April 2017.

    Comments: ICCV submission, combined main paper and supplemental material

  46. arXiv:1703.00061  [pdf, other

    cs.GR cs.HC

    SceneSuggest: Context-driven 3D Scene Design

    Authors: Manolis Savva, Angel X. Chang, Maneesh Agrawala

    Abstract: We present SceneSuggest: an interactive 3D scene design system providing context-driven suggestions for 3D model retrieval and placement. Using a point-and-click metaphor we specify regions in a scene in which to automatically place and orient relevant 3D models. Candidate models are ranked using a set of static support, position, and orientation priors learned from 3D scenes. We show that our sug… ▽ More

    Submitted 28 February, 2017; originally announced March 2017.

  47. arXiv:1703.00050  [pdf, other

    cs.GR cs.CL cs.HC

    SceneSeer: 3D Scene Design with Natural Language

    Authors: Angel X. Chang, Mihail Eric, Manolis Savva, Christopher D. Manning

    Abstract: Designing 3D scenes is currently a creative task that requires significant expertise and effort in using complex 3D design interfaces. This effortful design process starts in stark contrast to the easiness with which people can use language to describe real and imaginary environments. We present SceneSeer: an interactive text to 3D scene generation system that allows a user to design 3D scenes usi… ▽ More

    Submitted 28 February, 2017; originally announced March 2017.

  48. arXiv:1702.04405  [pdf, other

    cs.CV

    ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes

    Authors: Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, Matthias Nießner

    Abstract: A key requirement for leveraging supervised deep learning methods is the availability of large, labeled datasets. Unfortunately, in the context of RGB-D scene understanding, very little data is available -- current datasets cover a small range of scene views and have limited semantic annotations. To address this issue, we introduce ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scen… ▽ More

    Submitted 11 April, 2017; v1 submitted 14 February, 2017; originally announced February 2017.

  49. arXiv:1611.08974  [pdf, other

    cs.CV

    Semantic Scene Completion from a Single Depth Image

    Authors: Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, Thomas Funkhouser

    Abstract: This paper focuses on semantic scene completion, a task for producing a complete 3D voxel representation of volumetric occupancy and semantic labels for a scene from a single-view depth map observation. Previous work has considered scene completion and semantic labeling of depth maps separately. However, we observe that these two problems are tightly intertwined. To leverage the coupled nature of… ▽ More

    Submitted 27 November, 2016; originally announced November 2016.

  50. arXiv:1603.04767  [pdf, other

    cs.CL

    Evaluating the word-expert approach for Named-Entity Disambiguation

    Authors: Angel X. Chang, Valentin I. Spitkovsky, Christopher D. Manning, Eneko Agirre

    Abstract: Named Entity Disambiguation (NED) is the task of linking a named-entity mention to an instance in a knowledge-base, typically Wikipedia. This task is closely related to word-sense disambiguation (WSD), where the supervised word-expert approach has prevailed. In this work we present the results of the word-expert approach to NED, where one classifier is built for each target entity mention string.… ▽ More

    Submitted 15 March, 2016; originally announced March 2016.