Skip to main content

Showing 1–32 of 32 results for author: Gkioxari, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.06507  [pdf, other

    cs.CV

    Reconstructing Hand-Held Objects in 3D

    Authors: Jane Wu, Georgios Pavlakos, Georgia Gkioxari, Jitendra Malik

    Abstract: Objects manipulated by the hand (i.e., manipulanda) are particularly challenging to reconstruct from in-the-wild RGB images or videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. At the same time, two strong anchors emerge in this setting: (1) estimated 3D hands help disambiguate the location and scale of the objec… ▽ More

    Submitted 9 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: Project page: https://janehwu.github.io/mcc-ho

  2. arXiv:2403.08997  [pdf, other

    cs.CV cs.RO

    CART: Caltech Aerial RGB-Thermal Dataset in the Wild

    Authors: Connor Lee, Matthew Anderson, Nikhil Raganathan, Xingxing Zuo, Kevin Do, Georgia Gkioxari, Soon-Jo Chung

    Abstract: We present the first publicly available RGB-thermal dataset designed for aerial robotics operating in natural environments. Our dataset captures a variety of terrains across the continental United States, including rivers, lakes, coastlines, deserts, and forests, and consists of synchronized RGB, long-wave thermal, global positioning, and inertial data. Furthermore, we provide semantic segmentatio… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

  3. arXiv:2402.16412  [pdf, other

    cs.LG

    TOTEM: TOkenized Time Series EMbeddings for General Time Series Analysis

    Authors: Sabera Talukder, Yisong Yue, Georgia Gkioxari

    Abstract: The field of general time series analysis has recently begun to explore unified modeling, where a common architectural backbone can be retrained on a specific task for a specific dataset. In this work, we approach unification from a complementary vantage point: unification across tasks and domains. To this end, we explore the impact of discrete, learnt, time series data representations that enable… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  4. arXiv:2310.01401  [pdf, other

    cs.CV

    Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection

    Authors: Yiming Xie, Huaizu Jiang, Georgia Gkioxari, Julian Straub

    Abstract: We present PARQ - a multi-view 3D object detector with transformer and pixel-aligned recurrent queries. Unlike previous works that use learnable features or only encode 3D point positions as queries in the decoder, PARQ leverages appearance-enhanced queries initialized from reference points in 3D space and updates their 3D location with recurrent cross-attention operations. Incorporating pixel-ali… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

    Comments: ICCV 2023. Project page: https://ymingxie.github.io/parq

  5. arXiv:2307.05663  [pdf, other

    cs.CV cs.AI

    Objaverse-XL: A Universe of 10M+ 3D Objects

    Authors: Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, Ali Farhadi

    Abstract: Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects… ▽ More

    Submitted 11 July, 2023; originally announced July 2023.

  6. arXiv:2301.08247  [pdf, other

    cs.CV

    Multiview Compressive Coding for 3D Reconstruction

    Authors: Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, Georgia Gkioxari

    Abstract: A central goal of visual recognition is to understand objects and scenes from a single image. 2D recognition has witnessed tremendous progress thanks to large-scale learning and general-purpose representations. Comparatively, 3D poses new challenges stemming from occlusions not depicted in the image. Prior works try to overcome these by inferring from multiple views or rely on scarce CAD models an… ▽ More

    Submitted 19 January, 2023; originally announced January 2023.

    Comments: Project page: https://mcc3d.github.io/

  7. arXiv:2212.07401  [pdf, other

    cs.CV cs.AI

    BKinD-3D: Self-Supervised 3D Keypoint Discovery from Multi-View Videos

    Authors: Jennifer J. Sun, Lili Karashchuk, Amil Dravid, Serim Ryou, Sonia Fereidooni, John Tuthill, Aggelos Katsaggelos, Bingni W. Brunton, Georgia Gkioxari, Ann Kennedy, Yisong Yue, Pietro Perona

    Abstract: Quantifying motion in 3D is important for studying the behavior of humans and other animals, but manual pose annotations are expensive and time-consuming to obtain. Self-supervised keypoint discovery is a promising strategy for estimating 3D poses without annotations. However, current keypoint discovery approaches commonly process single 2D views and do not operate in the 3D space. We propose a ne… ▽ More

    Submitted 2 June, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

    Comments: CVPR 2023. Project page: https://sites.google.com/view/b-kind/3d Code: https://github.com/neuroethology/BKinD-3D

  8. arXiv:2207.10660  [pdf, other

    cs.CV

    Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild

    Authors: Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, Georgia Gkioxari

    Abstract: Recognizing scenes and objects in 3D from a single image is a longstanding goal of computer vision with applications in robotics and AR/VR. For 2D recognition, large datasets and scalable solutions have led to unprecedented advances. In 3D, existing benchmarks are small in size and approaches specialize in few object categories and specific domains, e.g. urban driving scenes. Motivated by the succ… ▽ More

    Submitted 23 March, 2023; v1 submitted 21 July, 2022; originally announced July 2022.

    Comments: CVPR 2023, Project website: https://omni3d.garrickbrazil.com/

  9. arXiv:2206.07028  [pdf, other

    cs.CV

    Learning 3D Object Shape and Layout without 3D Supervision

    Authors: Georgia Gkioxari, Nikhila Ravi, Justin Johnson

    Abstract: A 3D scene consists of a set of objects, each with a shape and a layout giving their position in space. Understanding 3D scenes from 2D images is an important goal, with applications in robotics and graphics. While there have been recent advances in predicting 3D shape and layout from a single image, most approaches rely on 3D ground truth for training which is expensive to collect at scale. We ov… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: CVPR 2022, project page: https://gkioxari.github.io/usl/

  10. arXiv:2112.01520  [pdf, other

    cs.CV

    Recognizing Scenes from Novel Viewpoints

    Authors: Shengyi Qian, Alexander Kirillov, Nikhila Ravi, Devendra Singh Chaplot, Justin Johnson, David F. Fouhey, Georgia Gkioxari

    Abstract: Humans can perceive scenes in 3D from a handful of 2D views. For AI agents, the ability to recognize a scene from any viewpoint given only a few images enables them to efficiently interact with the scene and its objects. In this work, we attempt to endow machines with this ability. We propose a model which takes as input a few RGB images of a new scene and recognizes the scene from novel viewpoint… ▽ More

    Submitted 2 December, 2021; originally announced December 2021.

  11. arXiv:2110.05472  [pdf, other

    cs.CV

    Differentiable Stereopsis: Meshes from multiple views using differentiable rendering

    Authors: Shubham Goel, Georgia Gkioxari, Jitendra Malik

    Abstract: We propose Differentiable Stereopsis, a multi-view stereo approach that reconstructs shape and texture from few input views and noisy cameras. We pair traditional stereopsis and modern differentiable rendering to build an end-to-end model which predicts textured 3D meshes of objects with varying topologies and shape. We frame stereopsis as an optimization problem and simultaneously update shape an… ▽ More

    Submitted 23 September, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: In CVPR2022. Project webpage: https://shubham-goel.github.io/ds/

    Journal ref: In CVPR 2022 (pp. 8635-8644)

  12. arXiv:2102.02896  [pdf, ps, other

    cs.CV cs.LG eess.IV

    Compressed Object Detection

    Authors: Gedeon Muhawenayo, Georgia Gkioxari

    Abstract: Deep learning approaches have achieved unprecedented performance in visual recognition tasks such as object detection and pose estimation. However, state-of-the-art models have millions of parameters represented as floats which make them computationally expensive and constrain their deployment on hardware such as mobile phones and IoT nodes. Most commonly, activations of deep neural networks tend… ▽ More

    Submitted 4 February, 2021; originally announced February 2021.

  13. arXiv:2007.08501  [pdf, other

    cs.CV cs.GR cs.LG

    Accelerating 3D Deep Learning with PyTorch3D

    Authors: Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, Georgia Gkioxari

    Abstract: Deep learning has significantly improved 2D image recognition. Extending into 3D may advance many new applications including autonomous vehicles, virtual and augmented reality, authoring 3D content, and even improving 2D recognition. However despite growing interest, 3D deep learning remains relatively underexplored. We believe that some of this disparity is due to the engineering challenges invol… ▽ More

    Submitted 16 July, 2020; originally announced July 2020.

    Comments: tech report

  14. arXiv:2007.03778  [pdf, other

    cs.CV cs.RO

    3D Shape Reconstruction from Vision and Touch

    Authors: Edward J. Smith, Roberto Calandra, Adriana Romero, Georgia Gkioxari, David Meger, Jitendra Malik, Michal Drozdzal

    Abstract: When a toddler is presented a new toy, their instinctual behaviour is to pick it upand inspect it with their hand and eyes in tandem, clearly searching over its surface to properly understand what they are playing with. At any instance here, touch provides high fidelity localized information while vision provides complementary global context. However, in 3D shape reconstruction, the complementary… ▽ More

    Submitted 2 November, 2020; v1 submitted 7 July, 2020; originally announced July 2020.

    Comments: Accepted at Neurips 2020

  15. arXiv:1912.08804  [pdf, other

    cs.CV

    SynSin: End-to-end View Synthesis from a Single Image

    Authors: Olivia Wiles, Georgia Gkioxari, Richard Szeliski, Justin Johnson

    Abstract: Single image view synthesis allows for the generation of new views of a scene given a single input image. This is challenging, as it requires comprehensively understanding the 3D scene from a single image. As a result, current methods typically use multiple images, train on ground-truth depth, or are limited to synthetic data. We propose a novel end-to-end model for this task; it is trained on rea… ▽ More

    Submitted 18 April, 2020; v1 submitted 18 December, 2019; originally announced December 2019.

    Comments: Project page: www.robots.ox.ac.uk/~ow/synsin.html

  16. arXiv:1909.04306  [pdf, other

    cs.CV cs.LG cs.RO

    Bayesian Relational Memory for Semantic Visual Navigation

    Authors: Yi Wu, Yuxin Wu, Aviv Tamar, Stuart Russell, Georgia Gkioxari, Yuandong Tian

    Abstract: We introduce a new memory architecture, Bayesian Relational Memory (BRM), to improve the generalization ability for semantic visual navigation agents in unseen environments, where an agent is given a semantic target to navigate towards. BRM takes the form of a probabilistic relation graph over semantic entities (e.g., room types), which allows (1) capturing the layout prior from training environme… ▽ More

    Submitted 10 September, 2019; originally announced September 2019.

    Comments: Accepted at ICCV 2019

  17. arXiv:1906.02739  [pdf, other

    cs.CV

    Mesh R-CNN

    Authors: Georgia Gkioxari, Jitendra Malik, Justin Johnson

    Abstract: Rapid advances in 2D perception have led to systems that accurately detect objects in real-world images. However, these systems make predictions in 2D, ignoring the 3D structure of the world. Concurrently, advances in 3D shape prediction have mostly focused on synthetic benchmarks and isolated objects. We unify advances in these two areas. We propose a system that detects objects in real-world ima… ▽ More

    Submitted 25 January, 2020; v1 submitted 6 June, 2019; originally announced June 2019.

  18. arXiv:1904.04686  [pdf, other

    cs.CV

    Multi-Target Embodied Question Answering

    Authors: Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L. Berg, Dhruv Batra

    Abstract: Embodied Question Answering (EQA) is a relatively new task where an agent is asked to answer questions about its environment from egocentric perception. EQA makes the fundamental assumption that every question, e.g., "what color is the car?", has exactly one target ("car") being inquired about. This assumption puts a direct limitation on the abilities of the agent. We present a generalization of E… ▽ More

    Submitted 9 April, 2019; originally announced April 2019.

    Comments: 10 pages, 6 figures

  19. arXiv:1904.03461  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Embodied Question Answering in Photorealistic Environments with Point Cloud Perception

    Authors: Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Abhishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, Dhruv Batra

    Abstract: To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -- Embodied Question Answering [1] in photo-realistic environments (Matterport 3D). We thoroughly study navigation policies that utilize 3D point clouds, RGB images, or their combination. Our analysis of these models reveals several key findings.… ▽ More

    Submitted 6 April, 2019; originally announced April 2019.

  20. arXiv:1810.11181  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    Neural Modular Control for Embodied Question Answering

    Authors: Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra

    Abstract: We present a modular approach for learning policies for navigation over long planning horizons from language input. Our hierarchical policy operates at multiple timescales, where the higher-level master policy proposes subgoals to be executed by specialized sub-policies. Our choice of subgoals is compositional and semantic, i.e. they can be sequentially combined in arbitrary orderings, and assume… ▽ More

    Submitted 2 May, 2019; v1 submitted 25 October, 2018; originally announced October 2018.

    Comments: 10 pages, 3 figures, 2 tables. Published at CoRL 2018. Webpage: https://embodiedqa.org/

  21. arXiv:1809.10842  [pdf, other

    cs.LG cs.AI stat.ML

    Learning and Planning with a Semantic Model

    Authors: Yi Wu, Yuxin Wu, Aviv Tamar, Stuart Russell, Georgia Gkioxari, Yuandong Tian

    Abstract: Building deep reinforcement learning agents that can generalize and adapt to unseen environments remains a fundamental challenge for AI. This paper describes progresses on this challenge in the context of man-made environments, which are visually diverse but contain intrinsic semantic regularities. We propose a hybrid model-based and model-free approach, LEArning and Planning with Semantics (LEAPS… ▽ More

    Submitted 27 September, 2018; originally announced September 2018.

    Comments: submitted to ICLR 2019

  22. arXiv:1801.02209  [pdf, other

    cs.LG cs.AI

    Building Generalizable Agents with a Realistic and Rich 3D Environment

    Authors: Yi Wu, Yuxin Wu, Georgia Gkioxari, Yuandong Tian

    Abstract: Teaching an agent to navigate in an unseen 3D environment is a challenging task, even in the event of simulated environments. To generalize to unseen environments, an agent needs to be robust to low-level variations (e.g. color, texture, object changes), and also high-level variations (e.g. layout changes of the environment). To improve overall generalization, all types of variations in the enviro… ▽ More

    Submitted 8 April, 2018; v1 submitted 7 January, 2018; originally announced January 2018.

    Comments: updated with improved content and more experinemnts

  23. arXiv:1712.09184  [pdf, other

    cs.CV

    Detect-and-Track: Efficient Pose Estimation in Videos

    Authors: Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, Du Tran

    Abstract: This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon the latest advancements in human detection and video understanding. Our method operates in two-stages: keypoint estimation in frames or short clips, followed by lightweight tracking to generate keypoint p… ▽ More

    Submitted 2 May, 2018; v1 submitted 26 December, 2017; originally announced December 2017.

    Comments: In CVPR 2018. Ranked first in ICCV 2017 PoseTrack challenge (keypoint tracking in videos). Code: https://github.com/facebookresearch/DetectAndTrack and webpage: https://rohitgirdhar.github.io/DetectAndTrack/

  24. arXiv:1712.04440  [pdf, other

    cs.CV

    Data Distillation: Towards Omni-Supervised Learning

    Authors: Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, Kaiming He

    Abstract: We investigate omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Omni-supervised learning is lower-bounded by performance on existing labeled datasets, offering the potential to surpass state-of-the-art fully supervised methods. To exploit the omni-supervised setting, we pro… ▽ More

    Submitted 12 December, 2017; originally announced December 2017.

    Comments: tech report

  25. arXiv:1711.11543  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Embodied Question Answering

    Authors: Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra

    Abstract: We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where an agent is spawned at a random location in a 3D environment and asked a question ("What color is the car?"). In order to answer, the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question ("orange"). This challenging… ▽ More

    Submitted 1 December, 2017; v1 submitted 30 November, 2017; originally announced November 2017.

    Comments: 20 pages, 13 figures, Webpage: https://embodiedqa.org/

  26. arXiv:1704.07333  [pdf, other

    cs.CV

    Detecting and Recognizing Human-Object Interactions

    Authors: Georgia Gkioxari, Ross Girshick, Piotr Dollár, Kaiming He

    Abstract: To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical and scientific problem. In this paper, we address the task of detecting <human, verb, object> triplets in challenging everyday photos. We propose a novel model… ▽ More

    Submitted 26 March, 2018; v1 submitted 24 April, 2017; originally announced April 2017.

  27. arXiv:1703.06870  [pdf, other

    cs.CV

    Mask R-CNN

    Authors: Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick

    Abstract: We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognit… ▽ More

    Submitted 24 January, 2018; v1 submitted 20 March, 2017; originally announced March 2017.

    Comments: open source; appendix on more results

  28. arXiv:1605.02346  [pdf, other

    cs.CV

    Chained Predictions Using Convolutional Neural Networks

    Authors: Georgia Gkioxari, Alexander Toshev, Navdeep Jaitly

    Abstract: In this paper, we present an adaptation of the sequence-to-sequence model for structured output prediction in vision tasks. In this model the output variables for a given input are predicted sequentially using neural networks. The prediction for each output variable depends not only on the input but also on the previously predicted output variables. The model is applied to spatial localization tas… ▽ More

    Submitted 23 October, 2016; v1 submitted 8 May, 2016; originally announced May 2016.

    Comments: in submission to EECV 2016

  29. arXiv:1505.01197  [pdf, other

    cs.CV

    Contextual Action Recognition with R*CNN

    Authors: Georgia Gkioxari, Ross Girshick, Jitendra Malik

    Abstract: There are multiple cues in an image which reveal what action a person is performing. For example, a jogger has a pose that is characteristic for jogging, but the scene (e.g. road, trail) and the presence of other joggers can be an additional source of information. In this work, we exploit the simple observation that actions are accompanied by contextual cues to build a strong action recognition sy… ▽ More

    Submitted 24 March, 2016; v1 submitted 5 May, 2015; originally announced May 2015.

  30. arXiv:1412.2604  [pdf, other

    cs.CV

    Actions and Attributes from Wholes and Parts

    Authors: Georgia Gkioxari, Ross Girshick, Jitendra Malik

    Abstract: We investigate the importance of parts for the tasks of action and attribute classification. We develop a part-based approach by leveraging convolutional network features inspired by recent advances in computer vision. Our part detectors are a deep version of poselets and capture parts of the human body under a distinct set of poses. For the tasks of action and attribute classification, we train h… ▽ More

    Submitted 5 May, 2015; v1 submitted 8 December, 2014; originally announced December 2014.

  31. arXiv:1411.6031  [pdf, other

    cs.CV

    Finding Action Tubes

    Authors: Georgia Gkioxari, Jitendra Malik

    Abstract: We address the problem of action detection in videos. Driven by the latest progress in object detection from 2D images, we build action models using rich feature hierarchies derived from shape and kinematic cues. We incorporate appearance and motion in two ways. First, starting from image region proposals we select those that are motion salient and thus are more likely to contain the action. This… ▽ More

    Submitted 21 November, 2014; originally announced November 2014.

  32. arXiv:1406.5212  [pdf, other

    cs.CV

    R-CNNs for Pose Estimation and Action Detection

    Authors: Georgia Gkioxari, Bharath Hariharan, Ross Girshick, Jitendra Malik

    Abstract: We present convolutional neural networks for the tasks of keypoint (pose) prediction and action classification of people in unconstrained images. Our approach involves training an R-CNN detector with loss functions depending on the task being tackled. We evaluate our method on the challenging PASCAL VOC dataset and compare it to previous leading approaches. Our method gives state-of-the-art result… ▽ More

    Submitted 19 June, 2014; originally announced June 2014.