Search | arXiv e-print repository

OCTO+: A Suite for Automatic Open-Vocabulary Object Placement in Mixed Reality

Authors: Aditya Sharma, Luke Yoffe, Tobias Höllerer

Abstract: One key challenge in Augmented Reality is the placement of virtual content in natural locations. Most existing automated techniques can only work with a closed-vocabulary, fixed set of objects. In this paper, we introduce and evaluate several methods for automatic object placement using recent advances in open-vocabulary vision-language models. Through a multifaceted evaluation, we identify a new… ▽ More One key challenge in Augmented Reality is the placement of virtual content in natural locations. Most existing automated techniques can only work with a closed-vocabulary, fixed set of objects. In this paper, we introduce and evaluate several methods for automatic object placement using recent advances in open-vocabulary vision-language models. Through a multifaceted evaluation, we identify a new state-of-the-art method, OCTO+. We also introduce a benchmark for automatically evaluating the placement of virtual objects in augmented reality, alleviating the need for costly user studies. Through this, in addition to human evaluations, we find that OCTO+ places objects in a valid region over 70% of the time, outperforming other methods on a range of metrics. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: 2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIXVR)

arXiv:2312.12815 [pdf, other]

OCTOPUS: Open-vocabulary Content Tracking and Object Placement Using Semantic Understanding in Mixed Reality

Authors: Luke Yoffe, Aditya Sharma, Tobias Höllerer

Abstract: One key challenge in augmented reality is the placement of virtual content in natural locations. Existing automated techniques are only able to work with a closed-vocabulary, fixed set of objects. In this paper, we introduce a new open-vocabulary method for object placement. Our eight-stage pipeline leverages recent advances in segmentation models, vision-language models, and LLMs to place any vir… ▽ More One key challenge in augmented reality is the placement of virtual content in natural locations. Existing automated techniques are only able to work with a closed-vocabulary, fixed set of objects. In this paper, we introduce a new open-vocabulary method for object placement. Our eight-stage pipeline leverages recent advances in segmentation models, vision-language models, and LLMs to place any virtual object in any AR camera frame or scene. In a preliminary user study, we show that our method performs at least as well as human experts 57% of the time. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2023

arXiv:2204.10356 [pdf, other]

Interactive Segmentation and Visualization for Tiny Objects in Multi-megapixel Images

Authors: Chengyuan Xu, Boning Dong, Noah Stier, Curtis McCully, D. Andrew Howell, Pradeep Sen, Tobias Höllerer

Abstract: We introduce an interactive image segmentation and visualization framework for identifying, inspecting, and editing tiny objects (just a few pixels wide) in large multi-megapixel high-dynamic-range (HDR) images. Detecting cosmic rays (CRs) in astronomical observations is a cumbersome workflow that requires multiple tools, so we developed an interactive toolkit that unifies model inference, HDR ima… ▽ More We introduce an interactive image segmentation and visualization framework for identifying, inspecting, and editing tiny objects (just a few pixels wide) in large multi-megapixel high-dynamic-range (HDR) images. Detecting cosmic rays (CRs) in astronomical observations is a cumbersome workflow that requires multiple tools, so we developed an interactive toolkit that unifies model inference, HDR image visualization, segmentation mask inspection and editing into a single graphical user interface. The feature set, initially designed for astronomical data, makes this work a useful research-supporting tool for human-in-the-loop tiny-object segmentation in scientific areas like biomedicine, materials science, remote sensing, etc., as well as computer vision. Our interface features mouse-controlled, synchronized, dual-window visualization of the image and the segmentation mask, a critical feature for locating tiny objects in multi-megapixel images. The browser-based tool can be readily hosted on the web to provide multi-user access and GPU acceleration for any device. The toolkit can also be used as a high-precision annotation tool, or adapted as the frontend for an interactive machine learning framework. Our open-source dataset, CR detection model, and visualization toolkit are available at https://github.com/cy-xu/cosmic-conn. △ Less

Submitted 21 April, 2022; originally announced April 2022.

Comments: 6 pages, 4 figures. Accepted by CVPR 2022 Demo Program

ACM Class: I.4

arXiv:2112.00236 [pdf, other]

VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion

Authors: Noah Stier, Alexander Rich, Pradeep Sen, Tobias Höllerer

Abstract: Recent volumetric 3D reconstruction methods can produce very accurate results, with plausible geometry even for unobserved surfaces. However, they face an undesirable trade-off when it comes to multi-view fusion. They can fuse all available view information by global averaging, thus losing fine detail, or they can heuristically cluster views for local fusion, thus restricting their ability to cons… ▽ More Recent volumetric 3D reconstruction methods can produce very accurate results, with plausible geometry even for unobserved surfaces. However, they face an undesirable trade-off when it comes to multi-view fusion. They can fuse all available view information by global averaging, thus losing fine detail, or they can heuristically cluster views for local fusion, thus restricting their ability to consider all views jointly. Our key insight is that greater detail can be retained without restricting view diversity by learning a view-fusion function conditioned on camera pose and image content. We propose to learn this multi-view fusion using a transformer. To this end, we introduce VoRTX, an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion. Our model is occlusion-aware, leveraging the transformer architecture to predict an initial, projective scene geometry estimate. This estimate is used to avoid backprojecting image features through surfaces into occluded regions. We train our model on ScanNet and show that it produces better reconstructions than state-of-the-art methods. We also demonstrate generalization without any fine-tuning, outperforming the same state-of-the-art methods on two other datasets, TUM-RGBD and ICL-NUIM. △ Less

Submitted 30 November, 2021; originally announced December 2021.

Comments: 3DV 2021

arXiv:2112.00202 [pdf, other]

3DVNet: Multi-View Depth Prediction and Volumetric Refinement

Authors: Alexander Rich, Noah Stier, Pradeep Sen, Tobias Höllerer

Abstract: We present 3DVNet, a novel multi-view stereo (MVS) depth-prediction method that combines the advantages of previous depth-based and volumetric MVS approaches. Our key idea is the use of a 3D scene-modeling network that iteratively updates a set of coarse depth predictions, resulting in highly accurate predictions which agree on the underlying scene geometry. Unlike existing depth-prediction techni… ▽ More We present 3DVNet, a novel multi-view stereo (MVS) depth-prediction method that combines the advantages of previous depth-based and volumetric MVS approaches. Our key idea is the use of a 3D scene-modeling network that iteratively updates a set of coarse depth predictions, resulting in highly accurate predictions which agree on the underlying scene geometry. Unlike existing depth-prediction techniques, our method uses a volumetric 3D convolutional neural network (CNN) that operates in world space on all depth maps jointly. The network can therefore learn meaningful scene-level priors. Furthermore, unlike existing volumetric MVS techniques, our 3D CNN operates on a feature-augmented point cloud, allowing for effective aggregation of multi-view information and flexible iterative refinement of depth maps. Experimental results show our method exceeds state-of-the-art accuracy in both depth prediction and 3D reconstruction metrics on the ScanNet dataset, as well as a selection of scenes from the TUM-RGBD and ICL-NUIM datasets. This shows that our method is both effective and generalizes to new settings. △ Less

Submitted 30 November, 2021; originally announced December 2021.

Comments: 10 pages, 6 figures, 3 tables. Accepted to 3DV 2021

arXiv:2111.11992 [pdf, ps, other]

Sparse Fusion for Multimodal Transformers

Authors: Yi Ding, Alex Rich, Mason Wang, Noah Stier, Matthew Turk, Pradeep Sen, Tobias Höllerer

Abstract: Multimodal classification is a core task in human-centric machine learning. We observe that information is highly complementary across modalities, thus unimodal information can be drastically sparsified prior to multimodal fusion without loss of accuracy. To this end, we present Sparse Fusion Transformers (SFT), a novel multimodal fusion method for transformers that performs comparably to existing… ▽ More Multimodal classification is a core task in human-centric machine learning. We observe that information is highly complementary across modalities, thus unimodal information can be drastically sparsified prior to multimodal fusion without loss of accuracy. To this end, we present Sparse Fusion Transformers (SFT), a novel multimodal fusion method for transformers that performs comparably to existing state-of-the-art methods while having greatly reduced memory footprint and computation cost. Key to our idea is a sparse-pooling block that reduces unimodal token sets prior to cross-modality modeling. Evaluations are conducted on multiple multimodal benchmark datasets for a wide range of classification tasks. State-of-the-art performance is obtained on multiple benchmarks under similar experiment conditions, while reporting up to six-fold reduction in computational cost and memory requirements. Extensive ablation studies showcase our benefits of combining sparsification and multimodal learning over naive approaches. This paves the way for enabling multimodal learning on low-resource devices. △ Less

Submitted 24 November, 2021; v1 submitted 23 November, 2021; originally announced November 2021.

Comments: 11 pages, 4 figures, 5 tables, Yi Ding and Alex Rich contributed equally

arXiv:2107.02965 [pdf, other]

Telelife: The Future of Remote Living

Authors: Jason Orlosky, Misha Sra, Kenan Bektaş, Huaishu Peng, Jeeeun Kim, Nataliya Kos'myna, Tobias Hollerer, Anthony Steed, Kiyoshi Kiyokawa, Kaan Akşit

Abstract: In recent years, everyday activities such as work and socialization have steadily shifted to more remote and virtual settings. With the COVID-19 pandemic, the switch from physical to virtual has been accelerated, which has substantially affected various aspects of our lives, including business, education, commerce, healthcare, and personal life. This rapid and large-scale switch from in-person to… ▽ More In recent years, everyday activities such as work and socialization have steadily shifted to more remote and virtual settings. With the COVID-19 pandemic, the switch from physical to virtual has been accelerated, which has substantially affected various aspects of our lives, including business, education, commerce, healthcare, and personal life. This rapid and large-scale switch from in-person to remote interactions has revealed that our current technologies lack functionality and are limited in their ability to recreate interpersonal interactions. To help address these limitations in the future, we introduce "Telelife," a vision for the near future that depicts the potential means to improve remote living better aligned with how we interact, live and work in the physical world. Telelife encompasses novel synergies of technologies and concepts such as digital twins, virtual prototy**, and attention and context-aware user interfaces with innovative hardware that can support ultrarealistic graphics, user state detection, and more. These ideas will guide the transformation of our daily lives and routines soon, targeting the year 2035. In addition, we identify opportunities across high-impact applications in domains related to this vision of Telelife. Along with a recent survey of relevant fields such as human-computer interaction, pervasive computing, and virtual reality, the directions outlined in this paper will guide future research on remote living. △ Less

Submitted 6 July, 2021; originally announced July 2021.

arXiv:2103.02130 [pdf, other]

Augmentation Strategies for Learning with Noisy Labels

Authors: Kento Nishi, Yi Ding, Alex Rich, Tobias Höllerer

Abstract: Imperfect labels are ubiquitous in real-world datasets. Several recent successful methods for training deep neural networks (DNNs) robust to label noise have used two primary techniques: filtering samples based on loss during a warm-up phase to curate an initial set of cleanly labeled samples, and using the output of a network as a pseudo-label for subsequent loss calculations. In this paper, we e… ▽ More Imperfect labels are ubiquitous in real-world datasets. Several recent successful methods for training deep neural networks (DNNs) robust to label noise have used two primary techniques: filtering samples based on loss during a warm-up phase to curate an initial set of cleanly labeled samples, and using the output of a network as a pseudo-label for subsequent loss calculations. In this paper, we evaluate different augmentation strategies for algorithms tackling the "learning with noisy labels" problem. We propose and examine multiple augmentation strategies and evaluate them using synthetic datasets based on CIFAR-10 and CIFAR-100, as well as on the real-world dataset Clothing1M. Due to several commonalities in these algorithms, we find that using one set of augmentations for loss modeling tasks and another set for learning is the most effective, improving results on the state-of-the-art and other previous methods. Furthermore, we find that applying augmentation during the warm-up period can negatively impact the loss convergence behavior of correctly versus incorrectly labeled samples. We introduce this augmentation strategy to the state-of-the-art technique and demonstrate that we can improve performance across all evaluated noise levels. In particular, we improve accuracy on the CIFAR-10 benchmark at 90% symmetric noise by more than 15% in absolute accuracy, and we also improve performance on the Clothing1M dataset. (K. Nishi and Y. Ding contributed equally to this work) △ Less

Submitted 1 April, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

arXiv:1711.11243 [pdf, other]

doi 10.1109/TVCG.2018.2868568

ARbis Pictus: A Study of Language Learning with Augmented Reality

Authors: Adam Ibrahim, Brandon Huynh, Jonathan Downey, Tobias Höllerer, Dorothy Chun, John O'Donovan

Abstract: This paper describes "ARbis Pictus" --a novel system for immersive language learning through dynamic labeling of real-world objects in augmented reality. We describe a within-subjects lab-based study (N=52) that explores the effect of our system on participants learning nouns in an unfamiliar foreign language, compared to a traditional flashcard-based approach. Our results show that the immersive… ▽ More This paper describes "ARbis Pictus" --a novel system for immersive language learning through dynamic labeling of real-world objects in augmented reality. We describe a within-subjects lab-based study (N=52) that explores the effect of our system on participants learning nouns in an unfamiliar foreign language, compared to a traditional flashcard-based approach. Our results show that the immersive experience of learning with virtual labels on real-world objects is both more effective and more enjoyable for the majority of participants, compared to flashcards. Specifically, when participants learned through augmented reality, they scored significantly better by 7% (p=0.011) on productive recall tests performed same-day, and significantly better by 21% (p=0.001) on 4-day delayed productive recall post tests than when they learned using the flashcard method. We believe this result is an indication of the strong potential for language learning in augmented reality, particularly because of the improvement shown in sustained recall compared to the traditional approach. △ Less

Submitted 17 June, 2019; v1 submitted 30 November, 2017; originally announced November 2017.

Comments: TVCG version

Journal ref: IEEE Transactions on Visualization and Computer Graphics ( Volume: 24 , Issue: 11 , Nov. 2018 )

arXiv:1702.06492 [pdf, other]

doi 10.1145/3027063.3053227

Automated Assistants to Identify and Prompt Action on Visual News Bias

Authors: Vishwajeet Narwal, Mohamed Hashim Salih, Jose Angel Lopez, Angel Ortega, John O'Donovan, Tobias Höllerer, Saiph Savage

Abstract: Bias is a common problem in today's media, appearing frequently in text and in visual imagery. Users on social media websites such as Twitter need better methods for identifying bias. Additionally, activists --those who are motivated to effect change related to some topic, need better methods to identify and counteract bias that is contrary to their mission. With both of these use cases in mind, i… ▽ More Bias is a common problem in today's media, appearing frequently in text and in visual imagery. Users on social media websites such as Twitter need better methods for identifying bias. Additionally, activists --those who are motivated to effect change related to some topic, need better methods to identify and counteract bias that is contrary to their mission. With both of these use cases in mind, in this paper we propose a novel tool called UnbiasedCrowd that supports identification of, and action on bias in visual news media. In particular, it addresses the following key challenges (1) identification of bias; (2) aggregation and presentation of evidence to users; (3) enabling activists to inform the public of bias and take action by engaging people in conversation with bots. We describe a preliminary study on the Twitter platform that explores the impressions that activists had of our tool, and how people reacted and engaged with online bots that exposed visual bias. We conclude by discussing design and implication of our findings for creating future systems to identify and counteract the effects of news bias. △ Less

Submitted 10 March, 2017; v1 submitted 21 February, 2017; originally announced February 2017.

Comments: 6 pages, 6 figures, (Accepted) CHI 17 Extended Abstracts, May 06-11, 2017, Denver, CO, USA

ACM Class: K.4.2

arXiv:1607.03949 [pdf, other]

Large Scale SfM with the Distributed Camera Model

Authors: Chris Sweeney, Victor Fragoso, Tobias Hollerer, Matthew Turk

Abstract: We introduce the distributed camera model, a novel model for Structure-from-Motion (SfM). This model describes image observations in terms of light rays with ray origins and directions rather than pixels. As such, the proposed model is capable of describing a single camera or multiple cameras simultaneously as the collection of all light rays observed. We show how the distributed camera model is a… ▽ More We introduce the distributed camera model, a novel model for Structure-from-Motion (SfM). This model describes image observations in terms of light rays with ray origins and directions rather than pixels. As such, the proposed model is capable of describing a single camera or multiple cameras simultaneously as the collection of all light rays observed. We show how the distributed camera model is a generalization of the standard camera model and describe a general formulation and solution to the absolute camera pose problem that works for standard or distributed cameras. The proposed method computes a solution that is up to 8 times more efficient and robust to rotation singularities in comparison with gDLS. Finally, this method is used in an novel large-scale incremental SfM pipeline where distributed cameras are accurately and robustly merged together. This pipeline is a direct generalization of traditional incremental SfM; however, instead of incrementally adding one camera at a time to grow the reconstruction the reconstruction is grown by adding a distributed camera. Our pipeline produces highly accurate reconstructions efficiently by avoiding the need for many bundle adjustment iterations and is capable of computing a 3D model of Rome from over 15,000 images in just 22 minutes. △ Less

Submitted 30 November, 2016; v1 submitted 13 July, 2016; originally announced July 2016.

Comments: Published at 2016 3DV Conference

arXiv:1509.06026 [pdf, other]

doi 10.1145/2818048.2819985

Botivist: Calling Volunteers to Action Using Online Bots

Authors: Saiph Savage, Andres Monroy-Hernandez, Tobias Hollerer

Abstract: To help activists call new volunteers to action, we present Botivist: a platform that uses Twitter bots to find potential volunteers and request contributions. By leveraging different Twitter accounts, Botivist employs different strategies to encourage participation. We explore how people respond to bots calling them to action using a test case about corruption in Latin America. Our results show t… ▽ More To help activists call new volunteers to action, we present Botivist: a platform that uses Twitter bots to find potential volunteers and request contributions. By leveraging different Twitter accounts, Botivist employs different strategies to encourage participation. We explore how people respond to bots calling them to action using a test case about corruption in Latin America. Our results show that the majority of volunteers (>80%) who responded to Botivist's calls to action contributed relevant proposals to address the assigned social problem. Different strategies produced differences in the quantity and relevance of contributions. Some strategies that work well offline and face-to-face appeared to hinder people's participation when used by an online bot. We analyze user behavior in response to being approached by bots with an activist purpose. We also provide strong evidence for the value of this type of civic media, and derive design implications. △ Less

Submitted 20 September, 2015; originally announced September 2015.

Comments: 9 pages, 3 figures, CSCW'16

ACM Class: H.5.2

arXiv:1509.01095 [pdf, other]

Tag Me Maybe: Perceptions of Public Targeted Sharing on Facebook

Authors: Saiph Savage, Andres Monroy-Hernandez, Kasturi Bhattacharjee, Tobias Hollerer

Abstract: Social network sites allow users to publicly tag people in their posts. These tagged posts allow users to share to both the general public and a targeted audience, dynamically assembled via notifications that alert the people mentioned. We investigate people's perceptions of this mixed sharing mode through a qualitative study with 120 participants. We found that individuals like this sharing modal… ▽ More Social network sites allow users to publicly tag people in their posts. These tagged posts allow users to share to both the general public and a targeted audience, dynamically assembled via notifications that alert the people mentioned. We investigate people's perceptions of this mixed sharing mode through a qualitative study with 120 participants. We found that individuals like this sharing modality as they believe it strengthens their relationships. Individuals also report using tags to have more control of Facebook's ranking algorithm, and to expose one another to novel information and people. This work helps us understand people's complex relationships with the algorithms that mediate their interactions with each another. We conclude by discussing the design implications of these findings. △ Less

Submitted 3 September, 2015; originally announced September 2015.

Comments: 5 pages, one figure, Hypertext 2016

ACM Class: H.5.3

Showing 1–13 of 13 results for author: Hollerer, T