Search | arXiv e-print repository

HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections

Authors: Chen Dudai, Morris Alper, Hana Bezalel, Rana Hanocka, Itai Lang, Hadar Averbuch-Elor

Abstract: Internet image collections containing photos captured by crowds of photographers show promise for enabling digital exploration of large-scale tourist landmarks. However, prior works focus primarily on geometric reconstruction and visualization, neglecting the key role of language in providing a semantic interface for navigation and fine-grained understanding. In constrained 3D domains, recent meth… ▽ More Internet image collections containing photos captured by crowds of photographers show promise for enabling digital exploration of large-scale tourist landmarks. However, prior works focus primarily on geometric reconstruction and visualization, neglecting the key role of language in providing a semantic interface for navigation and fine-grained understanding. In constrained 3D domains, recent methods have leveraged vision-and-language models as a strong prior of 2D visual semantics. While these models display an excellent understanding of broad visual semantics, they struggle with unconstrained photo collections depicting such tourist landmarks, as they lack expert knowledge of the architectural domain. In this work, we present a localization system that connects neural representations of scenes depicting large-scale landmarks with text describing a semantic region within the scene, by harnessing the power of SOTA vision-and-language models with adaptations for understanding landmark scene semantics. To bolster such models with fine-grained knowledge, we leverage large-scale Internet data containing images of similar landmarks along with weakly-related textual information. Our approach is built upon the premise that images physically grounded in space can provide a powerful supervision signal for localizing new concepts, whose semantics may be unlocked from Internet textual metadata with large language models. We use correspondences between views of scenes to bootstrap spatial understanding of these semantics, providing guidance for 3D-compatible segmentation that ultimately lifts to a volumetric scene representation. Our results show that HaLo-NeRF can accurately localize a variety of semantic concepts related to architectural landmarks, surpassing the results of other 3D models as well as strong 2D segmentation baselines. Our project page is at https://tau-vailab.github.io/HaLo-NeRF/. △ Less

Submitted 14 February, 2024; originally announced April 2024.

Comments: Eurographics 2024. Project page: https://tau-vailab.github.io/HaLo-NeRF/

arXiv:2403.01306 [pdf, other]

ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation

Authors: Moran Yanuka, Morris Alper, Hadar Averbuch-Elor, Raja Giryes

Abstract: Web-scale training on paired text-image data is becoming increasingly central to multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract or subjective text. These approaches lack the fine-grained ability to isolate the most concr… ▽ More Web-scale training on paired text-image data is becoming increasingly central to multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract or subjective text. These approaches lack the fine-grained ability to isolate the most concrete samples that provide the strongest signal for learning in a noisy dataset. In this work, we propose a new metric, image caption concreteness, that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. Our approach leverages strong foundation models for measuring visual-semantic information loss in multimodal representations. We demonstrate that this strongly correlates with human evaluation of concreteness in both single-word and sentence-level texts. Moreover, we show that curation using ICC complements existing approaches: It succeeds in selecting the highest quality samples from multimodal web-scale datasets to allow for efficient training in resource-constrained settings. △ Less

Submitted 11 June, 2024; v1 submitted 2 March, 2024; originally announced March 2024.

Comments: Accepted to ACL 2024 (Finding). For Project webpage, see https://moranyanuka.github.io/icc/

arXiv:2312.03631 [pdf, other]

Mitigating Open-Vocabulary Caption Hallucinations

Authors: Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, Hadar Averbuch-Elor

Abstract: While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring the long-tailed nature… ▽ More While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring the long-tailed nature of hallucinations that occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting. Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations for image captioning, surpassing the popular and similarly-sized CHAIR benchmark in both diversity and accuracy. Furthermore, to mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa, an approach harnessing advancements in reinforcement learning. Our multi-objective reward function explicitly targets the trade-off between fidelity and adequacy in generations without requiring any strong supervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR benchmark and other existing metrics. We will release our code and models. △ Less

Submitted 19 April, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

Comments: Website Link: https://assafbk.github.io/mocha/

arXiv:2310.16781 [pdf, other]

Kiki or Bouba? Sound Symbolism in Vision-and-Language Models

Authors: Morris Alper, Hadar Averbuch-Elor

Abstract: Although the map** between sound and meaning in human language is assumed to be largely arbitrary, research in cognitive science has shown that there are non-trivial correlations between particular sounds and meanings across languages and demographic groups, a phenomenon known as sound symbolism. Among the many dimensions of meaning, sound symbolism is particularly salient and well-demonstrated… ▽ More Although the map** between sound and meaning in human language is assumed to be largely arbitrary, research in cognitive science has shown that there are non-trivial correlations between particular sounds and meanings across languages and demographic groups, a phenomenon known as sound symbolism. Among the many dimensions of meaning, sound symbolism is particularly salient and well-demonstrated with regards to cross-modal associations between language and the visual domain. In this work, we address the question of whether sound symbolism is reflected in vision-and-language models such as CLIP and Stable Diffusion. Using zero-shot knowledge probing to investigate the inherent knowledge of these models, we find strong evidence that they do show this pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our work provides a novel method for demonstrating sound symbolism and understanding its nature using computational tools. Our code will be made publicly available. △ Less

Submitted 2 April, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

Comments: Accepted to NeurIPS 2023 (spotlight). Project webpage: https://kiki-bouba.github.io/

arXiv:2309.14821 [pdf, other]

Expedited Data Transfers for Serverless Clouds

Authors: Dmitrii Ustiugov, Shyam Jesalpura, Mert Bora Alper, Michal Baczun, Rustem Feyzkhanov, Edouard Bugnion, Boris Grot, Marios Kogias

Abstract: Serverless computing has emerged as a popular cloud deployment paradigm. In serverless, the developers implement their application as a set of chained functions that form a workflow in which functions invoke each other. The cloud providers are responsible for automatically scaling the number of instances for each function on demand and forwarding the requests in a workflow to the appropriate funct… ▽ More Serverless computing has emerged as a popular cloud deployment paradigm. In serverless, the developers implement their application as a set of chained functions that form a workflow in which functions invoke each other. The cloud providers are responsible for automatically scaling the number of instances for each function on demand and forwarding the requests in a workflow to the appropriate function instance. Problematically, today's serverless clouds lack efficient support for cross-function data transfers in a workflow, preventing the efficient execution of data-intensive serverless applications. In production clouds, functions transmit intermediate, i.e., ephemeral, data to other functions either as part of invocation HTTP requests (i.e., inline) or via third-party services, such as AWS S3 storage or AWS ElastiCache in-memory cache. The former approach is restricted to small transfer sizes, while the latter supports arbitrary transfers but suffers from performance and cost overheads. This work introduces Expedited Data Transfers (XDT), an API-preserving high-performance data communication method for serverless that enables direct function-to-function transfers. With XDT, a trusted component of the sender function buffers the payload in its memory and sends a secure reference to the receiver, which is picked by the load balancer and autoscaler based on the current load. Using the reference, the receiver instance pulls the transmitted data directly from the sender's memory. XDT is natively compatible with existing autoscaling infrastructure, preserves function invocation semantics, is secure, and avoids the cost and performance overheads of using an intermediate service for data transfers. We prototype our system in vHive/Knative deployed on a cluster of AWS EC2 nodes, showing that XDT improves latency, bandwidth, and cost over AWS S3 and ElasticCache. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: latest version

MSC Class: 68 ACM Class: D.4.4

arXiv:2304.14104 [pdf, other]

Learning Human-Human Interactions in Images from Weak Textual Supervision

Authors: Morris Alper, Hadar Averbuch-Elor

Abstract: Interactions between humans are diverse and context-dependent, but previous works have treated them as categorical, disregarding the heavy tail of possible interactions. We propose a new paradigm of learning human-human interactions as free text from a single still image, allowing for flexibility in modeling the unlimited space of situations and relationships between people. To overcome the absenc… ▽ More Interactions between humans are diverse and context-dependent, but previous works have treated them as categorical, disregarding the heavy tail of possible interactions. We propose a new paradigm of learning human-human interactions as free text from a single still image, allowing for flexibility in modeling the unlimited space of situations and relationships between people. To overcome the absence of data labelled specifically for this task, we use knowledge distillation applied to synthetic caption data produced by a large language model without explicit supervision. We show that the pseudo-labels produced by this procedure can be used to train a captioning model to effectively understand human-human interactions in images, as measured by a variety of metrics that measure textual and semantic faithfulness and factual groundedness of our predictions. We further show that our approach outperforms SOTA image captioning and situation recognition models on this task. We will release our code and pseudo-labels along with Waldo and Wenda, a manually-curated test set for still image human-human interaction understanding. △ Less

Submitted 18 September, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

Comments: To be presented at ICCV 2023. Project webpage: https://learning-interactions.github.io

arXiv:2303.12513 [pdf, other]

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

Authors: Morris Alper, Michael Fiman, Hadar Averbuch-Elor

Abstract: Most humans use visual imagination to understand and reason about language, but models such as BERT reason about language using knowledge acquired during text-only pretraining. In this work, we investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning, focusing primarily on zero-shot probing methods. We propose a suite of… ▽ More Most humans use visual imagination to understand and reason about language, but models such as BERT reason about language using knowledge acquired during text-only pretraining. In this work, we investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning, focusing primarily on zero-shot probing methods. We propose a suite of visual language understanding (VLU) tasks for probing the visual reasoning abilities of text encoder models, as well as various non-visual natural language understanding (NLU) tasks for comparison. We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks without needing a prediction head such as the masked language modelling head of models like BERT. We show that SOTA multimodally trained text encoders outperform unimodally trained text encoders on the VLU tasks while being underperformed by them on the NLU tasks, lending new context to previously mixed results regarding the NLU capabilities of multimodal models. We conclude that exposure to images during pretraining affords inherent visual reasoning knowledge that is reflected in language-only tasks that require implicit visual reasoning. Our findings bear importance in the broader context of multimodal learning, providing principled guidelines for the choice of text encoders used in such contexts. △ Less

Submitted 2 November, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

Comments: Accepted to CVPR 2023. Project webpage: https://isbertblind.github.io/

arXiv:2206.08874 [pdf, other]

SwarmHawk: Self-Sustaining Multi-Agent System for Landing on a Moving Platform through an Agent Supervision

Authors: Ayush Gupta, Ekaterina Dorzhieva, Ahmed Baza, Mert Alper, Aleksey Fedoseev, Dzmitry Tsetserukou

Abstract: Heterogeneous teams of mobile robots and UAVs are offering a substantial benefit in an autonomous exploration of the environment. Nevertheless, although joint exploration scenarios for such systems are widely discussed, they are still suffering from low adaptability to changes in external conditions and faults of swarm agents during the UAV docking. We propose a novel vision-based drone swarm dock… ▽ More Heterogeneous teams of mobile robots and UAVs are offering a substantial benefit in an autonomous exploration of the environment. Nevertheless, although joint exploration scenarios for such systems are widely discussed, they are still suffering from low adaptability to changes in external conditions and faults of swarm agents during the UAV docking. We propose a novel vision-based drone swarm docking system for robust landing on a moving platform when one of the agents lost its position signal. The proposed SwarmHawk system relies on vision-based detection for the mobile platform tracking and navigation of its agents. Each drone of the swarm carries an RGB camera and AprilTag3 QR-code marker on board. SwarmHawk can switch between two modes of operation, acting as a homogeneous swarm in case of global UAV localization or assigning leader drones to navigate its neighbors in case of a camera fault in one of the drones or global localization failure. Two experiments were performed to evaluate SwarmHawk's performance under the global and local localization with static and moving platforms. The experimental results revealed a sufficient accuracy in the swarm landing task on a static mobile platform (error of 4.2 cm in homogeneous formation and 1.9 cm in leader-follower formation) and on moving platform (error of 6.9 cm in homogeneous formation and 4.7 cm in leader-follower formation). Moreover, the drones showed a good landing on a platform moving along a complex trajectory (average error of 19.4 cm) in leader-follower formation. The proposed SwarmHawk technology can be potentially applied in various swarm scenarios, including complex environment exploration, inspection, and drone delivery. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: Accepted paper at IEEE International Conference on Unmanned Aircraft System (ICUAS 2022), IEEE copyright

arXiv:2206.08856 [pdf, other]

SwarmHive: Heterogeneous Swarm of Drones for Robust Autonomous Landing on Moving Robot

Authors: Ayush Gupta, Ahmed Baza, Ekaterina Dorzhieva, Mert Alper, Mariia Makarova, Stepan Perminov, Aleksey Fedoseev, Dzmitry Tsetserukou

Abstract: The paper focuses on a heterogeneous swarm of drones to achieve a dynamic landing of formation on a moving robot. This challenging task was not yet achieved by scientists. The key technology is that instead of facilitating each agent of the swarm of drones with computer vision that considerably increases the payload and shortens the flight time, we propose to install only one camera on the leader… ▽ More The paper focuses on a heterogeneous swarm of drones to achieve a dynamic landing of formation on a moving robot. This challenging task was not yet achieved by scientists. The key technology is that instead of facilitating each agent of the swarm of drones with computer vision that considerably increases the payload and shortens the flight time, we propose to install only one camera on the leader drone. The follower drones receive the commands from the leader UAV and maintain a collision-free trajectory with the artificial potential field. The experimental results revealed a high accuracy of the swarm landing on a static mobile platform (RMSE of 4.48 cm). RMSE of swarm landing on the mobile platform moving with the maximum velocities of 1.0 m/s and 1.5 m/s equals 8.76 cm and 8.98 cm, respectively. The proposed SwarmHive technology will allow the time-saving landing of the swarm for further drone recharging. This will make it possible to achieve self-sustainable operation of a multi-agent robotic system for such scenarios as rescue operations, inspection and maintenance, autonomous warehouse inventory, cargo delivery, and etc. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: Accepted paper at IEEE Vehicular Technology Conference 2022 (IEEE VTC 2022), IEEE copyright

Showing 1–9 of 9 results for author: Alper, M