Search | arXiv e-print repository

Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following

Authors: Anshul Gupta, Pierre Vuillecard, Arya Farkhondeh, Jean-Marc Odobez

Abstract: Contextual cues related to a person's pose and interactions with objects and other people in the scene can provide valuable information for gaze following. While existing methods have focused on dedicated cue extraction methods, in this work we investigate the zero-shot capabilities of Vision-Language Models (VLMs) for extracting a wide array of contextual cues to improve gaze following performanc… ▽ More Contextual cues related to a person's pose and interactions with objects and other people in the scene can provide valuable information for gaze following. While existing methods have focused on dedicated cue extraction methods, in this work we investigate the zero-shot capabilities of Vision-Language Models (VLMs) for extracting a wide array of contextual cues to improve gaze following performance. We first evaluate various VLMs, prompting strategies, and in-context learning (ICL) techniques for zero-shot cue recognition performance. We then use these insights to extract contextual cues for gaze following, and investigate their impact when incorporated into a state of the art model for the task. Our analysis indicates that BLIP-2 is the overall top performing VLM and that ICL can improve performance. We also observe that VLMs are sensitive to the choice of the text prompt although ensembling over multiple text prompts can provide more robust performance. Additionally, we discover that using the entire image along with an ellipse drawn around the target person is the most effective strategy for visual prompting. For gaze following, incorporating the extracted cues results in better generalization performance, especially when considering a larger set of cues, highlighting the potential of this approach. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: Accepted at the GAZE Workshop at CVPR 2024

arXiv:2403.10511 [pdf, other]

A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction

Authors: Anshul Gupta, Samy Tafasca, Arya Farkhondeh, Pierre Vuillecard, Jean-Marc Odobez

Abstract: Gaze following and social gaze prediction are fundamental tasks providing insights into human communication behaviors, intent, and social interactions. Most previous approaches addressed these tasks separately, either by designing highly specialized social gaze models that do not generalize to other social gaze tasks or by considering social gaze inference as an ad-hoc post-processing of the gaze… ▽ More Gaze following and social gaze prediction are fundamental tasks providing insights into human communication behaviors, intent, and social interactions. Most previous approaches addressed these tasks separately, either by designing highly specialized social gaze models that do not generalize to other social gaze tasks or by considering social gaze inference as an ad-hoc post-processing of the gaze following task. Furthermore, the vast majority of gaze following approaches have proposed static models that can handle only one person at a time, therefore failing to take advantage of social interactions and temporal dynamics. In this paper, we address these limitations and introduce a novel framework to jointly predict the gaze target and social gaze label for all people in the scene. The framework comprises of: (i) a temporal, transformer-based architecture that, in addition to image tokens, handles person-specific tokens capturing the gaze information related to each individual; (ii) a new dataset, VSGaze, that unifies annotation types across multiple gaze following and social gaze datasets. We show that our model trained on VSGaze can address all tasks jointly, and achieves state-of-the-art results for multi-person gaze following and social gaze prediction. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2203.10974 [pdf, other]

Towards Self-Supervised Gaze Estimation

Authors: Arya Farkhondeh, Cristina Palmero, Simone Scardapane, Sergio Escalera

Abstract: Recent joint embedding-based self-supervised methods have surpassed standard supervised approaches on various image recognition tasks such as image classification. These self-supervised methods aim at maximizing agreement between features extracted from two differently transformed views of the same image, which results in learning an invariant representation with respect to appearance and geometri… ▽ More Recent joint embedding-based self-supervised methods have surpassed standard supervised approaches on various image recognition tasks such as image classification. These self-supervised methods aim at maximizing agreement between features extracted from two differently transformed views of the same image, which results in learning an invariant representation with respect to appearance and geometric image transformations. However, the effectiveness of these approaches remains unclear in the context of gaze estimation, a structured regression task that requires equivariance under geometric transformations (e.g., rotations, horizontal flip). In this work, we propose SwAT, an equivariant version of the online clustering-based self-supervised approach SwAV, to learn more informative representations for gaze estimation. We demonstrate that SwAT, with ResNet-50 and supported with uncurated unlabeled face images, outperforms state-of-the-art gaze estimation methods and supervised baselines in various experiments. In particular, we achieve up to 57% and 25% improvements in cross-dataset and within-dataset evaluation tasks on existing benchmarks (ETH-XGaze, Gaze360, and MPIIFaceGaze). △ Less

Submitted 23 November, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

Comments: BMVC 2022. For code and pre-trained models, visit https://github.com/aryafarkhondeh/SwAT

arXiv:1909.02380 [pdf]

A Private and Unlinkable Message Exchange Using a Public bulletin board in Opportunistic Networks

Authors: Ardalan Farkhondeh

Abstract: We plan to simulate a private and unlinkable exchange of messages by using a Public bulletin board and Mix networks in Opportunistic networks. This Opportunistic network uses a secure and privacy-friendly asynchronous unidirectional message transmission protocol. By using this protocol, we create a Public bulletin board in a network that makes individuals send or receive events unlinkable to one a… ▽ More We plan to simulate a private and unlinkable exchange of messages by using a Public bulletin board and Mix networks in Opportunistic networks. This Opportunistic network uses a secure and privacy-friendly asynchronous unidirectional message transmission protocol. By using this protocol, we create a Public bulletin board in a network that makes individuals send or receive events unlinkable to one another . With the design of a Public bulletin board in an Opportunistic network, the clients can use the benefits of this Public bulletin board in a safe environment. When this Opportunistic network uses the protocol, it can guarantee an unlinkable communication based on the Mix networks. The protocol can work with the Public bulletin board exclusively with acceptable performance. Also, this simulation can be used for hiding metadata in the bidirectional message exchange in some messengers such as WhatsApp. As we know, one of the main goals of a messenger like WhatsApp is to protect the social graph. By using this protocol, a messenger can protect social graph and a central Public bulletin board. △ Less

Submitted 5 September, 2019; originally announced September 2019.

Comments: 13 pages, 10 figures , conference

Showing 1–4 of 4 results for author: Farkhondeh, A