License: CC BY-NC-SA 4.0
arXiv:2206.09256v2 [cs.CV] 12 Feb 2024

Multistream Gaze Estimation with Anatomical Eye Region Isolation by Synthetic to Real
Transfer Learning

Zunayed Mahmud    Paul Hungler    Ali Etemad     \IEEEmembershipSenior Member, IEEE \IEEEcompsocitemizethanks\IEEEcompsocthanksitemZ. Mahmud, and A. Etemad are with the Department of Electrical and Computer Engineering and Ingenuity Labs Research Institute, Queen’s University, Kingston, Ontario, Canada. E-mails: {zunayed.mahmud, ali.etemad}@queensu.ca \IEEEcompsocthanksitemP. Hungler is with Ingenuity Labs Research Institute, Queen’s University, Kingston, Ontario, Canada. E-mails: [email protected]
Abstract

We propose a novel neural pipeline, MSGazeNet, that learns gaze representations by taking advantage of the eye anatomy information through a multistream framework. Our proposed solution comprises two components, first a network for isolating anatomical eye regions, and a second network for multistream gaze estimation. The eye region isolation is performed with a U-Net style network which we train using a synthetic dataset that contains eye region masks for the visible eyeball and the iris region. The synthetic dataset used in this stage is procured using the UnityEyes simulator, and consists of 80,000 eye images. Successive to training, the eye region isolation network is then transferred to the real domain for generating masks for the real-world eye images. In order to successfully make the transfer, we exploit domain randomization in the training process, which allows for the synthetic images to benefit from a larger variance with the help of augmentations that resemble artifacts. The generated eye region masks along with the raw eye images are then used together as a multistream input to our gaze estimation network, which consists of wide residual blocks. The output embeddings from these encoders are fused in the channel dimension before feeding into the gaze regression layers. We evaluate our framework on three gaze estimation datasets and achieve strong performances. Our method surpasses the state-of-the-art by 7.57% and 1.85% on two datasets, and obtains competitive results on the other. We also study the robustness of our method with respect to the noise in the data and demonstrate that our model is less sensitive to noisy data. Lastly, we perform a variety of experiments including ablation studies to evaluate the contribution of different components and design choices in our solution.

{IEEEImpStatement}

Gaze patterns can reveal meaningful information about a person’s behaviour and mental state and is often utilized by modern intelligent interactive systems to better understand the users. Gaze can also be a useful communication cue for people with disabilities. The application of gaze estimation ranges from studying human behaviour and psychology to analyzing visual attention in autonomous driving, virtual reality, and remote classrooms. Many of these applications are sensitive to precision and lack user-specific calibration data. In this work, we aim to improve person-independent gaze estimation by presenting a novel framework that integrates eye region segmentation with multistream gaze estimation. Our experiments reveal that using anatomical features in the form of binary masks improves the accuracy of gaze estimation. Our model does not require any calibration samples yet can estimate gaze for unseen users with high accuracy and can be seamlessly integrated into real-time systems. Since gaze tracking involves the use of an individual’s eye image and has the potential to disclose sensitive details about where the user is looking, it is important to first obtain consent and ensure the maintenance of privacy before proceeding with the work.

{IEEEkeywords}

Gaze estimation, eye region segmentation, multistream network, deep neural network, domain randomization, transfer learning.

1 Introduction

\IEEEPARstart

Eye gaze patterns can be used to characterize important eye-related movement events such as fixation, saccade, and smooth pursuit [1], which in turn can reveal meaningful information about human behaviour such as a person’s emotion, intention, desire, and state of mind. This classified eye movement data can further be exploited as useful features in cognitive load detection [2], mental fatigue detection [3], and stress level analysis [4]. The application of gaze extends to many different fields such as human-computer interaction (HCI) [5], human-robot interaction (HRI) [6], visual attention analysis [7], augmented or virtual reality (AR/VR) systems [8, 9], autonomous driving [10], and others. Hence, gaze estimation has become a widely acknowledged research area in computer vision due to its relevance and numerous contributions to various applications.

Early works on image-based gaze estimation were performed under constrained settings such as fixed head pose and unchanged illumination [11, 12]. Subsequently, with newer datasets such as UTMultiview [13], Eyediap [14], and MPIIGaze [15], some of the above-mentioned limitations were mitigated to make gaze estimation more realistic and compatible with in-the-wild scenarios. Such datasets were collected either in a laboratory environment [13, 14] or in daily life settings [15], offering continuous head pose, continuous gaze targets, variation in appearance, and illumination. These datasets are inherently more challenging for gaze estimation given the dynamic nature and variations in experimental conditions. To overcome these challenges, most recent methods [15, 16, 17, 18] have leveraged deep convolutional neural networks (CNNs) as they are comparatively more robust towards noise and changes in visual factors.

Deep learning solutions for gaze estimation [19, 15, 18] generally focus on regressing gaze angles directly from the raw eye image, and often do not consider additional information which may be found in different regions of the eye. For instance, the iris region contains important information that can aid in better estimation of gaze, if it were explicitly learned by the network. Moreover, distinct anatomical regions of the eye, for example the visible eyeball and the iris, are not highly complex to detect, making them insensitive to noise and illumination variations throughout the image, and thus highly beneficial for gaze learning. Yet, to the best of our knowledge, gaze estimation solutions have rarely focused on these properties, and explicit detection and learning of anatomical eye regions has so far been ignored. Eye landmarks have in fact been used to help gaze estimation [20, 21], but are difficult to learn, especially in noisy settings, and considerably increase the dimensionality of the data.

In this paper, we aim to improve person-independent gaze estimation by exploiting additional information extracted from raw input images. Our proposed solution consists of two key steps, namely anatomical eye region isolation and multistream gaze estimation, together called MSGazeNet. An overview of our proposed framework is depicted in Figure 1. Our model first uses a U-Net style network [22] to perform anatomical eye region isolation, outputting binary masks for two key eye regions which are the iris and the visible eyeball. Following, our model uses a multistream gaze estimation network that takes the raw input eye images along with the outputs of the U-Net, as inputs to estimate the gaze. This component of MSGazeNet uses an encoder in each stream to learn effective gaze-related representations. Next, channel-wise feature fusion is performed. Finally, the fused representations are passed through additional convolutional layers and a regression block consisting a set of fully connected (FC) layers to estimate the output gaze. Given that the existing real-world datasets that are used for gaze estimation do not contain detailed eye structure information such as visible eye region or the iris mask, we train the anatomical eye region isolation network exclusively on a synthetic dataset which we procure using UnityEyes simulator [23]. This synthetic dataset consists of 80,000 synthetic eye images and corresponding eye region masks. Once the isolation network is trained in the synthetic domain, we perform transfer learning and integrate it into MSGazeNet. We also perform domain randomization when training the isolation network to ensure that the domain gap between the synthetic images and the real images used to train the downstream network is reduced. The weights of the isolation network are then frozen and the rest of the network is trained on the real-world gaze datasets for gaze estimation. We use three publicly available gaze datasets, MPIIGaze [15], Eyediap [14], and UTMultiview [13], to evaluate our solution. MSGazeNet obtains strong results, outperforming the state-of-the-art on Eyediap and UTMultiview, and achieving competitive results on MPIIGaze. We also perform a number of ablation studies and qualitative analysis to test the impact of different components and parameters in our network. Lastly, we perform a robustness analysis to investigate the performance across different amount of noise existing in the real-world datasets, and observe that MSGazeNet performs more robustly in comparison to prior works in this area.

Refer to caption

Figure 1: Overview of our proposed method. The anatomical eye region isolation module is used to generate binary eye region masks for the real-world eye images. These masks along with the raw eye image are then used by the multistream gaze estimator to perform gaze estimation. The pre-trained mask generation is represented via dotted lines and the solid lines represent the training pipeline.

Our contributions in this paper are three-fold. (1) We propose a novel deep neural framework for gaze estimation which consists of two main components, an anatomical eye region isolation module, followed by a gaze estimation network, together called MSGazeNet. The isolation network detects key areas of the eye which are informative toward gaze estimation. Our approach eliminates the need for additional inputs such as head pose and eye landmarks, which are often required by multimodal solutions in the area. (2) The eye region isolation network is solely trained using a synthetic dataset that contains over 80,000 eye images along with their iris and visible eyeball masks. The trained network is used to extract binary masks for real-world eye images. We perform domain randomization utilizing artifact-like augmentations to ensure a smooth transfer from synthetic to real domain. (3) Through rigorous experiments, we demonstrate that our model performs robust gaze estimation even in the presence of noisy data. Given the ability of our model to learn the critical regions of the eye, our results set new state-of-the-arts on two benchmark gaze estimation datasets (Eyediap and UTMultiview) and achieves competitive performance on another dataset (MPIIGaze). We then validate our design choices through detailed ablation studies and exploring several variants of our proposed solution. To encourage reproducibility and contribute to the field, we make our code public at: https://github.com/z-mahmud22/MSGazeNet.

The rest of this paper is organized as follows. In the next section, we describe the related work in the field. Following, we present our method, including the network and synthesized dataset. Next, we describe the experimental setup and implementation details. This is then followed by detailed experimental results and various sensitivity/ablation studies. Lastly, we summarize our work and discuss the potential future research directions.

Table 1: An overview of existing gaze estimation methods.
Year Method Dataset Input Feature Extractor Regressor
2015 Zhang et al. [19] MPIIGaze, UTMultiview Image, pose LeNet FC
2017 Zhang et al. [15] MPIIGaze, UTMultiview Image, pose VGG-16 FC
2018 Park et al. [24] Eyediap, MPIIGaze Image Hourglass+DenseNet FC
2018 Park et al. [20] Columbia [25], Eyediap, MPIIGaze, UTMultiview Image Hourglass SVR
2018 Yu et al. [18] Eyediap, UTMultiview Image, pose 4 Layers CNN FC
2019 Yu et al. [26] Columbia, MPIIGaze Image VGG-16 FC
2019 Wang et al. [27] Columbia, Eyediap, MPIIGaze, UTMultiview Image Bayesian CNN FC
2020 Yu et al. [28] Columbia, Eyediap, UTMultiview Image ResNet FC
2022 Ghosh et al. [29] Columbia, Gaze360 [30], MPIIGaze Image ResNet-50 FC
2022 Mahmud et al. [31] Eyediap Image U-Net+Multistream VGG-16 FC
2023 Cai et al. [32] Eyediap, ETH-XGaze [33], Gaze360, GazeCapture [34], MPIIGaze Image Real-ESRGAN [35]+ResNet-18 FC
2023 ** et al. [36] Eyediap, MPIIGaze Image, pose VGG-16 FC
2023 **dal et al. [37] EVE [38], Columbia, MPIIGaze Image ResNet-18 FC

2 Related Work

Gaze is often represented as a 2D screen coordinate representing the point of gaze (PoG), or an angular vector representing the gaze direction by pitch and yaw angles. Vision-based methods primarily fall under three categories which are feature-based, model-based, and appearance-based. Feature-based approaches [39, 40, 41] use the geometric shape of the eye to extract hand-crafted features such as eyeball centre, radius, pupil centre, eye corners from eye images which are then used in light-weight machine learning models to regress gaze. These methods were generally used prior to emergence of deep learning solutions. Model-based methods [42, 43, 44] aim to fit 3D deformable eye region models to eye images. Both classical machine learning and more recent deep learning techniques have been used in this category of literature. Appearance-based methods [23, 15, 18, 24] aim to learn a direct map** of gaze from input eye images. These methods, which our paper also falls under, mainly rely on deep learning models. Following we present the related work under appearance-based gaze estimation methods. In particular, we review prior works on supervised learning, domain adaptation methods, as well as few-shot, semi-supervised, and self-supervised solutions. Lastly, since segmentation of eye regions is used in our study, we also briefly review the works in this area.

2.1 Supervised Methods

A multimodal network was proposed in [19] where a LeNet type architecture was adopted to perform gaze estimation from eye images and head pose. In a subsequent work [15], a VGG type architecture was extended into a multimodal model that also used eye images and head pose as multimodal inputs. In [18], a deep multitask network was proposed where the network aimed to learn eye gaze and eye landmarks from eye images and their corresponding head pose information. The work explored the correlation of eye landmarks and gaze direction and argued for the eye landmarks to provide information cues for gaze estimation. Landmark detection in the form of Gaussian probability heatmaps of landmark coordinates from synthetic images was proposed in [20] using a stacked hourglass network [45]. The predicted landmarks were then fed into a support vector regressor (SVR) [46] to estimate gaze. The proposed solution improved iris localization, eyelid registration, and gaze estimation accuracies in both cross-dataset and person-specific settings. In [24], a pictorial representation of gaze was proposed, which was hypothesized to be an intermediate eye image representation. The proposed pipeline consisted of a stacked hourglass network [45] which was trained to predict the intermediate gazemaps from eye images, followed by a lightweight DenseNet architecture [47] to regress gaze from the gazemaps. Another multimodal approach was proposed in [21], which used both RGB eye images and their corresponding eye landmark heatmaps for gaze estimation. The two inputs were processed via separate CNN encoders to extract features which were then concatenated along with head pose information, and subsequently fed into dense layers that output 3D gaze direction.

2.2 Domain Adaptation Methods

A synthetic dataset, SynthesEyes, was published in [48] to perform both eye shape registration and gaze estimation. The dataset offers a wide variation of synthesized eye images in terms of head pose, gaze, and illumination conditions. It was shown that when used for pre-training, the dataset results in significant performance improvement in a cross-dataset setting using the network proposed in [19]. Following the prior work, UnityEyes, a synthesis framework was developed in [23], that can render eye region images in real-time. The system can be used to generate large scale synthetic eye image datasets and their corresponding landmarks, along with eye gaze annotations. A synthetic dataset consisting of millions of images which were generated by the simulator was used to train a simple kNN algorithm, outperforming their previous work [48] in cross-dataset experiments. To minimize the domain gap between synthetic and real images, a generative adversarial network (GAN) was proposed in [16] that used unlabeled synthetic and real eye images. The network consisted of a refiner network that refined the synthetic images to make them more realistic through adversarial learning. These refined images were then used to train a simple CNN to estimate gaze, which outperformed previous state-of-the-arts by a large margin in cross-dataset settings. A further improvement was reported in [17] which relied on bidirectional map** between synthetic and real eye images by leveraging a cyclic image-to-image translation framework. Highlighting the key challenges in cross-domain gaze estimation, a domain generalization technique was proposed in [49] where gaze-irrelevant features such as illuminations and appearance factors were eliminated via self-adversarial learning to extract purified gaze-relevant features from facial images. The conducted experiments resulted in new state-of-the-art performances in cross-dataset settings across multiple gaze estimation datasets without any fine-tuning. In [32], a source-free domain adaptation method was introduced to adapt a gaze estimator to an unlabeled target domain without any source data. The neural architecture consisted of a face enhancer model that generates high-quality input images for the gaze estimator, leading to reduced variance and uncertainty of gaze predictions in the target domain.

2.3 Few-Shot and Unsupervised Methods

For person-specific gaze estimation, a differential CNN was proposed in [50], which output the gaze difference between two eyes of the same subject. During inference, the network used a set of calibration samples from the same subject and predicted the gaze difference between the input image and the calibration samples as the estimated gaze. With the intent of making more personalized gaze networks with lower gaze error, a person-specific gaze estimation network was proposed in [51] that worked with only a few (\leq 9) calibration samples from the test person. The proposed solution disentangled appearance features, gaze, and head pose information from facial images using a disentangling encoder-decoder (DT-ED) [51]. The network took an RGB face image as input and the decoder mapped it to three latent space vectors, which corresponded to eye region appearance, gaze, and head pose information respectively. The gaze latent vector was then fed to dense layers in order to make the gaze prediction. The scarcity of calibration samples in few-shot person-specific gaze estimation was addressed in [26] by generating more training samples via synthesis of gaze-redirected eye images from an available set of calibration samples. The framework relied on synthetic images, generated by [23], to learn the gaze redirection task. To better adapt to the real domain, the network was further trained with real images which were first redirected given a redirection angle, and supervised via gaze redirection loss. Following, an inverse redirection angle was applied to the gaze redirected images to reconstruct the original images which were supervised via a cycle consistency loss. In [36], a Kappa angle compensation method was proposed to neutralize the ocular counter-rolling response (OCR). The normalization process of eye images naturally induces the OCR which redistributes the Kappa angle’s pitch and yaw component. This method with a few calibration samples (\leq 9) from the test subject, regresses the Kappa angle to refine the estimated gaze.

A semi-supervised approach was presented in [27] where a Bayesian convolutional neural network (BCNN) relied on both labeled and unlabeled eye images to perform gaze estimation along with appearance classification and head pose estimation. The framework also included an adversarial component where the gaze labeled images were used as source domain and the unlabeled images are used as the target domain. The framework used the labeled images to supervise the gaze estimator, while the gaze estimator aimed to learn person-invariant features to oppose the adversarial module. Eye region segmentation was performed as an auxiliary task in [31] and the output eye segments and raw eye images were used as inputs to a multistream network for gaze estimation. The multistream network consisted of three encoders to extract features separately from the three inputs and the eye image encoder was pretrained with self-supervised contrastive learning and then fine-tuned during the downstream gaze estimation task. A sensitivity experiment verified the stability of the proposed network while using very limited amount of labeled data. In [29], a multitask network was developed to perform three auxiliary gaze-relevant tasks with limited supervision. Using off-the-shelf networks, psuedo-gaze, eye orientation and head pose were extracted from large scale facial image datasets [52, 53] which were later used to train a CNN backbone for the auxiliary tasks. To minimize noise in the generated labels, a noise distribution model was also incorporated in the framework. The network was then fine-tuned for downstream gaze estimation. A contrastive learning method, GazeCLR, was proposed in [37] which pre-trains a CNN encoder with both single-view and multi-view gaze samples. These different gaze samples help the network learn invariance and equivariance among gaze representations, improving the cross-domain gaze estimation performance.

In [28], an unsupervised approach was proposed to learn low dimensional gaze representations by utilizing a gaze redirection network. The proposed pipeline used an image pair of the same eye with different gaze directions as inputs to two separate networks for representations to be learned. The output latent vectors and their difference were then used in a gaze redirection network to reconstruct the latent representation of one of the images by redirecting the other image based on the gaze difference. The entire pipeline did not require ground truth gaze labels while training. The trained gaze representation network was then calibrated with randomly selected labeled samples (10-100) from the training data.

Refer to caption
Figure 2: Our proposed framework, MSGazeNet. First, we perform anatomical eye region isolation using a U-Net style network which we train using the synthetic dataset. Next, we perform gaze estimation using real-world eye images and their corresponding eye region masks as input to our multistream gaze estimation network.

2.4 Eye Region Segmentation

Given that segmentation plays a role in our method, we briefly review the literature on this topic. In the context of segmentation focused on eye images [54, 55, 56], a large-scale eye segmentation dataset, OpenEDS, was released in [57]. Along with the dataset, some baseline experiments using deep convolutional encoder-decoder architectures [58] were performed to set eye segmentation benchmarks. Subsequently, an attention-based encoder-decoder network, Eyenet [59], was proposed to perform multi-scale supervision during eye region segmentation. The neural architecture consisted of slightly modified residual units and two types of attention modules that applied attention on both channel and spatial dimensions. In [60], a lightweight segmentation network based on MobileNetv2 [61] was proposed which significantly reduced the processing time by utilizing depthwise convolution. A real-time segmentation network, RITnet, was released in [62] to segment eye images at 300 Hz. The proposed solution combined DenseNet with U-Net to create the architecture which was supervised via a weighted combination of three different loss functions. Semi-supervised approaches were explored in [63, 64] where the solutions significantly improved the baseline performance with fewer annotated data and trainable parameters. In [65], three types of domain adaptation methods, supervised, semi-supervised, and unsupervised were explored using eye segmentation datasets collected from two different setups.

3 Method

In this section, we first discuss the problem statement. This is followed by an overview of the proposed network, and detailed description of each component in our pipeline.

3.1 Problem Statement

Let’s assume we have input x𝑥xitalic_x which contains a grey-scale eye image. Our goal is to develop a model \mathcal{F}caligraphic_F which can reliably estimate the gaze parameters pitch (ϕpsubscriptitalic-ϕ𝑝\phi_{p}italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) and yaw (ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT) angles. Pitch angles refer to the up/down eye movements and the yaw angles refer to the left/right eye movements. We hypothesize that in addition to learning the representation of the overall eye image, x𝑥xitalic_x, extracting information explicitly from anatomical regions namely the visible eyeball xvissubscript𝑥visx_{\text{vis}}italic_x start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT and the iris xirissubscript𝑥irisx_{\text{iris}}italic_x start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT would result in more informative features for estimating the final ϕpsubscriptitalic-ϕ𝑝\phi_{p}italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT.

3.2 Proposed Solution

Solution Overview. To address the problem above, we design a network that first performs anatomical eye region isolation in order to separate key geometric sections of the eye for explicit processing and representation learning. This critical step in our pipeline, as we will describe later, relies on transfer learning between simulation to real domains. Next, our pipeline uses all the available information, i.e., the original input along with the isolated eye regions, to perform representation learning followed by fusion, and eventually gaze estimation. An overview of our method is presented in Figure 1. In the following, we describe each of these components in our proposed method.

Anatomical Eye Region Isolation. As touched upon above, a core component of our proposed method is the process of isolating different anatomical eye regions so that they could be individually used for gaze estimation. Here we first discuss our justification for including this step in our proposed method. Due to the inherent noise in real-world images, it is often quite difficult to recognize the gaze or orientation of the eye from raw images. Prior research [18] suggests that gaze direction has a strong correlation with eye landmarks, indicating that these landmarks could potentially contribute to gaze estimation as auxiliary information. However, obtaining such detailed and accurate landmarks is computationally expensive and susceptible to noise itself. Nevertheless, some methods use off-the-shelf landmark detectors which are primarily trained using synthetic data to extract eye landmarks. However, learning such high dimensional information is very challenging and those networks also suffer from the ‘synthetic to real’ domain gap, which hinders the robustness of their landmark predictions. As a result, even though some methods [20, 21] use these landmarks in the form of heatmaps, there still remains considerable amount of noise in the training data.

In our proposed solution, as opposed to using the eye landmarks directly as auxiliary data, we propose and use the Anatomical Eye Region Isolation (AERI) network, to extract and isolates anatomical eye regions, namely the visible eyeball and the iris. We denote this network by AERIθsubscriptsuperscript𝜃AERI\mathcal{F}^{\theta}_{\text{AERI}}caligraphic_F start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT AERI end_POSTSUBSCRIPT where θ𝜃\thetaitalic_θ are learnable parameters. AERIθsubscriptsuperscript𝜃AERI\mathcal{F}^{\theta}_{\text{AERI}}caligraphic_F start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT AERI end_POSTSUBSCRIPT takes the eye image x𝑥xitalic_x as input, and outputs mvissubscript𝑚vism_{\text{vis}}italic_m start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT and mirissubscript𝑚irism_{\text{iris}}italic_m start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT, which are binary masks corresponding to xvissubscript𝑥visx_{\text{vis}}italic_x start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT and xirissubscript𝑥irisx_{\text{iris}}italic_x start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT respectively. This design allows our pipeline to gain additional information about the orientation of the eye and key regions therein, without having to rely on the high-dimensional and noisy landmarks.

As shown in Figure 2, AERIθsubscriptsuperscript𝜃AERI\mathcal{F}^{\theta}_{\text{AERI}}caligraphic_F start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT AERI end_POSTSUBSCRIPT uses a U-Net style architecture [22] with a two-channel output [miris,mvis]Tsuperscriptsubscript𝑚irissubscript𝑚vis𝑇[m_{\text{iris}},~{}m_{\text{vis}}]^{T}[ italic_m start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Thus, to train the network we require a dataset of [x,miris,mvis]Tsuperscript𝑥subscript𝑚irissubscript𝑚vis𝑇[x,~{}m_{\text{iris}},~{}m_{\text{vis}}]^{T}[ italic_x , italic_m start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT tuples. Since such a dataset does not exist and its collection from real images is quite difficult and time-consuming given the difficulty of recording eye images and isolating the anatomical regions for every image manually, we introduce a synthetic dataset. For this purpose, we rely on an eye image simulator UnityEyes [23], to procure the synthetic eye image dataset. The simulator can render synthetic eye images along with their detailed 2D landmark annotations and gaze labels in real-time. The simulator generates 32 landmarks for the iris region, denoted by Liris=[l1,l2,,l32]subscript𝐿irissubscript𝑙1subscript𝑙2subscript𝑙32L_{\text{iris}}=[l_{1},l_{2},...,l_{32}]italic_L start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT = [ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT ], where li2subscript𝑙𝑖superscript2l_{i}\in\mathbb{R}^{2}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Next, mirissubscript𝑚irism_{\text{iris}}italic_m start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT is calculated by:

miris=Binenc(poly(Liris)),subscript𝑚iris𝐵𝑖subscript𝑛enc𝑝𝑜𝑙𝑦subscript𝐿irism_{\text{iris}}={Bin_{\text{enc}}}\big{(}poly(L_{\text{iris}})\big{)},italic_m start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT = italic_B italic_i italic_n start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT ( italic_p italic_o italic_l italic_y ( italic_L start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT ) ) , (1)

where poly𝑝𝑜𝑙𝑦polyitalic_p italic_o italic_l italic_y is a function that takes a series of landmark coordinates and creates a polygon by connecting them sequentially, and Binenc𝐵𝑖subscript𝑛encBin_{\text{enc}}italic_B italic_i italic_n start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT is a binarizing operator which creates a binary mask by taking the enclosed area in its input and setting it to 1, while the outside is set to 0.

The simulator also provides 16 landmarks for the interior region of the eye, i.e., the visible eyeball, plus 6 additional 2D coordinates for the caruncle region (inner corner of the eye). Here, we first average the 6 caruncle coordinates to create a single caruncle representative landmark, bringing the total landmarks for the visible eyeball to 17, as Lvis=[l1,l2,,l17]subscript𝐿vissubscript𝑙1subscript𝑙2subscript𝑙17L_{\text{vis}}=[l_{1},l_{2},...,l_{17}]italic_L start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT = [ italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT 17 end_POSTSUBSCRIPT ], where li2subscript𝑙𝑖superscript2l_{i}\in\mathbb{R}^{2}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Next, similar to our approach for creating mirissubscript𝑚irism_{\text{iris}}italic_m start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT, we use:

mvis=Binenc(poly(Lvis)),subscript𝑚vis𝐵𝑖subscript𝑛enc𝑝𝑜𝑙𝑦subscript𝐿vism_{\text{vis}}={Bin_{\text{enc}}}\big{(}poly(L_{\text{vis}})\big{)},italic_m start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT = italic_B italic_i italic_n start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT ( italic_p italic_o italic_l italic_y ( italic_L start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT ) ) , (2)

to generate the visible eyeball mask. The process of obtaining [xmirismvis]Tsuperscriptdelimited-[]𝑥subscript𝑚irissubscript𝑚vis𝑇[x~{}m_{\text{iris}}~{}m_{\text{vis}}]^{T}[ italic_x italic_m start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is also depicted in Figure 3.

Refer to caption

Figure 3: The mask generation pipeline. The visible eyeball landmarks are marked in green dots and the iris landmarks are marked in red dots. The binary masks are extracted from the corresponding landmarks.

Successive to procuring the synthetic dataset, we use it to train AERIθsubscriptsuperscript𝜃AERI\mathcal{F}^{\theta}_{\text{AERI}}caligraphic_F start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT AERI end_POSTSUBSCRIPT using mean squared error (MSE) loss:

AERI=1Pi=1Pyiy^i2.subscriptAERI1𝑃superscriptsubscript𝑖1𝑃superscriptnormsubscript𝑦𝑖subscript^𝑦𝑖2\mathcal{L}_{\text{AERI}}=\frac{1}{P}\sum_{i=1}^{P}||y_{i}-\hat{y}_{i}||^{2}.caligraphic_L start_POSTSUBSCRIPT AERI end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT | | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (3)

Here, AERIsubscriptAERI\mathcal{L}_{\text{AERI}}caligraphic_L start_POSTSUBSCRIPT AERI end_POSTSUBSCRIPT is the average MSE loss calculated between the predicted mask y^i=[m^iris,m^vis]Tsubscript^𝑦𝑖superscriptsubscript^𝑚irissubscript^𝑚vis𝑇\hat{y}_{i}=[\hat{m}_{\text{iris}},~{}\hat{m}_{\text{vis}}]^{T}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT , over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and the ground truth mask yi=[miris,mvis]Tsubscript𝑦𝑖superscriptsubscript𝑚irissubscript𝑚vis𝑇y_{i}=[m_{\text{iris}},~{}m_{\text{vis}}]^{T}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_m start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, for pixels iP𝑖𝑃i\in Pitalic_i ∈ italic_P.

As for the architectural details of the AERI network, as shown in Figure 2 (top), it consists of an encoder-decoder U-Net architecture. The architecture details presented in Table 2 are adapted from the classical U-Net model and are selected to maximize performance. The encoder contains 5 convolutional blocks (conv blocks) with 2×\times×2 maxpool layers in-between. The decoder consists of 4 conv blocks with 2×\times×2 upsampling layers in-between. A final 1×\times×1 convolution layer followed by sigmoid generates the network output. The feature maps from the upsampling layers of the decoder blocks are concatenated with their corresponding feature map outputs from the encoder blocks via skip-connections. Each conv block of both encoder and decoder consists of two 3×\times×3 convolution layers followed by batch normalization, zero-padding, and ReLU activation. The number of feature maps at the initial block of the encoder is 64, which is doubled following every maxpool to a maximum of 1024 feature maps at the last block of the encoder. Conversely, the number of feature maps in the initial decoder block is 1024, which is halved after every upsampling layer, reaching 64 at the final conv block.

Table 2: Architectural details for the anatomical eye region isolation network, AERIθsubscriptsuperscript𝜃AERI\mathcal{F}^{\theta}_{\text{AERI}}caligraphic_F start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT AERI end_POSTSUBSCRIPT.
Modules Parameters Values
Conv Block(s) Input shape 1×\times×36×\times×60
# Encoder blocks 5
# Decoder blocks 4
# Layers 2
Layer type Conv2D
Kernel size 3×\times×3
Padding type Zero-padding
Activation ReLU
Downsample # Layers 4
Layer type Maxpool2D
Kernel size 2×\times×2
Upsample # Layers 4
Layer type Bilinear upsample
Scale factor 2.0
Output # Layers 1
Layer type Conv2D
Kernel size 1×\times×1
Activation Sigmoid
Full Network Batch size 32
Loss function MSE
Optimizer Adam
Learning rate 0.00001
Learning rate decay 0.1

Multistream Gaze Estimation. Successive to training the network described above, it is frozen and the weights are transferred to our final gaze estimation model. In order to ensure that domain-shift issues between the synthetic and real domains do not negatively impact the overall performance, we explore a variety of augmentations in order to allow the distribution of the synthetic dataset to become closer to that of the real datasets. This will be discussed in detail later in Section 3.3.

We use the frozen AERI network, AERIθ*subscriptsuperscriptsuperscript𝜃AERI\mathcal{F}^{\theta^{*}}_{\text{AERI}}caligraphic_F start_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT AERI end_POSTSUBSCRIPT to generate mirissubscript𝑚irism_{\text{iris}}italic_m start_POSTSUBSCRIPT iris end_POSTSUBSCRIPT and mvissubscript𝑚vism_{\text{vis}}italic_m start_POSTSUBSCRIPT vis end_POSTSUBSCRIPT. These two masks, along with x𝑥xitalic_x are then used for gaze estimation. The layout of the gaze estimation network, GEθsubscriptsuperscript𝜃GE\mathcal{F}^{\theta}_{\text{GE}}caligraphic_F start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GE end_POSTSUBSCRIPT, is depicted in Figure 2 (bottom) and the architecture is detailed in Table 3. The network consists of three parallel branches for the three different input streams. The first branch processes input eye image through a 3×\times×3 convolution layer followed by two conv blocks with identical structures. We use wide residual blocks (WRBs) [66] for the conv block architectures. Each conv block is a combination of two WRBs, denoted by B(M), where M represents the list of kernel sizes of the convolution layers inside a WRB. The initial WRB is of type B(3,1,3) which consists of two 3×\times×3 convolution layers followed by a 1×\times×1 convolution layer, while the subsequent WRB is of type B(3,3). As mentioned earlier, the first conv block is followed by a second identical conv block. A similar architecture is used for the other two branches responsible for processing the eye anatomy mask streams. The feature maps obtained from the three branches are then fused in a channel-wise manner through concatenation, and subsequently fed to a single conv block (conv block 3) to extract combined representations.

Table 3: Architectural details for the proposed gaze estimation network, GEθsubscriptsuperscript𝜃GE\mathcal{F}^{\theta}_{\text{GE}}caligraphic_F start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GE end_POSTSUBSCRIPT.
Modules Parameters Values
Conv2D Input shape 1×\times×36×\times×60
Kernel size 3×\times×3
Padding type Zero-padding
Output feature size 16×\times×36×\times×60
Conv Block(s) Architecture Wide Residual Network
# WRBs 2
Block type B(3,1,3), B(3,3)
Activation LeakyReLU
Dropout rate 0.5
Full Network Widen factor 4
Depth 16
Batch Size 32
Loss function MSE
Optimizer Adam
Learning rate 0.0001
Learning rate decay 0.5

Gaze Regression. The combined feature embeddings from the multistream backbone network are passed through a global average pooling layer. The final feature embeddings are then flattened and fed to a set of 3 FC layers to regress the output gaze in the form of pitch (ϕpsubscriptitalic-ϕ𝑝\phi_{p}italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) and yaw (ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT) angles. We use dropout with a probability of 0.25 and ReLU activation in between the FC layers, with the exception of the output layer.

The gaze regression is supervised by the following loss:

gaze=1ni=1ngig^i2,subscriptgaze1𝑛superscriptsubscript𝑖1𝑛superscriptnormsubscript𝑔𝑖subscript^𝑔𝑖2\color[rgb]{0,0,0}\mathcal{L}_{\text{gaze}}=\frac{1}{n}\sum_{i=1}^{n}||g_{i}-% \hat{g}_{i}||^{2},caligraphic_L start_POSTSUBSCRIPT gaze end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | | italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (4)

where gazesubscriptgaze\mathcal{L}_{\text{gaze}}caligraphic_L start_POSTSUBSCRIPT gaze end_POSTSUBSCRIPT is the average MSE loss calculated between the predicted gaze g^isubscript^𝑔𝑖\hat{g}_{i}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ϕ^psubscript^italic-ϕ𝑝\hat{\phi}_{p}over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, ϕ^ysubscript^italic-ϕ𝑦\hat{\phi}_{y}over^ start_ARG italic_ϕ end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT] and the ground truth gaze gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ϕpsubscriptitalic-ϕ𝑝\phi_{p}italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT] for in𝑖𝑛i\in nitalic_i ∈ italic_n sample images.

3.3 Domain Randomization

We apply a variety of augmentations during the training of AERIθsubscriptsuperscript𝜃AERI\mathcal{F}^{\theta}_{\text{AERI}}caligraphic_F start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT AERI end_POSTSUBSCRIPT as well as GEθsubscriptsuperscript𝜃GE\mathcal{F}^{\theta}_{\text{GE}}caligraphic_F start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT GE end_POSTSUBSCRIPT. Since the first component of the pipeline is solely trained on the synthetic domain, there remains a possibility for the network to not perform well in the real domain when integrated into the final MSGazeNet pipeline. To prevent this from happening, we adopt domain randomization [67] by applying visual transformations such as Gaussian noise, blur, and change of contrast to augment the simulated images so that they better reflect the distribution of the real-world images. The purpose of using this technique is to include a large number of relatively strong variations to the simulated images so that they represent the variations of the real-world datasets. The parameters of all the applied transformations as detailed in Table 4 are chosen via trial and error. During training, these transformations were randomly applied to augment the images. We apply Gaussian noise with a mean (μ𝜇\muitalic_μ) of 0 and standard deviation (σ𝜎\sigmaitalic_σ) in the range of [0 - 10.0]. We apply blurring operation with σ𝜎\sigmaitalic_σ in the range of [0 - 2.0] and a filter size of 3×\times×3. Downscale by a random factor in the range of [0-0.5] is applied followed by up-scaling the original image size, essentially quantizing the image with the original resolution. As another augmentation, we add random lines (0 to 2 lines). We also randomly change the contrast of the input images with a minimum intensity value, rminsubscript𝑟𝑚𝑖𝑛r_{min}italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = [0-100] and a maximum intensity value, rmaxsubscript𝑟𝑚𝑎𝑥r_{max}italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = [155-255]. Lastly, we randomly remove regions with a dynamic square kernel of size 1 to 10 pixels in both height (h) and width (w). As we show later in the results section, we observe that applying strong augmentations enables both the anatomical eye region isolation network and the gaze estimation network to learn better representations for their respective tasks.

Table 4: Description of transformations.
Transformations Parameters
Gaussian noise σ𝜎\sigmaitalic_σ = [0 - 10.0], μ𝜇\muitalic_μ = 0
Gaussian blur σ𝜎\sigmaitalic_σ = [0 - 2.0], filter size = 3 ×\times× 3
Cutout h,w = [1 - 10.0] px
Downscale [0-0.5]×\times×
Random lines 0 to 2
Contrast change rminsubscript𝑟𝑚𝑖𝑛r_{min}italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = [0-100], rmaxsubscript𝑟𝑚𝑎𝑥r_{max}italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = [155-255]
Table 5: Overview of the real-world datasets used in this paper.
Name # Subjects Size Resolution Head pose Gaze target Illumination condition Collection duration
Eyediap [14] 16 237 min HD & VGA Continuous Continuous 2 2 days
MPIIGaze [15] 15 213,659 1280×\times×720 Continuous Continuous Daily life similar-to\sim3 months
UTMultiview [13] 50 64,000 1280×\times×1024 8 + synthesized 160 1 1 day

4 Experiments

In this section, we describe the datasets used in our work, followed by data normalization. Next, the evaluation protocol is described, followed by the description of the metric commonly used for quantifying gaze estimation performance. Lastly, the implementation details of our model are described.

4.1 Datasets

Real Datasets. We evaluate our model’s performance on three public gaze datasets, Eyediap [14], MPIIGaze [15], and UTMultiview [13]. The overview of these datasets is shown in Table 5. The Eyediap dataset contains 94 VGA and HD resolution videos captured from 16 subjects among which 12 are male and 4 are female. The dataset was collected in two different sessions for each subject. In one session, named the screen target session, small dots appeared on a screen which the users were asked to look at. In another session, called the 3D floating ball session, a ball was continuously moved in front of the user, which the user was asked to look at and follow. Both of these sessions included both static and mobile head pose scenarios. Although the dataset offers large variability in gaze and head pose, it contains similar illumination condition across all sessions.

The MPIIGaze is a benchmark dataset for in-the-wild appearance-based gaze estimation. It has been collected from 15 subjects in an unconstrained manner over the course of several months. Among these subjects, 9 are male, 6 are female, and 5 subjects wore glasses. The age of the participants ranges from 21 to 35 years old. The dataset consists of 213,659 facial images of subjects during their everyday laptop usage and the corresponding ground truth gaze labels. The labels were collected using a custom software that triggered a sequence of 20 dots at every 10 minutes, which the users were asked to look at. The dataset contains significant variation in terms of appearance, head movement, gaze target, and illumination.

The UTMultiview dataset consists of 160 discrete gaze targets, which contrasts the two datasets mentioned above given that they consist of continuous gaze targets. There are 50 subjects (35 male and 15 female) who are aged between approximately 20 to 40 years old. This dataset was collected in a laboratory environment and the collection procedure included 8 cameras to capture facial images of the participants from multiple viewpoints in a synchronized manner. The gaze labels were collected by instructing the participants to follow a visual target on a monitor.

Synthetic Dataset. In addition to the three datasets above, we use UnityEyes simulator [23], to procure a synthetic dataset, which consists of 80,000 synthetic eye images with a resolution of 1280 ×\times× 768 and their corresponding masks to represent the iris and visible eyeball region of the eye. The images are captured by varying eye appearances, illuminations (light intensity = [0.60 - 1.20]), shapes, headpose (ϕpsubscriptitalic-ϕ𝑝\phi_{p}italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = [±plus-or-minus\pm±20.00{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT, ±plus-or-minus\pm±40.00{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT]), and gaze (ϕpsubscriptitalic-ϕ𝑝\phi_{p}italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = [±plus-or-minus\pm±49.49{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT, ±plus-or-minus\pm±78.28{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT]) directions, all of which are controllable parameters in the simulator. The dataset contains images from 20 virtual subjects out of which 5 are female and 15 are male. For training, 60,000 of the synthetic images are used, while we set aside 20,000 for hold-out validation. Finally, we present the distribution plots of both head pose and gaze for all the datasets used in this study in Figure 4. We observe that the distributions of the synthetic dataset are wider than the three real-world datasets in terms of both gaze and head pose, making this dataset suitable for training purposes.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: The heatmap distribution of the gaze angles (top row) and the head pose angles (bottom row) of all the datasets.

4.2 Data Normalization

We normalize all the eye images and their corresponding gaze labels from Eyediap and UTMultiview datasets, as well as the images and masks from the synthetic dataset, according to [68]. The MPIIGaze dataset directly provides normalized data. The normalization process requires 3D head pose and camera calibration matrix. A virtual camera is used to map the input images to a normalized space where a fixed distance of 600 mm is maintained between the eye centre and the virtual camera. The virtual camera is also rotated to point at the reference point (eye centre) as well as to prevent the freedom of head rotation in the roll axis to avoid any ambiguity. Next, perspective transformation is applied to the images, followed by resizing to 60×\times×36 pixels. Finally, the eye images from all four datasets are converted to gray-scale, and then histogram equalized.

4.3 Evaluation Protocol

For gaze estimation, we use the VGA videos from the screen target session of the Eyediap dataset. The eye images are collected at every 15 frames from each recording to create a subset which we later use for training and testing. As there were no screen target session data for two subjects, the evaluation set contains data from 14 subjects rather than 16. The MPIIGaze dataset provides a subset of 45,000 eye images which is comprised of 1,500 left-eye and 1,500 right-eye images from each of the 15 subjects. The UTMultiview dataset contains both real-world and synthesized images for each participant. We use the real-world part which contains 64,000 eye images from 50 participants. We use the standard evaluation protocols for all the three benchmark datasets to be consistent with the previous methods [15, 24, 18, 28]. We follow leave-one-subject-out evaluation for the MPIIGaze dataset, a 5-fold validation for the Eyediap dataset and a 3-fold validation for the UTMultiview dataset. Note that there is no subject overlap in the different folds for Eyediap and UTMultiview datasets.

4.4 Evaluation Metric

Following prior works [15, 24, 28], the gaze angular error is used to measure the gaze estimation performance. The gaze angular error δ𝛿\deltaitalic_δ, is calculated between ground truth gaze g𝑔gitalic_g and predicted gaze g^^𝑔\hat{g}over^ start_ARG italic_g end_ARG. Before calculating δ𝛿\deltaitalic_δ, the gaze angle values are converted to a three-dimensional vector vgsubscript𝑣𝑔v_{g}italic_v start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, of Cartesian coordinates by the following:

vg=T(g)=[cosϕpsinϕy,sinϕp,cosϕpcosϕy],subscript𝑣𝑔𝑇𝑔𝑐𝑜𝑠subscriptitalic-ϕ𝑝𝑠𝑖𝑛subscriptitalic-ϕ𝑦𝑠𝑖𝑛subscriptitalic-ϕ𝑝𝑐𝑜𝑠subscriptitalic-ϕ𝑝𝑐𝑜𝑠subscriptitalic-ϕ𝑦v_{g}=T(g)=[-cos\phi_{p}sin\phi_{y},\,-sin\phi_{p},\,-cos\phi_{p}cos\phi_{y}],italic_v start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_T ( italic_g ) = [ - italic_c italic_o italic_s italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_s italic_i italic_n italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , - italic_s italic_i italic_n italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , - italic_c italic_o italic_s italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_c italic_o italic_s italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] , (5)

where T()𝑇T(\cdot)italic_T ( ⋅ ) represents the conversion between two coordinate systems. Hence, δ𝛿\deltaitalic_δ is calculated by:

vg=T(g),vg^=T(g^),formulae-sequencesubscript𝑣𝑔𝑇𝑔subscript𝑣^𝑔𝑇^𝑔v_{g}=T(g),\quad v_{\hat{g}}=T(\hat{g}),italic_v start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_T ( italic_g ) , italic_v start_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG end_POSTSUBSCRIPT = italic_T ( over^ start_ARG italic_g end_ARG ) , (6)

and

δ=arccosvgTvg^vgvg^.𝛿𝑎𝑟𝑐𝑐𝑜𝑠superscriptsubscript𝑣𝑔𝑇subscript𝑣^𝑔normsubscript𝑣𝑔normsubscript𝑣^𝑔\delta=arccos\ \frac{v_{g}^{T}\cdot v_{\hat{g}}}{||v_{g}||\cdot||v_{\hat{g}}||}.italic_δ = italic_a italic_r italic_c italic_c italic_o italic_s divide start_ARG italic_v start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_v start_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG end_POSTSUBSCRIPT end_ARG start_ARG | | italic_v start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT | | ⋅ | | italic_v start_POSTSUBSCRIPT over^ start_ARG italic_g end_ARG end_POSTSUBSCRIPT | | end_ARG . (7)

4.5 Implementation Details

Both the anatomical eye region isolation and the gaze estimation networks are implemented using the PyTorch library. It should be noted that the setup of the regression block is optimized based on the datasets. While for all three datasets, three FC layers are used, the number of nodes vary. For MPIIGaze dataset, we use 512, 256, and 2 nodes for the three layers respectively. For Eyediap dataset, we use 256, 128 and 2 nodes. And lastly for UTMultiview dataset, we use 1024, 512 and 2 nodes. The AERI network is trained using an Adam optimizer [69] with an initial learning rate of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and a batch size of 32. The network is trained for 30 epochs via a step learning scheduler with a step size of 5 and a decay factor of 0.1. The gaze estimation network is also trained for 30 epochs using an Adam optimizer with an initial learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and a batch size of 32. At this step, a plateau learning rate scheduler is used with a decay factor of 0.5 and patience of 3. We train the entire pipeline (both the networks) using a single Nvidia 2080 Ti GPU. The training time of the AERI network lasted for approximately 2 hours. The gaze estimator took about 0.5 hour per fold (out of 5) on the Eyediap dataset, 1.5 hours per fold (out of 15) on MPIIGaze dataset and 2.5 hours per fold (out of 3) on UTMultiview dataset. The inference time of our framework is 6 ms/image on GPU and 20 ms/image on CPU, making our model highly suitable for real-time systems.

5 Results

In this section, we first present the quantitative results obtained by our model, and make comparisons against existing methods in the area notably the state-of-the-art. Following, we analyze the performance of our model in terms of robustness to noise. Next, we present qualitative results and visually demonstrate the performance of our method. This is followed by a series of thorough ablation studies to evaluate the impact of each major component of our solution. Finally we perform additional experiments on several network variants, feature fusion, and the impact of augmentations in our solution.

5.1 Quantitative Results

At first, we present the quantitative results of our AERI network. To evaluate the performance of this module, we perform hold-out validation on the synthetic dataset. We use the MSE (Eq. 3), and mean intersection over union (mIoU) as the evaluation metrics. It can be observed from Table 6, that we obtain low segmentation errors for both training and validation sets, while the overlap between the ground truth and predicted masks are quite high.

Table 6: Performance of AERI.
Dataset MSE (normal-↓\downarrow) mIoU (normal-↑\uparrow)
Train set 0.011 0.875
Validation set 0.098 0.892
Table 7: Average gaze estimation error ±plus-or-minus\pm± standard deviation for leave-one-subject-out evaluation on MPIIGaze dataset. The reported errors are in degrees ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT).
Method Input Feature Extractor Reg. Err.
Zhang et al. [19] Img+pose LeNet FC 6.3
Zhang et al. [15] Img+pose VGG-16 FC 5.5
Park et al. [24] Img Hourglass+DenseNet FC 4.56
Yu et al. [26] Img VGG-16 FC 5.35
Wang et al. [27] Img Bayesian CNN FC 4.3
Ghosh et al. [29] Img ResNet-50 FC 4.21±plus-or-minus\pm±1.90
MSGazeNet (Ours) Img U-Net+M.S. w. ResNet FC 4.64 ±plus-or-minus\pm±0.73
Table 8: Average gaze estimation error ±plus-or-minus\pm± standard deviation for 5-fold evaluation on Eyediap dataset. The reported errors are in degrees ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT).
Method Input Feature Extractor Reg. Err.
Zhang et al. [15] Img+pose VGG-16 FC 6.3{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT
Park et al. [20] Img Hourglass SVR 11.9
Park et al. [24] Img Hourglass+DenseNet FC 10.3
Yu et al. [18] Img+pose 4 Layers CNN FC 6.5
Wang et al. [27] Img Bayesian CNN FC 9.9
Yu et al. [28] Img ResNet FC 6.79
Mahmud et al. [31] Img U-Net+M.S. VGG-16 FC 6.34
MSGazeNet (Ours) Img U-Net+M.S. w. ResNet FC 5.86±plus-or-minus\pm±0.80

{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT represents the error reported in [18]

Table 9: Average gaze estimation error ±plus-or-minus\pm± standard deviation for 3-fold evaluation on UTMultiview dataset. The reported errors are in degrees ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT).
Method Input Feature Extractor Reg. Err.
Sugano et al. [13] Img+pose Raster-scanner kNN 6.5
Zhang et al. [19] Img+pose LeNet FC 5.9
Zhang et al. [15] Img+pose VGG-16 FC 6.3{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT
Yu et al. [18] Img+pose 4 Layers CNN FC 5.7
Wang et al. [27] Img Bayesian CNN FC 5.4
Yu et al. [28] Img ResNet FC 5.52
MSGazeNet (Ours) Img U-Net+M.S. w. ResNet FC 5.30±plus-or-minus\pm±0.57

{}^{\ast}start_FLOATSUPERSCRIPT ∗ end_FLOATSUPERSCRIPT represents the error reported in [18]

We follow the standard evaluation protocol [15, 24, 18, 28] that was previously described in Section 4.3 to quantify the performance of our method and compare our performance with the existing state-of-the-arts on different datasets. As our focus is on person-independent gaze estimation, we only consider and compare our results against published prior works which have also adopted this validation scheme. The results are reported in Table 7, Table 8, Table 9 in terms of angular error measured in degrees for MPIIGaze, Eyediap, and UTMultiview datasets, respectively. It should be noted that the result for [15] on Eyediap and UTMultiview, has been taken from [18] since the original paper [15] did not report their performance for these datasets. It can be observed that our proposed model, MSGazeNet, outperforms all the existing state-of-the-arts in both Eyediap and UTMultiview datasets while achieving competitive performance to [29] on the MPIIGaze dataset. On Eyediap dataset, we achieve a performance gain of 7.57%percent\%% (6.34{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT \rightarrow 5.86{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) over the existing state-of-the-art [31]. We also improve the current state-of-the-art [27] on the UTMultiview dataset by 1.85%percent\%% (5.4{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT \rightarrow 5.3{}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT). Our results justify the significance of using anatomical eye region isolation along with raw eye images as a multistream input for gaze estimation. Despite not being very common practice in the literature, we also calculate and report the standard deviations for the performance of our method in each dataset. A considerably lower standard deviation compared to [29] on MPIIGaze (less than half) indicates the stability and certainty of our model.

5.2 Robustness to Noise

In addition to comparing the results of our proposed method with existing state-of-the-art solutions, we also perform a robustness analysis to investigate the performance of our network in the presence of noisy data. In this analysis, following [70], we first estimate the inherent noise in all the images from the previously mentioned real-world datasets (Eyediap, MPIIGaze, and UTMultiview) which we used for gaze estimation. The method estimates the variance of additive zero mean Gaussian noise in a given image, I𝐼Iitalic_I by convolving a noise operator, N𝑁Nitalic_N on the image. Following is the noise operator N𝑁Nitalic_N, which is a mask of 3×\times×3 dimension as follows:

N=121242121𝑁missing-subexpressionmissing-subexpressionmissing-subexpression121missing-subexpressionmissing-subexpressionmissing-subexpression242missing-subexpressionmissing-subexpressionmissing-subexpression121N=\begin{array}[]{|c|c|c|}\hline\cr 1&-2&1\\ \hline\cr-2&4&-2\\ \hline\cr 1&-2&1\\ \hline\cr\end{array}italic_N = start_ARRAY start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 4 end_CELL start_CELL - 2 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL end_ROW end_ARRAY (8)

N𝑁Nitalic_N has zero mean and a variance of 36σn236superscriptsubscript𝜎𝑛236\sigma_{n}^{2}36 italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The noise variance, σn2superscriptsubscript𝜎𝑛2\sigma_{n}^{2}italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of I𝐼Iitalic_I after applying N𝑁Nitalic_N can be computed by the following:

σn2=136(W2)(H2)I(I(x,y)N)2,superscriptsubscript𝜎𝑛2136𝑊2𝐻2subscript𝐼superscript𝐼𝑥𝑦𝑁2\sigma_{n}^{2}=\frac{1}{36(W-2)(H-2)}\ \sum_{I}\ (I(x,y)\ast N)^{2},italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 36 ( italic_W - 2 ) ( italic_H - 2 ) end_ARG ∑ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_I ( italic_x , italic_y ) ∗ italic_N ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (9)

where W𝑊Witalic_W and H𝐻Hitalic_H are the width and height of the given image I𝐼Iitalic_I and \ast represents the convolution operation at position (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) of I𝐼Iitalic_I. Next, we evaluate the performance of our network against the estimated noise variances in the input images and compare with prior works [19], [15], and [24], for which public implementations and codes are available (at the time of performing this study, the codes for the other techniques were not made public). Our findings are plotted in Figure 5 where we observe that as the amount of noise in the data increases, our model shows comparatively less deterioration than prior works, indicating better robustness to noise. However, in the presence of significantly higher levels of noise, such as a noise variance exceeding 10 for the MPIIGaze and UTMultiview datasets and 3.5 for the Eyediap dataset, the performance of all the tested solutions, including ours, deteriorates.

Refer to caption
Refer to caption
Refer to caption
Figure 5: The outcome of the robustness analysis with respect to noise. Here we show the performance comparison of our proposed framework against Zhang et al. [19], Zhang et al. [15] and Park et al. [24] in the presence of different amounts of noise.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: The heatmap distribution of the predicted gaze angles versus the ground truth gaze angles for pitch (top row) and yaw (bottom row).
Refer to caption
Figure 7: The gaze predictions and eye region segmentation masks for samples from the three datasets. The predicted gaze angles are marked with green arrows and the ground truths are marked with red arrows.

5.3 Qualitative Results

To carry out a qualitative analysis of the performance of MSGazeNet across different datasets, following [21], we show a heatmap distribution of the predictions versus the ground truth gaze labels in Figure 6. In the top row, the relationship between predictions and ground truths for the pitch angles (ϕpsubscriptitalic-ϕ𝑝\phi_{p}italic_ϕ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) are illustrated while in the bottom row, the same relationship is shown for the yaw angles (ϕysubscriptitalic-ϕ𝑦\phi_{y}italic_ϕ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT). The representations for all the three datasets exhibits almost a linear relationship between the predicted and the ground truth gaze angles. This further verifies the performance of our proposed framework.

Next, we illustrate the output from our anatomical eye region isolation network and visualize the gaze angle predictions versus the ground truth gaze angles from the three benchmark datasets in Figure 7. The reported samples in each row are drawn from different participants within a dataset. The leftmost images of each row are comparatively less noisy with little to no occlusion while the rightmost images are more noisy or have more occlusions. From the figure, we can see that even in the presence of extreme noise, poor resolution, or partial occlusions the eye region masks can still capture the eye anatomy and eventually aid in estimating gaze. Expectedly, as the amount of noise and artifacts increase, the performance of our method does experience degradation, yet as shown in Figure 5, our model is relatively robust against such factors.

5.4 Ablation Studies

We conduct thorough ablation experiments to evaluate the design choices and various components in our proposed neural architecture. Following we describe these experiments and results.

Impact of Each Stream. In our work, we hypothesize that using information from isolated anatomical eye regions in the model would aid gaze estimation. Here, we aim to explore the performance of our network using only one input stream at a time and subsequently adding the other streams for gaze estimation. To this end, we remove each branch of MSGazeNet, systematically, and evaluate the performance. First, we remove the two branches of the network responsible for learning representations from mvissubscript𝑚𝑣𝑖𝑠m_{vis}italic_m start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT and mirissubscript𝑚𝑖𝑟𝑖𝑠m_{iris}italic_m start_POSTSUBSCRIPT italic_i italic_r italic_i italic_s end_POSTSUBSCRIPT. The results presented in Table 10 show that by removing the branches related to the anatomical eye regions, our model suffers performance degradation vs. the full model. Next, we remove the branch of the network that learns representations from the full eye image, and present the results in Table 10, where again we observe a performance drop. Subsequently, we remove only visible eyeball stream, followed by the removal of the iris stream to compare the gaze estimation performance using different combinations of input data. Similarly, both ablations result in performance drops, indicating that the full MSGazeNet is better capable of learning representations that are useful for gaze estimation.

Table 10: Ablation experiments on our proposed model.
Input Dataset Error ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) MPIIGaze Eyediap UTMultiview
MSGazeNet (full network) 4.64±plus-or-minus\pm±0.73 5.86±plus-or-minus\pm±0.80 5.30±plus-or-minus\pm±0.57
Eye regions removed 4.95±plus-or-minus\pm±0.69 6.19±plus-or-minus\pm±0.78 5.51±plus-or-minus\pm±0.72
Eye image removed 5.27±plus-or-minus\pm±0.66 6.88±plus-or-minus\pm±0.64 7.36±plus-or-minus\pm±0.76
Visible eyeball region removed 4.83±plus-or-minus\pm±0.65 6.27±plus-or-minus\pm±0.78 6.97±plus-or-minus\pm±0.79
Iris region removed 4.89±plus-or-minus\pm±0.70 6.37±plus-or-minus\pm±0.92 7.27±plus-or-minus\pm±0.52
Conv block 2 removed 4.91±plus-or-minus\pm±0.82 6.58±plus-or-minus\pm±0.80 6.75±plus-or-minus\pm±0.65
Conv blocks 2 and 3 removed 5.65±plus-or-minus\pm±0.50 9.70±plus-or-minus\pm±0.68 9.22±plus-or-minus\pm±0.78

Impact of AERI Network. The isolation network for the anatomical eye region plays a vital role in our proposed framework. To evaluate the significance of this component, we conduct an experiment by substituting the AERI network with an equivalent counterpart. We utilize an off-the-shelf facial landmark extractor, OpenFace 2.0 [71] to extract eye landmarks and subsequently create the eye region masks from the extracted landmarks. These masks and the corresponding eye images are used to perform gaze estimation, enabling us to conduct a comparative performance analysis. Due to the unavailability of face images, which is the requirement for facial landmark extraction via OpenFace 2.0, we could not use the UTMultiview dataset for this experiment. Our findings, presented in Table 11, reveal that gaze estimation performance using the masks provided by the AERI network outperform the utilization of the OpenFace 2.0 toolkit. While both models are predicting outputs for real-world images, OpenFace 2.0 is estimating high dimensional landmark coordinates. Evidently, the predictions contain more noise in this case while we aim to simplify this process by predicting low dimensional binary masks for the eye regions and experimentally show that the predicted masks contribute better to accurately estimate gaze. This experiment illustrates that the AERI network can accurately capture crucial eye regions from real-world eye images to ensure more robust gaze estimation.

Table 11: Performance comparison between the proposed network and the variant with OpenFace 2.0 on two datasets.
Method Dataset Error ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) MPIIGaze Eyediap
MSGazeNet (w/ AERI) 4.64±plus-or-minus\pm±0.73 5.86±plus-or-minus\pm±0.80
MSGazeNet (w/ OpenFace 2.0) 4.98±plus-or-minus\pm±0.71 6.31±plus-or-minus\pm±0.72

Impact of Network Depth. Here, we investigate the performance of gaze estimation by varying the depth of our network. To this end, we systematically remove the convolutional blocks from all three branches and study the variation in performance. First, we remove conv block 2 from each branch of our proposed network. Next, we further reduce the depth by removing conv block 3 from the remaining network. The results presented in Table 10 reveal that the removal of the two conv blocks considerably deteriorates the performance.

5.5 Network Variants

Here, we create a number of variants of our model in order to further validate our network design choices. An element to our proposed neural network for gaze estimation is the usage of separate conv blocks to encode each individual input to extract optimal input-specific representations. Therefore we create three different variants of our network to explore the various possible encoding strategies for the three networks. These variants, which we describe below along with their performance, have all been trained and evaluated following the same setup and protocol as our proposed model.

First, we construct network Variant 1, in which a single series of conv blocks are used to encode all three inputs (eye image and the two anatomical eye regions). The results presented in Table 12 show that for all three datasets, this strategy give results that are sub-par to our initial design, and in fact below state-of-the-art methods such as [24, 27, 29] for MPIIGaze dataset, and [15, 18, 27, 28] for UTMultiview dataset.

Next, we create Variant 2, in which the two anatomical eye region inputs are encoded together, while the full eye image is encoded separately via a set of separate conv blocks. Similar to Variant 1, we observe the results from Table 12 that indicates deviating from our initial design choice of encoding each input separately degrades the performance, and places the results below [24, 27, 29] for MPIIGaze dataset, and [27, 28] for UTMultiview dataset.

Lastly, we explore the notion of extracting representations from the three branches with encoders that share weights. This strategy would considerably reduce the number of trainable parameters in the overall network. To this end, we create Variant 3 that encodes each input stream separately with a series of conv blocks that share weights. The results presented in Table 12 show that this variant, similar to the other variants, degrades the performance, and falls below [24, 27, 29] for MPIIGaze dataset, and [27] for UTMultiview dataset.

Table 12: Performance comparison between the proposed network and the variants on all three datasets.
Method Dataset Error ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) MPIIGaze Eyediap UTMultiview
MSGazeNet 4.64±plus-or-minus\pm±0.73 5.86±plus-or-minus\pm±0.80 5.30±plus-or-minus\pm±0.57
Var. 1 (1 encoding br.) 4.85±plus-or-minus\pm±0.76 5.97±plus-or-minus\pm±0.85 5.94±plus-or-minus\pm±0.89
Var. 2 (2 encoding br.) 4.81±plus-or-minus\pm±0.77 5.93±plus-or-minus\pm±0.85 5.57±plus-or-minus\pm±0.73
Var. 3 (3 encoding br., shared) 4.83±plus-or-minus\pm±0.66 5.97±plus-or-minus\pm±0.91 5.45±plus-or-minus\pm±0.57
Table 13: Performance comparison between the proposed network and variants with different feature fusion modules on all three datasets.
Method Dataset Error ({}^{\circ}start_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT) MPIIGaze Eyediap UTMultiview
MSGazeNet 4.64±plus-or-minus\pm±0.73 5.86±plus-or-minus\pm±0.80 5.30±plus-or-minus\pm±0.57
Late feature fusion 4.87±plus-or-minus\pm±0.82 6.12±plus-or-minus\pm±0.79 6.08±plus-or-minus\pm±1.02
Early feature fusion 4.81±plus-or-minus\pm±0.69 6.02±plus-or-minus\pm±0.95 5.52±plus-or-minus\pm±0.64

5.6 Impact of Feature Fusion

Given that our network uses separate conv blocks to extract features from each input, a feature fusion stage in our proposed gaze estimation architecture concatenates the individual feature maps in the channel dimension. Next, the fused features are fed to another conv block which is used to extract combined features. In this experiment, we create two baseline networks by modifying the position of the feature fusion stage to create late and early feature fusion variants. To implement the late feature fusion, we place the feature fusion stage after conv block 3 instead of the original design. On the contrary for early feature fusion, we apply feature fusion prior to conv block 2. The results are presented in Table 13, where we observe that for all the datasets, we obtain the best performance from our proposed network compared to the network variants. Specifically, we see a performance drop in the range of 2.66% to a maximum of similar-to\sim 4% in terms of mean angular error across different datasets using the different fusion strategies.

Table 14: Effect of domain randomization with different augmentation types on gaze estimation performance across all three datasets.
Noise Blur Cutout Scale Lines Contrast MPIIGaze Eyediap UTM
4.85 6.14 5.53
4.73 5.98 5.42
4.77 6.14 5.39
4.74 6.11 5.40
4.76 6.02 5.42
4.74 5.94 5.38
4.73 5.97 5.45
4.64 5.86 5.30

5.7 Impact of Domain Randomization

Domain randomization is an essential step in our solution for transferring the anatomical eye region isolation network, which was trained with synthetic and simulated data, into the real domain. This synthetic data does not contain the various types of noise and artifacts which would normally be encountered with real datasets, in this case MPIIGaze, Eyediap, and UTMultiview. Hence, to ensure a significant domain mismatch does not occur, we apply different strong augmentations when training the eye region isolation network. Here, we study the influence of the applied augmentations. To this end, we first train the network without any augmentations, and then add one augmentation at a time. In the end, all the augmentations are applied. Throughout these individual steps, the size of the training data remained the same as the original 60,000 images. The trained network is then integrated into the MSGazeNet framework. It can be observed from the results shown in Table 14 that when no augmentations are used, the gaze performance drops for all the datasets. This performance drop, should it have not been prevented, would have resulted in lower performances in comparison to [24, 27, 29] for MPIIGaze, and [27, 28] for UTMultiview dataset. Our experiments further show that adding ‘random lines’ seems to be the most effective augmentation in the context of our study. Nonetheless, the combination of all the augmentations yields the best results, pushing the performance of our model past the state-of-the-art on Eyediap and UTMultiview.

6 Conclusion and Future Work

In this work, we present MSGazeNet, a novel gaze estimator. Our solution performs person-independent gaze estimation, and consists of two integral parts namely the anatomical eye region isolation and multistream gaze estimation. The anatomical eye region isolation is a crucial component of our framework which is solely trained with synthetic data due to the scarcity of detailed and accurate eye region annotations in real-world gaze estimation datasets. To this end, we procure a synthetic dataset using UnityEyes eye-gaze simulator. Our dataset consists of 80,000 images along with the eye visible eyeball region and iris masks. This dataset is used to train a U-Net style model to isolate eye regions given an input eye image. To allow for this network to then be used for downstream integration into a model for real-world (not synthetic) gaze estimation, we perform domain randomization using a variety of artifact-like augmentations, which helps to narrow the domain gap. The eye region isolation network is then transferred into our gaze estimation pipeline which consists of a multistream architecture. The network takes raw real-world eye images along with the eye region masks to predict the gaze direction. We perform various experiments and demonstrate that our solution achieves strong results, achieving the state-of-the-art on Eyediap and UTMultiview datasets, and exhibiting competitive performance on MPIIGaze dataset. Our robustness experiments show that our model, is more robust in dealing with noise, in comparison to other methods. Detailed and thorough ablation studies and comparisons with several variants quantify the impact of each component in our network and validate our design choices.

In this study, we demonstrated the relevance and importance of using information pertaining to anatomical eye regions towards gaze estimation. This work can serve to motivate further research into using various regions of the eye for gaze estimation. We believe such approaches can lead to learning better and more generalized gaze representations. In addition to the above, a key area to investigate could be the implementation of semi-supervised learning to take advantage of large amounts of unlabeled real-world eye images while training the eye region isolation network. This could improve the eye region isolation module and further enhance the overall performance. Another important scope of research could be the consideration of using 3D representations of key eye regions (i.e. depth maps) which can be constructed from 3D landmarks. It is likely that the 3D representations would contain richer anatomical information about the eye in terms of eyeball curvature or iris contour, which would better aid gaze estimation.

Acknowledgment. The authors would like to thank Innovation for Defence Excellence and Security (IDEaS) program for funding this project. The authors would also like to thank Dr. Dirk Rodenburg for his help throughout the project.

References

  • [1] L. R. Young and D. Sheena, “Survey of eye movement recording methods,” Behavior Research Methods & Instrumentation, vol. 7, no. 5, pp. 397–429, 1975.
  • [2] J. Zagermann, U. Pfeil, and H. Reiterer, “Measuring cognitive load using eye tracking technology in visual computing,” in Proceedings of the 6th Workshop on Beyond Time and Errors on Novel Evaluation Methods for Visualization, 2016, pp. 78–85.
  • [3] Y. Yamada and M. Kobayashi, “Detecting mental fatigue from eye-tracking data gathered while watching video: Evaluation in younger and older adults,” Artificial Intelligence in Medicine, vol. 91, pp. 39–48, 2018.
  • [4] C. Jyotsna and J. Amudha, “Eye gaze as an indicator for stress level analysis in students,” International Conference on Advances in Computing, Communications and Informatics, pp. 1588–1593, 2018.
  • [5] P. Majaranta and A. Bulling, “Eye tracking and eye-based human–computer interaction,” Advances in Physiological Computing, pp. 39–65, 2014.
  • [6] S. Andrist, X. Z. Tan, M. Gleicher, and B. Mutlu, “Conversational gaze aversion for humanlike robots,” Proceedings of the 9th ACM/IEEE International Conference on Human-Robot Interaction, pp. 25–32, 2014.
  • [7] H. Liu and I. Heynderickx, “Visual attention in objective image quality assessment: Based on eye-tracking data,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 7, pp. 971–982, 2011.
  • [8] H. M. Park, S. H. Lee, and J. S. Choi, “Wearable augmented reality system using gaze interaction,” in 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, 2008, pp. 175–176.
  • [9] V. Clay, P. König, and S. Koenig, “Eye tracking in virtual reality,” Journal of Eye Movement Research, vol. 12, no. 1, 2019.
  • [10] T. Louw and N. Merat, “Are you in the loop? using gaze dispersion to understand driver visual attention during vehicle automation,” Transportation Research Part C: Emerging Technologies, vol. 76, pp. 35–50, 2017.
  • [11] S. Baluja and D. Pomerleau, “Non-intrusive gaze tracking using artificial neural networks,” Advances in Neural Information Processing Systems, vol. 6, 1993.
  • [12] K.-H. Tan, D. J. Kriegman, and N. Ahuja, “Appearance-based eye gaze estimation,” in 6th IEEE Workshop on Applications of Computer Vision, 2002, pp. 191–195.
  • [13] Y. Sugano, Y. Matsushita, and Y. Sato, “Learning-by-synthesis for appearance-based 3d gaze estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1821–1828.
  • [14] K. A. Funes Mora, F. Monay, and J.-M. Odobez, “Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras,” in Proceedings of the Symposium on Eye Tracking Research and Applications, 2014, pp. 255–258.
  • [15] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Mpiigaze: Real-world dataset and deep appearance-based gaze estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 1, pp. 162–175, 2017.
  • [16] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learning from simulated and unsupervised images through adversarial training,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2107–2116.
  • [17] K. Lee, H. Kim, and C. Suh, “Simulated+unsupervised learning with adaptive data generation and bidirectional map**s,” in International Conference on Learning Representations, 2018.
  • [18] Y. Yu, G. Liu, and J.-M. Odobez, “Deep multitask gaze estimation with a constrained landmark-gaze model,” in Proceedings of the European Conference on Computer Vision Workshops, 2018.
  • [19] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Appearance-based gaze estimation in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4511–4520.
  • [20] S. Park, X. Zhang, A. Bulling, and O. Hilliges, “Learning to find eye region landmarks for remote gaze estimation in unconstrained settings,” in Proceedings of the ACM Symposium on Eye Tracking Research & Applications, 2018, pp. 1–10.
  • [21] N. Sinha, M. Balazia, and F. Bremond, “Flame: Facial landmark heatmap activated multimodal gaze estimation,” in 17th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2021, pp. 1–8.
  • [22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer Assisted Intervention.   Springer, 2015, pp. 234–241.
  • [23] E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson, and A. Bulling, “Learning an appearance-based gaze estimator from one million synthesised images,” in Proceedings of the 9th Biennial ACM Symposium on Eye Tracking Research & Applications, 2016, pp. 131–138.
  • [24] S. Park, A. Spurr, and O. Hilliges, “Deep pictorial gaze estimation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 721–738.
  • [25] B. A. Smith, Q. Yin, S. K. Feiner, and S. K. Nayar, “Gaze locking: passive eye contact detection for human-object interaction,” in Proceedings of the 26th annual ACM symposium on User interface software and technology, 2013, pp. 271–280.
  • [26] Y. Yu, G. Liu, and J.-M. Odobez, “Improving few-shot user-specific gaze adaptation via gaze redirection synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 937–11 946.
  • [27] K. Wang, R. Zhao, H. Su, and Q. Ji, “Generalizing eye tracking with bayesian adversarial learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 907–11 916.
  • [28] Y. Yu and J.-M. Odobez, “Unsupervised representation learning for gaze estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7314–7324.
  • [29] S. Ghosh, M. Hayat, A. Dhall, and J. Knibbe, “Mtgls: Multi-task gaze estimation with limited supervision,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 3223–3234.
  • [30] P. Kellnhofer, A. Recasens, S. Stent, W. Matusik, and A. Torralba, “Gaze360: Physically unconstrained gaze estimation in the wild,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6912–6921.
  • [31] Z. Mahmud, P. Hungler, and A. Etemad, “Gaze estimation with eye region segmentation and self-supervised multistream learning,” AAAI Workshop on Human-Centric Self-Supervised Learning, 2022.
  • [32] X. Cai, J. Zeng, S. Shan, and X. Chen, “Source-free adaptive gaze estimation by uncertainty reduction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 035–22 045.
  • [33] X. Zhang, S. Park, T. Beeler, D. Bradley, S. Tang, and O. Hilliges, “Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation,” in Proceedings of the 16th European Conference on Computer Vision.   Springer, 2020, pp. 365–381.
  • [34] K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar, W. Matusik, and A. Torralba, “Eye tracking for everyone,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2176–2184.
  • [35] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-world blind super-resolution with pure synthetic data,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1905–1914.
  • [36] S. **, J. Dai, and T. Nguyen, “Kappa angle regression with ocular counter-rolling awareness for gaze estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2658–2667.
  • [37] S. **dal and R. Manduchi, “Contrastive representation learning for gaze estimation,” in Proceedings of The 1st Gaze Meets ML Workshop.   PMLR, 2023, pp. 37–49.
  • [38] S. Park, E. Aksan, X. Zhang, and O. Hilliges, “Towards end-to-end video-based eye-tracking,” in Proceedings of the 16th European Conference on Computer Vision.   Springer, 2020, pp. 747–763.
  • [39] F. Martinez, A. Carbone, and E. Pissaloux, “Gaze estimation using local features and non-linear regression,” in 19th IEEE International Conference on Image Processing, 2012, pp. 1961–1964.
  • [40] M. X. Huang, T. C. Kwok, G. Ngai, H. V. Leong, and S. C. Chan, “Building a self-learning eye gaze model from user interaction data,” in Proceedings of the 22nd ACM International Conference on Multimedia, 2014, pp. 1017–1020.
  • [41] X. Xiong, Z. Liu, Q. Cai, and Z. Zhang, “Eye gaze tracking using an rgbd camera: A comparison with a rgb solution,” in Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, 2014, pp. 1113–1121.
  • [42] E. Wood and A. Bulling, “Eyetab: Model-based gaze estimation on unmodified tablet computers,” in Proceedings of the Symposium on Eye Tracking Research and Applications, 2014, pp. 207–210.
  • [43] E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson, and A. Bulling, “A 3d morphable eye region model for gaze estimation,” in European Conference on Computer Vision.   Springer, 2016, pp. 297–313.
  • [44] K. Wang and Q. Ji, “Real time eye gaze tracking with 3d deformable eye-face model,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1003–1011.
  • [45] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision.   Springer, 2016, pp. 483–499.
  • [46] A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and Computing, vol. 14, no. 3, pp. 199–222, 2004.
  • [47] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
  • [48] E. Wood, T. Baltrušaitis, X. Zhang, Y. Sugano, P. Robinson, and A. Bulling, “Rendering of eyes for eye-shape registration and gaze estimation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3756–3764.
  • [49] Y. Cheng, Y. Bao, and F. Lu, “Puregaze: Purifying gaze feature for generalizable gaze estimation,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  • [50] G. Liu, Y. Yu, K. A. F. Mora, and J.-M. Odobez, “A differential approach for gaze estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 3, pp. 1092–1099, 2019.
  • [51] S. Park, S. D. Mello, P. Molchanov, U. Iqbal, O. Hilliges, and J. Kautz, “Few-shot adaptive gaze estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9368–9377.
  • [52] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,” in European Conference on Computer Vision.   Springer, 2016, pp. 87–102.
  • [53] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in 13th IEEE International Conference on Automatic Face and Gesture Recognition), 2018, pp. 67–74.
  • [54] “Casia iris image database,” https://hycasia.github.io/dataset/casia-irisv4/, 2004.
  • [55] H. Proença and L. A. Alexandre, “Ubiris: A noisy iris image database,” in 13th International Conference on Image Analysis and Processing.   Springer, 2005, pp. 970–977.
  • [56] H. Proença, S. Filipe, R. Santos, J. Oliveira, and L. A. Alexandre, “The ubiris. v2: A database of visible wavelength iris images captured on-the-move and at-a-distance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 8, pp. 1529–1535, 2009.
  • [57] S. J. Garbin, Y. Shen, I. Schuetz, R. Cavin, G. Hughes, and S. S. Talathi, “Openeds: Open eye dataset,” arXiv preprint arXiv:1905.03702, 2019.
  • [58] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  • [59] P. Kansal and S. Devanathan, “Eyenet: Attention based convolutional encoder-decoder network for eye region segmentation,” in IEEE/CVF International Conference on Computer Vision Workshop, pp. 3688–3693.
  • [60] S.-H. Kim, G.-S. Lee, H.-J. Yang et al., “Eye semantic segmentation with a lightweight model,” in IEEE/CVF International Conference on Computer Vision Workshop, 2019, pp. 3694–3697.
  • [61] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
  • [62] A. K. Chaudhary, R. Kothari, M. Acharya, S. Dangi, N. Nair, R. Bailey, C. Kanan, G. Diaz, and J. B. Pelz, “Ritnet: Real-time semantic segmentation of the eye for gaze tracking,” in IEEE/CVF International Conference on Computer Vision Workshop, 2019, pp. 3698–3702.
  • [63] J. Perry and A. S. Fernandez, “Eyeseg: Fast and efficient few-shot semantic segmentation,” in in Proceedings of the European Conference on Computer Vision Workshops.   Springer, 2020, pp. 570–582.
  • [64] A. K. Chaudhary, P. K. Gyawali, L. Wang, and J. B. Pelz, “Semi-supervised learning for eye image segmentation,” in ACM Symposium on Eye Tracking Research and Applications, 2021, pp. 1–7.
  • [65] Y. Shen, O. Komogortsev, and S. S. Talathi, “Domain adaptation for eye segmentation,” in in Proceedings of the European Conference on Computer Vision Workshops.   Springer, 2020, pp. 555–569.
  • [66] S. Zagoruyko and N. Komodakis, “Wide residual networks,” 27th British Machine Vision Conference, 2016.
  • [67] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2017, pp. 23–30.
  • [68] X. Zhang, Y. Sugano, and A. Bulling, “Revisiting data normalization for appearance-based gaze estimation,” in Proceedings of the ACM Symposium on Eye Tracking Research & Applications, 2018, pp. 1–9.
  • [69] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 2015.
  • [70] J. Immerkaer, “Fast noise variance estimation,” Computer Vision and Image Understanding, vol. 64, no. 2, pp. 300–302, 1996.
  • [71] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency, “Openface 2.0: Facial behavior analysis toolkit,” in 13th IEEE International Conference on Automatic Face & Gesture Recognition, 2018, pp. 59–66.