Multistream Gaze Estimation with Anatomical Eye Region Isolation by Synthetic to Real
Transfer Learning
Abstract
We propose a novel neural pipeline, MSGazeNet, that learns gaze representations by taking advantage of the eye anatomy information through a multistream framework. Our proposed solution comprises two components, first a network for isolating anatomical eye regions, and a second network for multistream gaze estimation. The eye region isolation is performed with a U-Net style network which we train using a synthetic dataset that contains eye region masks for the visible eyeball and the iris region. The synthetic dataset used in this stage is procured using the UnityEyes simulator, and consists of 80,000 eye images. Successive to training, the eye region isolation network is then transferred to the real domain for generating masks for the real-world eye images. In order to successfully make the transfer, we exploit domain randomization in the training process, which allows for the synthetic images to benefit from a larger variance with the help of augmentations that resemble artifacts. The generated eye region masks along with the raw eye images are then used together as a multistream input to our gaze estimation network, which consists of wide residual blocks. The output embeddings from these encoders are fused in the channel dimension before feeding into the gaze regression layers. We evaluate our framework on three gaze estimation datasets and achieve strong performances. Our method surpasses the state-of-the-art by 7.57% and 1.85% on two datasets, and obtains competitive results on the other. We also study the robustness of our method with respect to the noise in the data and demonstrate that our model is less sensitive to noisy data. Lastly, we perform a variety of experiments including ablation studies to evaluate the contribution of different components and design choices in our solution.
Gaze patterns can reveal meaningful information about a person’s behaviour and mental state and is often utilized by modern intelligent interactive systems to better understand the users. Gaze can also be a useful communication cue for people with disabilities. The application of gaze estimation ranges from studying human behaviour and psychology to analyzing visual attention in autonomous driving, virtual reality, and remote classrooms. Many of these applications are sensitive to precision and lack user-specific calibration data. In this work, we aim to improve person-independent gaze estimation by presenting a novel framework that integrates eye region segmentation with multistream gaze estimation. Our experiments reveal that using anatomical features in the form of binary masks improves the accuracy of gaze estimation. Our model does not require any calibration samples yet can estimate gaze for unseen users with high accuracy and can be seamlessly integrated into real-time systems. Since gaze tracking involves the use of an individual’s eye image and has the potential to disclose sensitive details about where the user is looking, it is important to first obtain consent and ensure the maintenance of privacy before proceeding with the work.
Gaze estimation, eye region segmentation, multistream network, deep neural network, domain randomization, transfer learning.
1 Introduction
\IEEEPARstartEye gaze patterns can be used to characterize important eye-related movement events such as fixation, saccade, and smooth pursuit [1], which in turn can reveal meaningful information about human behaviour such as a person’s emotion, intention, desire, and state of mind. This classified eye movement data can further be exploited as useful features in cognitive load detection [2], mental fatigue detection [3], and stress level analysis [4]. The application of gaze extends to many different fields such as human-computer interaction (HCI) [5], human-robot interaction (HRI) [6], visual attention analysis [7], augmented or virtual reality (AR/VR) systems [8, 9], autonomous driving [10], and others. Hence, gaze estimation has become a widely acknowledged research area in computer vision due to its relevance and numerous contributions to various applications.
Early works on image-based gaze estimation were performed under constrained settings such as fixed head pose and unchanged illumination [11, 12]. Subsequently, with newer datasets such as UTMultiview [13], Eyediap [14], and MPIIGaze [15], some of the above-mentioned limitations were mitigated to make gaze estimation more realistic and compatible with in-the-wild scenarios. Such datasets were collected either in a laboratory environment [13, 14] or in daily life settings [15], offering continuous head pose, continuous gaze targets, variation in appearance, and illumination. These datasets are inherently more challenging for gaze estimation given the dynamic nature and variations in experimental conditions. To overcome these challenges, most recent methods [15, 16, 17, 18] have leveraged deep convolutional neural networks (CNNs) as they are comparatively more robust towards noise and changes in visual factors.
Deep learning solutions for gaze estimation [19, 15, 18] generally focus on regressing gaze angles directly from the raw eye image, and often do not consider additional information which may be found in different regions of the eye. For instance, the iris region contains important information that can aid in better estimation of gaze, if it were explicitly learned by the network. Moreover, distinct anatomical regions of the eye, for example the visible eyeball and the iris, are not highly complex to detect, making them insensitive to noise and illumination variations throughout the image, and thus highly beneficial for gaze learning. Yet, to the best of our knowledge, gaze estimation solutions have rarely focused on these properties, and explicit detection and learning of anatomical eye regions has so far been ignored. Eye landmarks have in fact been used to help gaze estimation [20, 21], but are difficult to learn, especially in noisy settings, and considerably increase the dimensionality of the data.
In this paper, we aim to improve person-independent gaze estimation by exploiting additional information extracted from raw input images. Our proposed solution consists of two key steps, namely anatomical eye region isolation and multistream gaze estimation, together called MSGazeNet. An overview of our proposed framework is depicted in Figure 1. Our model first uses a U-Net style network [22] to perform anatomical eye region isolation, outputting binary masks for two key eye regions which are the iris and the visible eyeball. Following, our model uses a multistream gaze estimation network that takes the raw input eye images along with the outputs of the U-Net, as inputs to estimate the gaze. This component of MSGazeNet uses an encoder in each stream to learn effective gaze-related representations. Next, channel-wise feature fusion is performed. Finally, the fused representations are passed through additional convolutional layers and a regression block consisting a set of fully connected (FC) layers to estimate the output gaze. Given that the existing real-world datasets that are used for gaze estimation do not contain detailed eye structure information such as visible eye region or the iris mask, we train the anatomical eye region isolation network exclusively on a synthetic dataset which we procure using UnityEyes simulator [23]. This synthetic dataset consists of 80,000 synthetic eye images and corresponding eye region masks. Once the isolation network is trained in the synthetic domain, we perform transfer learning and integrate it into MSGazeNet. We also perform domain randomization when training the isolation network to ensure that the domain gap between the synthetic images and the real images used to train the downstream network is reduced. The weights of the isolation network are then frozen and the rest of the network is trained on the real-world gaze datasets for gaze estimation. We use three publicly available gaze datasets, MPIIGaze [15], Eyediap [14], and UTMultiview [13], to evaluate our solution. MSGazeNet obtains strong results, outperforming the state-of-the-art on Eyediap and UTMultiview, and achieving competitive results on MPIIGaze. We also perform a number of ablation studies and qualitative analysis to test the impact of different components and parameters in our network. Lastly, we perform a robustness analysis to investigate the performance across different amount of noise existing in the real-world datasets, and observe that MSGazeNet performs more robustly in comparison to prior works in this area.
Our contributions in this paper are three-fold. (1) We propose a novel deep neural framework for gaze estimation which consists of two main components, an anatomical eye region isolation module, followed by a gaze estimation network, together called MSGazeNet. The isolation network detects key areas of the eye which are informative toward gaze estimation. Our approach eliminates the need for additional inputs such as head pose and eye landmarks, which are often required by multimodal solutions in the area. (2) The eye region isolation network is solely trained using a synthetic dataset that contains over 80,000 eye images along with their iris and visible eyeball masks. The trained network is used to extract binary masks for real-world eye images. We perform domain randomization utilizing artifact-like augmentations to ensure a smooth transfer from synthetic to real domain. (3) Through rigorous experiments, we demonstrate that our model performs robust gaze estimation even in the presence of noisy data. Given the ability of our model to learn the critical regions of the eye, our results set new state-of-the-arts on two benchmark gaze estimation datasets (Eyediap and UTMultiview) and achieves competitive performance on another dataset (MPIIGaze). We then validate our design choices through detailed ablation studies and exploring several variants of our proposed solution. To encourage reproducibility and contribute to the field, we make our code public at: https://github.com/z-mahmud22/MSGazeNet.
The rest of this paper is organized as follows. In the next section, we describe the related work in the field. Following, we present our method, including the network and synthesized dataset. Next, we describe the experimental setup and implementation details. This is then followed by detailed experimental results and various sensitivity/ablation studies. Lastly, we summarize our work and discuss the potential future research directions.
Year | Method | Dataset | Input | Feature Extractor | Regressor |
---|---|---|---|---|---|
2015 | Zhang et al. [19] | MPIIGaze, UTMultiview | Image, pose | LeNet | FC |
2017 | Zhang et al. [15] | MPIIGaze, UTMultiview | Image, pose | VGG-16 | FC |
2018 | Park et al. [24] | Eyediap, MPIIGaze | Image | Hourglass+DenseNet | FC |
2018 | Park et al. [20] | Columbia [25], Eyediap, MPIIGaze, UTMultiview | Image | Hourglass | SVR |
2018 | Yu et al. [18] | Eyediap, UTMultiview | Image, pose | 4 Layers CNN | FC |
2019 | Yu et al. [26] | Columbia, MPIIGaze | Image | VGG-16 | FC |
2019 | Wang et al. [27] | Columbia, Eyediap, MPIIGaze, UTMultiview | Image | Bayesian CNN | FC |
2020 | Yu et al. [28] | Columbia, Eyediap, UTMultiview | Image | ResNet | FC |
2022 | Ghosh et al. [29] | Columbia, Gaze360 [30], MPIIGaze | Image | ResNet-50 | FC |
2022 | Mahmud et al. [31] | Eyediap | Image | U-Net+Multistream VGG-16 | FC |
2023 | Cai et al. [32] | Eyediap, ETH-XGaze [33], Gaze360, GazeCapture [34], MPIIGaze | Image | Real-ESRGAN [35]+ResNet-18 | FC |
2023 | ** et al. [36] | Eyediap, MPIIGaze | Image, pose | VGG-16 | FC |
2023 | **dal et al. [37] | EVE [38], Columbia, MPIIGaze | Image | ResNet-18 | FC |
2 Related Work
Gaze is often represented as a 2D screen coordinate representing the point of gaze (PoG), or an angular vector representing the gaze direction by pitch and yaw angles. Vision-based methods primarily fall under three categories which are feature-based, model-based, and appearance-based. Feature-based approaches [39, 40, 41] use the geometric shape of the eye to extract hand-crafted features such as eyeball centre, radius, pupil centre, eye corners from eye images which are then used in light-weight machine learning models to regress gaze. These methods were generally used prior to emergence of deep learning solutions. Model-based methods [42, 43, 44] aim to fit 3D deformable eye region models to eye images. Both classical machine learning and more recent deep learning techniques have been used in this category of literature. Appearance-based methods [23, 15, 18, 24] aim to learn a direct map** of gaze from input eye images. These methods, which our paper also falls under, mainly rely on deep learning models. Following we present the related work under appearance-based gaze estimation methods. In particular, we review prior works on supervised learning, domain adaptation methods, as well as few-shot, semi-supervised, and self-supervised solutions. Lastly, since segmentation of eye regions is used in our study, we also briefly review the works in this area.
2.1 Supervised Methods
A multimodal network was proposed in [19] where a LeNet type architecture was adopted to perform gaze estimation from eye images and head pose. In a subsequent work [15], a VGG type architecture was extended into a multimodal model that also used eye images and head pose as multimodal inputs. In [18], a deep multitask network was proposed where the network aimed to learn eye gaze and eye landmarks from eye images and their corresponding head pose information. The work explored the correlation of eye landmarks and gaze direction and argued for the eye landmarks to provide information cues for gaze estimation. Landmark detection in the form of Gaussian probability heatmaps of landmark coordinates from synthetic images was proposed in [20] using a stacked hourglass network [45]. The predicted landmarks were then fed into a support vector regressor (SVR) [46] to estimate gaze. The proposed solution improved iris localization, eyelid registration, and gaze estimation accuracies in both cross-dataset and person-specific settings. In [24], a pictorial representation of gaze was proposed, which was hypothesized to be an intermediate eye image representation. The proposed pipeline consisted of a stacked hourglass network [45] which was trained to predict the intermediate gazemaps from eye images, followed by a lightweight DenseNet architecture [47] to regress gaze from the gazemaps. Another multimodal approach was proposed in [21], which used both RGB eye images and their corresponding eye landmark heatmaps for gaze estimation. The two inputs were processed via separate CNN encoders to extract features which were then concatenated along with head pose information, and subsequently fed into dense layers that output 3D gaze direction.
2.2 Domain Adaptation Methods
A synthetic dataset, SynthesEyes, was published in [48] to perform both eye shape registration and gaze estimation. The dataset offers a wide variation of synthesized eye images in terms of head pose, gaze, and illumination conditions. It was shown that when used for pre-training, the dataset results in significant performance improvement in a cross-dataset setting using the network proposed in [19]. Following the prior work, UnityEyes, a synthesis framework was developed in [23], that can render eye region images in real-time. The system can be used to generate large scale synthetic eye image datasets and their corresponding landmarks, along with eye gaze annotations. A synthetic dataset consisting of millions of images which were generated by the simulator was used to train a simple kNN algorithm, outperforming their previous work [48] in cross-dataset experiments. To minimize the domain gap between synthetic and real images, a generative adversarial network (GAN) was proposed in [16] that used unlabeled synthetic and real eye images. The network consisted of a refiner network that refined the synthetic images to make them more realistic through adversarial learning. These refined images were then used to train a simple CNN to estimate gaze, which outperformed previous state-of-the-arts by a large margin in cross-dataset settings. A further improvement was reported in [17] which relied on bidirectional map** between synthetic and real eye images by leveraging a cyclic image-to-image translation framework. Highlighting the key challenges in cross-domain gaze estimation, a domain generalization technique was proposed in [49] where gaze-irrelevant features such as illuminations and appearance factors were eliminated via self-adversarial learning to extract purified gaze-relevant features from facial images. The conducted experiments resulted in new state-of-the-art performances in cross-dataset settings across multiple gaze estimation datasets without any fine-tuning. In [32], a source-free domain adaptation method was introduced to adapt a gaze estimator to an unlabeled target domain without any source data. The neural architecture consisted of a face enhancer model that generates high-quality input images for the gaze estimator, leading to reduced variance and uncertainty of gaze predictions in the target domain.
2.3 Few-Shot and Unsupervised Methods
For person-specific gaze estimation, a differential CNN was proposed in [50], which output the gaze difference between two eyes of the same subject. During inference, the network used a set of calibration samples from the same subject and predicted the gaze difference between the input image and the calibration samples as the estimated gaze. With the intent of making more personalized gaze networks with lower gaze error, a person-specific gaze estimation network was proposed in [51] that worked with only a few ( 9) calibration samples from the test person. The proposed solution disentangled appearance features, gaze, and head pose information from facial images using a disentangling encoder-decoder (DT-ED) [51]. The network took an RGB face image as input and the decoder mapped it to three latent space vectors, which corresponded to eye region appearance, gaze, and head pose information respectively. The gaze latent vector was then fed to dense layers in order to make the gaze prediction. The scarcity of calibration samples in few-shot person-specific gaze estimation was addressed in [26] by generating more training samples via synthesis of gaze-redirected eye images from an available set of calibration samples. The framework relied on synthetic images, generated by [23], to learn the gaze redirection task. To better adapt to the real domain, the network was further trained with real images which were first redirected given a redirection angle, and supervised via gaze redirection loss. Following, an inverse redirection angle was applied to the gaze redirected images to reconstruct the original images which were supervised via a cycle consistency loss. In [36], a Kappa angle compensation method was proposed to neutralize the ocular counter-rolling response (OCR). The normalization process of eye images naturally induces the OCR which redistributes the Kappa angle’s pitch and yaw component. This method with a few calibration samples ( 9) from the test subject, regresses the Kappa angle to refine the estimated gaze.
A semi-supervised approach was presented in [27] where a Bayesian convolutional neural network (BCNN) relied on both labeled and unlabeled eye images to perform gaze estimation along with appearance classification and head pose estimation. The framework also included an adversarial component where the gaze labeled images were used as source domain and the unlabeled images are used as the target domain. The framework used the labeled images to supervise the gaze estimator, while the gaze estimator aimed to learn person-invariant features to oppose the adversarial module. Eye region segmentation was performed as an auxiliary task in [31] and the output eye segments and raw eye images were used as inputs to a multistream network for gaze estimation. The multistream network consisted of three encoders to extract features separately from the three inputs and the eye image encoder was pretrained with self-supervised contrastive learning and then fine-tuned during the downstream gaze estimation task. A sensitivity experiment verified the stability of the proposed network while using very limited amount of labeled data. In [29], a multitask network was developed to perform three auxiliary gaze-relevant tasks with limited supervision. Using off-the-shelf networks, psuedo-gaze, eye orientation and head pose were extracted from large scale facial image datasets [52, 53] which were later used to train a CNN backbone for the auxiliary tasks. To minimize noise in the generated labels, a noise distribution model was also incorporated in the framework. The network was then fine-tuned for downstream gaze estimation. A contrastive learning method, GazeCLR, was proposed in [37] which pre-trains a CNN encoder with both single-view and multi-view gaze samples. These different gaze samples help the network learn invariance and equivariance among gaze representations, improving the cross-domain gaze estimation performance.
In [28], an unsupervised approach was proposed to learn low dimensional gaze representations by utilizing a gaze redirection network. The proposed pipeline used an image pair of the same eye with different gaze directions as inputs to two separate networks for representations to be learned. The output latent vectors and their difference were then used in a gaze redirection network to reconstruct the latent representation of one of the images by redirecting the other image based on the gaze difference. The entire pipeline did not require ground truth gaze labels while training. The trained gaze representation network was then calibrated with randomly selected labeled samples (10-100) from the training data.
![Refer to caption](x2.png)
2.4 Eye Region Segmentation
Given that segmentation plays a role in our method, we briefly review the literature on this topic. In the context of segmentation focused on eye images [54, 55, 56], a large-scale eye segmentation dataset, OpenEDS, was released in [57]. Along with the dataset, some baseline experiments using deep convolutional encoder-decoder architectures [58] were performed to set eye segmentation benchmarks. Subsequently, an attention-based encoder-decoder network, Eyenet [59], was proposed to perform multi-scale supervision during eye region segmentation. The neural architecture consisted of slightly modified residual units and two types of attention modules that applied attention on both channel and spatial dimensions. In [60], a lightweight segmentation network based on MobileNetv2 [61] was proposed which significantly reduced the processing time by utilizing depthwise convolution. A real-time segmentation network, RITnet, was released in [62] to segment eye images at 300 Hz. The proposed solution combined DenseNet with U-Net to create the architecture which was supervised via a weighted combination of three different loss functions. Semi-supervised approaches were explored in [63, 64] where the solutions significantly improved the baseline performance with fewer annotated data and trainable parameters. In [65], three types of domain adaptation methods, supervised, semi-supervised, and unsupervised were explored using eye segmentation datasets collected from two different setups.
3 Method
In this section, we first discuss the problem statement. This is followed by an overview of the proposed network, and detailed description of each component in our pipeline.
3.1 Problem Statement
Let’s assume we have input which contains a grey-scale eye image. Our goal is to develop a model which can reliably estimate the gaze parameters pitch () and yaw () angles. Pitch angles refer to the up/down eye movements and the yaw angles refer to the left/right eye movements. We hypothesize that in addition to learning the representation of the overall eye image, , extracting information explicitly from anatomical regions namely the visible eyeball and the iris would result in more informative features for estimating the final and .
3.2 Proposed Solution
Solution Overview. To address the problem above, we design a network that first performs anatomical eye region isolation in order to separate key geometric sections of the eye for explicit processing and representation learning. This critical step in our pipeline, as we will describe later, relies on transfer learning between simulation to real domains. Next, our pipeline uses all the available information, i.e., the original input along with the isolated eye regions, to perform representation learning followed by fusion, and eventually gaze estimation. An overview of our method is presented in Figure 1. In the following, we describe each of these components in our proposed method.
Anatomical Eye Region Isolation. As touched upon above, a core component of our proposed method is the process of isolating different anatomical eye regions so that they could be individually used for gaze estimation. Here we first discuss our justification for including this step in our proposed method. Due to the inherent noise in real-world images, it is often quite difficult to recognize the gaze or orientation of the eye from raw images. Prior research [18] suggests that gaze direction has a strong correlation with eye landmarks, indicating that these landmarks could potentially contribute to gaze estimation as auxiliary information. However, obtaining such detailed and accurate landmarks is computationally expensive and susceptible to noise itself. Nevertheless, some methods use off-the-shelf landmark detectors which are primarily trained using synthetic data to extract eye landmarks. However, learning such high dimensional information is very challenging and those networks also suffer from the ‘synthetic to real’ domain gap, which hinders the robustness of their landmark predictions. As a result, even though some methods [20, 21] use these landmarks in the form of heatmaps, there still remains considerable amount of noise in the training data.
In our proposed solution, as opposed to using the eye landmarks directly as auxiliary data, we propose and use the Anatomical Eye Region Isolation (AERI) network, to extract and isolates anatomical eye regions, namely the visible eyeball and the iris. We denote this network by where are learnable parameters. takes the eye image as input, and outputs and , which are binary masks corresponding to and respectively. This design allows our pipeline to gain additional information about the orientation of the eye and key regions therein, without having to rely on the high-dimensional and noisy landmarks.
As shown in Figure 2, uses a U-Net style architecture [22] with a two-channel output . Thus, to train the network we require a dataset of tuples. Since such a dataset does not exist and its collection from real images is quite difficult and time-consuming given the difficulty of recording eye images and isolating the anatomical regions for every image manually, we introduce a synthetic dataset. For this purpose, we rely on an eye image simulator UnityEyes [23], to procure the synthetic eye image dataset. The simulator can render synthetic eye images along with their detailed 2D landmark annotations and gaze labels in real-time. The simulator generates 32 landmarks for the iris region, denoted by , where . Next, is calculated by:
(1) |
where is a function that takes a series of landmark coordinates and creates a polygon by connecting them sequentially, and is a binarizing operator which creates a binary mask by taking the enclosed area in its input and setting it to 1, while the outside is set to 0.
The simulator also provides 16 landmarks for the interior region of the eye, i.e., the visible eyeball, plus 6 additional 2D coordinates for the caruncle region (inner corner of the eye). Here, we first average the 6 caruncle coordinates to create a single caruncle representative landmark, bringing the total landmarks for the visible eyeball to 17, as , where . Next, similar to our approach for creating , we use:
(2) |
to generate the visible eyeball mask. The process of obtaining is also depicted in Figure 3.
Successive to procuring the synthetic dataset, we use it to train using mean squared error (MSE) loss:
(3) |
Here, is the average MSE loss calculated between the predicted mask and the ground truth mask , for pixels .
As for the architectural details of the AERI network, as shown in Figure 2 (top), it consists of an encoder-decoder U-Net architecture. The architecture details presented in Table 2 are adapted from the classical U-Net model and are selected to maximize performance. The encoder contains 5 convolutional blocks (conv blocks) with 22 maxpool layers in-between. The decoder consists of 4 conv blocks with 22 upsampling layers in-between. A final 11 convolution layer followed by sigmoid generates the network output. The feature maps from the upsampling layers of the decoder blocks are concatenated with their corresponding feature map outputs from the encoder blocks via skip-connections. Each conv block of both encoder and decoder consists of two 33 convolution layers followed by batch normalization, zero-padding, and ReLU activation. The number of feature maps at the initial block of the encoder is 64, which is doubled following every maxpool to a maximum of 1024 feature maps at the last block of the encoder. Conversely, the number of feature maps in the initial decoder block is 1024, which is halved after every upsampling layer, reaching 64 at the final conv block.
Modules | Parameters | Values |
---|---|---|
Conv Block(s) | Input shape | 13660 |
# Encoder blocks | 5 | |
# Decoder blocks | 4 | |
# Layers | 2 | |
Layer type | Conv2D | |
Kernel size | 33 | |
Padding type | Zero-padding | |
Activation | ReLU | |
Downsample | # Layers | 4 |
Layer type | Maxpool2D | |
Kernel size | 22 | |
Upsample | # Layers | 4 |
Layer type | Bilinear upsample | |
Scale factor | 2.0 | |
Output | # Layers | 1 |
Layer type | Conv2D | |
Kernel size | 11 | |
Activation | Sigmoid | |
Full Network | Batch size | 32 |
Loss function | MSE | |
Optimizer | Adam | |
Learning rate | 0.00001 | |
Learning rate decay | 0.1 |
Multistream Gaze Estimation. Successive to training the network described above, it is frozen and the weights are transferred to our final gaze estimation model. In order to ensure that domain-shift issues between the synthetic and real domains do not negatively impact the overall performance, we explore a variety of augmentations in order to allow the distribution of the synthetic dataset to become closer to that of the real datasets. This will be discussed in detail later in Section 3.3.
We use the frozen AERI network, to generate and . These two masks, along with are then used for gaze estimation. The layout of the gaze estimation network, , is depicted in Figure 2 (bottom) and the architecture is detailed in Table 3. The network consists of three parallel branches for the three different input streams. The first branch processes input eye image through a 33 convolution layer followed by two conv blocks with identical structures. We use wide residual blocks (WRBs) [66] for the conv block architectures. Each conv block is a combination of two WRBs, denoted by B(M), where M represents the list of kernel sizes of the convolution layers inside a WRB. The initial WRB is of type B(3,1,3) which consists of two 33 convolution layers followed by a 11 convolution layer, while the subsequent WRB is of type B(3,3). As mentioned earlier, the first conv block is followed by a second identical conv block. A similar architecture is used for the other two branches responsible for processing the eye anatomy mask streams. The feature maps obtained from the three branches are then fused in a channel-wise manner through concatenation, and subsequently fed to a single conv block (conv block 3) to extract combined representations.
Modules | Parameters | Values |
---|---|---|
Conv2D | Input shape | 13660 |
Kernel size | 33 | |
Padding type | Zero-padding | |
Output feature size | 163660 | |
Conv Block(s) | Architecture | Wide Residual Network |
# WRBs | 2 | |
Block type | B(3,1,3), B(3,3) | |
Activation | LeakyReLU | |
Dropout rate | 0.5 | |
Full Network | Widen factor | 4 |
Depth | 16 | |
Batch Size | 32 | |
Loss function | MSE | |
Optimizer | Adam | |
Learning rate | 0.0001 | |
Learning rate decay | 0.5 |
Gaze Regression. The combined feature embeddings from the multistream backbone network are passed through a global average pooling layer. The final feature embeddings are then flattened and fed to a set of 3 FC layers to regress the output gaze in the form of pitch () and yaw () angles. We use dropout with a probability of 0.25 and ReLU activation in between the FC layers, with the exception of the output layer.
The gaze regression is supervised by the following loss:
(4) |
where is the average MSE loss calculated between the predicted gaze = [, ] and the ground truth gaze = [, ] for sample images.
3.3 Domain Randomization
We apply a variety of augmentations during the training of as well as . Since the first component of the pipeline is solely trained on the synthetic domain, there remains a possibility for the network to not perform well in the real domain when integrated into the final MSGazeNet pipeline. To prevent this from happening, we adopt domain randomization [67] by applying visual transformations such as Gaussian noise, blur, and change of contrast to augment the simulated images so that they better reflect the distribution of the real-world images. The purpose of using this technique is to include a large number of relatively strong variations to the simulated images so that they represent the variations of the real-world datasets. The parameters of all the applied transformations as detailed in Table 4 are chosen via trial and error. During training, these transformations were randomly applied to augment the images. We apply Gaussian noise with a mean () of 0 and standard deviation () in the range of [0 - 10.0]. We apply blurring operation with in the range of [0 - 2.0] and a filter size of 33. Downscale by a random factor in the range of [0-0.5] is applied followed by up-scaling the original image size, essentially quantizing the image with the original resolution. As another augmentation, we add random lines (0 to 2 lines). We also randomly change the contrast of the input images with a minimum intensity value, = [0-100] and a maximum intensity value, = [155-255]. Lastly, we randomly remove regions with a dynamic square kernel of size 1 to 10 pixels in both height (h) and width (w). As we show later in the results section, we observe that applying strong augmentations enables both the anatomical eye region isolation network and the gaze estimation network to learn better representations for their respective tasks.
Transformations | Parameters |
---|---|
Gaussian noise | = [0 - 10.0], = 0 |
Gaussian blur | = [0 - 2.0], filter size = 3 3 |
Cutout | h,w = [1 - 10.0] px |
Downscale | [0-0.5] |
Random lines | 0 to 2 |
Contrast change | = [0-100], = [155-255] |
Name | # Subjects | Size | Resolution | Head pose | Gaze target | Illumination condition | Collection duration |
---|---|---|---|---|---|---|---|
Eyediap [14] | 16 | 237 min | HD & VGA | Continuous | Continuous | 2 | 2 days |
MPIIGaze [15] | 15 | 213,659 | 1280720 | Continuous | Continuous | Daily life | 3 months |
UTMultiview [13] | 50 | 64,000 | 12801024 | 8 + synthesized | 160 | 1 | 1 day |
4 Experiments
In this section, we describe the datasets used in our work, followed by data normalization. Next, the evaluation protocol is described, followed by the description of the metric commonly used for quantifying gaze estimation performance. Lastly, the implementation details of our model are described.
4.1 Datasets
Real Datasets. We evaluate our model’s performance on three public gaze datasets, Eyediap [14], MPIIGaze [15], and UTMultiview [13]. The overview of these datasets is shown in Table 5. The Eyediap dataset contains 94 VGA and HD resolution videos captured from 16 subjects among which 12 are male and 4 are female. The dataset was collected in two different sessions for each subject. In one session, named the screen target session, small dots appeared on a screen which the users were asked to look at. In another session, called the 3D floating ball session, a ball was continuously moved in front of the user, which the user was asked to look at and follow. Both of these sessions included both static and mobile head pose scenarios. Although the dataset offers large variability in gaze and head pose, it contains similar illumination condition across all sessions.
The MPIIGaze is a benchmark dataset for in-the-wild appearance-based gaze estimation. It has been collected from 15 subjects in an unconstrained manner over the course of several months. Among these subjects, 9 are male, 6 are female, and 5 subjects wore glasses. The age of the participants ranges from 21 to 35 years old. The dataset consists of 213,659 facial images of subjects during their everyday laptop usage and the corresponding ground truth gaze labels. The labels were collected using a custom software that triggered a sequence of 20 dots at every 10 minutes, which the users were asked to look at. The dataset contains significant variation in terms of appearance, head movement, gaze target, and illumination.
The UTMultiview dataset consists of 160 discrete gaze targets, which contrasts the two datasets mentioned above given that they consist of continuous gaze targets. There are 50 subjects (35 male and 15 female) who are aged between approximately 20 to 40 years old. This dataset was collected in a laboratory environment and the collection procedure included 8 cameras to capture facial images of the participants from multiple viewpoints in a synchronized manner. The gaze labels were collected by instructing the participants to follow a visual target on a monitor.
Synthetic Dataset. In addition to the three datasets above, we use UnityEyes simulator [23], to procure a synthetic dataset, which consists of 80,000 synthetic eye images with a resolution of 1280 768 and their corresponding masks to represent the iris and visible eyeball region of the eye. The images are captured by varying eye appearances, illuminations (light intensity = [0.60 - 1.20]), shapes, headpose (, = [20.00, 40.00]), and gaze (, = [49.49, 78.28]) directions, all of which are controllable parameters in the simulator. The dataset contains images from 20 virtual subjects out of which 5 are female and 15 are male. For training, 60,000 of the synthetic images are used, while we set aside 20,000 for hold-out validation. Finally, we present the distribution plots of both head pose and gaze for all the datasets used in this study in Figure 4. We observe that the distributions of the synthetic dataset are wider than the three real-world datasets in terms of both gaze and head pose, making this dataset suitable for training purposes.
![Refer to caption](x4.png)
![Refer to caption](x5.png)
![Refer to caption](x6.png)
![Refer to caption](x7.png)
![Refer to caption](x8.png)
![Refer to caption](x9.png)
![Refer to caption](x10.png)
![Refer to caption](x11.png)
4.2 Data Normalization
We normalize all the eye images and their corresponding gaze labels from Eyediap and UTMultiview datasets, as well as the images and masks from the synthetic dataset, according to [68]. The MPIIGaze dataset directly provides normalized data. The normalization process requires 3D head pose and camera calibration matrix. A virtual camera is used to map the input images to a normalized space where a fixed distance of 600 mm is maintained between the eye centre and the virtual camera. The virtual camera is also rotated to point at the reference point (eye centre) as well as to prevent the freedom of head rotation in the roll axis to avoid any ambiguity. Next, perspective transformation is applied to the images, followed by resizing to 6036 pixels. Finally, the eye images from all four datasets are converted to gray-scale, and then histogram equalized.
4.3 Evaluation Protocol
For gaze estimation, we use the VGA videos from the screen target session of the Eyediap dataset. The eye images are collected at every 15 frames from each recording to create a subset which we later use for training and testing. As there were no screen target session data for two subjects, the evaluation set contains data from 14 subjects rather than 16. The MPIIGaze dataset provides a subset of 45,000 eye images which is comprised of 1,500 left-eye and 1,500 right-eye images from each of the 15 subjects. The UTMultiview dataset contains both real-world and synthesized images for each participant. We use the real-world part which contains 64,000 eye images from 50 participants. We use the standard evaluation protocols for all the three benchmark datasets to be consistent with the previous methods [15, 24, 18, 28]. We follow leave-one-subject-out evaluation for the MPIIGaze dataset, a 5-fold validation for the Eyediap dataset and a 3-fold validation for the UTMultiview dataset. Note that there is no subject overlap in the different folds for Eyediap and UTMultiview datasets.
4.4 Evaluation Metric
Following prior works [15, 24, 28], the gaze angular error is used to measure the gaze estimation performance. The gaze angular error , is calculated between ground truth gaze and predicted gaze . Before calculating , the gaze angle values are converted to a three-dimensional vector , of Cartesian coordinates by the following:
(5) |
where represents the conversion between two coordinate systems. Hence, is calculated by:
(6) |
and
(7) |
4.5 Implementation Details
Both the anatomical eye region isolation and the gaze estimation networks are implemented using the PyTorch library. It should be noted that the setup of the regression block is optimized based on the datasets. While for all three datasets, three FC layers are used, the number of nodes vary. For MPIIGaze dataset, we use 512, 256, and 2 nodes for the three layers respectively. For Eyediap dataset, we use 256, 128 and 2 nodes. And lastly for UTMultiview dataset, we use 1024, 512 and 2 nodes. The AERI network is trained using an Adam optimizer [69] with an initial learning rate of and a batch size of 32. The network is trained for 30 epochs via a step learning scheduler with a step size of 5 and a decay factor of 0.1. The gaze estimation network is also trained for 30 epochs using an Adam optimizer with an initial learning rate of and a batch size of 32. At this step, a plateau learning rate scheduler is used with a decay factor of 0.5 and patience of 3. We train the entire pipeline (both the networks) using a single Nvidia 2080 Ti GPU. The training time of the AERI network lasted for approximately 2 hours. The gaze estimator took about 0.5 hour per fold (out of 5) on the Eyediap dataset, 1.5 hours per fold (out of 15) on MPIIGaze dataset and 2.5 hours per fold (out of 3) on UTMultiview dataset. The inference time of our framework is 6 ms/image on GPU and 20 ms/image on CPU, making our model highly suitable for real-time systems.
5 Results
In this section, we first present the quantitative results obtained by our model, and make comparisons against existing methods in the area notably the state-of-the-art. Following, we analyze the performance of our model in terms of robustness to noise. Next, we present qualitative results and visually demonstrate the performance of our method. This is followed by a series of thorough ablation studies to evaluate the impact of each major component of our solution. Finally we perform additional experiments on several network variants, feature fusion, and the impact of augmentations in our solution.
5.1 Quantitative Results
At first, we present the quantitative results of our AERI network. To evaluate the performance of this module, we perform hold-out validation on the synthetic dataset. We use the MSE (Eq. 3), and mean intersection over union (mIoU) as the evaluation metrics. It can be observed from Table 6, that we obtain low segmentation errors for both training and validation sets, while the overlap between the ground truth and predicted masks are quite high.
Dataset | MSE () | mIoU () |
---|---|---|
Train set | 0.011 | 0.875 |
Validation set | 0.098 | 0.892 |
Method | Input | Feature Extractor | Reg. | Err. |
---|---|---|---|---|
Zhang et al. [19] | Img+pose | LeNet | FC | 6.3 |
Zhang et al. [15] | Img+pose | VGG-16 | FC | 5.5 |
Park et al. [24] | Img | Hourglass+DenseNet | FC | 4.56 |
Yu et al. [26] | Img | VGG-16 | FC | 5.35 |
Wang et al. [27] | Img | Bayesian CNN | FC | 4.3 |
Ghosh et al. [29] | Img | ResNet-50 | FC | 4.211.90 |
MSGazeNet (Ours) | Img | U-Net+M.S. w. ResNet | FC | 4.64 0.73 |
Method | Input | Feature Extractor | Reg. | Err. |
---|---|---|---|---|
Zhang et al. [15] | Img+pose | VGG-16 | FC | 6.3 |
Park et al. [20] | Img | Hourglass | SVR | 11.9 |
Park et al. [24] | Img | Hourglass+DenseNet | FC | 10.3 |
Yu et al. [18] | Img+pose | 4 Layers CNN | FC | 6.5 |
Wang et al. [27] | Img | Bayesian CNN | FC | 9.9 |
Yu et al. [28] | Img | ResNet | FC | 6.79 |
Mahmud et al. [31] | Img | U-Net+M.S. VGG-16 | FC | 6.34 |
MSGazeNet (Ours) | Img | U-Net+M.S. w. ResNet | FC | 5.860.80 |
represents the error reported in [18]
Method | Input | Feature Extractor | Reg. | Err. |
---|---|---|---|---|
Sugano et al. [13] | Img+pose | Raster-scanner | kNN | 6.5 |
Zhang et al. [19] | Img+pose | LeNet | FC | 5.9 |
Zhang et al. [15] | Img+pose | VGG-16 | FC | 6.3 |
Yu et al. [18] | Img+pose | 4 Layers CNN | FC | 5.7 |
Wang et al. [27] | Img | Bayesian CNN | FC | 5.4 |
Yu et al. [28] | Img | ResNet | FC | 5.52 |
MSGazeNet (Ours) | Img | U-Net+M.S. w. ResNet | FC | 5.300.57 |
represents the error reported in [18]
We follow the standard evaluation protocol [15, 24, 18, 28] that was previously described in Section 4.3 to quantify the performance of our method and compare our performance with the existing state-of-the-arts on different datasets. As our focus is on person-independent gaze estimation, we only consider and compare our results against published prior works which have also adopted this validation scheme. The results are reported in Table 7, Table 8, Table 9 in terms of angular error measured in degrees for MPIIGaze, Eyediap, and UTMultiview datasets, respectively. It should be noted that the result for [15] on Eyediap and UTMultiview, has been taken from [18] since the original paper [15] did not report their performance for these datasets. It can be observed that our proposed model, MSGazeNet, outperforms all the existing state-of-the-arts in both Eyediap and UTMultiview datasets while achieving competitive performance to [29] on the MPIIGaze dataset. On Eyediap dataset, we achieve a performance gain of 7.57 (6.34 5.86) over the existing state-of-the-art [31]. We also improve the current state-of-the-art [27] on the UTMultiview dataset by 1.85 (5.4 5.3). Our results justify the significance of using anatomical eye region isolation along with raw eye images as a multistream input for gaze estimation. Despite not being very common practice in the literature, we also calculate and report the standard deviations for the performance of our method in each dataset. A considerably lower standard deviation compared to [29] on MPIIGaze (less than half) indicates the stability and certainty of our model.
5.2 Robustness to Noise
In addition to comparing the results of our proposed method with existing state-of-the-art solutions, we also perform a robustness analysis to investigate the performance of our network in the presence of noisy data. In this analysis, following [70], we first estimate the inherent noise in all the images from the previously mentioned real-world datasets (Eyediap, MPIIGaze, and UTMultiview) which we used for gaze estimation. The method estimates the variance of additive zero mean Gaussian noise in a given image, by convolving a noise operator, on the image. Following is the noise operator , which is a mask of 33 dimension as follows:
(8) |
has zero mean and a variance of . The noise variance, of after applying can be computed by the following:
(9) |
where and are the width and height of the given image and represents the convolution operation at position of . Next, we evaluate the performance of our network against the estimated noise variances in the input images and compare with prior works [19], [15], and [24], for which public implementations and codes are available (at the time of performing this study, the codes for the other techniques were not made public). Our findings are plotted in Figure 5 where we observe that as the amount of noise in the data increases, our model shows comparatively less deterioration than prior works, indicating better robustness to noise. However, in the presence of significantly higher levels of noise, such as a noise variance exceeding 10 for the MPIIGaze and UTMultiview datasets and 3.5 for the Eyediap dataset, the performance of all the tested solutions, including ours, deteriorates.
![Refer to caption](x12.png)
![Refer to caption](x13.png)
![Refer to caption](x14.png)
![Refer to caption](x15.png)
![Refer to caption](x16.png)
![Refer to caption](x17.png)
![Refer to caption](x18.png)
![Refer to caption](x19.png)
![Refer to caption](x20.png)
![Refer to caption](x21.png)
5.3 Qualitative Results
To carry out a qualitative analysis of the performance of MSGazeNet across different datasets, following [21], we show a heatmap distribution of the predictions versus the ground truth gaze labels in Figure 6. In the top row, the relationship between predictions and ground truths for the pitch angles () are illustrated while in the bottom row, the same relationship is shown for the yaw angles (). The representations for all the three datasets exhibits almost a linear relationship between the predicted and the ground truth gaze angles. This further verifies the performance of our proposed framework.
Next, we illustrate the output from our anatomical eye region isolation network and visualize the gaze angle predictions versus the ground truth gaze angles from the three benchmark datasets in Figure 7. The reported samples in each row are drawn from different participants within a dataset. The leftmost images of each row are comparatively less noisy with little to no occlusion while the rightmost images are more noisy or have more occlusions. From the figure, we can see that even in the presence of extreme noise, poor resolution, or partial occlusions the eye region masks can still capture the eye anatomy and eventually aid in estimating gaze. Expectedly, as the amount of noise and artifacts increase, the performance of our method does experience degradation, yet as shown in Figure 5, our model is relatively robust against such factors.
5.4 Ablation Studies
We conduct thorough ablation experiments to evaluate the design choices and various components in our proposed neural architecture. Following we describe these experiments and results.
Impact of Each Stream. In our work, we hypothesize that using information from isolated anatomical eye regions in the model would aid gaze estimation. Here, we aim to explore the performance of our network using only one input stream at a time and subsequently adding the other streams for gaze estimation. To this end, we remove each branch of MSGazeNet, systematically, and evaluate the performance. First, we remove the two branches of the network responsible for learning representations from and . The results presented in Table 10 show that by removing the branches related to the anatomical eye regions, our model suffers performance degradation vs. the full model. Next, we remove the branch of the network that learns representations from the full eye image, and present the results in Table 10, where again we observe a performance drop. Subsequently, we remove only visible eyeball stream, followed by the removal of the iris stream to compare the gaze estimation performance using different combinations of input data. Similarly, both ablations result in performance drops, indicating that the full MSGazeNet is better capable of learning representations that are useful for gaze estimation.
MPIIGaze | Eyediap | UTMultiview | |
---|---|---|---|
MSGazeNet (full network) | 4.640.73 | 5.860.80 | 5.300.57 |
Eye regions removed | 4.950.69 | 6.190.78 | 5.510.72 |
Eye image removed | 5.270.66 | 6.880.64 | 7.360.76 |
Visible eyeball region removed | 4.830.65 | 6.270.78 | 6.970.79 |
Iris region removed | 4.890.70 | 6.370.92 | 7.270.52 |
Conv block 2 removed | 4.910.82 | 6.580.80 | 6.750.65 |
Conv blocks 2 and 3 removed | 5.650.50 | 9.700.68 | 9.220.78 |
Impact of AERI Network. The isolation network for the anatomical eye region plays a vital role in our proposed framework. To evaluate the significance of this component, we conduct an experiment by substituting the AERI network with an equivalent counterpart. We utilize an off-the-shelf facial landmark extractor, OpenFace 2.0 [71] to extract eye landmarks and subsequently create the eye region masks from the extracted landmarks. These masks and the corresponding eye images are used to perform gaze estimation, enabling us to conduct a comparative performance analysis. Due to the unavailability of face images, which is the requirement for facial landmark extraction via OpenFace 2.0, we could not use the UTMultiview dataset for this experiment. Our findings, presented in Table 11, reveal that gaze estimation performance using the masks provided by the AERI network outperform the utilization of the OpenFace 2.0 toolkit. While both models are predicting outputs for real-world images, OpenFace 2.0 is estimating high dimensional landmark coordinates. Evidently, the predictions contain more noise in this case while we aim to simplify this process by predicting low dimensional binary masks for the eye regions and experimentally show that the predicted masks contribute better to accurately estimate gaze. This experiment illustrates that the AERI network can accurately capture crucial eye regions from real-world eye images to ensure more robust gaze estimation.
MPIIGaze | Eyediap | |
---|---|---|
MSGazeNet (w/ AERI) | 4.640.73 | 5.860.80 |
MSGazeNet (w/ OpenFace 2.0) | 4.980.71 | 6.310.72 |
Impact of Network Depth. Here, we investigate the performance of gaze estimation by varying the depth of our network. To this end, we systematically remove the convolutional blocks from all three branches and study the variation in performance. First, we remove conv block 2 from each branch of our proposed network. Next, we further reduce the depth by removing conv block 3 from the remaining network. The results presented in Table 10 reveal that the removal of the two conv blocks considerably deteriorates the performance.
5.5 Network Variants
Here, we create a number of variants of our model in order to further validate our network design choices. An element to our proposed neural network for gaze estimation is the usage of separate conv blocks to encode each individual input to extract optimal input-specific representations. Therefore we create three different variants of our network to explore the various possible encoding strategies for the three networks. These variants, which we describe below along with their performance, have all been trained and evaluated following the same setup and protocol as our proposed model.
First, we construct network Variant 1, in which a single series of conv blocks are used to encode all three inputs (eye image and the two anatomical eye regions). The results presented in Table 12 show that for all three datasets, this strategy give results that are sub-par to our initial design, and in fact below state-of-the-art methods such as [24, 27, 29] for MPIIGaze dataset, and [15, 18, 27, 28] for UTMultiview dataset.
Next, we create Variant 2, in which the two anatomical eye region inputs are encoded together, while the full eye image is encoded separately via a set of separate conv blocks. Similar to Variant 1, we observe the results from Table 12 that indicates deviating from our initial design choice of encoding each input separately degrades the performance, and places the results below [24, 27, 29] for MPIIGaze dataset, and [27, 28] for UTMultiview dataset.
Lastly, we explore the notion of extracting representations from the three branches with encoders that share weights. This strategy would considerably reduce the number of trainable parameters in the overall network. To this end, we create Variant 3 that encodes each input stream separately with a series of conv blocks that share weights. The results presented in Table 12 show that this variant, similar to the other variants, degrades the performance, and falls below [24, 27, 29] for MPIIGaze dataset, and [27] for UTMultiview dataset.
MPIIGaze | Eyediap | UTMultiview | |
---|---|---|---|
MSGazeNet | 4.640.73 | 5.860.80 | 5.300.57 |
Var. 1 (1 encoding br.) | 4.850.76 | 5.970.85 | 5.940.89 |
Var. 2 (2 encoding br.) | 4.810.77 | 5.930.85 | 5.570.73 |
Var. 3 (3 encoding br., shared) | 4.830.66 | 5.970.91 | 5.450.57 |
MPIIGaze | Eyediap | UTMultiview | |
---|---|---|---|
MSGazeNet | 4.640.73 | 5.860.80 | 5.300.57 |
Late feature fusion | 4.870.82 | 6.120.79 | 6.081.02 |
Early feature fusion | 4.810.69 | 6.020.95 | 5.520.64 |
5.6 Impact of Feature Fusion
Given that our network uses separate conv blocks to extract features from each input, a feature fusion stage in our proposed gaze estimation architecture concatenates the individual feature maps in the channel dimension. Next, the fused features are fed to another conv block which is used to extract combined features. In this experiment, we create two baseline networks by modifying the position of the feature fusion stage to create late and early feature fusion variants. To implement the late feature fusion, we place the feature fusion stage after conv block 3 instead of the original design. On the contrary for early feature fusion, we apply feature fusion prior to conv block 2. The results are presented in Table 13, where we observe that for all the datasets, we obtain the best performance from our proposed network compared to the network variants. Specifically, we see a performance drop in the range of 2.66% to a maximum of 4% in terms of mean angular error across different datasets using the different fusion strategies.
Noise | Blur | Cutout | Scale | Lines | Contrast | MPIIGaze | Eyediap | UTM |
---|---|---|---|---|---|---|---|---|
– | – | – | – | – | – | 4.85 | 6.14 | 5.53 |
✓ | – | – | – | – | – | 4.73 | 5.98 | 5.42 |
– | ✓ | – | – | – | – | 4.77 | 6.14 | 5.39 |
– | – | ✓ | – | – | – | 4.74 | 6.11 | 5.40 |
– | – | – | ✓ | – | – | 4.76 | 6.02 | 5.42 |
– | – | – | – | ✓ | – | 4.74 | 5.94 | 5.38 |
– | – | – | – | – | ✓ | 4.73 | 5.97 | 5.45 |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | 4.64 | 5.86 | 5.30 |
5.7 Impact of Domain Randomization
Domain randomization is an essential step in our solution for transferring the anatomical eye region isolation network, which was trained with synthetic and simulated data, into the real domain. This synthetic data does not contain the various types of noise and artifacts which would normally be encountered with real datasets, in this case MPIIGaze, Eyediap, and UTMultiview. Hence, to ensure a significant domain mismatch does not occur, we apply different strong augmentations when training the eye region isolation network. Here, we study the influence of the applied augmentations. To this end, we first train the network without any augmentations, and then add one augmentation at a time. In the end, all the augmentations are applied. Throughout these individual steps, the size of the training data remained the same as the original 60,000 images. The trained network is then integrated into the MSGazeNet framework. It can be observed from the results shown in Table 14 that when no augmentations are used, the gaze performance drops for all the datasets. This performance drop, should it have not been prevented, would have resulted in lower performances in comparison to [24, 27, 29] for MPIIGaze, and [27, 28] for UTMultiview dataset. Our experiments further show that adding ‘random lines’ seems to be the most effective augmentation in the context of our study. Nonetheless, the combination of all the augmentations yields the best results, pushing the performance of our model past the state-of-the-art on Eyediap and UTMultiview.
6 Conclusion and Future Work
In this work, we present MSGazeNet, a novel gaze estimator. Our solution performs person-independent gaze estimation, and consists of two integral parts namely the anatomical eye region isolation and multistream gaze estimation. The anatomical eye region isolation is a crucial component of our framework which is solely trained with synthetic data due to the scarcity of detailed and accurate eye region annotations in real-world gaze estimation datasets. To this end, we procure a synthetic dataset using UnityEyes eye-gaze simulator. Our dataset consists of 80,000 images along with the eye visible eyeball region and iris masks. This dataset is used to train a U-Net style model to isolate eye regions given an input eye image. To allow for this network to then be used for downstream integration into a model for real-world (not synthetic) gaze estimation, we perform domain randomization using a variety of artifact-like augmentations, which helps to narrow the domain gap. The eye region isolation network is then transferred into our gaze estimation pipeline which consists of a multistream architecture. The network takes raw real-world eye images along with the eye region masks to predict the gaze direction. We perform various experiments and demonstrate that our solution achieves strong results, achieving the state-of-the-art on Eyediap and UTMultiview datasets, and exhibiting competitive performance on MPIIGaze dataset. Our robustness experiments show that our model, is more robust in dealing with noise, in comparison to other methods. Detailed and thorough ablation studies and comparisons with several variants quantify the impact of each component in our network and validate our design choices.
In this study, we demonstrated the relevance and importance of using information pertaining to anatomical eye regions towards gaze estimation. This work can serve to motivate further research into using various regions of the eye for gaze estimation. We believe such approaches can lead to learning better and more generalized gaze representations. In addition to the above, a key area to investigate could be the implementation of semi-supervised learning to take advantage of large amounts of unlabeled real-world eye images while training the eye region isolation network. This could improve the eye region isolation module and further enhance the overall performance. Another important scope of research could be the consideration of using 3D representations of key eye regions (i.e. depth maps) which can be constructed from 3D landmarks. It is likely that the 3D representations would contain richer anatomical information about the eye in terms of eyeball curvature or iris contour, which would better aid gaze estimation.
Acknowledgment. The authors would like to thank Innovation for Defence Excellence and Security (IDEaS) program for funding this project. The authors would also like to thank Dr. Dirk Rodenburg for his help throughout the project.
References
- [1] L. R. Young and D. Sheena, “Survey of eye movement recording methods,” Behavior Research Methods & Instrumentation, vol. 7, no. 5, pp. 397–429, 1975.
- [2] J. Zagermann, U. Pfeil, and H. Reiterer, “Measuring cognitive load using eye tracking technology in visual computing,” in Proceedings of the 6th Workshop on Beyond Time and Errors on Novel Evaluation Methods for Visualization, 2016, pp. 78–85.
- [3] Y. Yamada and M. Kobayashi, “Detecting mental fatigue from eye-tracking data gathered while watching video: Evaluation in younger and older adults,” Artificial Intelligence in Medicine, vol. 91, pp. 39–48, 2018.
- [4] C. Jyotsna and J. Amudha, “Eye gaze as an indicator for stress level analysis in students,” International Conference on Advances in Computing, Communications and Informatics, pp. 1588–1593, 2018.
- [5] P. Majaranta and A. Bulling, “Eye tracking and eye-based human–computer interaction,” Advances in Physiological Computing, pp. 39–65, 2014.
- [6] S. Andrist, X. Z. Tan, M. Gleicher, and B. Mutlu, “Conversational gaze aversion for humanlike robots,” Proceedings of the 9th ACM/IEEE International Conference on Human-Robot Interaction, pp. 25–32, 2014.
- [7] H. Liu and I. Heynderickx, “Visual attention in objective image quality assessment: Based on eye-tracking data,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 7, pp. 971–982, 2011.
- [8] H. M. Park, S. H. Lee, and J. S. Choi, “Wearable augmented reality system using gaze interaction,” in 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, 2008, pp. 175–176.
- [9] V. Clay, P. König, and S. Koenig, “Eye tracking in virtual reality,” Journal of Eye Movement Research, vol. 12, no. 1, 2019.
- [10] T. Louw and N. Merat, “Are you in the loop? using gaze dispersion to understand driver visual attention during vehicle automation,” Transportation Research Part C: Emerging Technologies, vol. 76, pp. 35–50, 2017.
- [11] S. Baluja and D. Pomerleau, “Non-intrusive gaze tracking using artificial neural networks,” Advances in Neural Information Processing Systems, vol. 6, 1993.
- [12] K.-H. Tan, D. J. Kriegman, and N. Ahuja, “Appearance-based eye gaze estimation,” in 6th IEEE Workshop on Applications of Computer Vision, 2002, pp. 191–195.
- [13] Y. Sugano, Y. Matsushita, and Y. Sato, “Learning-by-synthesis for appearance-based 3d gaze estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1821–1828.
- [14] K. A. Funes Mora, F. Monay, and J.-M. Odobez, “Eyediap: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras,” in Proceedings of the Symposium on Eye Tracking Research and Applications, 2014, pp. 255–258.
- [15] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Mpiigaze: Real-world dataset and deep appearance-based gaze estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 1, pp. 162–175, 2017.
- [16] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learning from simulated and unsupervised images through adversarial training,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2107–2116.
- [17] K. Lee, H. Kim, and C. Suh, “Simulated+unsupervised learning with adaptive data generation and bidirectional map**s,” in International Conference on Learning Representations, 2018.
- [18] Y. Yu, G. Liu, and J.-M. Odobez, “Deep multitask gaze estimation with a constrained landmark-gaze model,” in Proceedings of the European Conference on Computer Vision Workshops, 2018.
- [19] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Appearance-based gaze estimation in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4511–4520.
- [20] S. Park, X. Zhang, A. Bulling, and O. Hilliges, “Learning to find eye region landmarks for remote gaze estimation in unconstrained settings,” in Proceedings of the ACM Symposium on Eye Tracking Research & Applications, 2018, pp. 1–10.
- [21] N. Sinha, M. Balazia, and F. Bremond, “Flame: Facial landmark heatmap activated multimodal gaze estimation,” in 17th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2021, pp. 1–8.
- [22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer Assisted Intervention. Springer, 2015, pp. 234–241.
- [23] E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson, and A. Bulling, “Learning an appearance-based gaze estimator from one million synthesised images,” in Proceedings of the 9th Biennial ACM Symposium on Eye Tracking Research & Applications, 2016, pp. 131–138.
- [24] S. Park, A. Spurr, and O. Hilliges, “Deep pictorial gaze estimation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 721–738.
- [25] B. A. Smith, Q. Yin, S. K. Feiner, and S. K. Nayar, “Gaze locking: passive eye contact detection for human-object interaction,” in Proceedings of the 26th annual ACM symposium on User interface software and technology, 2013, pp. 271–280.
- [26] Y. Yu, G. Liu, and J.-M. Odobez, “Improving few-shot user-specific gaze adaptation via gaze redirection synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 937–11 946.
- [27] K. Wang, R. Zhao, H. Su, and Q. Ji, “Generalizing eye tracking with bayesian adversarial learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 907–11 916.
- [28] Y. Yu and J.-M. Odobez, “Unsupervised representation learning for gaze estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7314–7324.
- [29] S. Ghosh, M. Hayat, A. Dhall, and J. Knibbe, “Mtgls: Multi-task gaze estimation with limited supervision,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 3223–3234.
- [30] P. Kellnhofer, A. Recasens, S. Stent, W. Matusik, and A. Torralba, “Gaze360: Physically unconstrained gaze estimation in the wild,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6912–6921.
- [31] Z. Mahmud, P. Hungler, and A. Etemad, “Gaze estimation with eye region segmentation and self-supervised multistream learning,” AAAI Workshop on Human-Centric Self-Supervised Learning, 2022.
- [32] X. Cai, J. Zeng, S. Shan, and X. Chen, “Source-free adaptive gaze estimation by uncertainty reduction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 035–22 045.
- [33] X. Zhang, S. Park, T. Beeler, D. Bradley, S. Tang, and O. Hilliges, “Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation,” in Proceedings of the 16th European Conference on Computer Vision. Springer, 2020, pp. 365–381.
- [34] K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar, W. Matusik, and A. Torralba, “Eye tracking for everyone,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2176–2184.
- [35] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-world blind super-resolution with pure synthetic data,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1905–1914.
- [36] S. **, J. Dai, and T. Nguyen, “Kappa angle regression with ocular counter-rolling awareness for gaze estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2658–2667.
- [37] S. **dal and R. Manduchi, “Contrastive representation learning for gaze estimation,” in Proceedings of The 1st Gaze Meets ML Workshop. PMLR, 2023, pp. 37–49.
- [38] S. Park, E. Aksan, X. Zhang, and O. Hilliges, “Towards end-to-end video-based eye-tracking,” in Proceedings of the 16th European Conference on Computer Vision. Springer, 2020, pp. 747–763.
- [39] F. Martinez, A. Carbone, and E. Pissaloux, “Gaze estimation using local features and non-linear regression,” in 19th IEEE International Conference on Image Processing, 2012, pp. 1961–1964.
- [40] M. X. Huang, T. C. Kwok, G. Ngai, H. V. Leong, and S. C. Chan, “Building a self-learning eye gaze model from user interaction data,” in Proceedings of the 22nd ACM International Conference on Multimedia, 2014, pp. 1017–1020.
- [41] X. Xiong, Z. Liu, Q. Cai, and Z. Zhang, “Eye gaze tracking using an rgbd camera: A comparison with a rgb solution,” in Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, 2014, pp. 1113–1121.
- [42] E. Wood and A. Bulling, “Eyetab: Model-based gaze estimation on unmodified tablet computers,” in Proceedings of the Symposium on Eye Tracking Research and Applications, 2014, pp. 207–210.
- [43] E. Wood, T. Baltrušaitis, L.-P. Morency, P. Robinson, and A. Bulling, “A 3d morphable eye region model for gaze estimation,” in European Conference on Computer Vision. Springer, 2016, pp. 297–313.
- [44] K. Wang and Q. Ji, “Real time eye gaze tracking with 3d deformable eye-face model,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1003–1011.
- [45] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision. Springer, 2016, pp. 483–499.
- [46] A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and Computing, vol. 14, no. 3, pp. 199–222, 2004.
- [47] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
- [48] E. Wood, T. Baltrušaitis, X. Zhang, Y. Sugano, P. Robinson, and A. Bulling, “Rendering of eyes for eye-shape registration and gaze estimation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3756–3764.
- [49] Y. Cheng, Y. Bao, and F. Lu, “Puregaze: Purifying gaze feature for generalizable gaze estimation,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
- [50] G. Liu, Y. Yu, K. A. F. Mora, and J.-M. Odobez, “A differential approach for gaze estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 3, pp. 1092–1099, 2019.
- [51] S. Park, S. D. Mello, P. Molchanov, U. Iqbal, O. Hilliges, and J. Kautz, “Few-shot adaptive gaze estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9368–9377.
- [52] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 87–102.
- [53] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in 13th IEEE International Conference on Automatic Face and Gesture Recognition), 2018, pp. 67–74.
- [54] “Casia iris image database,” https://hycasia.github.io/dataset/casia-irisv4/, 2004.
- [55] H. Proença and L. A. Alexandre, “Ubiris: A noisy iris image database,” in 13th International Conference on Image Analysis and Processing. Springer, 2005, pp. 970–977.
- [56] H. Proença, S. Filipe, R. Santos, J. Oliveira, and L. A. Alexandre, “The ubiris. v2: A database of visible wavelength iris images captured on-the-move and at-a-distance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 8, pp. 1529–1535, 2009.
- [57] S. J. Garbin, Y. Shen, I. Schuetz, R. Cavin, G. Hughes, and S. S. Talathi, “Openeds: Open eye dataset,” arXiv preprint arXiv:1905.03702, 2019.
- [58] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
- [59] P. Kansal and S. Devanathan, “Eyenet: Attention based convolutional encoder-decoder network for eye region segmentation,” in IEEE/CVF International Conference on Computer Vision Workshop, pp. 3688–3693.
- [60] S.-H. Kim, G.-S. Lee, H.-J. Yang et al., “Eye semantic segmentation with a lightweight model,” in IEEE/CVF International Conference on Computer Vision Workshop, 2019, pp. 3694–3697.
- [61] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
- [62] A. K. Chaudhary, R. Kothari, M. Acharya, S. Dangi, N. Nair, R. Bailey, C. Kanan, G. Diaz, and J. B. Pelz, “Ritnet: Real-time semantic segmentation of the eye for gaze tracking,” in IEEE/CVF International Conference on Computer Vision Workshop, 2019, pp. 3698–3702.
- [63] J. Perry and A. S. Fernandez, “Eyeseg: Fast and efficient few-shot semantic segmentation,” in in Proceedings of the European Conference on Computer Vision Workshops. Springer, 2020, pp. 570–582.
- [64] A. K. Chaudhary, P. K. Gyawali, L. Wang, and J. B. Pelz, “Semi-supervised learning for eye image segmentation,” in ACM Symposium on Eye Tracking Research and Applications, 2021, pp. 1–7.
- [65] Y. Shen, O. Komogortsev, and S. S. Talathi, “Domain adaptation for eye segmentation,” in in Proceedings of the European Conference on Computer Vision Workshops. Springer, 2020, pp. 555–569.
- [66] S. Zagoruyko and N. Komodakis, “Wide residual networks,” 27th British Machine Vision Conference, 2016.
- [67] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2017, pp. 23–30.
- [68] X. Zhang, Y. Sugano, and A. Bulling, “Revisiting data normalization for appearance-based gaze estimation,” in Proceedings of the ACM Symposium on Eye Tracking Research & Applications, 2018, pp. 1–9.
- [69] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 2015.
- [70] J. Immerkaer, “Fast noise variance estimation,” Computer Vision and Image Understanding, vol. 64, no. 2, pp. 300–302, 1996.
- [71] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency, “Openface 2.0: Facial behavior analysis toolkit,” in 13th IEEE International Conference on Automatic Face & Gesture Recognition, 2018, pp. 59–66.