Modeling State Shifting via Local-Global Distillation for Event-Frame Gaze Tracking

Jiading Li, Zhiyu Zhu, **hui Hou, Junhui Hou, Senior Member, IEEE, and **jian Wu, Senior Member, IEEE This work was supported in part by Hong Kong Research Grants Council under Grant 11218121, and in part by Hong Kong Innovation and Technology Fund under Grant MHP/117/21. J. Li and Z. Zhu contributed equally to this work.J. Li, Z. Zhu, J. Hou, and J. Hou are with the Department of Computer Science, City University of Hong Kong, Hong Kong SAR. Email: [email protected]; [email protected]; [email protected]; [email protected]. Wu is with the School of Artificial Intelligence, Xidian University, Xi’an 710071, China. Email: [email protected]

Abstract

This paper tackles the problem of passive gaze estimation using both event and frame data. Considering the inherently different physiological structures, it is intractable to accurately estimate gaze purely based on a given state. Thus, we reformulate gaze estimation as the quantification of the state shifting from the current state to several prior registered anchor states. Specifically, we propose a two-stage learning-based gaze estimation framework that divides the whole gaze estimation process into a coarse-to-fine approach involving anchor state selection and final gaze location. Moreover, to improve the generalization ability, instead of learning a large gaze estimation network directly, we align a group of local experts with a student network, where a novel denoising distillation algorithm is introduced to utilize denoising diffusion techniques to iteratively remove inherent noise in event data. Extensive experiments demonstrate the effectiveness of the proposed method, which surpasses state-of-the-art methods by a large margin of 15 $\%$ . The code will be publicly available at https://github.com/jdjdli/Denoise_distill_EF_gazetracker.

Index Terms:

Event-based vision, gaze estimation, latent distillation.

I Introduction

Eye gaze tracking represents a critical component in non-verbal communication, offering profound insights into an individual’s intentions and emotional states by capturing visual information from the user’s eyes. This technology can infer desires and needs by measuring the user’s attention on specific objects, making it invaluable in fields like human-computer interaction [1, 2, 3], virtual reality [4, 5], and intelligent transportation [6, 7].

Over the recent decades, gaze estimation has witnessed an exponential increase in the development of diverse methodologies, which can be generally divided into three categories, i.e., 2D eye feature regression-based approaches [8, 9, 10, 11], 3D model-based eye movement reconstruction algorithms [12], and appearance-based gaze estimation techniques [13]. Although the first two categories excel in tracking the eye’s position and movement, they typically rely on specialized hardware systems and light conditions. Conversely, appearance-based methods only utilize face or eye images as input to train a map** model between appearance and gaze, thereby determining the corresponding gaze based on new appearance data [14, 15, 13, 16]. However, appearance-based methods are sensitive to individual differences and head movements in unconstrained environments. Addressing these challenges often necessitates high-speed, high-resolution RGB and optical cameras to capture more detailed visual information, leading to significant costs and energy demands [17, 18].

Refer to caption — Figure 1: Left: Overview of our gaze estimation setup: our framework emphasizes the modeling of gaze shifts from a registered anchor state to the currently acquired state captured during actual use. Our approach takes input in the form of a frame coupled with corresponding event data to infer the position of the directional gaze point as the output. Right: Beyond the confines of static frame-based gaze estimation, studying dynamic ocular movements constitutes an additional research trajectory within computer vision.

Recently, event-based cameras have surged in popularity due to their exceptional temporal resolution, low latency, and expansive dynamic range, making them ideal for tracking fast eye movements like saccades with high fidelity [19]. Additionally, the high dynamic range allows them to operate well under varying lighting conditions, which is crucial for accurate gaze estimation in real-world scenarios. Despite the advantages, typical event-based sensors provide limited visual information, missing out on color, texture, and comprehensive contextual details that are readily available through conventional RGB imaging[20, 21, 22, 23]. Therefore, the emerging field of cross-modal gaze estimation promises to enhance the robustness and accuracy of gaze tracking by capitalizing on the complementary strengths of both data streams [24, 25, 26, 27, 28]. However, most existing methods leverage only one modality to assist the other, failing to take full advantage of the information contained within both data types.

In this paper, we propose a novel gaze estimation transformer framework that revolutionizes gaze tracking by leveraging the complementary strengths of event cameras and traditional frame-based imaging, as shown in Fig. 1. Though synergizing the high temporal resolution of event data with the rich spatial information of frames via a local-global distillation process, our method achieves a new level of performance in gaze tracking. Technically, we reformulate the gaze estimation as the quantification of eye motion transitions from the current state to several prior registered anchor states. Based on this, we initially partition the entire gaze points region into several sub-regions and employ the vision transformers to pre-train a set of models on different sub-regions, yielding several local expert networks with relatively high accuracy for localized gaze estimation. Furthermore, we introduce a local-global latent denoising distillation method to distill knowledge from the set of local expert networks to a global student network to diminish the adverse effects of inherent noise from event data on student network performance. Extensive experiments demonstrate the significant superiority of the proposed method over state-of-the-art methods.

In summary, the main contributions of this paper are three-fold:

•

we formulate the gaze estimation as an end-to-end prediction of state shifting from registered anchor state;
•

we distill multiple pre-trained local expert networks into a more robust student network to combat overfitting in gaze estimation; and
•

we propose a self-supervised latent denoising method to mitigate the adverse effects of noise from local expert networks to improve the performance of the student network.

The remainder of this paper is organized as follows. Section II briefly reviews related works in this field. Section III presents the proposed method in detail, followed by extensive experiments and analysis in Section IV. Finally, Section V concludes this paper.

II Related work

In this section, we give a review of event-based vision, eye tracking, event-frame methods, and distillation networks.

Event-based Vision. Neuromorphic event-based cameras, inspired by the Silicon Retina concept [29, 30], are key for fast vision tasks due to their quick response and low latency [31, 32, 33, 34]. They support a wide range of applications including object recognition [35, 21], navigation [36], pose estimation [37], 3D reconstruction [38], SLAM [39, 40], gesture tracking [41], and object tracking [42, 43, 44, 45]. Initially, these cameras used event patterns to detect motion [46, 47, 48, 49] and track simple shapes [50, 51]. Later improvements introduced event-driven algorithms [52] and optimization techniques like gradient descent to refine tracking [53]. Algorithms such as mean-shift and Monte Carlo [54, 55] have further enhanced tracking by adjusting to changes in the model. For instance, part-based models [55] have segmented subjects into parts, enabling quick and accurate tracking of facial or body movements [56]. Chen et al. [57] introduced a sparse Change-Based ConvLSTM model for efficient event-based eye tracking. Stoffregen et al. [58] presented a novel method for high-frequency, low-power eye tracking using event cameras and a coded differential lighting scheme to enhance corneal glint detection while suppressing non-glint events. Wang et al. [59] introduced a unified single-stage transformer-based framework for efficient and accurate color-event object tracking. Ryan et al. [60] introduced a novel method for real-time face and eye tracking, as well as blink detection by using event cameras. It leverages a fully convolutional recurrent neural network architecture to enhance driver monitoring systems. However, the sparse nature of event data can be problematic in low-contrast settings, and using algorithms designed for dense data in sparse situations may still increase computational demands.

Eye Tracking. Progressing from initial camera-based systems of eye tracking that monitored Purkinje images [61, 62, 63, 64, 65, 66], Morimoto et al. [67] and Duchowski et al. [68] provided detailed examinations of these pupil modeling and gaze estimation processes. Contemporary research focuses on deep learning to deduce gaze orientation from complex facial datasets obtained via standard webcams, directly map** the visual characteristics of the eye to gaze coordinates, with Chen et al. [69] enhancing accuracy through dilated convolution techniques. Advancing the field, Cheng et al. [16] integrated full face and eye region data for more accurate gaze inference and incorporates transformer models to exploit their superior handling of data dependencies. Nonetheless, these advanced models are best for full-face images and not ideal for eye-only cameras, requiring custom-designed networks for accurate data analysis. Also, their effectiveness is limited by the camera’s frame rate, which can affect their real-time accuracy and performance.

Event-frame Methods. Hybrid methods combine the detailed intensity data from a standard frame with the rapid detection of intensity changes from asynchronous event streams [25, 26, 27, 28]. Wang et al. [44] introduced a cross-modality transformer algorithm for enhancing reliable object tracking by combining visible and event camera data, demonstrating improved performance in challenging scenarios. Feng et al. [25] developed an event-driven eye segmentation algorithm that overcomes the limitations of standard frame rates, maintaining high accuracy despite a lower resolution. The auto-segmentation model was designed to combine with a previous gaze estimation model to improve estimation accuracy. In fact, it was not an end-to-end estimation model and the prediction latency was very significant. Angelopoulos et al. [19] enhanced the temporal resolution of gaze tracking by integrating event cameras close to the eyes. These cameras provided constant updates to the initial pupil location determined by traditional algorithms, allowing for precise adjustments at high temporal resolutions. However, the conventional baseline can significantly affect the performance of these methods [70, 71]. This approach resulted in gaze estimation accuracy varying within a wide range.

Distillation Networks. Knowledge distillation [72] is intended for the transfer of learned features from a "teacher" model to an efficient "student" model. Wang et al. [73] presented a novel hierarchical knowledge distillation framework for high-speed, low-latency visual object tracking using event cameras. Lopez-Paz et al. [74] expanded upon this concept by introducing privileged information, wherein the student model leverages insights from multiple teacher models, each accessing different data sources. Xiang et al. [75] utilized the collective intelligence of several teacher models to tackle the long-tailed distribution problems. Guo et al. [76] formulated a collaborative learning framework, similar to a congregation of local experts sharing their knowledge. Therefore, the essence of these expert networks’ collective intelligence is distilled into a singular, more generalized student network [77, 78, 79], with an enhanced level of accuracy, surpassing what could be achieved by an individual model trained on a uniform dataset. This kind of distillation has been proven to be suitable for the end-to-end gaze estimation task, with the detailed information provided in Sec. III.

This study fuses conventional intensity frames with dynamic event camera data into a unified network architecture, leveraging pre-trained local expert models. It aims to utilize this integration to create an end-to-end, advanced, and accurate eye-tracking method, harnessing the unique benefits of both traditional frame modality and innovative event-based sensor input.

III Proposed Method

Gaze tracking aims to determine the gaze location, given the measured state. However, the high speed of eye movement and the subtle pattern of the eyeball make it hard to derive accurate predictions. Inspired by the high-temporal resolution and low-latency characteristics of event-based vision [31, 32, 33, 34], we propose to utilize frames together with event data for building an accurate near-eye gaze estimation pipeline, as illustrated in Fig. 2.

Specifically, due to the different individual biometric characteristics, e.g., pupil distance and size of the eyeball, there would be a significant bias in the gaze estimation process [68]. Consequently, instead of directly calculating the absolute location of gaze focus from a single observational state, we propose to calculate the relative shift of the measured state compared with pre-registered anchor states. Moreover, to learn the correlation between those two states in an end-to-end manner, we delve into the potential of utilizing pre-trained vision transformers for cross-modal eye tracking. However, directly training the gaze estimation network in a large region usually leads to overfitting phenomena, as shown in Fig. 3. Thus, we train a set of sub-region gaze estimation models for different anchor states, which are called local expert networks (see Sec. III-A).

Subsequently, to further boost the capacity for accurate gaze prediction across diverse scenarios, we design a distillation-based algorithm to ensemble knowledge of pre-trained local expert networks into a large student network. However, the presence of noise in the measured inputs (especially for the event data) can disrupt neural network training and negatively impact performance. To alleviate the potential noise influencing the neural network training, we propose a self-supervised latent denoising neural network for feature maps from experts and then apply knowledge distillation to the student network (see Sec. III-B).

In what follows, we will detail the proposed pipeline.

III-A State Correlation Modeling by Local Expert

Event-Frame Tokenization. Our framework takes a paired near-eye frame and the corresponding event stream as input. To effectively fuse these two data modalities, we transform them into a unified representation. Specifically, we first aggregate events within specific time intervals to convert the asynchronous event flow into a synchronous format, aligning with frame exposure duration. We then voxelize the original event stream into a grid of voxels using PointNet [80]. This voxelization enables us to represent the event data in a structured format analogous to the frame data. Finally, we tokenize both the frame and the voxelized event data into sequences of spatial tokens, where each token corresponds to a specific location in the frame and its associated event voxel. This tokenized representation facilitates seamless fusion and processing of event and frame information within our model.

Correlation Modeling. After creating the unified representation for both data modalities, we construct the embedding representations of two gaze points, i.e., the estimation gaze point indicates the current state and the template gaze point represents the registered anchor state, by concatenating the token embeddings of two modalities from the same state.

We then apply a Vision Transformer (ViT) with the multi-head self-attention (MSA) mechanism to model the correlation between the current and registered anchor states. The processed features are subsequently flattened into a single feature representation, which is analyzed by a convolutional layer to predict the final class label.

Our experiments confirm the model’s effectiveness in distinguishing between the current and registered anchor states for gaze estimation. However, we observed limitations when a single registered anchor state was used for predictions across a large gaze area, leading to significant over-fitting as shown in Figs. 3 and 4. To address this, we propose partitioning the entire gaze point area into $N$ smaller sub-regions, each with its own regional registered anchor state. This allows us to train $N$ specialized local expert networks. For our task, we achieve a good balance between training costs and accuracy by dividing the region into five sub-regions, as shown in Fig. 5, each with a dedicated local expert network.

Anchor Selection Mechanism. Since the local expert networks only perceive data from their own sub-regions, we further design a network to globally determine the usage of which expert (anchor) during the gaze estimation process. As illustrated in Fig. 6, the network uses the MLPs to choose the nearest registered anchor state based on the input current state to determine which sub-region the current state belongs to.

III-B Local-Global Latent Denoising Distillation

Algorithm 1 Training Denoiser

1:Repeat

\mathbf{X}_{i}\sim q(\mathbf{X}_{i})

t\sim\mathrm{Uniform}(\{i,i+1,...,T\})

\epsilon\sim\mathcal{N}(0,\mathbf{I})

i\sim\mathcal{U}(27,32)

6:Take gradient descent step on

\delta=\epsilon-\epsilon_{\theta}(\sqrt{\frac{\bar{\alpha}_{t}}{\bar{\alpha}_{% i}}}\mathbf{X}_{i}+\sqrt{1-\frac{\bar{\alpha}_{t}}{\bar{\alpha}_{i}}}\epsilon,t)

\nabla_{\theta}\|\sum\delta\|^{2}+\|\sigma(\delta)-\sqrt{2}\|^{2}

9:Until converged

Algorithm 2 Reverse process

\epsilon\sim\mathcal{N}(0,\mathbf{I})

\mathbf{X}_{T^{\prime}}=\sqrt{\frac{\bar{\alpha}_{T^{\prime}}}{\bar{\alpha}_{i% }}}\mathbf{X}_{i}+\sqrt{1-\frac{\bar{\alpha}_{T^{\prime}}}{\bar{\alpha}_{i}}}\epsilon

3:For

t=T^{\prime},...,i

\mathbf{z}\sim\mathcal{N}(0,\mathbf{I})

t>i

, else

\mathbf{z}=0

\sigma_{t}=\sqrt{\frac{\hat{\beta}_{t}(\bar{\alpha}_{i}-\bar{\alpha}_{t-1})}{% \bar{\alpha}_{i}-\bar{\alpha}_{t}}}

\mathbf{X}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}(\mathbf{X}_{t}-\frac{(1-\alpha_{t% })\sqrt{\bar{\alpha}_{i}}}{\sqrt{\bar{\alpha}_{i}-\bar{\alpha}_{t}}}\mathbf{% \epsilon}_{\theta}(\mathbf{X}_{t},t))+\sigma_{t}\mathbf{z}

7:End for

8:Return

\mathbf{X}_{i}

The proposed denoising diffusion algorithm consists of the following two steps:

1.

training a denoising diffusion network for removing potential noise as illustrated by Eq. (1) and Algorithm 1; and
2.

distilling the knowledge from local expert networks into a student network via Eq. (7) with the denoising process as Algorithm 2.

Given the feature maps of measurement $\widetilde{\mathbf{X}}=\mathbf{X}+\gamma\mathbf{\epsilon}$ , where $\mathbf{X}$ indicates the expectation and $\mathbf{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ represents the random Gaussian noise, we aim to recover the distribution of $\widetilde{\mathbf{X}}\sim\mathcal{N}(\mathbf{X},\gamma\mathbf{I})$ from a given measurement $\widetilde{\mathbf{X}}$ . Specifically, we train a denoising diffusion network that can generate latent $\mathbf{\mu}$ from Gaussian noise. Then, by progressively removing the noise from the latent of a given noised measurement, we can finally derive the expectation by averaging several denoising results.

We formulate the distribution of the latent feature maps as $p(\widetilde{\mathbf{X}}|\mathbf{X})=\mathcal{N}(\widetilde{\mathbf{X}}|% \mathbf{X},\gamma\mathbf{I})$ , where $\widetilde{\mathbf{X}}$ indicates the measured noisy latent, $\mathbf{X}$ denotes the corresponding noise-free expectation, and $\gamma\mathbf{I}$ represents the covariance matrix with the independent assumption. Since we only know the noised sample, we further re-parameterize the $\gamma^{\prime}=1-\gamma$ and $\mathbf{X}=\sqrt{\gamma^{\prime}}\mathbf{Y}$ .

Subsequently, we have $p(\widetilde{\mathbf{X}}|\mathbf{Y})=\mathcal{N}(\widetilde{\mathbf{X}}|\sqrt{% \gamma^{\prime}}\mathbf{Y},\gamma\mathbf{I})$ . Recalling the intermediate sample of DDPM [81], $\mathbf{X}_{t}=\sqrt{\bar{\alpha}_{i}}\mathbf{X}_{0}+\sqrt{1-\bar{\alpha}_{i}}% \mathbf{X}_{T}$ , where $\bar{\alpha}_{i}=\prod_{j=0}^{i}\alpha_{i}$ . Thus, the noised latent $\widetilde{\mathbf{X}}$ could be approximated by a certain step result $\mathbf{X}_{i}$ in a diffusion reverse process ( $\mathbf{X}_{T}\rightarrow\mathbf{X}_{0}$ ), where $\mathbf{X}_{0}=\mathbf{Y}$ and $\bar{\alpha_{i}}=\gamma^{\prime}$ . We then introduce a process to drive the distribution of $\mathbf{X}_{i}$ from a sample $\widetilde{\mathbf{X}}$ via diffusion models.

The forward process can be simply constructed by adding noise following $\mathbf{X}_{t}=\sqrt{\bar{\alpha}_{t}}\mathbf{X}_{t-1}+\sqrt{1-\bar{\alpha}_{t% }}\epsilon_{t}$ , where $\epsilon_{t}$ indicates a random noise. Thus, we can train the denoiser as Algorithm 1 to learn the potential noise from the noised sample.

In Algorithm 1, we optimize the diffusion denoiser with the following loss term

L_{Diff}=\nabla_{\theta}\|\sum\delta\|^{2}+\|\sigma(\delta)-\sqrt{2}\|^{2}.

(1)

Since $\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ , we expect $\epsilon_{\theta}(\sqrt{\frac{\bar{\alpha}_{t}}{\bar{\alpha}_{i}}}\mathbf{X}_{% i}+\sqrt{1-\frac{\bar{\alpha}_{t}}{\bar{\alpha}_{i}}}\epsilon,t)\sim\mathcal{N% }(\mathbf{0},\mathbf{I})$ . Thus, we have

\mathbf{\delta}=\mathbf{\epsilon}-\mathbf{\epsilon}_{\theta}(\sqrt{\frac{\bar{% \alpha}_{t}}{\bar{\alpha}_{i}}}\mathbf{X}_{i}+\sqrt{1-\frac{\bar{\alpha}_{t}}{% \bar{\alpha}_{i}}}\mathbf{\epsilon},t),\mathbf{\delta}\sim\mathcal{N}(\mathbf{% 0},2\mathbf{I}),

(2)

Then we regularize $\sum\delta\rightarrow 0$ and $\sigma(\delta)\rightarrow\sqrt{2}$ , which is exact the training objective Eq. (1).

However, our algorithm starts from a noised measurement and iteratively adds and then removes noise. Thus, the reverse process of our algorithm is explicitly different from the standard de-noising diffusion model. We further delicately investigate the reverse distribution of the Markov chain, which starts from $\mathbf{X}_{i}$ . We begin with one reverse transition process as

q(\mathbf{X}_{t-1}|\mathbf{X}_{t},\mathbf{X}_{i})=\frac{q(\mathbf{X}_{t}|% \mathbf{X}_{t-1},\mathbf{X}_{i})q(\mathbf{X}_{t-1}|\mathbf{X}_{i})}{q(\mathbf{% X}_{t}|\mathbf{X}_{i})},

(3)

where $\mathbf{X}_{t}$ indicates the noised sample with inherent noise from collected event data. Then, we take an investigation of the term $q(\mathbf{X}_{t-1}|\mathbf{X}_{t},\mathbf{X}_{i})$ .

	$\displaystyle q(\mathbf{X}_{t-1}\|\mathbf{X}_{t},\mathbf{X}_{i})=\frac{q(% \mathbf{X}_{t}\|\mathbf{X}_{t-1},\mathbf{X}_{i})q(\mathbf{X}_{t-1}\|\mathbf{X}_{% i})}{q(\mathbf{X}_{t}\|\mathbf{X}_{i})}$		(4)
	$\displaystyle=\frac{\mathcal{N}(\mathbf{X}_{t},\sqrt{\alpha_{t}}\mathbf{X}_{t-% 1},\sqrt{1-\alpha_{t}}I)\mathcal{N}(\mathbf{X}_{t-1},\sqrt{\frac{\bar{\alpha}_% {t-1}}{\bar{\alpha_{i}}}}\mathbf{X}_{i},\sqrt{1-\frac{\bar{\alpha}_{t-1}}{\bar% {\alpha_{i}}}}I)}{\mathcal{N}(\mathbf{X}_{t},\sqrt{\frac{\bar{\alpha}_{t}}{% \bar{\alpha_{i}}}}\mathbf{X}_{i},\sqrt{1-\frac{\bar{\alpha}_{t}}{\bar{\alpha_{% i}}}}I)}$
	$\displaystyle\propto\mathcal{N}(\mathbf{X}_{t-1},\frac{(1-\alpha_{t})(\bar{% \alpha}_{i}-\bar{\alpha}_{t-1})}{\bar{\alpha}_{i}-\bar{\alpha}_{t}}(\frac{% \sqrt{\alpha_{t}}}{1-\alpha_{t}}\mathbf{X}_{t}+\frac{\sqrt{\bar{\alpha}_{i}% \bar{\alpha}_{t-1}}}{\bar{\alpha}_{i}-\bar{\alpha}_{t-1}}\mathbf{X}_{i}),$		(5)
	$\displaystyle\sqrt{\frac{(1-\alpha_{t})(\bar{\alpha}_{i}-\bar{\alpha}_{t})}{% \bar{\alpha}_{i}-\bar{\alpha}_{t}}}).$

Moreover, with $\hat{\beta}_{t}=1-\alpha_{t}$ , we have

	$\displaystyle q(\mathbf{X}_{t-1}\|\mathbf{X}_{t},\mathbf{X}_{i})\propto\mathcal% {N}(\mathbf{X}_{t-1},\frac{\hat{\beta}_{t}(\bar{\alpha}_{i}-\bar{\alpha}_{t-1}% )}{\bar{\alpha}_{i}-\bar{\alpha}_{t}}(\frac{\sqrt{\alpha_{t}}}{\hat{\beta}_{t}% }\mathbf{X}_{t}+$		(6)
	$\displaystyle\frac{\sqrt{\bar{\alpha}_{i}\bar{\alpha}_{t-1}}}{\bar{\alpha}_{i}% -\bar{\alpha}_{t-1}}\mathbf{X}_{i}),\sqrt{\frac{\hat{\beta}_{t}(\bar{\alpha}_{% i}-\bar{\alpha}_{t-1})}{\bar{\alpha}_{i}-\bar{\alpha}_{t}}}).$		(6)

However, such a noise level indicated by $i$ is quite hard to derive. Thus, during the training phase, we relax $\bar{\alpha}_{i}$ to a certain range of $i\in[27,32]$ . Note that through the aforementioned reverse process, we derive the distribution of $\mathbf{X}_{i}$ from its certain measurement by iteratively sample $\mathbf{X}_{t-1},t={T^{\prime},~{}\cdots,~{}i},$ as Eq. (6), which is detailedly illustrated in Algorithm 2. Although we do not explicitly remove the noise from the data, we expect during the training process, the variable $\mathbf{X}$ would be converged to the noise-free expectation $\mathbf{X}_{0}$ as

\displaystyle\mathcal{L}_{d}(\mathbf{X}^{\theta},\mathbf{X}^{\phi}_{i})=\sum_{% \mathbf{X}^{\phi}_{i}\sim p(\mathbf{X}^{\phi}_{i})}\|\mathbf{X}^{\theta}-\frac% {\mathbf{X}^{\phi}_{i}}{\sqrt{\bar{\alpha}_{i}}}\|_{2}^{2},

(7)

where $\sum_{\mathbf{X}^{\phi}_{i}\sim p(\mathbf{X}_{i})}$ indicates summation across different diffusion reconstruction samples of teacher network $\phi$ , $\mathbf{X}_{\theta}$ denotes the output feature of student network $\theta$ .

III-C Training Objective

We train our gaze estimation framework in a two-stage manner. In the first stage, we train five local experts, each specializing in a sub-region of the gaze point area. Each expert is trained using the cross-entropy loss function, $\mathcal{L}_{e}$ , with corresponding subsets of gaze point labels and regional registered anchor states:

\mathcal{L}_{e}=\mathcal{L}_{CE}(\hat{\mathbf{Y}},\mathbf{Y}),

(8)

where $\hat{\mathbf{Y}}\in\mathbb{R}^{L}$ and $\mathbf{Y}\in\mathbb{R}^{L}$ represent the predicted and ground-truth locations, respectively.

The second stage involves distilling the knowledge from these local experts into a student network designed to handle the entire gaze movement region. This distillation process utilizes three loss functions. We maintain the use of $\mathcal{L}_{e}$ as a hard loss to ensure stable performance of the student network. Additionally, we employ a soft loss, the Kullback-Leibler Divergence loss function $\mathcal{L}_{s}$ to guide the learning of the student network:

\mathcal{L}_{s}=\mathcal{L}_{KL}(\mathbf{T}_{S},\mathbf{T}_{E}),

(9)

where $\mathbf{T}_{S},\mathbf{T}_{E}\in\mathbb{R}^{H\times W}$ represent the attention matrices of the student and local experts, respectively. Finally, the total loss for the second stage is a weighted sum of these three losses:

\mathcal{L}=\alpha\cdot\mathcal{L}_{e}+\beta\cdot\mathcal{L}_{s}+\lambda\cdot% \mathcal{L}_{d},

(10)

where $\alpha$ , $\beta$ , and $\lambda$ are corresponding weights for balancing different loss terms. Based on our extensive ablation experiments, we set $\alpha$ , $\beta$ , and $\lambda$ to 1, 1, and 500, respectively.

IV Experiment

IV-A Experimental Settings

Dataset. We utilized a hybrid event-based IR near-eye gaze tracking dataset [19] to evaluate our proposed system. The dataset integrated a sophisticated DAVIS-346b sensor (iniVation) with a high-resolution 25 mm f/1.4 VIS-NIR C-mount lens (EO-#67-715), further augmented with a UV/VIS cutoff filter (EO-#89-834) to capture the ocular dynamics of subjects using an ophthalmic headrest coupled with a restraining apparatus to minimize potential head movement artifacts.

The dataset recorded the gaze movements from 24 subjects using a 40 $\times$ 40-pixel luminous green fixation cross rendered on a 40-inch-diagonal, 1920 $\times$ 1080-pixel display (Sceptre 1080p X415BV-FSR) with a visual field of view (FoV) of 64 ${}^{\circ}\times$ 96^∘.

The dataset was divided into two distinct experimental conditions tailored to elicit different oculomotor responses: stochastic saccadic movements and controlled smooth pursuit tracking. During the first experimental paradigm, subjects were instructed to direct their gaze towards the stimulus. The stimulus materialized at random within a grid matrix of 121 discrete points (an 11 $\times$ 11 grid pattern projected onto the display), with each point being presented for 1.5 seconds. This sequence of locations was uniformly randomized and remained consistent across all participants. Some sample images are shown in Fig. 7. The second experimental paradigm was used for free gaze point estimation. In this paradigm, the subjects’ task was to maintain visual fixation on the stimulus as it traversed a predefined square-wave trajectory. This trajectory commenced at the upper boundary of the display and proceeded in a downward motion, covering the full horizontal extent of the screen, with a vertical displacement amplitude of 150 pixels. Despite the intentional induction of saccadic jumps and smooth pursuit movements within the experimental framework, the resulting dataset encapsulates a plethora of involuntary eye dynamics, including microsaccades and ocular tremors, thereby offering a comprehensive profile of ocular motion behavior.

Implementation details. 1) Local expert network. We set the input frame and event voxel size to 224 $\times$ 224. We directly adopt the first two layers of the base vision transformer (ViT-B). We configure the mini-batch size to 80 and adopt the AdamW optimization algorithm due to its efficacy in handling sparse gradients and incorporating weight decay for regularization. We initiate the training with a learning rate of 0.0001, which is methodically attenuated following a cosine annealing schedule, descending to a factor of 0.1 of the original learning rate. The training epochs are set to 350 to ensure the local expert networks adequately learn the intricate patterns within the data. 2) Distilled student network. In the subsequent stage, the five local expert networks are distilled into the student network, leveraging the AdamW optimizer once again. The student network consists of the first two layers of ViT-B. This phase is conducted with a reduced learning rate of $10^{-5}$ and a momentum coefficient of 0.9, extending over 500 epochs. Due to the computationally intensive nature of knowledge distillation, the mini-batch size is adjusted to 4 to accommodate the increased model complexity and ensure stable convergence. Code is implemented using Pytorch and trained with RTX3090 GPUs.

Inference. The inference of our model is different from the training process. During training, it is necessary to do latent denoising, to convert the feature maps into a distribution before distilling its knowledge into the student model. During inference, it only requires the use of the MLPs to determine which sub-region the current state belongs to and directly feeds the fused input into the student network without doing latent denoising, as the student model already possesses complete inference capabilities, as shown by the pink arrow in Fig. 2

Metrics. Following previous works [82], we employed the Mean Angle Error (MAE) to evaluate the performance quantitatively, computed as

\mathrm{MAE}=\frac{1}{N}\sum_{i=1}^{N}\arccos{\frac{<\overrightarrow{p}_{i},% \overrightarrow{t}_{i}>}{\|\overrightarrow{p}_{i}\|\|\overrightarrow{t}_{i}\|}}

(11)

where $N$ is the number of samples in the dataset, $\overrightarrow{p_{i}}$ is the predicted gaze vector for the i-th sample, $\overrightarrow{t_{i}}$ is the corresponding ground truth vector, and $<\cdot,\cdot>$ computes the inner product of two input vectors.

IV-B Comparison with State-of-the-Art Methods

We compared our method with the following four methods:

•

S-T GE [83] leverages temporal sequences of eye images to enhance the accuracy of an end-to-end appearance-based deep-learning model for gaze estimation.
•

Dilated-Net [69] integrates dilated convolutional layers to enhance feature extraction for gaze estimation.
•

EventGT [19] achieves an equilibrium of computational efficiency and accuracy—a key consideration for real-time gaze tracking.
•

HE-Tracker [82] employs a sophisticated pipeline that commences with the E-Tracker’s encoding of eye imagery.

The quantitative results are presented in Table I, where it can be seen that our method achieves superior performance via reducing mean absolute error by nearly $50\%$ and boosting tracking accuracy by $15\%$ compared to recent state-of-the-art methods, confirming the effectiveness of our framework. We also refer the reviewers to the video demo for more visual results.

TABLE I: Quantitative results of different methods.

\downarrow

(resp.

\uparrow

) indicates the smaller (resp. larger), the better.

Method	S-T GE[83]	Dilated-Net[69]	EventGT[19]	HE-Tracker[82]	Ours
MAE $\downarrow$	7.650^∘	4.020^∘	3.000^∘	4.170^∘	1.928^∘
Accuracy $\uparrow$	61.88%	66.63%	72.06%	72.87%	87.67%
Time (ms)	288.34	562.03	—	20.65	191.25

IV-C Ablation Study

Data Modality. We analyzed the impact of different data modalities on gaze estimation performance in Table II. Using only frames resulted in a substantial angular error of $40.16^{\circ}$ , significantly higher than the two-modality baseline of $1.93^{\circ}$ . Conversely, relying solely on event data yielded a poor prediction accuracy of $1.45\%$ . These findings highlight that although both frames and event data correlate with the directional vector of gaze, their individual use is insufficient for accurate gaze estimation. The best performance is achieved by fusing frames and event data, underscoring the value of our multimodal approach.

Latent Denoising. We also conducted experiments to validate the effectiveness of latent denoising. As shown in Table. II, incorporating latent denoising enables our method to achieve nearly a $3\%$ improvement in accuracy and a $44.5\%$ reduction in MAE, showcasing the potential of the proposed denoising distillation strategy.

TABLE II: Gaze estimation performance across various data modalities and denoising, where ’F’ indicates the frame and ’E’ represents the event.

Metric	F	E	Ours (F+E)	Ours w/o Denoising
MAE $\downarrow$	40.161^∘	—	1.928^∘	3.472^∘
Accuracy $\uparrow$	53.00%	1.45%	87.67%	84.64%

Anchor State. Meanwhile, we conducted a series of ablation studies to ascertain the necessity of a registered anchor state on the performance of our gaze estimation model. In the absence of any anchor state, our model yielded gaze estimation with an angular error of 32.00^∘. In marked divergence, a single registered anchor state resulted in substantial mitigation of angular error to 15.19^∘, thereby highlighting the importance of the registered anchor state in the enhancement of gaze prediction fidelity. These ablation experiments also confirmed the effectiveness of introducing multiple registered anchor states for tasks involving large gaze movement regions, as mentioned in Sec. III-A. These empirical findings are concisely encapsulated in Fig. 8 (a).

Weight of Feature Map Loss. Within our model, a feature map loss $\mathcal{L}_{d}$ is employed to the guidance of the learning algorithm. Note that the gradient from MSE loss (feature distillation) is typically smaller than KL-divergence (task loss), to balance those different terms, we give a large factor for distillation loss $\mathcal{L}_{d}$ . As shown in Fig. 8 (b), it can be seen that the weight of distillation loss is quite large. Such an increment is imperative to augment the network’s learning capacity, thereby enhancing the precision of gaze estimation.

Number of Distillation Samples. As shown in Eq. (7), we expect the network can learn the noise-free samples from multiple noised measurements. Thus, enlarging the batch size is a necessary step to make the solution value of Eq. (7) converge to expectation. We investigated the impact of varying the number of distillation samples (indicating the number of summation samples in Eq. 7), including 4, 8, and 16 samples. The results are shown in Table III. The results indicate a positive correlation between the number of distillation samples and the accuracy of gaze estimation. Meanwhile, there is an observable decrease in the MAE. It indicates that without enough samples, it may potentially reduce the model’s predictive capabilities, since the gradients of Eq. (7) in small batches may lead to deteriorated model weight distributions. In other words, the results demonstrate that this approach effectively reduces the over-fitting issue.

TABLE III: Gaze estimation performance across various reconstruction samples. The adopted strategy is underlined.

Metric	4 Samples	8 Samples	16 Samples
MAE $\downarrow$	3.987^∘	2.684^∘	1.928^∘
Accuracy $\uparrow$	82.01%	85.48%	87.67%

IV-D Aligning Continuous Location Prediction into Pre-trained Model

An essential requirement of eye tracking technology is the capability to dynamically estimate the point of gaze. After distilling local experts into a comprehensive model, our system can provide a rough estimate of the gaze locus with low resolution. Building upon the foundation of a pre-trained model, we generate accurate three-dimensional coordinates for the free gaze point [84, 85, 86]. To produce the actual location of the gaze intersection with the screen, we have designed a branch for gaze projection coordinates, as depicted in Fig. 9. Specifically, we have adapted the final output layer, traditionally responsible for generating class labels, to directly predict the two-dimensional coordinates of the gaze point on the screen [87, 88]. Concurrently, we refined the optimization objective of the model by transitioning to an alternative loss function, designated as $\mathcal{L}_{c}$ , to fine-tune the pre-trained model parameters. The MSE loss function was selected to direct the training process of the gaze position regression model, providing a robust quantitative measure for minimizing the discrepancy between the predicted and actual gaze coordinates. This strategic modification is predicated on enhancing the model’s precision in capturing the subtleties of gaze behavior.

\mathcal{L}_{c}=\|\hat{\mathbf{P}}-\mathbf{P}\|_{2}^{2},

(12)

where $\hat{\mathbf{P}}\in\mathbb{R}^{2}$ and $\mathbf{P}\in\mathbb{R}^{2}$ represent the predicted value of the model and the ground-truth value of the gaze point, respectively.

This modification enables the system to translate the abstract understanding of where a person is looking into a concrete set of screen coordinates, facilitating applications that require precise tracking of the user’s point of gaze. Next, our system establishes a spatial coordinate system with the human eye as the origin in three-dimensional space. Subsequently, using randomly captured near-eye frame and event data, our system outputs continuous location predictions. The quantitative results of this assessment are systematically presented in Table IV. We also refer reviewers to the video demo for more results.

TABLE IV: Quantitative results of our methods of continuous location prediction.

\downarrow

indicates the smaller, the better.

Method	S-T GE[83]	Dilated-Net[69]	EventGT[19]	HE-Tracker[82]	Ours
MAE $\downarrow$	6.980^∘	3.589^∘	3.900^∘	3.655^∘	3.184^∘

V Conclusion and Discussion

We have presented a novel coarse-to-fine dual-stage model for gaze estimation that leverages frame data and event data, utilizing anchor states to enhance precision. Technically, we employed event-frame transformer as backbone and introduced a global-local latent denoising knowledge distillation to effectively merge the unique attributes of frame and event data. Our extensive experiments confirm the model’s capability to tackle challenges in multimodal data fusion and reduce overfitting tendencies. Our approach achieves reliable gaze estimation, maintaining angular error below 2^∘. This outperforms various contemporary state-of-the-art gaze estimation methods, setting a new standard for this intricate task.

Despite the substantial superiority of the proposed method compared to state-of-the-art approaches, additional considerations must be addressed for further advancement. Focusing on the development of lightweight neural networks to optimize the inference process is essential. This can be accomplished through techniques like distillation or neural network pruning. Furthermore, there is scope for enhancing accuracy at the retina level (below $1^{\circ}$ ), akin to the exceptional capabilities demonstrated by Apple Vision Pro and HTC Vive.

References

[1] K. Reiter, K. Pfeuffer, A. Esteves, T. Mittermeier, and F. Alt, “Look & turn: One-handed and expressive menu interaction by gaze and arm turns in vr,” in Proc. Symposium on Eye Tracking Research and Applications, 2022.
[2] M. Choi, D. Sakamoto, and T. Ono, “Kuiper belt: Utilizing the “out-of-natural angle” region in the eye-gaze interaction for virtual reality,” in Proc. CHI Conference on Human Factors in Computing Systems, 2022.
[3] T. Kim, A. Ham, S. Ahn, and G. Lee, “Lattice menu: A low-error gaze-based marking menu utilizing target-assisted gaze gestures on a lattice of visual anchors,” in Proc. CHI Conference on Human Factors in Computing Systems, 2022.
[4] M. Kassner, W. Patera, and A. Bulling, “Pupil: an open source platform for pervasive eye tracking and mobile gaze-based interaction,” in Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, 2014.
[5] S. Ahn, S. Santosa, M. Parent, D. Wigdor, T. Grossman, and M. Giordano, “Stickypie: A gaze-based, scale-invariant marking menu optimized for ar/vr,” in Proc. CHI Conference on Human Factors in Computing Systems, 2021.
[6] T. K. Wee, E. Cuervo, and R. Balan, “Focusvr: Effective 8 usable vr display power management,” Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 2, no. 3, sep 2018.
[7] X. Ma, Z. Yao, Y. Wang, W. Pei, and H. Chen, “Combining brain-computer interface and eye tracking for high-speed text entry in virtual reality,” in Proc. International Conference on Intelligent User Interfaces, 2018.
[8] C. Hennessey, B. Noureddin, and P. Lawrence, “A single camera eye-gaze tracking system with free head motion,” in Proc. ACM Symposium on Eye Tracking Research and Applications, 2006.
[9] Z. Zhu, Q. Ji, and K. Bennett, “Nonlinear eye gaze map** function estimation via support vector regression,” in Proc. International Conference on Pattern Recognition, vol. 1, 2006.
[10] Wang, Sung, and R. Venkateswarlu, “Eye gaze estimation from a single image of one eye,” in Proc. IEEE International Conference on Computer Vision, 2003.
[11] E. Wood and A. Bulling, “Eyetab: Model-based gaze estimation on unmodified tablet computers,” in Proc. ACM Symposium on Eye Tracking Research and Applications, 2014.
[12] C. Lu, P. Chakravarthula, K. Liu, X. Liu, S. Li, and H. Fuchs, “Neural 3d gaze: 3d pupil localization and gaze tracking based on anatomical eye model and neural refraction correction,” in Proc. IEEE International Symposium on Mixed and Augmented Reality, 2022.
[13] R. Ranjan, S. De Mello, and J. Kautz, “Light-weight head pose invariant gaze tracking,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.
[14] S. Baluja and D. Pomerleau, “Non-intrusive gaze tracking using artificial neural networks,” in Proc. Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector, Eds., vol. 6, 1993.
[15] K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar, W. Matusik, and A. Torralba, “Eye tracking for everyone,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[16] Y. Cheng and F. Lu, “Dvgaze: Dual-view gaze estimation,” in Proc. IEEE/CVF International Conference on Computer Vision, 2023.
[17] K. Roy and D. Chanda, “A robust webcam-based eye gaze estimation system for human-computer interaction,” in Proc. International Conference on Innovations in Science, Engineering and Technology, 2022.
[18] M. N. Lystbaek, K. Pfeuffer, J. E. S. Gronbaek, and H. Gellersen, “Exploring gaze for assisting freehand selection-based text entry in ar,” Proc. ACM Hum.-Comput. Interact., vol. 6, no. ETRA, may 2022.
[19] A. N. Angelopoulos, J. N. Martel, A. P. Kohli, J. Conradt, and G. Wetzstein, “Event-based near-eye gaze tracking beyond 10,000 hz,” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 5, 2021.
[20] D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza, Asynchronous, Photometric Feature Tracking Using Events and Frames. Springer International Publishing, 2018.
[21] H. Liu, D. P. Moeys, G. Das, D. Neil, S.-C. Liu, and T. Delbrück, “Combined frame and event based detection and tracking,” in Proc. IEEE International Symposium on Circuits and Systems, 2016.
[22] M. Mokatren, T. Kuflik, and I. Shimshoni, “3d gaze estimation using rgb-ir cameras,” Sensors, vol. 23, no. 1, 2023.
[23] Y. Cheng and F. Lu, “Gaze estimation using transformer,” in Proc. International Conference on Pattern Recognition, 2022.
[24] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual–inertial odometry using nonlinear optimization,” The International Journal of Robotics Research, vol. 34, no. 3, 2015.
[25] Y. Feng, N. Goulding-Hotta, A. Khan, H. Reyserhove, and Y. Zhu, “Real-time gaze tracking with event-driven eye segmentation,” in Proc. IEEE Conference on Virtual Reality and 3D User Interfaces, 2022.
[26] Y. Lei, S. He, M. Khamis, and J. Ye, “An end-to-end review of gaze estimation and its interactive applications on handheld mobile devices,” ACM Comput. Surv., vol. 56, no. 2, sep 2023.
[27] M. F. Ansari, P. Kasprowski, and P. Peer, “Person-specific gaze estimation from low-quality webcam images,” Sensors, vol. 23, no. 8, 2023.
[28] G. Zhao, Y. Yang, J. Liu, N. Chen, Y. Shen, H. Wen, and G. Lan, “EV-eye: Rethinking high-frequency eye tracking through the lenses of event cameras,” in Proc. Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
[29] M. A. Mahowald, An Analog VLSI System for Stereoscopic Vision. USA: Kluwer Academic Publishers, 1994.
[30] C. A. Mead and M. Mahowald, “A silicon model of early visual processing,” Neural Networks, vol. 1, no. 1, 1988.
[31] T. Delbruck, “Silicon retina with correlation-based, velocity-tuned pixels,” IEEE Transactions on Neural Networks, vol. 4, no. 3, 1993.
[32] T. Delbruck, B. Linares-Barranco, E. Culurciello, and C. Posch, “Activity-driven, event-based vision sensors,” in Proc. IEEE International Symposium on Circuits and Systems, 2010.
[33] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 $\times$ 128 120 db 15 $\mu$ s latency asynchronous temporal contrast vision sensor,” IEEE Journal of Solid State Circuits, vol. 43, 03 2008.
[34] C. Posch, T. Serrano-Gotarredona, B. Linares-Barranco, and T. Delbruck, “Retinomorphic event-based vision sensors: Bioinspired cameras with spiking output,” Proceedings of the IEEE, vol. 102, no. 10, 2014.
[35] A. Mitrokhin, C. Fermuller, C. Parameshwara, and Y. Aloimonos, “Event-based moving object detection and tracking,” in Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct. 2018.
[36] H. Rebecq, T. Horstschaefer, and D. Scaramuzza, “Real-time visual-inertial odometry for event cameras using keyframe-based nonlinear optimization,” in Proc. British Machine Vision Conference, 2017.
[37] E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, and D. Scaramuzza, “The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and slam,” The International Journal of Robotics Research, vol. 36, no. 2, Feb. 2017.
[38] J. N. P. Martel, J. Müller, J. Conradt, and Y. Sandamirskaya, “An active approach to solving the stereo matching problem using event-based sensors,” in Proc. IEEE International Symposium on Circuits and Systems, 2018.
[39] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “Monoslam: Real-time single camera slam,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, 2007.
[40] D. Weikersdorfer, D. B. Adrian, D. Cremers, and J. Conradt, “Event-based 3d slam with a depth-augmented dynamic vision sensor,” in Proc. IEEE International Conference on Robotics and Automation, 2014.
[41] J. Lee, P. Park, C.-W. Shin, H. Ryu, B.-C. Kang, and T. Delbruck, “Touchless hand gesture ui with instantaneous responses,” in Proc. International Conference on Image Processing, 09 2012.
[42] Z. Zhu, J. Hou, and D. O. Wu, “Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers,” in Proc. IEEE/CVF International Conference on Computer Vision, 2023.
[43] Z. Zhu, J. Hou, and X. Lyu, “Learning graph-embedded key-event back-tracing for object tracking in event clouds,” Advances in Neural Information Processing Systems, vol. 35, 2022.
[44] X. Wang, J. Li, L. Zhu, Z. Zhang, Z. Chen, X. Li, Y. Wang, Y. Tian, and F. Wu, “Visevent: Reliable object tracking via collaboration of frame and event flows,” IEEE Transactions on Cybernetics, 2023.
[45] C. Tang, X. Wang, J. Huang, B. Jiang, L. Zhu, J. Zhang, Y. Wang, and Y. Tian, “Revisiting color-event based tracking: A unified network, dataset, and metric,” arXiv preprint arXiv:2211.11010, 2022.
[46] T. Delbruck and P. Lichtsteiner, “Fast sensory motor control based on event-based hybrid neuromorphic-procedural system,” in Proc. IEEE International Symposium on Circuits and Systems, 2007.
[47] T. Delbruck and M. Lang, “Robotic goalie with 3ms reaction time at 4% cpu load using event-based dynamic vision sensor,” Frontiers in Neuroscience, vol. 7, 2013.
[48] M. Litzenberger, B. Kohn, A. Belbachir, N. Donath, G. Gritsch, H. Garn, C. Posch, and S. Schraml, “Estimation of vehicle speed based on asynchronous data from a silicon retina optical sensor,” in Proc. IEEE Intelligent Transportation Systems Conference, 2006.
[49] M. Litzenberger, C. Posch, D. Bauer, A. Belbachir, P. Schon, B. Kohn, and H. Garn, “Embedded vision system for real-time object tracking using an asynchronous transient vision sensor,” in Proc. IEEE Digital Signal Processing Workshop and IEEE Signal Processing Education Workshop, 2006.
[50] X. Lagorce, C. Meyer, S.-H. Ieng, D. Filliat, and R. Benosman, “Asynchronous event-based multikernel algorithm for high-speed visual features tracking,” IEEE transactions on neural networks and learning systems, vol. 26, 09 2014.
[51] J. Conradt, M. Cook, R. Berner, P. Lichtsteiner, R. Douglas, and T. Delbruck, “A pencil balancing robot using a pair of aer dynamic vision sensors,” in Proc. IEEE International Symposium on Circuits and Systems, 06 2009.
[52] Z. Ni, A. Bolopion, J. Agnus, R. Benosman, and S. Regnier, “Asynchronous event-based visual shape tracking for stable haptic feedback in microrobotics,” IEEE Transactions on Robotics, vol. 28, no. 5, 2012.
[53] Z. Ni, S.-H. Ieng, C. Posch, S. Régnier, and R. Benosman, “Visual tracking using neuromorphic asynchronous event-based cameras,” Neural Computation, vol. 27, no. 4, 04 2015.
[54] A. Glover and C. Bartolozzi, “Robust visual tracking with a freely-moving event camera,” in Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, 2017.
[55] D. Reverter Valeiras, X. Lagorce, X. Clady, C. Bartolozzi, S.-H. Ieng, and R. Benosman, “An asynchronous neuromorphic event-driven visual part-based shape tracking,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 12, 2015.
[56] N. Li, M. Chang, and A. Raychowdhury, “E-gaze: Gaze estimation with event camera,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[57] Q. Chen, Z. Wang, S.-C. Liu, and C. Gao, “3et: Efficient event-based eye tracking using a change-based convlstm network,” in Proc. IEEE Biomedical Circuits and Systems Conference, 2023.
[58] T. Stoffregen, H. Daraei, C. Robinson, and A. Fix, “Event-based kilohertz eye tracking using coded differential lighting,” in Proc. IEEE/CVF Winter Conference on Applications of Computer Vision, 2022.
[59] X. Wang, J. Huang, S. Wang, C. Tang, B. Jiang, Y. Tian, J. Tang, and B. Luo, “Long-term frame-event visual tracking: Benchmark dataset and baseline,” 2024.
[60] C. Ryan, B. O’Sullivan, A. Elrasad, A. Cahill, J. Lemley, P. Kielty, C. Posch, and E. Perot, “Real-time face & eye tracking and blink detection using event cameras,” Neural Networks, vol. 141, 2021.
[61] L. R. Young and D. Sheena, “Survey of eye movement recording methods,” Behavior research methods and instrumentation, vol. 7, no. 5, 1975.
[62] T. N. Cornsweet and H. D. Crane, “Accurate two-dimensional eye tracker using first and fourth purkinje images.” Journal of the Optical Society of America, vol. 63 8, 1973.
[63] H. D. Crane and C. M. Steele, “Generation-v dual-purkinje-image eyetracker,” Applied optics, vol. 24, no. 4, 1985.
[64] Y. Li, S. Wang, and X. Ding, “Eye/eyes tracking based on a unified deformable template and particle filtering,” Pattern Recognition Letters, vol. 31, no. 11, 2010.
[65] Y. Tian, T. Kanade, and J. Cohn, “Dual-state parametric eye tracking,” in Proc. IEEE International Conference on Automatic Face and Gesture Recognition, March 2000.
[66] K. Wang and Q. Ji, “Real time eye gaze tracking with 3d deformable eye-face model,” in Proc. IEEE International Conference on Computer Vision, 2017.
[67] C. H. Morimoto and M. R. Mimica, “Eye gaze tracking techniques for interactive applications,” Computer vision and image understanding, vol. 98, no. 1, 2005.
[68] T. A. Duchowski, Eye tracking: methodology theory and practice. Springer, 2017.
[69] Z. Chen and B. E. Shi, “Appearance-based gaze estimation using dilated-convolutions,” in Proc. Asian Conference on Computer Vision, 2019.
[70] J. Bao, B. Liu, and J. Yu, “An individual-difference-aware model for cross-person gaze estimation,” IEEE Transactions on Image Processing, vol. 31, 2022.
[71] W. Shin, H. Park, S.-P. Kim, and S. Sul, “Individual differences in gaze-cuing effect are associated with facial emotion recognition and social conformity,” Frontiers in Psychology, vol. 14, 2023.
[72] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Proc. NIPS Deep Learning and Representation Learning Workshop, 2015.
[73] X. Wang, S. Wang, C. Tang, L. Zhu, B. Jiang, Y. Tian, and J. Tang, “Event stream-based visual object tracking: A high-resolution benchmark dataset and a novel baseline,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2024.
[74] D. Lopez-Paz, L. Bottou, B. Scholkopf, and V. N. Vapnik, “Unifying distillation and privileged information,” CoRR, vol. abs/1511.03643, 2015.
[75] L. Xiang, G. Ding, and J. Han, Learning From Multiple Experts: Self-paced Knowledge Distillation for Long-Tailed Classification, 10 2020.
[76] Q. Guo, X. Wang, Y. Wu, Z. Yu, D. Liang, X. Hu, and P. Luo, “Online knowledge distillation via collaborative learning,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.
[77] L. Bicsi, B. Alexe, R. T. Ionescu, and M. Leordeanu, “JEDI: joint expert distillation in a semi-supervised multi-dataset student-teacher scenario for video action recognition,” in Proc. IEEE/CVF International Conference on Computer Vision, 2023.
[78] N. Shen, T. Xu, S. Huang, F. Mu, and J. Li, “Expert-guided knowledge distillation for semi-supervised vessel segmentation,” IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 11, 2023.
[79] A. Sochopoulos, I. Mademlis, E. Charalampakis, S. Papadopoulos, and I. Pitas, “Deep reinforcement learning with semi-expert distillation for autonomous uav cinematography,” in Proc. IEEE International Conference on Multimedia and Expo, jul 2023.
[80] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[81] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, 2020.
[82] L. Chen, Y. Li, X. Bai, X. Wang, Y. Hu, M. Song, L. Xie, Y. Yan, and E. Yin, “Real-time gaze tracking with head-eye coordination for head-mounted displays,” in Proc. IEEE International Symposium on Mixed and Augmented Reality, 2022.
[83] C. Palmero Cantarino, O. V. Komogortsev, and S. S. Talathi, “Benefits of temporal information for appearance-based gaze estimation,” in Proc. ACM Symposium on Eye Tracking Research and Applications, Jun. 2020.
[84] A. T. Duchowski, K. Krejtz, M. Volonte, C. J. Hughes, M. Brescia-Zapata, and P. Orero, “3d gaze in virtual reality: Vergence, calibration, event detection,” Procedia Computer Science, vol. 207, 2022, knowledge-Based and Intelligent Information $\&$ Engineering Systems: Proceedings of the 26th International Conference KES2022.
[85] K. Wang and Q. Ji, “3d gaze estimation without explicit personal calibration,” Pattern Recognition, vol. 79, 2018.
[86] M. Mansouryar, J. Steil, Y. Sugano, and A. Bulling, “3d gaze estimation from 2d pupil positions on monocular head-mounted eye trackers,” in Proc. ACM Symposium on Eye Tracking Research and Applications, Mar. 2016.
[87] C. Elmadjian, P. Shukla, A. D. Tula, and C. H. Morimoto, “3d gaze estimation in the scene volume with a head-mounted eye tracker,” in Proc. ACM Workshop on Communication by Gaze Interaction, 2018.
[88] Y. Man, X. Zhao, and K. Zhang, “3D gaze estimation based on facial feature tracking,” in Proc. International Conference on Graphic and Image Processing, Z. Zhu, Ed., vol. 8768, 2013.