Modeling State Shifting via Local-Global Distillation for Event-Frame Gaze Tracking

Jiading Li, Zhiyu Zhu, **hui Hou, Junhui Hou, Senior Member, IEEE, and **jian Wu, Senior Member, IEEE This work was supported in part by Hong Kong Research Grants Council under Grant 11218121, and in part by Hong Kong Innovation and Technology Fund under Grant MHP/117/21. J. Li and Z. Zhu contributed equally to this work.J. Li, Z. Zhu, J. Hou, and J. Hou are with the Department of Computer Science, City University of Hong Kong, Hong Kong SAR. Email: [email protected]; [email protected]; [email protected]; [email protected]. Wu is with the School of Artificial Intelligence, Xidian University, Xi’an 710071, China. Email: [email protected]
Abstract

This paper tackles the problem of passive gaze estimation using both event and frame data. Considering the inherently different physiological structures, it is intractable to accurately estimate gaze purely based on a given state. Thus, we reformulate gaze estimation as the quantification of the state shifting from the current state to several prior registered anchor states. Specifically, we propose a two-stage learning-based gaze estimation framework that divides the whole gaze estimation process into a coarse-to-fine approach involving anchor state selection and final gaze location. Moreover, to improve the generalization ability, instead of learning a large gaze estimation network directly, we align a group of local experts with a student network, where a novel denoising distillation algorithm is introduced to utilize denoising diffusion techniques to iteratively remove inherent noise in event data. Extensive experiments demonstrate the effectiveness of the proposed method, which surpasses state-of-the-art methods by a large margin of 15%percent\%%. The code will be publicly available at https://github.com/jdjdli/Denoise_distill_EF_gazetracker.

Index Terms:
Event-based vision, gaze estimation, latent distillation.

I Introduction

Eye gaze tracking represents a critical component in non-verbal communication, offering profound insights into an individual’s intentions and emotional states by capturing visual information from the user’s eyes. This technology can infer desires and needs by measuring the user’s attention on specific objects, making it invaluable in fields like human-computer interaction [1, 2, 3], virtual reality [4, 5], and intelligent transportation [6, 7].

Over the recent decades, gaze estimation has witnessed an exponential increase in the development of diverse methodologies, which can be generally divided into three categories, i.e., 2D eye feature regression-based approaches [8, 9, 10, 11], 3D model-based eye movement reconstruction algorithms [12], and appearance-based gaze estimation techniques [13]. Although the first two categories excel in tracking the eye’s position and movement, they typically rely on specialized hardware systems and light conditions. Conversely, appearance-based methods only utilize face or eye images as input to train a map** model between appearance and gaze, thereby determining the corresponding gaze based on new appearance data [14, 15, 13, 16]. However, appearance-based methods are sensitive to individual differences and head movements in unconstrained environments. Addressing these challenges often necessitates high-speed, high-resolution RGB and optical cameras to capture more detailed visual information, leading to significant costs and energy demands [17, 18].

Refer to caption
Figure 1: Left: Overview of our gaze estimation setup: our framework emphasizes the modeling of gaze shifts from a registered anchor state to the currently acquired state captured during actual use. Our approach takes input in the form of a frame coupled with corresponding event data to infer the position of the directional gaze point as the output. Right: Beyond the confines of static frame-based gaze estimation, studying dynamic ocular movements constitutes an additional research trajectory within computer vision.

Recently, event-based cameras have surged in popularity due to their exceptional temporal resolution, low latency, and expansive dynamic range, making them ideal for tracking fast eye movements like saccades with high fidelity [19]. Additionally, the high dynamic range allows them to operate well under varying lighting conditions, which is crucial for accurate gaze estimation in real-world scenarios. Despite the advantages, typical event-based sensors provide limited visual information, missing out on color, texture, and comprehensive contextual details that are readily available through conventional RGB imaging[20, 21, 22, 23]. Therefore, the emerging field of cross-modal gaze estimation promises to enhance the robustness and accuracy of gaze tracking by capitalizing on the complementary strengths of both data streams [24, 25, 26, 27, 28]. However, most existing methods leverage only one modality to assist the other, failing to take full advantage of the information contained within both data types.

In this paper, we propose a novel gaze estimation transformer framework that revolutionizes gaze tracking by leveraging the complementary strengths of event cameras and traditional frame-based imaging, as shown in Fig. 1. Though synergizing the high temporal resolution of event data with the rich spatial information of frames via a local-global distillation process, our method achieves a new level of performance in gaze tracking. Technically, we reformulate the gaze estimation as the quantification of eye motion transitions from the current state to several prior registered anchor states. Based on this, we initially partition the entire gaze points region into several sub-regions and employ the vision transformers to pre-train a set of models on different sub-regions, yielding several local expert networks with relatively high accuracy for localized gaze estimation. Furthermore, we introduce a local-global latent denoising distillation method to distill knowledge from the set of local expert networks to a global student network to diminish the adverse effects of inherent noise from event data on student network performance. Extensive experiments demonstrate the significant superiority of the proposed method over state-of-the-art methods.

In summary, the main contributions of this paper are three-fold:

  • we formulate the gaze estimation as an end-to-end prediction of state shifting from registered anchor state;

  • we distill multiple pre-trained local expert networks into a more robust student network to combat overfitting in gaze estimation; and

  • we propose a self-supervised latent denoising method to mitigate the adverse effects of noise from local expert networks to improve the performance of the student network.

The remainder of this paper is organized as follows. Section II briefly reviews related works in this field. Section III presents the proposed method in detail, followed by extensive experiments and analysis in Section IV. Finally, Section V concludes this paper.

Refer to caption
Figure 2: Illustration of the workflow of the proposed framework, where Black arrow (resp. Pink arrow) represents the training (resp. testing) pipeline. The First Stage (Sec. III-A): State Correlation Modeling by Local Expert. We first partition the entire gaze points region into several sub-regions, wherein each region’s data is trained to cultivate a local expert network. Each expert network is simultaneously fed with the anchor state and a search state and utilizes the transformers to explicitly model the correlation between the anchor and states. The Second Stage (Sec. III-B): Local-Global Latent Denoising Distillation. A latent denoising knowledge distillation method is introduced to amalgamate the expertise of these several local expert networks into a singular, comprehensive student network. Note that the latent denoising and knowledge distillation are utilized in the training phase only (see details in Sec. IV). Anchor selection in the light pink box is illustrated in detail in Fig. 6.

II Related work

In this section, we give a review of event-based vision, eye tracking, event-frame methods, and distillation networks.

Event-based Vision. Neuromorphic event-based cameras, inspired by the Silicon Retina concept [29, 30], are key for fast vision tasks due to their quick response and low latency [31, 32, 33, 34]. They support a wide range of applications including object recognition [35, 21], navigation [36], pose estimation [37], 3D reconstruction [38], SLAM [39, 40], gesture tracking [41], and object tracking [42, 43, 44, 45]. Initially, these cameras used event patterns to detect motion [46, 47, 48, 49] and track simple shapes [50, 51]. Later improvements introduced event-driven algorithms [52] and optimization techniques like gradient descent to refine tracking [53]. Algorithms such as mean-shift and Monte Carlo [54, 55] have further enhanced tracking by adjusting to changes in the model. For instance, part-based models [55] have segmented subjects into parts, enabling quick and accurate tracking of facial or body movements [56]. Chen et al.  [57] introduced a sparse Change-Based ConvLSTM model for efficient event-based eye tracking. Stoffregen et al.  [58] presented a novel method for high-frequency, low-power eye tracking using event cameras and a coded differential lighting scheme to enhance corneal glint detection while suppressing non-glint events. Wang et al. [59] introduced a unified single-stage transformer-based framework for efficient and accurate color-event object tracking. Ryan et al.  [60] introduced a novel method for real-time face and eye tracking, as well as blink detection by using event cameras. It leverages a fully convolutional recurrent neural network architecture to enhance driver monitoring systems. However, the sparse nature of event data can be problematic in low-contrast settings, and using algorithms designed for dense data in sparse situations may still increase computational demands.

Eye Tracking. Progressing from initial camera-based systems of eye tracking that monitored Purkinje images [61, 62, 63, 64, 65, 66], Morimoto et al. [67] and Duchowski et al. [68] provided detailed examinations of these pupil modeling and gaze estimation processes. Contemporary research focuses on deep learning to deduce gaze orientation from complex facial datasets obtained via standard webcams, directly map** the visual characteristics of the eye to gaze coordinates, with Chen et al. [69] enhancing accuracy through dilated convolution techniques. Advancing the field, Cheng et al. [16] integrated full face and eye region data for more accurate gaze inference and incorporates transformer models to exploit their superior handling of data dependencies. Nonetheless, these advanced models are best for full-face images and not ideal for eye-only cameras, requiring custom-designed networks for accurate data analysis. Also, their effectiveness is limited by the camera’s frame rate, which can affect their real-time accuracy and performance.

Event-frame Methods. Hybrid methods combine the detailed intensity data from a standard frame with the rapid detection of intensity changes from asynchronous event streams [25, 26, 27, 28]. Wang et al. [44] introduced a cross-modality transformer algorithm for enhancing reliable object tracking by combining visible and event camera data, demonstrating improved performance in challenging scenarios. Feng et al. [25] developed an event-driven eye segmentation algorithm that overcomes the limitations of standard frame rates, maintaining high accuracy despite a lower resolution. The auto-segmentation model was designed to combine with a previous gaze estimation model to improve estimation accuracy. In fact, it was not an end-to-end estimation model and the prediction latency was very significant. Angelopoulos et al. [19] enhanced the temporal resolution of gaze tracking by integrating event cameras close to the eyes. These cameras provided constant updates to the initial pupil location determined by traditional algorithms, allowing for precise adjustments at high temporal resolutions. However, the conventional baseline can significantly affect the performance of these methods [70, 71]. This approach resulted in gaze estimation accuracy varying within a wide range.

Distillation Networks. Knowledge distillation [72] is intended for the transfer of learned features from a "teacher" model to an efficient "student" model. Wang et al. [73] presented a novel hierarchical knowledge distillation framework for high-speed, low-latency visual object tracking using event cameras. Lopez-Paz et al. [74] expanded upon this concept by introducing privileged information, wherein the student model leverages insights from multiple teacher models, each accessing different data sources. Xiang et al. [75] utilized the collective intelligence of several teacher models to tackle the long-tailed distribution problems. Guo et al. [76] formulated a collaborative learning framework, similar to a congregation of local experts sharing their knowledge. Therefore, the essence of these expert networks’ collective intelligence is distilled into a singular, more generalized student network [77, 78, 79], with an enhanced level of accuracy, surpassing what could be achieved by an individual model trained on a uniform dataset. This kind of distillation has been proven to be suitable for the end-to-end gaze estimation task, with the detailed information provided in Sec. III.

This study fuses conventional intensity frames with dynamic event camera data into a unified network architecture, leveraging pre-trained local expert models. It aims to utilize this integration to create an end-to-end, advanced, and accurate eye-tracking method, harnessing the unique benefits of both traditional frame modality and innovative event-based sensor input.

III Proposed Method

Gaze tracking aims to determine the gaze location, given the measured state. However, the high speed of eye movement and the subtle pattern of the eyeball make it hard to derive accurate predictions. Inspired by the high-temporal resolution and low-latency characteristics of event-based vision [31, 32, 33, 34], we propose to utilize frames together with event data for building an accurate near-eye gaze estimation pipeline, as illustrated in Fig. 2.

Specifically, due to the different individual biometric characteristics, e.g., pupil distance and size of the eyeball, there would be a significant bias in the gaze estimation process [68]. Consequently, instead of directly calculating the absolute location of gaze focus from a single observational state, we propose to calculate the relative shift of the measured state compared with pre-registered anchor states. Moreover, to learn the correlation between those two states in an end-to-end manner, we delve into the potential of utilizing pre-trained vision transformers for cross-modal eye tracking. However, directly training the gaze estimation network in a large region usually leads to overfitting phenomena, as shown in Fig. 3. Thus, we train a set of sub-region gaze estimation models for different anchor states, which are called local expert networks (see Sec. III-A).

Subsequently, to further boost the capacity for accurate gaze prediction across diverse scenarios, we design a distillation-based algorithm to ensemble knowledge of pre-trained local expert networks into a large student network. However, the presence of noise in the measured inputs (especially for the event data) can disrupt neural network training and negatively impact performance. To alleviate the potential noise influencing the neural network training, we propose a self-supervised latent denoising neural network for feature maps from experts and then apply knowledge distillation to the student network (see Sec. III-B).

In what follows, we will detail the proposed pipeline.

Refer to caption
Figure 3: Illustration of gaze estimation accuracy by trained using different perceived sizes, denoted as n×n𝑛𝑛n\times nitalic_n × italic_n. Moreover, all models are evaluated on data with perceived regions identical to those in their respective training sets. The experimental results indicate that an incremental increase in the training dataset region leads to a substantial degradation in network performance. Moreover, the incremental of the network’s parameters is for fitting the dataset (otherwise, the network is hard to converge). Meanwhile, as shown in the rightmost example, directly training with multiple anchors in the 11×\times×11 region is also hard to converging on an accurate result. This observation suggests that instead of directly training on the whole region, we can distil those small but accurate models in local regions into a large student network for accurate modelling of gaze motion. \uparrow (resp. \downarrow) indicates that larger (resp. smaller) values are better.
Refer to caption
Figure 4: (a) illustrates the outcome of training with a model on a large region, exhibiting pronounced over-fitting, as evidenced by the heatmap, indicating attention dispersion away from the ocular region. (b) showcases the performance of our model, which makes distillation of knowledge from a set of local experts, with a heatmap that is distinctly concentrated on the ocular region. The visualization indicates that through the proposed local-global distillation, network has accurate attention on the relevant region.

III-A State Correlation Modeling by Local Expert

Event-Frame Tokenization. Our framework takes a paired near-eye frame and the corresponding event stream as input. To effectively fuse these two data modalities, we transform them into a unified representation. Specifically, we first aggregate events within specific time intervals to convert the asynchronous event flow into a synchronous format, aligning with frame exposure duration. We then voxelize the original event stream into a grid of voxels using PointNet [80]. This voxelization enables us to represent the event data in a structured format analogous to the frame data. Finally, we tokenize both the frame and the voxelized event data into sequences of spatial tokens, where each token corresponds to a specific location in the frame and its associated event voxel. This tokenized representation facilitates seamless fusion and processing of event and frame information within our model.

Correlation Modeling. After creating the unified representation for both data modalities, we construct the embedding representations of two gaze points, i.e., the estimation gaze point indicates the current state and the template gaze point represents the registered anchor state, by concatenating the token embeddings of two modalities from the same state.

We then apply a Vision Transformer (ViT) with the multi-head self-attention (MSA) mechanism to model the correlation between the current and registered anchor states. The processed features are subsequently flattened into a single feature representation, which is analyzed by a convolutional layer to predict the final class label.

Our experiments confirm the model’s effectiveness in distinguishing between the current and registered anchor states for gaze estimation. However, we observed limitations when a single registered anchor state was used for predictions across a large gaze area, leading to significant over-fitting as shown in Figs. 3 and 4. To address this, we propose partitioning the entire gaze point area into N𝑁Nitalic_N smaller sub-regions, each with its own regional registered anchor state. This allows us to train N𝑁Nitalic_N specialized local expert networks. For our task, we achieve a good balance between training costs and accuracy by dividing the region into five sub-regions, as shown in Fig. 5, each with a dedicated local expert network.

Refer to caption
Figure 5: Impact of the number of registered anchor states on prediction accuracy. The results demonstrate that increasing the number of anchor states generally improves prediction accuracy, with an optimal performance achieved when using 5 anchor states. Note that product accuracy can be observed on both vertical axes.

Anchor Selection Mechanism. Since the local expert networks only perceive data from their own sub-regions, we further design a network to globally determine the usage of which expert (anchor) during the gaze estimation process. As illustrated in Fig. 6, the network uses the MLPs to choose the nearest registered anchor state based on the input current state to determine which sub-region the current state belongs to.

III-B Local-Global Latent Denoising Distillation

Refer to caption
Figure 6: Illustration of the anchor selection mechanism. we employ an MLP-driven anchor selection mechanism to dynamically identify the nearest anchor state for the input state. The correlation with other modules is illustrated in the training part in Fig. 2.
Algorithm 1 Training Denoiser
1:Repeat
2:𝐗iq(𝐗i)similar-tosubscript𝐗𝑖𝑞subscript𝐗𝑖\mathbf{X}_{i}\sim q(\mathbf{X}_{i})bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_q ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
3:tUniform({i,i+1,,T})similar-to𝑡Uniform𝑖𝑖1𝑇t\sim\mathrm{Uniform}(\{i,i+1,...,T\})italic_t ∼ roman_Uniform ( { italic_i , italic_i + 1 , … , italic_T } )
4:ϵ𝒩(0,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}(0,\mathbf{I})italic_ϵ ∼ caligraphic_N ( 0 , bold_I )
5:i𝒰(27,32)similar-to𝑖𝒰2732i\sim\mathcal{U}(27,32)italic_i ∼ caligraphic_U ( 27 , 32 )
6:Take gradient descent step on
7:     δ=ϵϵθ(α¯tα¯i𝐗i+1α¯tα¯iϵ,t)𝛿italic-ϵsubscriptitalic-ϵ𝜃subscript¯𝛼𝑡subscript¯𝛼𝑖subscript𝐗𝑖1subscript¯𝛼𝑡subscript¯𝛼𝑖italic-ϵ𝑡\delta=\epsilon-\epsilon_{\theta}(\sqrt{\frac{\bar{\alpha}_{t}}{\bar{\alpha}_{% i}}}\mathbf{X}_{i}+\sqrt{1-\frac{\bar{\alpha}_{t}}{\bar{\alpha}_{i}}}\epsilon,t)italic_δ = italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + square-root start_ARG 1 - divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ , italic_t )
8:    θδ2+σ(δ)22subscript𝜃superscriptnorm𝛿2superscriptnorm𝜎𝛿22\nabla_{\theta}\|\sum\delta\|^{2}+\|\sigma(\delta)-\sqrt{2}\|^{2}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∥ ∑ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_σ ( italic_δ ) - square-root start_ARG 2 end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
9:Until converged
Algorithm 2 Reverse process
1:ϵ𝒩(0,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}(0,\mathbf{I})italic_ϵ ∼ caligraphic_N ( 0 , bold_I )
2:𝐗T=α¯Tα¯i𝐗i+1α¯Tα¯iϵsubscript𝐗superscript𝑇subscript¯𝛼superscript𝑇subscript¯𝛼𝑖subscript𝐗𝑖1subscript¯𝛼superscript𝑇subscript¯𝛼𝑖italic-ϵ\mathbf{X}_{T^{\prime}}=\sqrt{\frac{\bar{\alpha}_{T^{\prime}}}{\bar{\alpha}_{i% }}}\mathbf{X}_{i}+\sqrt{1-\frac{\bar{\alpha}_{T^{\prime}}}{\bar{\alpha}_{i}}}\epsilonbold_X start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + square-root start_ARG 1 - divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ
3:For t=T,,i𝑡superscript𝑇𝑖t=T^{\prime},...,iitalic_t = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , … , italic_i do
4:    𝐳𝒩(0,𝐈)similar-to𝐳𝒩0𝐈\mathbf{z}\sim\mathcal{N}(0,\mathbf{I})bold_z ∼ caligraphic_N ( 0 , bold_I ) if t>i𝑡𝑖t>iitalic_t > italic_i, else 𝐳=0𝐳0\mathbf{z}=0bold_z = 0
5:    σt=β^t(α¯iα¯t1)α¯iα¯tsubscript𝜎𝑡subscript^𝛽𝑡subscript¯𝛼𝑖subscript¯𝛼𝑡1subscript¯𝛼𝑖subscript¯𝛼𝑡\sigma_{t}=\sqrt{\frac{\hat{\beta}_{t}(\bar{\alpha}_{i}-\bar{\alpha}_{t-1})}{% \bar{\alpha}_{i}-\bar{\alpha}_{t}}}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG
6:    𝐗t1=1αt(𝐗t(1αt)α¯iα¯iα¯tϵθ(𝐗t,t))+σt𝐳subscript𝐗𝑡11subscript𝛼𝑡subscript𝐗𝑡1subscript𝛼𝑡subscript¯𝛼𝑖subscript¯𝛼𝑖subscript¯𝛼𝑡subscriptitalic-ϵ𝜃subscript𝐗𝑡𝑡subscript𝜎𝑡𝐳\mathbf{X}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}(\mathbf{X}_{t}-\frac{(1-\alpha_{t% })\sqrt{\bar{\alpha}_{i}}}{\sqrt{\bar{\alpha}_{i}-\bar{\alpha}_{t}}}\mathbf{% \epsilon}_{\theta}(\mathbf{X}_{t},t))+\sigma_{t}\mathbf{z}bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_z
7:End for
8:Return 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

The proposed denoising diffusion algorithm consists of the following two steps:

  1. 1.

    training a denoising diffusion network for removing potential noise as illustrated by Eq. (1) and Algorithm 1; and

  2. 2.

    distilling the knowledge from local expert networks into a student network via Eq. (7) with the denoising process as Algorithm 2.

Given the feature maps of measurement 𝐗~=𝐗+γϵ~𝐗𝐗𝛾italic-ϵ\widetilde{\mathbf{X}}=\mathbf{X}+\gamma\mathbf{\epsilon}over~ start_ARG bold_X end_ARG = bold_X + italic_γ italic_ϵ, where 𝐗𝐗\mathbf{X}bold_X indicates the expectation and ϵ𝒩(𝟎,𝐈)similar-toitalic-ϵ𝒩0𝐈\mathbf{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) represents the random Gaussian noise, we aim to recover the distribution of 𝐗~𝒩(𝐗,γ𝐈)similar-to~𝐗𝒩𝐗𝛾𝐈\widetilde{\mathbf{X}}\sim\mathcal{N}(\mathbf{X},\gamma\mathbf{I})over~ start_ARG bold_X end_ARG ∼ caligraphic_N ( bold_X , italic_γ bold_I ) from a given measurement 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG. Specifically, we train a denoising diffusion network that can generate latent μ𝜇\mathbf{\mu}italic_μ from Gaussian noise. Then, by progressively removing the noise from the latent of a given noised measurement, we can finally derive the expectation by averaging several denoising results.

We formulate the distribution of the latent feature maps as p(𝐗~|𝐗)=𝒩(𝐗~|𝐗,γ𝐈)𝑝conditional~𝐗𝐗𝒩conditional~𝐗𝐗𝛾𝐈p(\widetilde{\mathbf{X}}|\mathbf{X})=\mathcal{N}(\widetilde{\mathbf{X}}|% \mathbf{X},\gamma\mathbf{I})italic_p ( over~ start_ARG bold_X end_ARG | bold_X ) = caligraphic_N ( over~ start_ARG bold_X end_ARG | bold_X , italic_γ bold_I ), where 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG indicates the measured noisy latent, 𝐗𝐗\mathbf{X}bold_X denotes the corresponding noise-free expectation, and γ𝐈𝛾𝐈\gamma\mathbf{I}italic_γ bold_I represents the covariance matrix with the independent assumption. Since we only know the noised sample, we further re-parameterize the γ=1γsuperscript𝛾1𝛾\gamma^{\prime}=1-\gammaitalic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 - italic_γ and 𝐗=γ𝐘𝐗superscript𝛾𝐘\mathbf{X}=\sqrt{\gamma^{\prime}}\mathbf{Y}bold_X = square-root start_ARG italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG bold_Y.

Subsequently, we have p(𝐗~|𝐘)=𝒩(𝐗~|γ𝐘,γ𝐈)𝑝conditional~𝐗𝐘𝒩conditional~𝐗superscript𝛾𝐘𝛾𝐈p(\widetilde{\mathbf{X}}|\mathbf{Y})=\mathcal{N}(\widetilde{\mathbf{X}}|\sqrt{% \gamma^{\prime}}\mathbf{Y},\gamma\mathbf{I})italic_p ( over~ start_ARG bold_X end_ARG | bold_Y ) = caligraphic_N ( over~ start_ARG bold_X end_ARG | square-root start_ARG italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG bold_Y , italic_γ bold_I ). Recalling the intermediate sample of DDPM [81], 𝐗t=α¯i𝐗0+1α¯i𝐗Tsubscript𝐗𝑡subscript¯𝛼𝑖subscript𝐗01subscript¯𝛼𝑖subscript𝐗𝑇\mathbf{X}_{t}=\sqrt{\bar{\alpha}_{i}}\mathbf{X}_{0}+\sqrt{1-\bar{\alpha}_{i}}% \mathbf{X}_{T}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, where α¯i=j=0iαisubscript¯𝛼𝑖superscriptsubscriptproduct𝑗0𝑖subscript𝛼𝑖\bar{\alpha}_{i}=\prod_{j=0}^{i}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Thus, the noised latent 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG could be approximated by a certain step result 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in a diffusion reverse process (𝐗T𝐗0subscript𝐗𝑇subscript𝐗0\mathbf{X}_{T}\rightarrow\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT → bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), where 𝐗0=𝐘subscript𝐗0𝐘\mathbf{X}_{0}=\mathbf{Y}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_Y and αi¯=γ¯subscript𝛼𝑖superscript𝛾\bar{\alpha_{i}}=\gamma^{\prime}over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We then introduce a process to drive the distribution of 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from a sample 𝐗~~𝐗\widetilde{\mathbf{X}}over~ start_ARG bold_X end_ARG via diffusion models.

The forward process can be simply constructed by adding noise following 𝐗t=α¯t𝐗t1+1α¯tϵtsubscript𝐗𝑡subscript¯𝛼𝑡subscript𝐗𝑡11subscript¯𝛼𝑡subscriptitalic-ϵ𝑡\mathbf{X}_{t}=\sqrt{\bar{\alpha}_{t}}\mathbf{X}_{t-1}+\sqrt{1-\bar{\alpha}_{t% }}\epsilon_{t}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates a random noise. Thus, we can train the denoiser as Algorithm 1 to learn the potential noise from the noised sample.

In Algorithm 1, we optimize the diffusion denoiser with the following loss term

LDiff=θδ2+σ(δ)22.subscript𝐿𝐷𝑖𝑓𝑓subscript𝜃superscriptnorm𝛿2superscriptnorm𝜎𝛿22L_{Diff}=\nabla_{\theta}\|\sum\delta\|^{2}+\|\sigma(\delta)-\sqrt{2}\|^{2}.italic_L start_POSTSUBSCRIPT italic_D italic_i italic_f italic_f end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∥ ∑ italic_δ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_σ ( italic_δ ) - square-root start_ARG 2 end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (1)

Since ϵ𝒩(𝟎,𝐈)similar-toitalic-ϵ𝒩0𝐈\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ), we expect ϵθ(α¯tα¯i𝐗i+1α¯tα¯iϵ,t)𝒩(𝟎,𝐈)similar-tosubscriptitalic-ϵ𝜃subscript¯𝛼𝑡subscript¯𝛼𝑖subscript𝐗𝑖1subscript¯𝛼𝑡subscript¯𝛼𝑖italic-ϵ𝑡𝒩0𝐈\epsilon_{\theta}(\sqrt{\frac{\bar{\alpha}_{t}}{\bar{\alpha}_{i}}}\mathbf{X}_{% i}+\sqrt{1-\frac{\bar{\alpha}_{t}}{\bar{\alpha}_{i}}}\epsilon,t)\sim\mathcal{N% }(\mathbf{0},\mathbf{I})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + square-root start_ARG 1 - divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ , italic_t ) ∼ caligraphic_N ( bold_0 , bold_I ). Thus, we have

δ=ϵϵθ(α¯tα¯i𝐗i+1α¯tα¯iϵ,t),δ𝒩(𝟎,2𝐈),formulae-sequence𝛿italic-ϵsubscriptitalic-ϵ𝜃subscript¯𝛼𝑡subscript¯𝛼𝑖subscript𝐗𝑖1subscript¯𝛼𝑡subscript¯𝛼𝑖italic-ϵ𝑡similar-to𝛿𝒩02𝐈\mathbf{\delta}=\mathbf{\epsilon}-\mathbf{\epsilon}_{\theta}(\sqrt{\frac{\bar{% \alpha}_{t}}{\bar{\alpha}_{i}}}\mathbf{X}_{i}+\sqrt{1-\frac{\bar{\alpha}_{t}}{% \bar{\alpha}_{i}}}\mathbf{\epsilon},t),\mathbf{\delta}\sim\mathcal{N}(\mathbf{% 0},2\mathbf{I}),italic_δ = italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + square-root start_ARG 1 - divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ , italic_t ) , italic_δ ∼ caligraphic_N ( bold_0 , 2 bold_I ) , (2)

Then we regularize δ0𝛿0\sum\delta\rightarrow 0∑ italic_δ → 0 and σ(δ)2𝜎𝛿2\sigma(\delta)\rightarrow\sqrt{2}italic_σ ( italic_δ ) → square-root start_ARG 2 end_ARG, which is exact the training objective Eq. (1).

However, our algorithm starts from a noised measurement and iteratively adds and then removes noise. Thus, the reverse process of our algorithm is explicitly different from the standard de-noising diffusion model. We further delicately investigate the reverse distribution of the Markov chain, which starts from 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We begin with one reverse transition process as

q(𝐗t1|𝐗t,𝐗i)=q(𝐗t|𝐗t1,𝐗i)q(𝐗t1|𝐗i)q(𝐗t|𝐗i),𝑞conditionalsubscript𝐗𝑡1subscript𝐗𝑡subscript𝐗𝑖𝑞conditionalsubscript𝐗𝑡subscript𝐗𝑡1subscript𝐗𝑖𝑞conditionalsubscript𝐗𝑡1subscript𝐗𝑖𝑞conditionalsubscript𝐗𝑡subscript𝐗𝑖q(\mathbf{X}_{t-1}|\mathbf{X}_{t},\mathbf{X}_{i})=\frac{q(\mathbf{X}_{t}|% \mathbf{X}_{t-1},\mathbf{X}_{i})q(\mathbf{X}_{t-1}|\mathbf{X}_{i})}{q(\mathbf{% X}_{t}|\mathbf{X}_{i})},italic_q ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_q ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_q ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG , (3)

where 𝐗tsubscript𝐗𝑡\mathbf{X}_{t}bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates the noised sample with inherent noise from collected event data. Then, we take an investigation of the term q(𝐗t1|𝐗t,𝐗i)𝑞conditionalsubscript𝐗𝑡1subscript𝐗𝑡subscript𝐗𝑖q(\mathbf{X}_{t-1}|\mathbf{X}_{t},\mathbf{X}_{i})italic_q ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

q(𝐗t1|𝐗t,𝐗i)=q(𝐗t|𝐗t1,𝐗i)q(𝐗t1|𝐗i)q(𝐗t|𝐗i)𝑞conditionalsubscript𝐗𝑡1subscript𝐗𝑡subscript𝐗𝑖𝑞conditionalsubscript𝐗𝑡subscript𝐗𝑡1subscript𝐗𝑖𝑞conditionalsubscript𝐗𝑡1subscript𝐗𝑖𝑞conditionalsubscript𝐗𝑡subscript𝐗𝑖\displaystyle q(\mathbf{X}_{t-1}|\mathbf{X}_{t},\mathbf{X}_{i})=\frac{q(% \mathbf{X}_{t}|\mathbf{X}_{t-1},\mathbf{X}_{i})q(\mathbf{X}_{t-1}|\mathbf{X}_{% i})}{q(\mathbf{X}_{t}|\mathbf{X}_{i})}italic_q ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_q ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_q ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (4)
=𝒩(𝐗t,αt𝐗t1,1αtI)𝒩(𝐗t1,α¯t1αi¯𝐗i,1α¯t1αi¯I)𝒩(𝐗t,α¯tαi¯𝐗i,1α¯tαi¯I)absent𝒩subscript𝐗𝑡subscript𝛼𝑡subscript𝐗𝑡11subscript𝛼𝑡𝐼𝒩subscript𝐗𝑡1subscript¯𝛼𝑡1¯subscript𝛼𝑖subscript𝐗𝑖1subscript¯𝛼𝑡1¯subscript𝛼𝑖𝐼𝒩subscript𝐗𝑡subscript¯𝛼𝑡¯subscript𝛼𝑖subscript𝐗𝑖1subscript¯𝛼𝑡¯subscript𝛼𝑖𝐼\displaystyle=\frac{\mathcal{N}(\mathbf{X}_{t},\sqrt{\alpha_{t}}\mathbf{X}_{t-% 1},\sqrt{1-\alpha_{t}}I)\mathcal{N}(\mathbf{X}_{t-1},\sqrt{\frac{\bar{\alpha}_% {t-1}}{\bar{\alpha_{i}}}}\mathbf{X}_{i},\sqrt{1-\frac{\bar{\alpha}_{t-1}}{\bar% {\alpha_{i}}}}I)}{\mathcal{N}(\mathbf{X}_{t},\sqrt{\frac{\bar{\alpha}_{t}}{% \bar{\alpha_{i}}}}\mathbf{X}_{i},\sqrt{1-\frac{\bar{\alpha}_{t}}{\bar{\alpha_{% i}}}}I)}= divide start_ARG caligraphic_N ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_I ) caligraphic_N ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , square-root start_ARG divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG end_ARG bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , square-root start_ARG 1 - divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG end_ARG italic_I ) end_ARG start_ARG caligraphic_N ( bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , square-root start_ARG divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG end_ARG bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , square-root start_ARG 1 - divide start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG end_ARG italic_I ) end_ARG
𝒩(𝐗t1,(1αt)(α¯iα¯t1)α¯iα¯t(αt1αt𝐗t+α¯iα¯t1α¯iα¯t1𝐗i),\displaystyle\propto\mathcal{N}(\mathbf{X}_{t-1},\frac{(1-\alpha_{t})(\bar{% \alpha}_{i}-\bar{\alpha}_{t-1})}{\bar{\alpha}_{i}-\bar{\alpha}_{t}}(\frac{% \sqrt{\alpha_{t}}}{1-\alpha_{t}}\mathbf{X}_{t}+\frac{\sqrt{\bar{\alpha}_{i}% \bar{\alpha}_{t-1}}}{\bar{\alpha}_{i}-\bar{\alpha}_{t-1}}\mathbf{X}_{i}),∝ caligraphic_N ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( divide start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (5)
(1αt)(α¯iα¯t)α¯iα¯t).\displaystyle\sqrt{\frac{(1-\alpha_{t})(\bar{\alpha}_{i}-\bar{\alpha}_{t})}{% \bar{\alpha}_{i}-\bar{\alpha}_{t}}}).square-root start_ARG divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ) .

Moreover, with β^t=1αtsubscript^𝛽𝑡1subscript𝛼𝑡\hat{\beta}_{t}=1-\alpha_{t}over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have

q(𝐗t1|𝐗t,𝐗i)𝒩(𝐗t1,β^t(α¯iα¯t1)α¯iα¯t(αtβ^t𝐗t+\displaystyle q(\mathbf{X}_{t-1}|\mathbf{X}_{t},\mathbf{X}_{i})\propto\mathcal% {N}(\mathbf{X}_{t-1},\frac{\hat{\beta}_{t}(\bar{\alpha}_{i}-\bar{\alpha}_{t-1}% )}{\bar{\alpha}_{i}-\bar{\alpha}_{t}}(\frac{\sqrt{\alpha_{t}}}{\hat{\beta}_{t}% }\mathbf{X}_{t}+italic_q ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∝ caligraphic_N ( bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , divide start_ARG over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( divide start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + (6)
α¯iα¯t1α¯iα¯t1𝐗i),β^t(α¯iα¯t1)α¯iα¯t).\displaystyle\frac{\sqrt{\bar{\alpha}_{i}\bar{\alpha}_{t-1}}}{\bar{\alpha}_{i}% -\bar{\alpha}_{t-1}}\mathbf{X}_{i}),\sqrt{\frac{\hat{\beta}_{t}(\bar{\alpha}_{% i}-\bar{\alpha}_{t-1})}{\bar{\alpha}_{i}-\bar{\alpha}_{t}}}).divide start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , square-root start_ARG divide start_ARG over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ) .

However, such a noise level indicated by i𝑖iitalic_i is quite hard to derive. Thus, during the training phase, we relax α¯isubscript¯𝛼𝑖\bar{\alpha}_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a certain range of i[27,32]𝑖2732i\in[27,32]italic_i ∈ [ 27 , 32 ]. Note that through the aforementioned reverse process, we derive the distribution of 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from its certain measurement by iteratively sample 𝐗t1,t=T,,i,formulae-sequencesubscript𝐗𝑡1𝑡superscript𝑇𝑖\mathbf{X}_{t-1},t={T^{\prime},~{}\cdots,~{}i},bold_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_t = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ⋯ , italic_i , as Eq. (6), which is detailedly illustrated in Algorithm 2. Although we do not explicitly remove the noise from the data, we expect during the training process, the variable 𝐗𝐗\mathbf{X}bold_X would be converged to the noise-free expectation 𝐗0subscript𝐗0\mathbf{X}_{0}bold_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as

d(𝐗θ,𝐗iϕ)=𝐗iϕp(𝐗iϕ)𝐗θ𝐗iϕα¯i22,subscript𝑑superscript𝐗𝜃subscriptsuperscript𝐗italic-ϕ𝑖subscriptsimilar-tosubscriptsuperscript𝐗italic-ϕ𝑖𝑝subscriptsuperscript𝐗italic-ϕ𝑖superscriptsubscriptnormsuperscript𝐗𝜃subscriptsuperscript𝐗italic-ϕ𝑖subscript¯𝛼𝑖22\displaystyle\mathcal{L}_{d}(\mathbf{X}^{\theta},\mathbf{X}^{\phi}_{i})=\sum_{% \mathbf{X}^{\phi}_{i}\sim p(\mathbf{X}^{\phi}_{i})}\|\mathbf{X}^{\theta}-\frac% {\mathbf{X}^{\phi}_{i}}{\sqrt{\bar{\alpha}_{i}}}\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( bold_X start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT , bold_X start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT bold_X start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( bold_X start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∥ bold_X start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT - divide start_ARG bold_X start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (7)

where 𝐗iϕp(𝐗i)subscriptsimilar-tosubscriptsuperscript𝐗italic-ϕ𝑖𝑝subscript𝐗𝑖\sum_{\mathbf{X}^{\phi}_{i}\sim p(\mathbf{X}_{i})}∑ start_POSTSUBSCRIPT bold_X start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT indicates summation across different diffusion reconstruction samples of teacher network ϕitalic-ϕ\phiitalic_ϕ, 𝐗θsubscript𝐗𝜃\mathbf{X}_{\theta}bold_X start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes the output feature of student network θ𝜃\thetaitalic_θ.

III-C Training Objective

We train our gaze estimation framework in a two-stage manner. In the first stage, we train five local experts, each specializing in a sub-region of the gaze point area. Each expert is trained using the cross-entropy loss function, esubscript𝑒\mathcal{L}_{e}caligraphic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, with corresponding subsets of gaze point labels and regional registered anchor states:

e=CE(𝐘^,𝐘),subscript𝑒subscript𝐶𝐸^𝐘𝐘\mathcal{L}_{e}=\mathcal{L}_{CE}(\hat{\mathbf{Y}},\mathbf{Y}),caligraphic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( over^ start_ARG bold_Y end_ARG , bold_Y ) , (8)

where 𝐘^L^𝐘superscript𝐿\hat{\mathbf{Y}}\in\mathbb{R}^{L}over^ start_ARG bold_Y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT and 𝐘L𝐘superscript𝐿\mathbf{Y}\in\mathbb{R}^{L}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT represent the predicted and ground-truth locations, respectively.

The second stage involves distilling the knowledge from these local experts into a student network designed to handle the entire gaze movement region. This distillation process utilizes three loss functions. We maintain the use of esubscript𝑒\mathcal{L}_{e}caligraphic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT as a hard loss to ensure stable performance of the student network. Additionally, we employ a soft loss, the Kullback-Leibler Divergence loss function ssubscript𝑠\mathcal{L}_{s}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to guide the learning of the student network:

s=KL(𝐓S,𝐓E),subscript𝑠subscript𝐾𝐿subscript𝐓𝑆subscript𝐓𝐸\mathcal{L}_{s}=\mathcal{L}_{KL}(\mathbf{T}_{S},\mathbf{T}_{E}),caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( bold_T start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) , (9)

where 𝐓S,𝐓EH×Wsubscript𝐓𝑆subscript𝐓𝐸superscript𝐻𝑊\mathbf{T}_{S},\mathbf{T}_{E}\in\mathbb{R}^{H\times W}bold_T start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT represent the attention matrices of the student and local experts, respectively. Finally, the total loss for the second stage is a weighted sum of these three losses:

=αe+βs+λd,𝛼subscript𝑒𝛽subscript𝑠𝜆subscript𝑑\mathcal{L}=\alpha\cdot\mathcal{L}_{e}+\beta\cdot\mathcal{L}_{s}+\lambda\cdot% \mathcal{L}_{d},caligraphic_L = italic_α ⋅ caligraphic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + italic_β ⋅ caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_λ ⋅ caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , (10)

where α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and λ𝜆\lambdaitalic_λ are corresponding weights for balancing different loss terms. Based on our extensive ablation experiments, we set α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and λ𝜆\lambdaitalic_λ to 1, 1, and 500, respectively.

IV Experiment

IV-A Experimental Settings

Dataset.  We utilized a hybrid event-based IR near-eye gaze tracking dataset [19] to evaluate our proposed system. The dataset integrated a sophisticated DAVIS-346b sensor (iniVation) with a high-resolution 25 mm f/1.4 VIS-NIR C-mount lens (EO-#67-715), further augmented with a UV/VIS cutoff filter (EO-#89-834) to capture the ocular dynamics of subjects using an ophthalmic headrest coupled with a restraining apparatus to minimize potential head movement artifacts.

The dataset recorded the gaze movements from 24 subjects using a 40×\times×40-pixel luminous green fixation cross rendered on a 40-inch-diagonal, 1920 ×\times× 1080-pixel display (Sceptre 1080p X415BV-FSR) with a visual field of view (FoV) of 64×{}^{\circ}\timesstart_FLOATSUPERSCRIPT ∘ end_FLOATSUPERSCRIPT × 96.

The dataset was divided into two distinct experimental conditions tailored to elicit different oculomotor responses: stochastic saccadic movements and controlled smooth pursuit tracking. During the first experimental paradigm, subjects were instructed to direct their gaze towards the stimulus. The stimulus materialized at random within a grid matrix of 121 discrete points (an 11×\times×11 grid pattern projected onto the display), with each point being presented for 1.5 seconds. This sequence of locations was uniformly randomized and remained consistent across all participants. Some sample images are shown in Fig. 7. The second experimental paradigm was used for free gaze point estimation. In this paradigm, the subjects’ task was to maintain visual fixation on the stimulus as it traversed a predefined square-wave trajectory. This trajectory commenced at the upper boundary of the display and proceeded in a downward motion, covering the full horizontal extent of the screen, with a vertical displacement amplitude of 150 pixels. Despite the intentional induction of saccadic jumps and smooth pursuit movements within the experimental framework, the resulting dataset encapsulates a plethora of involuntary eye dynamics, including microsaccades and ocular tremors, thereby offering a comprehensive profile of ocular motion behavior.

Refer to caption
Figure 7: Illustrative depictions of ocular motion trajectories and the associated gaze coordinates on the visual field.

Implementation details.  1) Local expert network. We set the input frame and event voxel size to 224×\times×224. We directly adopt the first two layers of the base vision transformer (ViT-B). We configure the mini-batch size to 80 and adopt the AdamW optimization algorithm due to its efficacy in handling sparse gradients and incorporating weight decay for regularization. We initiate the training with a learning rate of 0.0001, which is methodically attenuated following a cosine annealing schedule, descending to a factor of 0.1 of the original learning rate. The training epochs are set to 350 to ensure the local expert networks adequately learn the intricate patterns within the data. 2) Distilled student network. In the subsequent stage, the five local expert networks are distilled into the student network, leveraging the AdamW optimizer once again. The student network consists of the first two layers of ViT-B. This phase is conducted with a reduced learning rate of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and a momentum coefficient of 0.9, extending over 500 epochs. Due to the computationally intensive nature of knowledge distillation, the mini-batch size is adjusted to 4 to accommodate the increased model complexity and ensure stable convergence. Code is implemented using Pytorch and trained with RTX3090 GPUs.

Inference.  The inference of our model is different from the training process. During training, it is necessary to do latent denoising, to convert the feature maps into a distribution before distilling its knowledge into the student model. During inference, it only requires the use of the MLPs to determine which sub-region the current state belongs to and directly feeds the fused input into the student network without doing latent denoising, as the student model already possesses complete inference capabilities, as shown by the pink arrow in Fig. 2

Metrics. Following previous works [82], we employed the Mean Angle Error (MAE) to evaluate the performance quantitatively, computed as

MAE=1Ni=1Narccos<pi,ti>piti\mathrm{MAE}=\frac{1}{N}\sum_{i=1}^{N}\arccos{\frac{<\overrightarrow{p}_{i},% \overrightarrow{t}_{i}>}{\|\overrightarrow{p}_{i}\|\|\overrightarrow{t}_{i}\|}}roman_MAE = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_arccos divide start_ARG < over→ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over→ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > end_ARG start_ARG ∥ over→ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∥ over→ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG (11)

where N𝑁Nitalic_N is the number of samples in the dataset, pisubscript𝑝𝑖\overrightarrow{p_{i}}over→ start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG is the predicted gaze vector for the i-th sample, tisubscript𝑡𝑖\overrightarrow{t_{i}}over→ start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG is the corresponding ground truth vector, and <,><\cdot,\cdot>< ⋅ , ⋅ > computes the inner product of two input vectors.

IV-B Comparison with State-of-the-Art Methods

We compared our method with the following four methods:

  • S-T GE [83] leverages temporal sequences of eye images to enhance the accuracy of an end-to-end appearance-based deep-learning model for gaze estimation.

  • Dilated-Net [69] integrates dilated convolutional layers to enhance feature extraction for gaze estimation.

  • EventGT [19] achieves an equilibrium of computational efficiency and accuracy—a key consideration for real-time gaze tracking.

  • HE-Tracker [82] employs a sophisticated pipeline that commences with the E-Tracker’s encoding of eye imagery.

The quantitative results are presented in Table I, where it can be seen that our method achieves superior performance via reducing mean absolute error by nearly 50%percent5050\%50 % and boosting tracking accuracy by 15%percent1515\%15 % compared to recent state-of-the-art methods, confirming the effectiveness of our framework. We also refer the reviewers to the video demo for more visual results.

TABLE I: Quantitative results of different methods. \downarrow (resp. \uparrow) indicates the smaller (resp. larger), the better.
Method S-T GE[83] Dilated-Net[69] EventGT[19] HE-Tracker[82] Ours
MAE \downarrow 7.650 4.020 3.000 4.170 1.928
Accuracy \uparrow 61.88% 66.63% 72.06% 72.87% 87.67%
Time (ms) 288.34 562.03 20.65 191.25

IV-C Ablation Study

Data Modality.  We analyzed the impact of different data modalities on gaze estimation performance in Table II. Using only frames resulted in a substantial angular error of 40.16superscript40.1640.16^{\circ}40.16 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, significantly higher than the two-modality baseline of 1.93superscript1.931.93^{\circ}1.93 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. Conversely, relying solely on event data yielded a poor prediction accuracy of 1.45%percent1.451.45\%1.45 %. These findings highlight that although both frames and event data correlate with the directional vector of gaze, their individual use is insufficient for accurate gaze estimation. The best performance is achieved by fusing frames and event data, underscoring the value of our multimodal approach.

Latent Denoising.  We also conducted experiments to validate the effectiveness of latent denoising. As shown in Table. II, incorporating latent denoising enables our method to achieve nearly a 3%percent33\%3 % improvement in accuracy and a 44.5%percent44.544.5\%44.5 % reduction in MAE, showcasing the potential of the proposed denoising distillation strategy.

TABLE II: Gaze estimation performance across various data modalities and denoising, where ’F’ indicates the frame and ’E’ represents the event.
Metric F E Ours (F+E) Ours w/o Denoising
MAE \downarrow 40.161 1.928 3.472
Accuracy \uparrow 53.00% 1.45% 87.67% 84.64%

Anchor State.  Meanwhile, we conducted a series of ablation studies to ascertain the necessity of a registered anchor state on the performance of our gaze estimation model. In the absence of any anchor state, our model yielded gaze estimation with an angular error of 32.00. In marked divergence, a single registered anchor state resulted in substantial mitigation of angular error to 15.19, thereby highlighting the importance of the registered anchor state in the enhancement of gaze prediction fidelity. These ablation experiments also confirmed the effectiveness of introducing multiple registered anchor states for tasks involving large gaze movement regions, as mentioned in Sec. III-A. These empirical findings are concisely encapsulated in Fig. 8 (a).

Refer to caption
Figure 8: (a) Performance across different numbers of registered anchor states. (b) Comparative analysis of model performance subject to varying weights assigned to feature map loss.

Weight of Feature Map Loss. Within our model, a feature map loss dsubscript𝑑\mathcal{L}_{d}caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is employed to the guidance of the learning algorithm. Note that the gradient from MSE loss (feature distillation) is typically smaller than KL-divergence (task loss), to balance those different terms, we give a large factor for distillation loss dsubscript𝑑\mathcal{L}_{d}caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. As shown in Fig. 8 (b), it can be seen that the weight of distillation loss is quite large. Such an increment is imperative to augment the network’s learning capacity, thereby enhancing the precision of gaze estimation.

Number of Distillation Samples. As shown in Eq. (7), we expect the network can learn the noise-free samples from multiple noised measurements. Thus, enlarging the batch size is a necessary step to make the solution value of Eq. (7) converge to expectation. We investigated the impact of varying the number of distillation samples (indicating the number of summation samples in Eq. 7), including 4, 8, and 16 samples. The results are shown in Table III. The results indicate a positive correlation between the number of distillation samples and the accuracy of gaze estimation. Meanwhile, there is an observable decrease in the MAE. It indicates that without enough samples, it may potentially reduce the model’s predictive capabilities, since the gradients of Eq. (7) in small batches may lead to deteriorated model weight distributions. In other words, the results demonstrate that this approach effectively reduces the over-fitting issue.

TABLE III: Gaze estimation performance across various reconstruction samples. The adopted strategy is underlined.
Metric 4 Samples 8 Samples 16 Samples
MAE \downarrow 3.987 2.684 1.928
Accuracy \uparrow 82.01% 85.48% 87.67%

IV-D Aligning Continuous Location Prediction into Pre-trained Model

An essential requirement of eye tracking technology is the capability to dynamically estimate the point of gaze. After distilling local experts into a comprehensive model, our system can provide a rough estimate of the gaze locus with low resolution. Building upon the foundation of a pre-trained model, we generate accurate three-dimensional coordinates for the free gaze point [84, 85, 86]. To produce the actual location of the gaze intersection with the screen, we have designed a branch for gaze projection coordinates, as depicted in Fig. 9. Specifically, we have adapted the final output layer, traditionally responsible for generating class labels, to directly predict the two-dimensional coordinates of the gaze point on the screen [87, 88]. Concurrently, we refined the optimization objective of the model by transitioning to an alternative loss function, designated as csubscript𝑐\mathcal{L}_{c}caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, to fine-tune the pre-trained model parameters. The MSE loss function was selected to direct the training process of the gaze position regression model, providing a robust quantitative measure for minimizing the discrepancy between the predicted and actual gaze coordinates. This strategic modification is predicated on enhancing the model’s precision in capturing the subtleties of gaze behavior.

Refer to caption
Figure 9: Integration of a continuous location prediction network for eye tracking within a pre-trained model via fully connected layer units.
c=𝐏^𝐏22,subscript𝑐superscriptsubscriptnorm^𝐏𝐏22\mathcal{L}_{c}=\|\hat{\mathbf{P}}-\mathbf{P}\|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = ∥ over^ start_ARG bold_P end_ARG - bold_P ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (12)

where 𝐏^2^𝐏superscript2\hat{\mathbf{P}}\in\mathbb{R}^{2}over^ start_ARG bold_P end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 𝐏2𝐏superscript2\mathbf{P}\in\mathbb{R}^{2}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT represent the predicted value of the model and the ground-truth value of the gaze point, respectively.

This modification enables the system to translate the abstract understanding of where a person is looking into a concrete set of screen coordinates, facilitating applications that require precise tracking of the user’s point of gaze. Next, our system establishes a spatial coordinate system with the human eye as the origin in three-dimensional space. Subsequently, using randomly captured near-eye frame and event data, our system outputs continuous location predictions. The quantitative results of this assessment are systematically presented in Table IV. We also refer reviewers to the video demo for more results.

TABLE IV: Quantitative results of our methods of continuous location prediction. \downarrow indicates the smaller, the better.
Method S-T GE[83] Dilated-Net[69] EventGT[19] HE-Tracker[82] Ours
MAE \downarrow 6.980 3.589 3.900 3.655 3.184

V Conclusion and Discussion

We have presented a novel coarse-to-fine dual-stage model for gaze estimation that leverages frame data and event data, utilizing anchor states to enhance precision. Technically, we employed event-frame transformer as backbone and introduced a global-local latent denoising knowledge distillation to effectively merge the unique attributes of frame and event data. Our extensive experiments confirm the model’s capability to tackle challenges in multimodal data fusion and reduce overfitting tendencies. Our approach achieves reliable gaze estimation, maintaining angular error below 2. This outperforms various contemporary state-of-the-art gaze estimation methods, setting a new standard for this intricate task.

Despite the substantial superiority of the proposed method compared to state-of-the-art approaches, additional considerations must be addressed for further advancement. Focusing on the development of lightweight neural networks to optimize the inference process is essential. This can be accomplished through techniques like distillation or neural network pruning. Furthermore, there is scope for enhancing accuracy at the retina level (below 1superscript11^{\circ}1 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT), akin to the exceptional capabilities demonstrated by Apple Vision Pro and HTC Vive.

References

  • [1] K. Reiter, K. Pfeuffer, A. Esteves, T. Mittermeier, and F. Alt, “Look & turn: One-handed and expressive menu interaction by gaze and arm turns in vr,” in Proc. Symposium on Eye Tracking Research and Applications, 2022.
  • [2] M. Choi, D. Sakamoto, and T. Ono, “Kuiper belt: Utilizing the “out-of-natural angle” region in the eye-gaze interaction for virtual reality,” in Proc. CHI Conference on Human Factors in Computing Systems, 2022.
  • [3] T. Kim, A. Ham, S. Ahn, and G. Lee, “Lattice menu: A low-error gaze-based marking menu utilizing target-assisted gaze gestures on a lattice of visual anchors,” in Proc. CHI Conference on Human Factors in Computing Systems, 2022.
  • [4] M. Kassner, W. Patera, and A. Bulling, “Pupil: an open source platform for pervasive eye tracking and mobile gaze-based interaction,” in Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, 2014.
  • [5] S. Ahn, S. Santosa, M. Parent, D. Wigdor, T. Grossman, and M. Giordano, “Stickypie: A gaze-based, scale-invariant marking menu optimized for ar/vr,” in Proc. CHI Conference on Human Factors in Computing Systems, 2021.
  • [6] T. K. Wee, E. Cuervo, and R. Balan, “Focusvr: Effective 8 usable vr display power management,” Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 2, no. 3, sep 2018.
  • [7] X. Ma, Z. Yao, Y. Wang, W. Pei, and H. Chen, “Combining brain-computer interface and eye tracking for high-speed text entry in virtual reality,” in Proc. International Conference on Intelligent User Interfaces, 2018.
  • [8] C. Hennessey, B. Noureddin, and P. Lawrence, “A single camera eye-gaze tracking system with free head motion,” in Proc. ACM Symposium on Eye Tracking Research and Applications, 2006.
  • [9] Z. Zhu, Q. Ji, and K. Bennett, “Nonlinear eye gaze map** function estimation via support vector regression,” in Proc. International Conference on Pattern Recognition, vol. 1, 2006.
  • [10] Wang, Sung, and R. Venkateswarlu, “Eye gaze estimation from a single image of one eye,” in Proc. IEEE International Conference on Computer Vision, 2003.
  • [11] E. Wood and A. Bulling, “Eyetab: Model-based gaze estimation on unmodified tablet computers,” in Proc. ACM Symposium on Eye Tracking Research and Applications, 2014.
  • [12] C. Lu, P. Chakravarthula, K. Liu, X. Liu, S. Li, and H. Fuchs, “Neural 3d gaze: 3d pupil localization and gaze tracking based on anatomical eye model and neural refraction correction,” in Proc. IEEE International Symposium on Mixed and Augmented Reality, 2022.
  • [13] R. Ranjan, S. De Mello, and J. Kautz, “Light-weight head pose invariant gaze tracking,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2018.
  • [14] S. Baluja and D. Pomerleau, “Non-intrusive gaze tracking using artificial neural networks,” in Proc. Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector, Eds., vol. 6, 1993.
  • [15] K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar, W. Matusik, and A. Torralba, “Eye tracking for everyone,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [16] Y. Cheng and F. Lu, “Dvgaze: Dual-view gaze estimation,” in Proc. IEEE/CVF International Conference on Computer Vision, 2023.
  • [17] K. Roy and D. Chanda, “A robust webcam-based eye gaze estimation system for human-computer interaction,” in Proc. International Conference on Innovations in Science, Engineering and Technology, 2022.
  • [18] M. N. Lystbaek, K. Pfeuffer, J. E. S. Gronbaek, and H. Gellersen, “Exploring gaze for assisting freehand selection-based text entry in ar,” Proc. ACM Hum.-Comput. Interact., vol. 6, no. ETRA, may 2022.
  • [19] A. N. Angelopoulos, J. N. Martel, A. P. Kohli, J. Conradt, and G. Wetzstein, “Event-based near-eye gaze tracking beyond 10,000 hz,” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 5, 2021.
  • [20] D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza, Asynchronous, Photometric Feature Tracking Using Events and Frames.   Springer International Publishing, 2018.
  • [21] H. Liu, D. P. Moeys, G. Das, D. Neil, S.-C. Liu, and T. Delbrück, “Combined frame and event based detection and tracking,” in Proc. IEEE International Symposium on Circuits and Systems, 2016.
  • [22] M. Mokatren, T. Kuflik, and I. Shimshoni, “3d gaze estimation using rgb-ir cameras,” Sensors, vol. 23, no. 1, 2023.
  • [23] Y. Cheng and F. Lu, “Gaze estimation using transformer,” in Proc. International Conference on Pattern Recognition, 2022.
  • [24] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual–inertial odometry using nonlinear optimization,” The International Journal of Robotics Research, vol. 34, no. 3, 2015.
  • [25] Y. Feng, N. Goulding-Hotta, A. Khan, H. Reyserhove, and Y. Zhu, “Real-time gaze tracking with event-driven eye segmentation,” in Proc. IEEE Conference on Virtual Reality and 3D User Interfaces, 2022.
  • [26] Y. Lei, S. He, M. Khamis, and J. Ye, “An end-to-end review of gaze estimation and its interactive applications on handheld mobile devices,” ACM Comput. Surv., vol. 56, no. 2, sep 2023.
  • [27] M. F. Ansari, P. Kasprowski, and P. Peer, “Person-specific gaze estimation from low-quality webcam images,” Sensors, vol. 23, no. 8, 2023.
  • [28] G. Zhao, Y. Yang, J. Liu, N. Chen, Y. Shen, H. Wen, and G. Lan, “EV-eye: Rethinking high-frequency eye tracking through the lenses of event cameras,” in Proc. Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  • [29] M. A. Mahowald, An Analog VLSI System for Stereoscopic Vision.   USA: Kluwer Academic Publishers, 1994.
  • [30] C. A. Mead and M. Mahowald, “A silicon model of early visual processing,” Neural Networks, vol. 1, no. 1, 1988.
  • [31] T. Delbruck, “Silicon retina with correlation-based, velocity-tuned pixels,” IEEE Transactions on Neural Networks, vol. 4, no. 3, 1993.
  • [32] T. Delbruck, B. Linares-Barranco, E. Culurciello, and C. Posch, “Activity-driven, event-based vision sensors,” in Proc. IEEE International Symposium on Circuits and Systems, 2010.
  • [33] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 ×\times× 128 120 db 15 μ𝜇\muitalic_μ s latency asynchronous temporal contrast vision sensor,” IEEE Journal of Solid State Circuits, vol. 43, 03 2008.
  • [34] C. Posch, T. Serrano-Gotarredona, B. Linares-Barranco, and T. Delbruck, “Retinomorphic event-based vision sensors: Bioinspired cameras with spiking output,” Proceedings of the IEEE, vol. 102, no. 10, 2014.
  • [35] A. Mitrokhin, C. Fermuller, C. Parameshwara, and Y. Aloimonos, “Event-based moving object detection and tracking,” in Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct. 2018.
  • [36] H. Rebecq, T. Horstschaefer, and D. Scaramuzza, “Real-time visual-inertial odometry for event cameras using keyframe-based nonlinear optimization,” in Proc. British Machine Vision Conference, 2017.
  • [37] E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, and D. Scaramuzza, “The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and slam,” The International Journal of Robotics Research, vol. 36, no. 2, Feb. 2017.
  • [38] J. N. P. Martel, J. Müller, J. Conradt, and Y. Sandamirskaya, “An active approach to solving the stereo matching problem using event-based sensors,” in Proc. IEEE International Symposium on Circuits and Systems, 2018.
  • [39] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “Monoslam: Real-time single camera slam,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, 2007.
  • [40] D. Weikersdorfer, D. B. Adrian, D. Cremers, and J. Conradt, “Event-based 3d slam with a depth-augmented dynamic vision sensor,” in Proc. IEEE International Conference on Robotics and Automation, 2014.
  • [41] J. Lee, P. Park, C.-W. Shin, H. Ryu, B.-C. Kang, and T. Delbruck, “Touchless hand gesture ui with instantaneous responses,” in Proc. International Conference on Image Processing, 09 2012.
  • [42] Z. Zhu, J. Hou, and D. O. Wu, “Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers,” in Proc. IEEE/CVF International Conference on Computer Vision, 2023.
  • [43] Z. Zhu, J. Hou, and X. Lyu, “Learning graph-embedded key-event back-tracing for object tracking in event clouds,” Advances in Neural Information Processing Systems, vol. 35, 2022.
  • [44] X. Wang, J. Li, L. Zhu, Z. Zhang, Z. Chen, X. Li, Y. Wang, Y. Tian, and F. Wu, “Visevent: Reliable object tracking via collaboration of frame and event flows,” IEEE Transactions on Cybernetics, 2023.
  • [45] C. Tang, X. Wang, J. Huang, B. Jiang, L. Zhu, J. Zhang, Y. Wang, and Y. Tian, “Revisiting color-event based tracking: A unified network, dataset, and metric,” arXiv preprint arXiv:2211.11010, 2022.
  • [46] T. Delbruck and P. Lichtsteiner, “Fast sensory motor control based on event-based hybrid neuromorphic-procedural system,” in Proc. IEEE International Symposium on Circuits and Systems, 2007.
  • [47] T. Delbruck and M. Lang, “Robotic goalie with 3ms reaction time at 4% cpu load using event-based dynamic vision sensor,” Frontiers in Neuroscience, vol. 7, 2013.
  • [48] M. Litzenberger, B. Kohn, A. Belbachir, N. Donath, G. Gritsch, H. Garn, C. Posch, and S. Schraml, “Estimation of vehicle speed based on asynchronous data from a silicon retina optical sensor,” in Proc. IEEE Intelligent Transportation Systems Conference, 2006.
  • [49] M. Litzenberger, C. Posch, D. Bauer, A. Belbachir, P. Schon, B. Kohn, and H. Garn, “Embedded vision system for real-time object tracking using an asynchronous transient vision sensor,” in Proc. IEEE Digital Signal Processing Workshop and IEEE Signal Processing Education Workshop, 2006.
  • [50] X. Lagorce, C. Meyer, S.-H. Ieng, D. Filliat, and R. Benosman, “Asynchronous event-based multikernel algorithm for high-speed visual features tracking,” IEEE transactions on neural networks and learning systems, vol. 26, 09 2014.
  • [51] J. Conradt, M. Cook, R. Berner, P. Lichtsteiner, R. Douglas, and T. Delbruck, “A pencil balancing robot using a pair of aer dynamic vision sensors,” in Proc. IEEE International Symposium on Circuits and Systems, 06 2009.
  • [52] Z. Ni, A. Bolopion, J. Agnus, R. Benosman, and S. Regnier, “Asynchronous event-based visual shape tracking for stable haptic feedback in microrobotics,” IEEE Transactions on Robotics, vol. 28, no. 5, 2012.
  • [53] Z. Ni, S.-H. Ieng, C. Posch, S. Régnier, and R. Benosman, “Visual tracking using neuromorphic asynchronous event-based cameras,” Neural Computation, vol. 27, no. 4, 04 2015.
  • [54] A. Glover and C. Bartolozzi, “Robust visual tracking with a freely-moving event camera,” in Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, 2017.
  • [55] D. Reverter Valeiras, X. Lagorce, X. Clady, C. Bartolozzi, S.-H. Ieng, and R. Benosman, “An asynchronous neuromorphic event-driven visual part-based shape tracking,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 12, 2015.
  • [56] N. Li, M. Chang, and A. Raychowdhury, “E-gaze: Gaze estimation with event camera,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • [57] Q. Chen, Z. Wang, S.-C. Liu, and C. Gao, “3et: Efficient event-based eye tracking using a change-based convlstm network,” in Proc. IEEE Biomedical Circuits and Systems Conference, 2023.
  • [58] T. Stoffregen, H. Daraei, C. Robinson, and A. Fix, “Event-based kilohertz eye tracking using coded differential lighting,” in Proc. IEEE/CVF Winter Conference on Applications of Computer Vision, 2022.
  • [59] X. Wang, J. Huang, S. Wang, C. Tang, B. Jiang, Y. Tian, J. Tang, and B. Luo, “Long-term frame-event visual tracking: Benchmark dataset and baseline,” 2024.
  • [60] C. Ryan, B. O’Sullivan, A. Elrasad, A. Cahill, J. Lemley, P. Kielty, C. Posch, and E. Perot, “Real-time face & eye tracking and blink detection using event cameras,” Neural Networks, vol. 141, 2021.
  • [61] L. R. Young and D. Sheena, “Survey of eye movement recording methods,” Behavior research methods and instrumentation, vol. 7, no. 5, 1975.
  • [62] T. N. Cornsweet and H. D. Crane, “Accurate two-dimensional eye tracker using first and fourth purkinje images.” Journal of the Optical Society of America, vol. 63 8, 1973.
  • [63] H. D. Crane and C. M. Steele, “Generation-v dual-purkinje-image eyetracker,” Applied optics, vol. 24, no. 4, 1985.
  • [64] Y. Li, S. Wang, and X. Ding, “Eye/eyes tracking based on a unified deformable template and particle filtering,” Pattern Recognition Letters, vol. 31, no. 11, 2010.
  • [65] Y. Tian, T. Kanade, and J. Cohn, “Dual-state parametric eye tracking,” in Proc. IEEE International Conference on Automatic Face and Gesture Recognition, March 2000.
  • [66] K. Wang and Q. Ji, “Real time eye gaze tracking with 3d deformable eye-face model,” in Proc. IEEE International Conference on Computer Vision, 2017.
  • [67] C. H. Morimoto and M. R. Mimica, “Eye gaze tracking techniques for interactive applications,” Computer vision and image understanding, vol. 98, no. 1, 2005.
  • [68] T. A. Duchowski, Eye tracking: methodology theory and practice.   Springer, 2017.
  • [69] Z. Chen and B. E. Shi, “Appearance-based gaze estimation using dilated-convolutions,” in Proc. Asian Conference on Computer Vision, 2019.
  • [70] J. Bao, B. Liu, and J. Yu, “An individual-difference-aware model for cross-person gaze estimation,” IEEE Transactions on Image Processing, vol. 31, 2022.
  • [71] W. Shin, H. Park, S.-P. Kim, and S. Sul, “Individual differences in gaze-cuing effect are associated with facial emotion recognition and social conformity,” Frontiers in Psychology, vol. 14, 2023.
  • [72] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Proc. NIPS Deep Learning and Representation Learning Workshop, 2015.
  • [73] X. Wang, S. Wang, C. Tang, L. Zhu, B. Jiang, Y. Tian, and J. Tang, “Event stream-based visual object tracking: A high-resolution benchmark dataset and a novel baseline,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2024.
  • [74] D. Lopez-Paz, L. Bottou, B. Scholkopf, and V. N. Vapnik, “Unifying distillation and privileged information,” CoRR, vol. abs/1511.03643, 2015.
  • [75] L. Xiang, G. Ding, and J. Han, Learning From Multiple Experts: Self-paced Knowledge Distillation for Long-Tailed Classification, 10 2020.
  • [76] Q. Guo, X. Wang, Y. Wu, Z. Yu, D. Liang, X. Hu, and P. Luo, “Online knowledge distillation via collaborative learning,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.
  • [77] L. Bicsi, B. Alexe, R. T. Ionescu, and M. Leordeanu, “JEDI: joint expert distillation in a semi-supervised multi-dataset student-teacher scenario for video action recognition,” in Proc. IEEE/CVF International Conference on Computer Vision, 2023.
  • [78] N. Shen, T. Xu, S. Huang, F. Mu, and J. Li, “Expert-guided knowledge distillation for semi-supervised vessel segmentation,” IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 11, 2023.
  • [79] A. Sochopoulos, I. Mademlis, E. Charalampakis, S. Papadopoulos, and I. Pitas, “Deep reinforcement learning with semi-expert distillation for autonomous uav cinematography,” in Proc. IEEE International Conference on Multimedia and Expo, jul 2023.
  • [80] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [81] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, 2020.
  • [82] L. Chen, Y. Li, X. Bai, X. Wang, Y. Hu, M. Song, L. Xie, Y. Yan, and E. Yin, “Real-time gaze tracking with head-eye coordination for head-mounted displays,” in Proc. IEEE International Symposium on Mixed and Augmented Reality, 2022.
  • [83] C. Palmero Cantarino, O. V. Komogortsev, and S. S. Talathi, “Benefits of temporal information for appearance-based gaze estimation,” in Proc. ACM Symposium on Eye Tracking Research and Applications, Jun. 2020.
  • [84] A. T. Duchowski, K. Krejtz, M. Volonte, C. J. Hughes, M. Brescia-Zapata, and P. Orero, “3d gaze in virtual reality: Vergence, calibration, event detection,” Procedia Computer Science, vol. 207, 2022, knowledge-Based and Intelligent Information &\&& Engineering Systems: Proceedings of the 26th International Conference KES2022.
  • [85] K. Wang and Q. Ji, “3d gaze estimation without explicit personal calibration,” Pattern Recognition, vol. 79, 2018.
  • [86] M. Mansouryar, J. Steil, Y. Sugano, and A. Bulling, “3d gaze estimation from 2d pupil positions on monocular head-mounted eye trackers,” in Proc. ACM Symposium on Eye Tracking Research and Applications, Mar. 2016.
  • [87] C. Elmadjian, P. Shukla, A. D. Tula, and C. H. Morimoto, “3d gaze estimation in the scene volume with a head-mounted eye tracker,” in Proc. ACM Workshop on Communication by Gaze Interaction, 2018.
  • [88] Y. Man, X. Zhao, and K. Zhang, “3D gaze estimation based on facial feature tracking,” in Proc. International Conference on Graphic and Image Processing, Z. Zhu, Ed., vol. 8768, 2013.