Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality

Zidong Cao1, Zhan Wang1, Yexin Liu1, Yan-Pei Cao3, Ying Shan3, Wei Zeng1,2 and Lin Wang1,2∗ 1HKUST(GZ)  2HKUST  3ARC Lab, Tencent PCG.Corresponding author.
Abstract

Viewing omnidirectional images (ODIs) in virtual reality (VR) represents a novel form of media that provides immersive experiences for users to navigate and interact with digital content. Nonetheless, this sense of immersion can be greatly compromised by a blur effect that masks details and hampers the user’s ability to engage with objects of interest. In this paper, we present a novel system, called OmniVR, designed to enhance visual clarity during VR navigation. Our system enables users to effortlessly locate and zoom in on the objects of interest in VR. It captures user commands for navigation and zoom, converting these inputs into parameters for the Möbius transformation matrix. Leveraging these parameters, the ODI is refined using a learning-based algorithm. The resultant ODI is presented within the VR media, effectively reducing blur and increasing user engagement. To verify the effectiveness of our system, we first evaluate our algorithm with state-of-the-art methods on public datasets, which achieves the best performance. Furthermore, we undertake a comprehensive user study to evaluate viewer experiences across diverse scenarios and to gather their qualitative feedback from multiple perspectives. The outcomes reveal that our system enhances user engagement by improving the viewers’ recognition, reducing discomfort, and improving the overall immersive experience. Our system makes the navigation and zoom more user-friendly. For more details, e.g., demos, please refer to the project page http://vlislab22.github.io/OmniVR/.

Index Terms:
Virtual reality, Image Processing and Computer Vision, Human-computer interaction

I Introduction

Refer to caption
Figure 1: Comparing the VR experience under the baseline condition and our OmniVR. Users can freely navigate and zoom in/out to see the object of interest. With our proposed algorithm, the objects can be refined with clear textural details, thus enhancing the engagement and immersive experience.

Omnidirectional images (ODIs), also called 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT images, have increasingly attracted interest for their capability to capture extensive content within a single frame. The recent surge in integrating such visual content into virtual reality (VR) environments is noteworthy [1, 2, 3]. This integration represents a novel form of media that allows users to navigate freely in any direction, offering an immersive and interactive experience akin to being in a real environment [4, 5, 6]. This immersive media has led to various applications, including but not limited to virtual tours [7, 8], real estate showcases [9], educational tools [10], and remote meeting solutions [3]. Moreover, the rapid advancement and wide availability of consumer-level VR devices [11] have made these experiences increasingly accessible to a broader audience. In this context, viewers transform from mere observers to active participants, who can navigate and zoom into objects of interest, thereby making significant progress in media consumption and user experience [11].

The ODIs are usually stored with the equirectangular projection (ERP) type and displayed in VR with perspective projection. A critical issue with ODIs is their relatively low angular resolution [12], which results in local regions appearing blurry. This blurriness intensifies when the images are zoomed in and navigated (see Fig. 1 (top)), potentially compromising the immersive experience by reducing local and fine details [13]. Consequently, this not only impedes user engagement with objects of interest but also detracts from the overall immersive experience, potentially causing mental and physical discomfort [14]. Several methods have been explored to mitigate this discomfort in VR navigation, including the incorporation of spatial blur [15], defocus blur [16], depth-of-field blur [17], and foveated rendering [18]. However, these methods fall short of enhancing the clear textural details of objects. Despite the freedom to navigate and zoom in/out on objects, the enlarged objects remain blurry.

In this paper, we introduce a novel system, dubbed OmniVR, aiming to make viewers navigate and zoom in/out in the VR space effortlessly, while simultaneously enhancing the visual quality to recover clear local details, as illustrated in Fig. 1 (bottom). As shown in Fig. 2, the viewer first views the original ODI displayed in a VR headset. With OmniVR, the viewer is free to navigate and find some objects of interest. Then, the viewer can use the headset and controller to give commands about rotation and zoom in/out. Our system captures these user commands and converts these commands into parameters for the Möbius transformation matrix (Sec. III-A). Leveraging these parameters, we propose a learning-based algorithm, which is built based on our conference work OmniZoomer [19], to achieve high-quality ODIs after transformation with two key techniques. First, OmniVR integrates the Möbius transformation into the network, enabling free navigation and zoom within ODIs. By learning transformed feature maps in various conditions, the network is enhanced to handle the increasing curves caused by navigation and zoom, thus alleviating the blurry effect (Sec. III-C). Secondly, we propose enhancing the feature maps to high-resolution (HR) space before the transformation. The HR feature maps contain more fine-grained textural details, which could compensate for the lack of pixels for describing curves (Sec. III-C1). After obtaining the HR feature maps, we also propose a spatial index generation module (Sec. III-C2) and a spherical resampling module (Sec. III-C3) to accomplish the feature transformation process. Finally, these feature maps are processed with a decoder to output the high-quality transformed ODI in ERP format. The ERP output is then transformed to the perspective format to be displayed in VR, effectively reducing blur and increasing user engagement.

For supervised learning, we create a dataset based on the ODI-SR dataset [20], called the ODIM dataset, including transformed ODIs under various Möbius transformations. We evaluate the effectiveness of OmniVR on the ODIM dataset under various Möbius transformations and up-sampling factors. Furthermore, we report the results of a user study for the VR experience to evaluate the effectiveness of our proposed system in quantitative and qualitative ways. Quantitatively, we record accuracy, response time, and confidence level in a series of scenarios and questions. Qualitatively, we conduct interviews about the subjects’ feelings, such as mental and physical costs, and immersive experience, etc. The results demonstrate that: 1) OmniVR is beneficial for participants to improve the recognition and understanding of the scenarios. 2) OmniVR can reduce the discomfort of participants. 3) OmniVR can significantly improve the immersive experience, and make navigation and zoom in the VR media more user-friendly.

The main contributions of this paper can be summarized as follows: (I) We propose a novel system OmniVR to enhance the visual clarity during VR navigation; (II) We propose a learning-based algorithm to enhance the ODI quality controlled by user commands; (III) We establish the ODIM dataset for supervised training. Compared with existing methods, OmniVR achieves state-of-the-art performance under various user commands and up-sampling factors; (IV) We conduct a user study, demonstrating the effectiveness of our system in both quantitative and qualitative ways.

Refer to caption
Figure 2: System overview. Our system collects the user commands for navigation and zooming in/out, which are converted to the parameters of the Möbius transformation matrix. The parameters together with the ODI are processed with a learning-based algorithm to generate high-quality transformed ODIs, which can be displayed with the perspective format in VR.

II Related Works

Navigation in VR. VR is an emerging media that provides more immersive and interactive experiences compared with traditional media [4, 5, 6]. Based on this immersive experience, there are already various applications, spanning virtual tourism [7], education [10], and entertainment [3]. The high spatial resolution is important to ensure the user experience because the blurry effect from the low spatial resolution would influence the engagement of viewers and might further cause discomfort. To alleviate the discomfort from the blur, current methods have proposed a series of techniques, including spatial blur [15], defocus blur [16], depth-of-field blur [17], and foveated rendering [18]. However, these methods mainly focus on the blurry regions but do not improve the local details of objects. As a result, if the viewer finds an object of interest, whatever operations s/he tries, the object always keeps blurry without any other details.

ODI Super-Resolution. [21] and [22] take distortion maps as additional input to alleviate the distortions. LAU-Net [20] splits an ODI into several bands along the latitude because ODIs in different latitudes present different distortions. SphereSR [12] proposes to reconstruct an ODI with arbitrary projection formats, including ERP, cube maps, and perspective formats. Recently, OSRT [23] and OPDN [24] employ transformers to construct global context and obtain good performance. However, their outputs are restricted to ERP format with no transformation.

Möbius Transformation. It has been utilized towards straight line rectification [25, 26, 27], and rotation and zoom [28]. Recently, [29] employs Möbius transformation to transform the feature maps into different forms to enhance the learning robustness. However, [29] has not explored to generate transformed ODIs with high quality. Except for ODI applications, Möbius transformation is widely applied in data augmentation [30], activation function [31, 32], pose estimation [33], and convolutions [34]. In this paper, we propose a learning-based algorithm which is built based on our conference version [19], to improve the textural details of ODIs when navigating and zooming in to an object of interest in VR.

III OmniVR System

III-A System Overview

As shown in Fig. 2, we design a system, namely OmniVR to help users effortlessly navigate and zoom in the VR meida, aiming to enhance the visual quality, and subsequently improve the immersion and interaction experience. Firstly, our system displays an original ODI for the viewer in the VR media. The viewer can freely navigate the scenario and find the object of interest. Once finding the object of interest, the viewer might feel that the object of interest is too small or not in the center of the field of view (FoV). Our system allows viewers to send commands through the VR headset and controllers. Then, the user command is utilized to generate the parameters of the Möbius transformation matrix. Leveraging these parameters, we propose a learning-based algorithm (see Fig. 4), which is built based on our conference version [19], to transform the original ODI with high quality. In the end, the transformed ODI is displayed in VR to provide finer details for the viewer. Our system allows the transformed ODI with various projection formats to adapt to the visual contents in VR. Below, we describe the user command, algorithm, and view transformation of our system in detail.

Refer to caption
Figure 3: The rotation of the VR headset generates horizontal and vertical angle parameters, while the trigger button of the right controller is used for generating zoom level parameters.

III-B User Command and Parameter Conversion

Our system first collects data about navigation and zoom operations through the VR headset and controllers. The rotation angle of the headset represents the navigation direction, and the trigger of the right controller controls the zoom in/out operation, as shown in Fig. 3. Specifically, the zoom operation is achieved by setting the UI button with arrow patterns, such as up arrows (zoom in), down arrows (zoom out), and left/right/circulation arrows (scene transition), as shown in Fig. 7. Once the raycast emitted from the controller touches the regions of UI buttons, the command is sent by clicking the trigger. Then, the collected commands are summarized with three parameters: zoom level s𝑠sitalic_s, horizontal rotation angle β𝛽\betaitalic_β, and vertical rotation angle γ𝛾\gammaitalic_γ, named user command. Next, the user command is transferred to parameters {a,b,c,d}𝑎𝑏𝑐𝑑\{a,b,c,d\}{ italic_a , italic_b , italic_c , italic_d } of the Möbius transformation matrix. We choose Möbius transformation as it is the only bijective transformation on the sphere with preserved shape. Specifically, when performing horizontal rotation with angle β𝛽\betaitalic_β, the parameters of Möbius transformations can be represented as follows:

(abcd)=(cos(β)+jsin(β)001).matrix𝑎𝑏𝑐𝑑matrix𝛽𝑗𝛽001\small\begin{pmatrix}a&b\\ c&d\\ \end{pmatrix}=\begin{pmatrix}\cos(\beta)+j\sin(\beta)&0\\ 0&1\\ \end{pmatrix}.( start_ARG start_ROW start_CELL italic_a end_CELL start_CELL italic_b end_CELL end_ROW start_ROW start_CELL italic_c end_CELL start_CELL italic_d end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL roman_cos ( italic_β ) + italic_j roman_sin ( italic_β ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) . (1)

Similarly, for vertical rotation with angle γ𝛾\gammaitalic_γ, the parameters of Möbius transformations can be represented as follows:

(abcd)=(cos(γ2)sin(γ2)sin(γ2)cos(γ2)).matrix𝑎𝑏𝑐𝑑matrix𝛾2𝛾2𝛾2𝛾2\small\begin{pmatrix}a&b\\ c&d\\ \end{pmatrix}=\begin{pmatrix}\cos(\frac{\gamma}{2})&\sin(\frac{\gamma}{2})\\ -\sin(\frac{\gamma}{2})&\cos(\frac{\gamma}{2})\\ \end{pmatrix}.( start_ARG start_ROW start_CELL italic_a end_CELL start_CELL italic_b end_CELL end_ROW start_ROW start_CELL italic_c end_CELL start_CELL italic_d end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL roman_cos ( divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG ) end_CELL start_CELL roman_sin ( divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG ) end_CELL end_ROW start_ROW start_CELL - roman_sin ( divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG ) end_CELL start_CELL roman_cos ( divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG ) end_CELL end_ROW end_ARG ) . (2)

An arbitrary rotation can be divided into horizontal rotation and vertical rotation. In addition, Möbius transformations can be composed to give a new Möbius transformation. Therefore, we can achieve arbitrary navigation on ODIs with horizontal rotation angle β𝛽\betaitalic_β and vertical rotation angle γ𝛾\gammaitalic_γ.

For zoom with level s𝑠sitalic_s, if the pole is the North pole, the parameters of Möbius transformations can be as follows:

(abcd)=(s001).matrix𝑎𝑏𝑐𝑑matrix𝑠001\small\begin{pmatrix}a&b\\ c&d\\ \end{pmatrix}=\begin{pmatrix}s&0\\ 0&1\\ \end{pmatrix}.( start_ARG start_ROW start_CELL italic_a end_CELL start_CELL italic_b end_CELL end_ROW start_ROW start_CELL italic_c end_CELL start_CELL italic_d end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL italic_s end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) . (3)

III-C The Proposed Algorithm

Refer to caption
Figure 4: The overall pipeline of the proposed algorithm. With the spatial index generation module and spherical resampling module, OmniVR can provide viewers with a flexible way to zoom in and out to objects of interest, such as the sculpture.

Overview. As depicted in Fig. 4, we propose a novel algorithm that allows for free navigation to objects of interest and zooming in with preserved shapes and high-quality details, based on our OmniZoomer [19]. Initially, we extract HR feature maps FUPH×W×Csubscript𝐹UPsuperscript𝐻𝑊𝐶F_{\text{UP}}\in\mathbb{R}^{H\times W\times C}italic_F start_POSTSUBSCRIPT UP end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT from the input ODI IINh×w×3subscript𝐼INsuperscript𝑤3I_{\text{IN}}\in\mathbb{R}^{h\times w\times 3}italic_I start_POSTSUBSCRIPT IN end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT through an encoder and an up-sampling block (Sec. III-C1). With FUPsubscript𝐹UPF_{\text{UP}}italic_F start_POSTSUBSCRIPT UP end_POSTSUBSCRIPT’s index map XH×W×2𝑋superscript𝐻𝑊2X\in\mathbb{R}^{H\times W\times 2}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 2 end_POSTSUPERSCRIPT as the input, we propose the spatial index generation module (Sec. III-C2) to apply the Möbius transformation [35] according to the user command on X𝑋Xitalic_X to generate the transformed spatial index map YH×W×2𝑌superscript𝐻𝑊2Y\in\mathbb{R}^{H\times W\times 2}italic_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 2 end_POSTSUPERSCRIPT. Note that the channel numbers of X𝑋Xitalic_X and Y𝑌Yitalic_Y indicate the longitude and latitude, respectively. Subsequently, we introduce a spherical resampling module (Sec. III-C3) that generates the transformed HR feature maps FMH×W×Csubscript𝐹Msuperscript𝐻𝑊𝐶F_{\text{M}}\in\mathbb{R}^{H\times W\times C}italic_F start_POSTSUBSCRIPT M end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT by resampling the pixels on the sphere guided by Y𝑌Yitalic_Y. Finally, we decode the transformed feature maps to output the transformed ODI. The decoder consists of three ResBlocks [36] and a convolution layer. We take the same parameters in the spatial index generation module to transform the HR ground truth ODIs and employ the L1𝐿1L1italic_L 1 loss for supervision. The proposed algorithm in OmnIVR has two main differences with OmniZoomer [19]: 1) To stabilize the convergence during training, we employ the skip connection by applying Möbius transformation onto IINsubscript𝐼INI_{\text{IN}}italic_I start_POSTSUBSCRIPT IN end_POSTSUBSCRIPT, which is added to the output of the decoder. 2) The proposed algorithm in our OmniVR enjoys arbitrary spherical projection, i.e., ERP and perspective projection, to meet the demands of viewing ODIs in VR with different FoVs. We now provide detailed descriptions of these components.

III-C1 Feature Extraction

Given an ODI IINh×w×3subscript𝐼INsuperscript𝑤3I_{\text{IN}}\in\mathbb{R}^{h\times w\times 3}italic_I start_POSTSUBSCRIPT IN end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × 3 end_POSTSUPERSCRIPT in ERP format, our initial step involves the use of an encoder composed of several convolution layers. This encoder is to extract the feature maps FINh×w×Csubscript𝐹INsuperscript𝑤𝐶F_{\text{IN}}\in\mathbb{R}^{h\times w\times C}italic_F start_POSTSUBSCRIPT IN end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_C end_POSTSUPERSCRIPT. Subsequently, we employ an upsampling block, equipped with multiple pixel-shuffle layers [[37]], to produce HR feature maps FUPH×W×Csubscript𝐹UPsuperscript𝐻𝑊𝐶F_{\text{UP}}\in\mathbb{R}^{H\times W\times C}italic_F start_POSTSUBSCRIPT UP end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT. Here, H=sh𝐻𝑠H=s*hitalic_H = italic_s ∗ italic_h, W=sw𝑊𝑠𝑤W=s*witalic_W = italic_s ∗ italic_w represent the spatial dimensions of the up-sampled image, where s𝑠sitalic_s denotes the scale factor and C𝐶Citalic_C indicates the number of channels. Notably, we apply the Möbius transformation in the HR space to address the aliasing issue. This issue arises due to inadequate pixel representation for accurately describing continuous curves post-transformation, potentially leading to object shape distortion.

III-C2 Spatial Index Generation

We apply the Möbius transformation on the spatial index map X𝑋Xitalic_X of HR feature maps FUPsubscript𝐹UPF_{\text{UP}}italic_F start_POSTSUBSCRIPT UP end_POSTSUBSCRIPT and generate the transformed spatial index map Y𝑌Yitalic_Y for the subsequent resampling operation. Möbius transformation is known as the only conformal bijective transformation between the sphere and the complex plane. To apply the Möbius transformation on the HR feature maps FUPsubscript𝐹UPF_{\text{UP}}italic_F start_POSTSUBSCRIPT UP end_POSTSUBSCRIPT, we first use spherical projection (SP) to project the spatial index map X𝑋Xitalic_X from spherical coordinates (θ,ϕ)𝜃italic-ϕ(\theta,\phi)( italic_θ , italic_ϕ ) (where θ𝜃\thetaitalic_θ represents the longitude and ϕitalic-ϕ\phiitalic_ϕ represents the latitude) to the Riemann sphere 𝕊2={(x,y,z)3|x2+y2+z2=1}superscript𝕊2conditional-set𝑥𝑦𝑧superscript3superscript𝑥2superscript𝑦2superscript𝑧21{\mathbb{S}^{2}=\{(x,y,z)\in\mathbb{C}^{3}|x^{2}+y^{2}+z^{2}=1}\}blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = { ( italic_x , italic_y , italic_z ) ∈ blackboard_C start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_z start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 }, formulated as:

SP:(xyz)=(cos(ϕ)cos(θ)cos(ϕ)sin(θ)sin(ϕ)).:SPmatrix𝑥𝑦𝑧matrixitalic-ϕ𝜃italic-ϕ𝜃italic-ϕ\text{SP}:\begin{pmatrix}x\\ y\\ z\\ \end{pmatrix}=\begin{pmatrix}\cos(\phi)\cos(\theta)\\ \cos(\phi)\sin(\theta)\\ \sin(\phi)\\ \end{pmatrix}.SP : ( start_ARG start_ROW start_CELL italic_x end_CELL end_ROW start_ROW start_CELL italic_y end_CELL end_ROW start_ROW start_CELL italic_z end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL roman_cos ( italic_ϕ ) roman_cos ( italic_θ ) end_CELL end_ROW start_ROW start_CELL roman_cos ( italic_ϕ ) roman_sin ( italic_θ ) end_CELL end_ROW start_ROW start_CELL roman_sin ( italic_ϕ ) end_CELL end_ROW end_ARG ) . (4)

Then, with stereographic projection (STP) [38], we can project a point (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) of the Riemann sphere 𝕊2superscript𝕊2\mathbb{S}^{2}blackboard_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT onto the complex plane and obtain the projected point (xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT). Let point (0,0,1)001(0,0,1)( 0 , 0 , 1 ) be the pole, STP can be formulated as:

STP:x=x1z,y=y1z.:STPformulae-sequencesuperscript𝑥𝑥1𝑧superscript𝑦𝑦1𝑧\text{STP}:x^{\prime}={\frac{x}{1-z}}\ ,\ y^{\prime}={\frac{y}{1-z}}.STP : italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_x end_ARG start_ARG 1 - italic_z end_ARG , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_y end_ARG start_ARG 1 - italic_z end_ARG . (5)

Subsequently, given the projected point p𝑝pitalic_p (Zpsubscript𝑍𝑝Z_{p}italic_Z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT+iy𝑖superscript𝑦iy^{\prime}italic_i italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) on the complex plane, we can conduct the Möbius transformation with the following formulation:

f(Zp)=aZp+bcZp+d,𝑓subscript𝑍𝑝𝑎subscript𝑍𝑝𝑏𝑐subscript𝑍𝑝𝑑f(Z_{p})={\frac{aZ_{p}+b}{cZ_{p}+d}},italic_f ( italic_Z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = divide start_ARG italic_a italic_Z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_b end_ARG start_ARG italic_c italic_Z start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_d end_ARG , (6)

where a𝑎aitalic_a, b𝑏bitalic_b, c𝑐citalic_c, and d𝑑ditalic_d are complex numbers satisfying adbc0𝑎𝑑𝑏𝑐0ad-bc\neq 0italic_a italic_d - italic_b italic_c ≠ 0. Finally, we apply the inverse stereographic projection STP1superscriptSTP1\text{STP}^{-1}STP start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and inverse spherical projection SP1superscriptSP1\text{SP}^{-1}SP start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT to re-project the complex plane into the ERP plane:

STP1:(xyz)=(2x1+x2+y22y1+x2+y21+x2+y21+x2+y2);SP1:(θϕ)=(arctan(y/x)arcsin(z)).:superscriptSTP1matrix𝑥𝑦𝑧matrix2superscript𝑥1superscript𝑥2superscript𝑦22superscript𝑦1superscript𝑥2superscript𝑦21superscript𝑥2superscript𝑦21superscript𝑥2superscript𝑦2superscriptSP1:matrix𝜃italic-ϕmatrix𝑦𝑥𝑧\begin{split}\text{STP}^{-1}:\begin{pmatrix}x\\ y\\ z\\ \end{pmatrix}&=\begin{pmatrix}\frac{2x^{\prime}}{1+x^{\prime 2}+y^{\prime 2}}% \\ \frac{2y^{\prime}}{1+x^{\prime 2}+y^{\prime 2}}\\ \frac{-1+x^{\prime 2}+y^{\prime 2}}{1+x^{\prime 2}+y^{\prime 2}}\\ \end{pmatrix}\ ;\\ \ \text{SP}^{-1}:\begin{pmatrix}\theta\\ \phi\\ \end{pmatrix}&=\begin{pmatrix}\arctan(y/x)\\ \arcsin(z)\\ \end{pmatrix}\ .\end{split}start_ROW start_CELL STP start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT : ( start_ARG start_ROW start_CELL italic_x end_CELL end_ROW start_ROW start_CELL italic_y end_CELL end_ROW start_ROW start_CELL italic_z end_CELL end_ROW end_ARG ) end_CELL start_CELL = ( start_ARG start_ROW start_CELL divide start_ARG 2 italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_x start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG 2 italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_x start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG - 1 + italic_x start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_x start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT + italic_y start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW end_ARG ) ; end_CELL end_ROW start_ROW start_CELL SP start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT : ( start_ARG start_ROW start_CELL italic_θ end_CELL end_ROW start_ROW start_CELL italic_ϕ end_CELL end_ROW end_ARG ) end_CELL start_CELL = ( start_ARG start_ROW start_CELL roman_arctan ( italic_y / italic_x ) end_CELL end_ROW start_ROW start_CELL roman_arcsin ( italic_z ) end_CELL end_ROW end_ARG ) . end_CELL end_ROW (7)
Refer to caption
Figure 5: The illustration of the proposed spatial index generation module. With the HR feature map as input, the spatial index generation module generates a transformed index map to accomplish the feature transformation process.

In summary, as shown in Fig. 5, we first project the index map X𝑋Xitalic_X of input feature FUPsubscript𝐹UPF_{\text{UP}}italic_F start_POSTSUBSCRIPT UP end_POSTSUBSCRIPT to the complex plane using SP (Eq. 4) and STP (Eq. 5), and then conduct the Möbius transformation with Eq. 6, and generate the transformed index map Y𝑌Yitalic_Y through the inverse STP (Eq. 7) and inverse SP (Eq. 7).

III-C3 Spherical Resampling

Inspired by the inherent spherical representation of ODIs and the spherical conformality of Möbius transformation, we propose the spherical resampling module to generate the transformed feature maps FMsubscript𝐹MF_{\text{M}}italic_F start_POSTSUBSCRIPT M end_POSTSUBSCRIPT. The spherical resampling module resamples on the curved sphere based on the spherical geodesic of two points on the sphere. Given a query pixel q𝑞qitalic_q with the spatial index (θq,ϕq)subscript𝜃𝑞subscriptitalic-ϕ𝑞(\theta_{q},\phi_{q})( italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) from the index map Y𝑌Yitalic_Y, we choose its four corner pixels {pi,i=0,1,2,3}formulae-sequencesubscript𝑝𝑖𝑖0123\{p_{i},i=0,1,2,3\}{ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 0 , 1 , 2 , 3 } as the neighboring pixels, which are located on the feature maps FUPsubscript𝐹UPF_{\text{UP}}italic_F start_POSTSUBSCRIPT UP end_POSTSUBSCRIPT (as shown in the left of Fig. 6. The indices of the neighboring pixels satisfy the following conditions: θ0=θ3subscript𝜃0subscript𝜃3\theta_{0}=\theta_{3}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, θ1=θ2subscript𝜃1subscript𝜃2\theta_{1}=\theta_{2}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, ϕ0=ϕ1subscriptitalic-ϕ0subscriptitalic-ϕ1\phi_{0}=\phi_{1}italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and ϕ2=ϕ3subscriptitalic-ϕ2subscriptitalic-ϕ3\phi_{2}=\phi_{3}italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. To obtain the feature value of the query pixel q𝑞qitalic_q, we employ the spherical linear interpolation (Slerp) [39], which is a constant-speed motion along the spherical geodesic of two points on the sphere, formulated as follows:

Slerp(a,b)=sin(1t)βsinβa+sintβsinβb,Slerp𝑎𝑏1𝑡𝛽𝛽𝑎𝑡𝛽𝛽𝑏\text{Slerp}(a,b)=\frac{\sin(1-t)\beta}{\sin\beta}a+\frac{\sin t\beta}{\sin% \beta}b,Slerp ( italic_a , italic_b ) = divide start_ARG roman_sin ( 1 - italic_t ) italic_β end_ARG start_ARG roman_sin italic_β end_ARG italic_a + divide start_ARG roman_sin italic_t italic_β end_ARG start_ARG roman_sin italic_β end_ARG italic_b , (8)

where β𝛽\betaitalic_β is the angle subtended by a𝑎aitalic_a and b𝑏bitalic_b, and t𝑡titalic_t is the resampling weight. Note that t𝑡titalic_t is easy to determine if a𝑎aitalic_a and b𝑏bitalic_b are located on the same longitude. Therefore, we calculate the feature value of pixel q𝑞qitalic_q with two steps. Firstly, we resample p0,p1subscript𝑝0subscript𝑝1p_{0},p_{1}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and p2,p3subscript𝑝2subscript𝑝3p_{2},p_{3}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT to p01subscript𝑝01p_{01}italic_p start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT and p23subscript𝑝23p_{23}italic_p start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT, respectively, as shown in the right of Fig.6. Taking the resampling of p0,1subscript𝑝01p_{0,1}italic_p start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT as example, the formulation can be described as:

F(p01)=sin(1t01)α01sinα01F(p0)+sint01α01sinα01F(p1),𝐹subscript𝑝011subscript𝑡01subscript𝛼01subscript𝛼01𝐹subscript𝑝0subscript𝑡01subscript𝛼01subscript𝛼01𝐹subscript𝑝1F(p_{01})=\frac{\sin(1-t_{01})\alpha_{01}}{\sin\alpha_{01}}F(p_{0})+\frac{\sin t% _{01}\alpha_{01}}{\sin\alpha_{01}}F(p_{1}),italic_F ( italic_p start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT ) = divide start_ARG roman_sin ( 1 - italic_t start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT ) italic_α start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT end_ARG start_ARG roman_sin italic_α start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT end_ARG italic_F ( italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + divide start_ARG roman_sin italic_t start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT end_ARG start_ARG roman_sin italic_α start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT end_ARG italic_F ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , (9)
Refer to caption
Figure 6: Spherical resampling considers the angles (i.e., α01subscript𝛼01\alpha_{01}italic_α start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT, α23subscript𝛼23\alpha_{23}italic_α start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT, ΩΩ\Omegaroman_Ω) between points on the sphere, which correspond to the red solid curves.
Refer to caption
Figure 7: The samples in the third scenario. The participant can freely choose the viewing direction and zoom level to get a better viewing experience.

where α01subscript𝛼01\alpha_{01}italic_α start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT is the angle subtended by p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and the weight t01subscript𝑡01t_{01}italic_t start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT is decided by the location of p01subscript𝑝01p_{01}italic_p start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT on the curve p0p1subscript𝑝0subscript𝑝1\overset{\LARGE{\frown}}{p_{0}p_{1}}over⌢ start_ARG italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG. Notably, t01subscript𝑡01t_{01}italic_t start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT should ensure p01subscript𝑝01p_{01}italic_p start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT to have the same longitude with the query pixel q𝑞qitalic_q. Similarly, α23subscript𝛼23\alpha_{23}italic_α start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT is the angle subtended by p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and p3subscript𝑝3p_{3}italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and p23subscript𝑝23p_{23}italic_p start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT also has the same longitude with the query pixel q𝑞qitalic_q by calculating the weight t23subscript𝑡23t_{23}italic_t start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT. After that, we follow the Slerp (Eq. 8) to calculate the feature value Fqsubscript𝐹𝑞F_{q}italic_F start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT as follows:

F(q)=sin(1tq)ΩsinΩF(p01)+sintqΩsinΩF(p23),𝐹𝑞1subscript𝑡𝑞ΩΩ𝐹subscript𝑝01subscript𝑡𝑞ΩΩ𝐹subscript𝑝23F(q)=\frac{\sin(1-t_{q})\Omega}{\sin\Omega}F(p_{01})+\frac{\sin t_{q}\Omega}{% \sin\Omega}F(p_{23}),italic_F ( italic_q ) = divide start_ARG roman_sin ( 1 - italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) roman_Ω end_ARG start_ARG roman_sin roman_Ω end_ARG italic_F ( italic_p start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT ) + divide start_ARG roman_sin italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT roman_Ω end_ARG start_ARG roman_sin roman_Ω end_ARG italic_F ( italic_p start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT ) , (10)

where ΩΩ\Omegaroman_Ω is the angle subtended by p01subscript𝑝01p_{01}italic_p start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT and p23subscript𝑝23p_{23}italic_p start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT, and tqsubscript𝑡𝑞t_{q}italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is decided by the location of q𝑞qitalic_q on the curve p01p23subscript𝑝01subscript𝑝23\overset{\LARGE{\frown}}{p_{01}p_{23}}over⌢ start_ARG italic_p start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT end_ARG. If we assume that p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and p01subscript𝑝01p_{01}italic_p start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT have the same latitude, the calculation of t01subscript𝑡01t_{01}italic_t start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT can be simplified into θqθ0θ1θ0subscript𝜃𝑞subscript𝜃0subscript𝜃1subscript𝜃0\frac{\theta_{q}-\theta_{0}}{\theta_{1}-\theta_{0}}divide start_ARG italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG. Similarly, t23subscript𝑡23t_{23}italic_t start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT can be simplified into θqθ2θ3θ2subscript𝜃𝑞subscript𝜃2subscript𝜃3subscript𝜃2\frac{\theta_{q}-\theta_{2}}{\theta_{3}-\theta_{2}}divide start_ARG italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG. Also, tqsubscript𝑡𝑞t_{q}italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT can be simplified into ϕqϕ01ϕ23ϕ01subscriptitalic-ϕ𝑞subscriptitalic-ϕ01subscriptitalic-ϕ23subscriptitalic-ϕ01\frac{\phi_{q}-\phi_{01}}{\phi_{23}-\phi_{01}}divide start_ARG italic_ϕ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_ϕ start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT end_ARG start_ARG italic_ϕ start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT - italic_ϕ start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT end_ARG.

III-D View Transformation in VR

As shown in Fig. 7, the transformed ODI is displayed by our proposed system. As OmniZoomer [19] only supports generating ODIs with ERP format, its output can not meet the demands of viewing ODIs in VR with different FoVs. In contrast, the proposed algorithm in our OmniVR allows for various projection formats by adding a projection transformation layer. In this case, the transformed ODI can be displayed in VR with the perspective format that fits the FoV of the user with less distortion effect. For detailed projection transformations, we recommend readers to the recent survey [40]. The transformed ODI by our algorithm contains finer details, which could assist the user in recognizing and understanding the scenario, thus improving the immersive and interactive experience in VR. As shown in Fig. 7, our system allows displaying the generated HR ODIs with perspective views in the VR media under various user commands, including zooming in/out and moving towards different directions, e.g., left, right, and up.

IV Experiment for Algorithm

IV-A Dataset and Implementation Details

Datasets. No datasets exist for ODIs under Möbius transformations, and collecting real-world ODI pairs with corresponding Möbius transformation matrices is difficult. Thus, we propose ODI-Möbius (ODIM) dataset to train our OmniVR and compare methods in a supervised manner. Our dataset is based on the ODI-SR dataset [41] with 1191 images in the train set, 100 images in the validation set, and 100 images in the test set. Note that the proposed dataset is consistent with the conference version [19]. During training, as our system aims to freely navigate and zoom in the VR media, we generate user commands that include horizontal rotation, vertical rotation, and zoom-in/out operations. The user commands are converted to the parameters {a,b,c,d}𝑎𝑏𝑐𝑑\{a,b,c,d\}{ italic_a , italic_b , italic_c , italic_d } of the Möbius transformation matrix according to Eq. 6. During validating and testing, we assign a fixed user command and the corresponding Möbius transformation matrix for each ODI. Besides, we test on SUN360 [42] dataset with 100 images.

Scale ×8absent8\times 8× 8 ×16absent16\times 16× 16
Method ODI-SR SUN 360 ODI-SR SUN 360
WS-PSNR WS-SSIM WS-PSNR WS-SSIM WS-PSNR WS-SSIM WS-PSNR WS-SSIM
Bicubic 26.77 0.7725 25.87 0.7103 24.79 0.7404 23.87 0.6802
RCAN(+Transform)Transform{\rm(+Transform)}( + roman_Transform ) [43] 27.46 0.7906 27.04 0.7443 25.45 0.7541 24.70 0.7001
LAU-Net(+Transform)Transform{\rm(+Transform)}( + roman_Transform ) [20] 27.25 0.7813 26.77 0.7363 25.23 0.7455 24.49 0.6921
OmniZoomer-RCAN 27.53 0.7970 27.34 0.7592 25.50 0.7584 24.84 0.7034
OmniVR-RCAN 27.62 0.8005 27.50 0.7662 25.52 0.7629 24.89 0.7094
TABLE I: Quantitative comparison of Möbius transformation results on ODIs. (+Transform)Transform{\rm(+Transform)}( + roman_Transform ) denotes that we first employ a scale-specific SR model for image SR and then conduct image-level Möbius transformation on the SR image. We report on ODI-SR dataset and SUN360 dataset with up-sampling factors ×8absent8\times 8× 8 and ×16absent16\times 16× 16. Bold indicates the best results.
Scale ×8absent8\times 8× 8 ×16absent16\times 16× 16
Method ODI-SR SUN 360 ODI-SR SUN 360
WS-PSNR WS-SSIM WS-PSNR WS-SSIM WS-PSNR WS-SSIM WS-PSNR WS-SSIM
Bicubic 19.64 0.5908 19.72 0.5403 17.12 0.4332 17.56 0.4638
EDSR [36] 23.97 0.6483 23.79 0.6472 22.24 0.6090 21.83 0.5974
RCAN [43] 24.26 0.6554 23.88 0.6542 22.49 0.6176 21.86 0.5938
360-SS [44] 24.14 0.6539 24.19 0.6536 22.35 0.6102 22.10 0.5947
SphereSR [12] 24.37 0.6777 24.17 0.6820 22.51 0.6370 21.95 0.6342
LAU-Net [20] 24.36 0.6602 24.24 0.6708 22.52 0.6284 22.05 0.6058
LAU-Net+ [41] 24.63 0.6815 24.37 0.6710 22.97 0.6316 22.22 0.6111
OmniVR-RCAN 24.61 0.6822 24.53 0.7152 22.68 0.6324 22.14 0.6483
TABLE II: Quantitative comparison of ODI SR task. The numbers are excerpted from  [41] except for [12], due to its reported results are obtained by utilizing 800 training images in the ODI-SR dataset. We report ×8absent8\times 8× 8, ×16absent16\times 16× 16 SR results on the ODI-SR and SUN360 datasets. Bold indicates the best.

Implementation details. We mainly evaluate the ERP format. The resolution of the HR ERP images is 1024×2048102420481024\times 20481024 × 2048, and the up-sampling factors we choose are ×8absent8\times 8× 8 and ×16absent16\times 16× 16. We use L1 loss, which is optimized by Adam optimizer [45], with an initial learning rate of 1e-4. The batch size is 1 when using RCAN [43] as the backbone. Especially, considering the spherical imagery of ODIs, we use specific WS-PSNR [46] and WS-SSIM [47] metrics for evaluation.

Refer to caption
Figure 8: Visual comparisons of Möbius transformation results on ODI-SR (1st row) and SUN360 (2nd row) datasets.

IV-B Quantitative and Qualitative Evaluation

Navigate and Zoom in: Except for OmniZoomer [19], there are no prior arts that can be directly compared. For a fair comparison, we combine the existing image SR models [43, 20] with image-level Möbius transformations. The SR models designed for 2D planar images are retrained based on their official settings.

In Tab. I, by applying RCAN as the backbone, OmniVR outperforms current methods in all metrics, all up-sampling factors, and test sets. It reveals the effectiveness of our OmniVR incorporating Möbius transformation into the neural network. Note that LAU-Net shows inferior performance because it is limited to vertically-captured ODIs. Importantly, our OmniVR outperforms OmniZoomer in all metrics, demonstrating the importance of skip connection for stable convergence during training. As shown in the first row of Fig. 8 on the ODI-SR dataset, OmniVR predicts clearer picture frames. Similarly, in the second row of Fig. 8 on the SUN360 dataset, OmniVR reconstructs clearer structures of the chairs than other methods.

Direct SR: Our OmniVR can achieve the naive SR task by setting the Möbius transformation matrix as an identity matrix. Tab. II shows that OmniVR with RCAN as backbone obtains 5 (total 8) best metrics. It demonstrates the strong capability of our method to handle the inherent distortions.

Refer to caption
Figure 9: Overview of the five scenarios and their order. The user study contains the baseline condition and OmniVR. The two techniques both use Möbius transformation for zoom in/out, while OmniVR additionally combines image enhancement.
Refer to caption
Figure 10: Sample scenes during user study. (a) The participant is equipped with the VR headset and controllers. (b) The participant is viewing the scenarios, which are displayed on the screen simultaneously. (c) Before and after equip** the VR headsets, the participant fills the basic information and interviews, respectively.

V User Study

We conduct a within-subject user study in VR to explore the effectiveness of OmniVR and the overall user experience of our proposed system. Note that the user study involves no more than minimal risk, and the IRB board has granted a waiver for the review process. To establish a comparative baseline for assessing the image enhancement performance of OmniVR, we include a condition that only enables Möbius transformation with Bicubic interpolation in Fig. 9.

V-A Experiment Set-Up & Participants

Participants. We recruited 18 participants (P1-P18) through the university mailing list, including 8 males and 10 females. 56% of them are between 18 and 24 years old, and 44% of them are between 25 and 34 years old. 9 participants have viewed ODIs on VR devices and mobile phones, and 6 participants have viewed ODIs on computers. Furthermore, their familiarity with 3D games or 3D models is mandatory (if 1 represents very low and 7 represents very high, the average is 3.94). Each participant received 3 dollars as compensation.

Appratus & Data. We employ a Meta Quest 2 for the experiment, as shown in Fig. 10(a). We select five ODIs as five scenarios in the VR experience. The ODIs are from the training and testing sets of the Flickr360 dataset [48]. The selected ODIs have texts or textures in the equator regions, which have a distinction in different zoom levels and spatial resolutions. The ODIs are various from indoor scenarios to outdoor scenarios. The transition among different scenarios and zooming in/out are achieved using the CenarioVR software 111https://www.cenariovr.com/ and controlled through specific arrow keys. During the VR experience, the view direction is controlled through the head movements, while the scenario transition is controlled with the trigger of the right VR controller.

Refer to caption
Figure 11: Means and standard deviations of five scenarios from two groups, i.e., the baseline condition and OmniVR. We also present the difference between the two groups using a t-test.

Experimental Conditions. As shown in Fig. 9, we set a baseline condition for comparison. The baseline condition enables Möbius transformation that enables zoom-in/out for users to find details in various zoom levels. Differently, the Möbius transformation in the baseline condition is conducted with Bicubic interpolation, while in OmniVR it is achieved by deep learning. Moreover, OmniVR additionally contains image enhancement to recover more details.

V-B Design & Procedure

The experiment was within-subjects: 2 technique ×\times× 5 scenarios ×\times× 3 answer ranges (i.e., accuracy, confidence, and response time) = 30 responses per participant. We fixed the order of two conditions because ODIs obtained from OmniVR exhibit higher quality than the baseline condition in Sec. IV. Firstly viewing ODIs generated with OmniVR would cause information leakage to the following baseline condition and thus influence the task performance from OmniVR. To further avoid the influence of this memorization issue when comparing two conditions, we design two questions with similar difficulty in each scenario. These questions are about text recognition in four scenarios, and number counting of a specific object in one scenario. Participants would respond to one question in the baseline condition, and respond to the other question in the OmniVR condition. The order of the two questions is random, while the number of participants receiving some order is consistent. We also fixed the order of scenarios because there is no relationship between them. The experiment lasted for about 20 minutes on average.

Refer to caption
Figure 12: Visualization of scenario 4. The first row is the HR ground truth ODI. The second and third rows represent the areas related to the two questions, respectively. The left patches are generated with the baseline condition, while the right patches are generated with OmniVR.

This study contains two parts. Before starting the study, each participant was given a short introduction to the study and our system. The VR headset and controllers were adjusted for each participant to ensure that the testing text was clear. The first part focused on assessing the image enhancement performance of our proposed technique. The participants first viewed the five scenarios generated with the baseline condition and then viewed them generated with OmniVR, as shown in Fig. 10(b). This order would help participants forget specific impressions for a fair comparison. For each technique ×\times× scenario, participants were asked to answer the question as quickly as possible. The second part focused on the overall user experience of the whole system. We asked participants to fill in a 7-point Likert questionnaire to measure their cybersickness, mental and physical costs, immersive experience, and usability of zoom-in techniques for two techniques, as shown in Fig. 10(c). For a fair comparison, the participants were only informed that the first and five scenarios are generated with "Technique 1" and "Technique 2" respectively. We also conduct a post-study interview to collect user feedback about the suggestions and expectations for our technique and system.

V-C Measures

In the first part, we set three metrics for evaluation: accuracy, response time, and confidence level. Ideally, higher accuracy, shorter response time, and higher confidence levels could demonstrate higher image quality. We set questions about text recognition and number counting for a specific object. For the text recognition question, the accuracy is the ratio of correct words. For the number counting question, the accuracy is calculated through e|N1N2|superscript𝑒subscript𝑁1subscript𝑁2e^{-|N_{1}-N_{2}|}italic_e start_POSTSUPERSCRIPT - | italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT, where N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the responded and right numbers, respectively. The confidence level is measured within a 7-point Likert scale. In addition, we allow participants to respond “Invisible" if the scenario is too blurry to identify. In this case, the accuracy and confidence level are set to zero. In the second part, we collected their 7-point Likert scale ratings about cybersickness, mental and physical costs, immersive experience, and usability (1 – very low, 7 – very high).

V-D Results & Findings

V-D1 Quantitative Evaluation

We utilize a t-test to analyze the participants’ VR experience in five scenarios. The main reason is that we want to compare the difference between the baseline condition and OmniVR.

Refer to caption
Figure 13: Qualitative results of user study, including cybersickness, mental and physical costs, immersion, and usability. We also present the difference between the two groups using a t-test.

In Fig. 11, we report the comparison results between the baseline condition and OmniVR across five scenarios and three metrics. For accuracy in Fig. 11(a), OmniVR surpasses the baseline condition consistently in five scenarios. Especially, in scenario 1, OmniVR outperforms 0.48 compared with the baseline condition. This is mainly because in scenario 1 the baseline condition recovers few local details, and thus 5 participants respond “Invisible". Similar occasion occurs in scenario 4, where 8 participants respond “Invisible" under the baseline condition. For response time in Fig. 11(b), OmniVR consumes less time than the baseline condition in most of the scenarios and comparable time with the baseline condition in scenario 4. For confidence level in Fig. 11(c), OmniVR obtains higher confidence levels in all scenarios. It demonstrates that OmniVR helps participants identify the local details easily through zoom-in functions and image enhancement. In this case, the ambiguous occasions that degrade the confidence level are reduced significantly.

We further utilize a t-test to explore the difference between the baseline condition and OmniVR. In Fig. 11, the baseline condition and OmniVR show significant differences in most scenarios and metrics. For scenario 3, the accuracy and response time between the two conditions have no significant difference. We ascribe it as the simple and easily identifiable words in scenario 3. For scenario 4, the response time between the two groups shows no significant difference. In Fig. 11(b), we can find that OmniVR consumes a little more time than the baseline condition in scenario 4. We think it is related to the task design in scenario 4. The task in scenario 4 is high-level and about number counting of a specific object. As a result, if the scenario is difficult to identify, the participants prefer to give very fast “Invisible" responses. For example, P16 consumes 2.5 seconds for the “Invisible" response with the baseline condition while consuming 16.0 seconds to finish the counting task with OmniVR.

V-D2 Qualitative Evaluation

We present participants’ feedback on the reasons for their subjective ratings of VR experiences under two experimental conditions (Fig. 13) as well as their expectations and suggestions for further improvement.

Usability. Generally, participants rated a higher score for the usability of the zoom-in technique in OmniVR than that in the baseline condition with a significant difference (p<.001𝑝.001p<.001italic_p < .001). Most participants (N=7) thought that the zoom-in technique is “less effective in the baseline condition, but very helpful in OmniVR”, while only one participant (P15) thought that the zoom-in technique is “useful in both groups”. In contrast, few participants (P18) complained about “the more blurry effects” of zoomed-in regions than the original regions. Interestingly, P13 presented a different view for the usability of the zoom-in technique, “The second group of scenarios is clear enough, and I have no need to zoom in; but for the first group of scenarios, I still need to zoom in to find more details.”

Immersion. Overall, participants show better satisfaction with the immersive experiences under OmniVR other than the baseline condition with a significant difference (P<.001𝑃.001P<.001italic_P < .001). The main reason that improves such an immersive experience is mentioned as the high-quality details recovered from OmniVR. For example, P17 said that “The scenarios are natural and I feel interesting when watching these scenarios.” However, a few participants (N=3) also expressed that there are still some cases that destroy immersion. For example, “the spherical distortion” (P2) makes the scene “unrealistic” (P3), and “the seriously blurry effect exists out of the focused regions” (P6). These effects are raised due to the original property of ODIs and Möbius transformation. Individually, one participant (P8) complained about the scenario selection and experienced less immersion in the first scenario because it is “about a clock and looks like a planar scenario.”

Cybersickness. Participants reported less cybersickness with the OmniVR. Although there shows no significant difference between the two conditions, some participants (N=3) indicated that OmniVR could “recover more local details” and thus alleviate the “blurry effect”, which is the main reason to raise cybersickness. Some other participants explained that they did not feel the obvious cybersickness difference as “the duration of this VR experience was not very long”.

Workload. Related to the lower cybersickness, participants reported lower mental and physical loads with OmniVR. In particular, OmniVR shows a significant difference (P<.05𝑃.05P<.05italic_P < .05) compared with the baseline condition on mental cost. Some participants (N=4) mentioned the main reason as “the blurry effect would increase the workload simultaneously”. Specifically, two participants discussed that their mental costs would increase correspondingly “if the scenario is hard to identify”.

VI Discussion

About the zoom-in function. In Sec. V-D2, we have demonstrated the effectiveness of the zoom-in function, especially in OmniVR. However, four participants (P7, P8, P12, P16) reflected that the zoom-in function is limited to fixed zoom levels. Specifically, although the zoomed-in regions could provide more local details, the objects of interest often occupy out of the FoV and are not in the center of the FoV. In this case, the participants have to adjust their viewing directions continuously to find the optimal direction. This would result in a bad immersive experience and increase the burdens in both mental and physical aspects. To improve it, we would try to learn how to select an optimal transformation by only assigning the interested objects. This might include the techniques about scene understanding techniques, such as ODI object detection.

About the accuracy. In four scenarios (1,2,3,5), the questions are about text recognition. There are two issues about these questions. Firstly, P12 said that “In scenario 3, some words on the bridge are written with scrawl, making it ambiguous for recognition". Secondly, P16 said that “Some words might be guessed by associating them with prior knowledge". That is, although some words are difficult to identify due to blurry effect, they might be responded rightly if the meaning of the sentence is understood by participants.

Failure cases. There is an interesting phenomenon in the second question of scenario 4. The question is about how many pipes are within the hands of the three standing people. As shown in the top of Fig. 12, only one person (middle) takes a pipe in hand. Statistically speaking, only one participant gives the right response using OmniVR, while three participants give the right responses using the baseline condition. To further analyze, we find that two railings are along the standing people. OmniVR recovers clearer railings but fails to recover the detail of the pipe (See Fig. 12 middle). The railing might be mistaken as the pipe. As a result, most participants said that “Two pipes are in the hands of standing people". Instead, the baseline condition can not recover the detail of the pipe, and only one railing might be seen indistinctly. As a result, three participants said that “One pipe is in the hands of standing people". In the other question about pipes in hand from lying people (See Fig. 12 bottom), as OmniVR can recover the details of pipes clearly, the number of right responses in OmniVR increases obviously.

Limitation. Our system can recover HR and high-quality details under various user commands. However, the user commands are totally determined by the user operation, as discussed in the zoom-in function of Sec. VI. In addition, the streaming speed of our algorithm is also a limitation. In the current stage, we can only generate the transformed ODI in advance on GPUs according to the user commands, and then display the transformed ODI in VR.

VII Conclusion

In this paper, we have presented a novel OmniVR system to enable viewers to navigate and zoom in/out effortlessly in the VR media, and have developed a learning-based algorithm to refine the visual fidelity. By conducting a comprehensive user study, our system was witnessed the following benefits: 1) Our system improved the scenario recognition for viewers by recovering the details of objects of interest; 2) Our system reduced the discomfort and helped viewers gain confidence obviously during VR navigation; 3) Our system was user-friendly in various user commands, i.e., navigation and zoom in/out. Our study revealed the importance of visual quality under various user commands in VR navigation, especially when the objects of interest were too small and required to zoom in. We release the project code of our OminVR system to inspire future studies in the community at http://vlislab22.github.io/OmniVR/.

References

  • [1] A. Vermast and W. Hürst, “Introducing 3d thumbnails to access 360-degree videos in virtual reality,” IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 5, pp. 2547–2556, 2023.
  • [2] Z. Luo, B. Chai, Z. Wang, M. Hu, and D. Wu, “Masked360: Enabling robust 360-degree video streaming with ultra low bandwidth consumption,” IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 5, pp. 2690–2699, 2023.
  • [3] M. Dasari, E. Lu, M. W. Farb, N. Pereira, I. Liang, and A. Rowe, “Scaling vr video conferencing,” in 2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR).   IEEE, 2023, pp. 648–657.
  • [4] Q. Zhang, J. Wei, S. Wang, S. Ma, and W. Gao, “Realvr: Efficient, economical, and quality-of-experience-driven vr video system based on mpeg omaf,” IEEE Transactions on Multimedia, 2022.
  • [5] J. Lou, Y. Wang, C. Nduka, M. Hamedi, I. Mavridou, F.-Y. Wang, and H. Yu, “Realistic facial expression reconstruction for vr hmd users,” IEEE Transactions on Multimedia, vol. 22, no. 3, pp. 730–743, 2019.
  • [6] P. Szabo, A. Simiscuka, S. Masneri, M. Zorrilla, and G.-M. Muntean, “A cnn-based framework for enhancing 360 vr experiences with multisensorial effects,” IEEE Transactions on Multimedia, 2022.
  • [7] S. Verma, L. Warrier, B. Bolia, and S. Mehta, “Past, present, and future of virtual tourism-a literature review,” International Journal of Information Management Data Insights, vol. 2, no. 2, p. 100085, 2022.
  • [8] A. Mohammad and H. Ismail, “Development and evaluation of an interactive 360 virtual tour for tourist destinations,” J. Inform. Technol. Impact, vol. 9, pp. 137–182, 2009.
  • [9] A. Azmi, R. Ibrahim, M. Abdul Ghafar, and A. Rashidi, “Smarter real estate marketing using virtual reality to influence potential homebuyers’ emotions and purchase intention,” Smart and Sustainable Built Environment, vol. 11, no. 4, pp. 870–890, 2022.
  • [10] J. Singh, M. Malhotra, and N. Sharma, “Metaverse in education: An overview,” Applying metalytics to measure customer experience in the metaverse, pp. 135–142, 2022.
  • [11] H. Chang and M. F. Cohen, “Panning and zooming high-resolution panoramas in virtual reality devices,” in Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, 2017, pp. 279–288.
  • [12] Y. Yoon, I. Chung, L. Wang, and K.-J. Yoon, “Spheresr: 360deg image super-resolution with arbitrary projection via continuous spherical image representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5677–5686.
  • [13] M. Kwon, R. Liu, and L. Chien, “Compensation for blur requires increase in field of view and viewing time,” PLoS One, vol. 11, no. 9, p. e0162711, 2016.
  • [14] L. O’Hare and P. B. Hibbard, “Visual discomfort and blur,” Journal of vision, vol. 13, no. 5, pp. 7–7, 2013.
  • [15] D. M. Hoffman, A. R. Girshick, K. Akeley, and M. S. Banks, “Vergence–accommodation conflicts hinder visual performance and cause visual fatigue,” Journal of vision, vol. 8, no. 3, pp. 33–33, 2008.
  • [16] S. Ang and J. Quarles, “Gingervr: An open source repository of cybersickness reduction techniques for unity,” in 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW).   IEEE, 2020, pp. 460–463.
  • [17] R. Hussain, M. Chessa, and F. Solari, “Mitigating cybersickness in virtual reality systems through foveated depth-of-field blur,” Sensors, vol. 21, no. 12, p. 4006, 2021.
  • [18] X. Meng, R. Du, M. Zwicker, and A. Varshney, “Kernel foveated rendering,” Proceedings of the ACM on Computer Graphics and Interactive Techniques, vol. 1, no. 1, pp. 1–20, 2018.
  • [19] Z. Cao, H. Ai, Y.-P. Cao, Y. Shan, X. Qie, and L. Wang, “Omnizoomer: Learning to move and zoom in on sphere at high-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12 897–12 907.
  • [20] X. Deng, H. Wang, M. Xu, Y. Guo, Y. Song, and L. Yang, “Lau-net: Latitude adaptive upscaling network for omnidirectional image super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9189–9198.
  • [21] V. Fakour-Sevom, E. Guldogan, and J.-K. Kämäräinen, “360 panorama super-resolution using deep convolutional networks,” in Int. Conf. on Computer Vision Theory and Applications (VISAPP), vol. 1, 2018.
  • [22] A. Nishiyama, S. Ikehata, and K. Aizawa, “360 single image super resolution via distortion-aware network and distorted perspective images,” in 2021 IEEE International Conference on Image Processing (ICIP).   IEEE, 2021, pp. 1829–1833.
  • [23] F. Yu, X. Wang, M. Cao, G. Li, Y. Shan, and C. Dong, “Osrt: Omnidirectional image super-resolution with distortion-aware transformer,” arXiv preprint arXiv:2302.03453, 2023.
  • [24] X. Sun, W. Li, Z. Zhang, Q. Ma, X. Sheng, M. Cheng, H. Ma, S. Zhao, J. Zhang, J. Li et al., “Opdn: Omnidirectional position-aware deformable network for omnidirectional image super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1293–1301.
  • [25] L. Penaranda, L. Velho, and L. Sacht, “Real-time correction of panoramic images using hyperbolic möbius transformations,” Journal of Real-Time Image Processing, vol. 15, pp. 725–738, 2018.
  • [26] L. S. Ferreira, L. Sacht, and L. Velho, “Local moebius transformations applied to omnidirectional images,” Computers & Graphics, vol. 68, pp. 77–83, 2017.
  • [27] L. S. Ferreira and L. Sacht, “Bounded biharmonic blending of möbius transformations for flexible omnidirectional image rectification,” Computers & Graphics, vol. 93, pp. 51–60, 2020.
  • [28] S. Schleimer and H. Segerman, “Squares that look round: transforming spherical images,” arXiv preprint arXiv:1605.01396, 2016.
  • [29] J. Wu, C. Xia, T. Yu, and J. Li, “View-aware salient object detection for 360 {{\{{\\\backslash\deg}}\}} omnidirectional image,” arXiv preprint arXiv:2209.13222, 2022.
  • [30] S. Zhou, J. Zhang, H. Jiang, T. Lundh, and A. Y. Ng, “Data augmentation with mobius transformations,” Machine Learning: Science and Technology, vol. 2, no. 2, p. 025016, 2021.
  • [31] D. P. Mandic and V. S. L. Goh, Complex valued nonlinear adaptive filters: noncircularity, widely linear and neural models.   John Wiley & Sons, 2009.
  • [32] N. Özdemir, B. B. İskender, and N. Y. Özgür, “Complex valued neural network with möbius activation function,” Communications in Nonlinear Science and Numerical Simulation, vol. 16, no. 12, pp. 4698–4703, 2011.
  • [33] N. Azizi, H. Possegger, E. Rodolà, and H. Bischof, “3d human pose estimation using möbius graph convolutional networks,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I.   Springer, 2022, pp. 160–178.
  • [34] T. W. Mitchel, N. Aigerman, V. G. Kim, and M. Kazhdan, “Möbius convolutions for spherical cnns,” in ACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–9.
  • [35] S. Kato and P. McCullagh, “Möbius transformation and a cauchy family on the sphere,” arXiv: Statistics Theory, 2015.
  • [36] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 136–144.
  • [37] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1874–1883, 2016.
  • [38] K. Eybpoosh, M. Rezghi, and A. Heydari, “Applying inverse stereographic projection to manifold learning and clustering,” Applied Intelligence, vol. 52, pp. 4443–4457, 2021.
  • [39] J. P. Fatelo and N. Martins-Ferreira, “Mobility spaces and geodesics for the n-sphere,” 2021.
  • [40] H. Ai, Z. Cao, J. Zhu, H. Bai, Y. Chen, and L. Wang, “Deep learning for omnidirectional vision: A survey and new perspectives,” arXiv preprint arXiv:2205.10468, 2022.
  • [41] X. Deng, H. Wang, M. Xu, L. Li, and Z. Wang, “Omnidirectional image super-resolution via latitude adaptive network,” IEEE Transactions on Multimedia, 2022.
  • [42] J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, “Recognizing scene viewpoint using panoramic place representation,” computer vision and pattern recognition, 2012.
  • [43] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 286–301.
  • [44] C. Ozcinar, A. Rana, and A. Smolic, “Super-resolution of omnidirectional images using adversarial learning,” in 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP).   IEEE, 2019, pp. 1–6.
  • [45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [46] Y. Sun, A. Lu, and L. Yu, “Weighted-to-spherically-uniform quality evaluation for omnidirectional video,” IEEE signal processing letters, vol. 24, no. 9, pp. 1408–1412, 2017.
  • [47] Y. Zhou, M. Yu, H. Ma, H. Shao, and G. Jiang, “Weighted-to-spherically-uniform ssim objective quality evaluation for panoramic video,” in 2018 14th IEEE International Conference on Signal Processing (ICSP).   IEEE, 2018, pp. 54–57.
  • [48] M. Cao, C. Mou, F. Yu, X. Wang, Y. Zheng, J. Zhang, C. Dong, G. Li, Y. Shan, R. Timofte et al., “Ntire 2023 challenge on 360deg omnidirectional image and video super-resolution: Datasets, methods and results,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1731–1745.