Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality

Zidong Cao¹, Zhan Wang¹, Yexin Liu¹, Yan-Pei Cao³, Ying Shan³, Wei Zeng^1,2 and Lin Wang^1,2∗ ¹HKUST(GZ) ²HKUST ³ARC Lab, Tencent PCG.^∗Corresponding author.

Abstract

Viewing omnidirectional images (ODIs) in virtual reality (VR) represents a novel form of media that provides immersive experiences for users to navigate and interact with digital content. Nonetheless, this sense of immersion can be greatly compromised by a blur effect that masks details and hampers the user’s ability to engage with objects of interest. In this paper, we present a novel system, called OmniVR, designed to enhance visual clarity during VR navigation. Our system enables users to effortlessly locate and zoom in on the objects of interest in VR. It captures user commands for navigation and zoom, converting these inputs into parameters for the Möbius transformation matrix. Leveraging these parameters, the ODI is refined using a learning-based algorithm. The resultant ODI is presented within the VR media, effectively reducing blur and increasing user engagement. To verify the effectiveness of our system, we first evaluate our algorithm with state-of-the-art methods on public datasets, which achieves the best performance. Furthermore, we undertake a comprehensive user study to evaluate viewer experiences across diverse scenarios and to gather their qualitative feedback from multiple perspectives. The outcomes reveal that our system enhances user engagement by improving the viewers’ recognition, reducing discomfort, and improving the overall immersive experience. Our system makes the navigation and zoom more user-friendly. For more details, e.g., demos, please refer to the project page http://vlislab22.github.io/OmniVR/.

Index Terms:

Virtual reality, Image Processing and Computer Vision, Human-computer interaction

I Introduction

Refer to caption — Figure 1: Comparing the VR experience under the baseline condition and our OmniVR. Users can freely navigate and zoom in/out to see the object of interest. With our proposed algorithm, the objects can be refined with clear textural details, thus enhancing the engagement and immersive experience.

Omnidirectional images (ODIs), also called $360^{\circ}$ images, have increasingly attracted interest for their capability to capture extensive content within a single frame. The recent surge in integrating such visual content into virtual reality (VR) environments is noteworthy [1, 2, 3]. This integration represents a novel form of media that allows users to navigate freely in any direction, offering an immersive and interactive experience akin to being in a real environment [4, 5, 6]. This immersive media has led to various applications, including but not limited to virtual tours [7, 8], real estate showcases [9], educational tools [10], and remote meeting solutions [3]. Moreover, the rapid advancement and wide availability of consumer-level VR devices [11] have made these experiences increasingly accessible to a broader audience. In this context, viewers transform from mere observers to active participants, who can navigate and zoom into objects of interest, thereby making significant progress in media consumption and user experience [11].

The ODIs are usually stored with the equirectangular projection (ERP) type and displayed in VR with perspective projection. A critical issue with ODIs is their relatively low angular resolution [12], which results in local regions appearing blurry. This blurriness intensifies when the images are zoomed in and navigated (see Fig. 1 (top)), potentially compromising the immersive experience by reducing local and fine details [13]. Consequently, this not only impedes user engagement with objects of interest but also detracts from the overall immersive experience, potentially causing mental and physical discomfort [14]. Several methods have been explored to mitigate this discomfort in VR navigation, including the incorporation of spatial blur [15], defocus blur [16], depth-of-field blur [17], and foveated rendering [18]. However, these methods fall short of enhancing the clear textural details of objects. Despite the freedom to navigate and zoom in/out on objects, the enlarged objects remain blurry.

In this paper, we introduce a novel system, dubbed OmniVR, aiming to make viewers navigate and zoom in/out in the VR space effortlessly, while simultaneously enhancing the visual quality to recover clear local details, as illustrated in Fig. 1 (bottom). As shown in Fig. 2, the viewer first views the original ODI displayed in a VR headset. With OmniVR, the viewer is free to navigate and find some objects of interest. Then, the viewer can use the headset and controller to give commands about rotation and zoom in/out. Our system captures these user commands and converts these commands into parameters for the Möbius transformation matrix (Sec. III-A). Leveraging these parameters, we propose a learning-based algorithm, which is built based on our conference work OmniZoomer [19], to achieve high-quality ODIs after transformation with two key techniques. First, OmniVR integrates the Möbius transformation into the network, enabling free navigation and zoom within ODIs. By learning transformed feature maps in various conditions, the network is enhanced to handle the increasing curves caused by navigation and zoom, thus alleviating the blurry effect (Sec. III-C). Secondly, we propose enhancing the feature maps to high-resolution (HR) space before the transformation. The HR feature maps contain more fine-grained textural details, which could compensate for the lack of pixels for describing curves (Sec. III-C1). After obtaining the HR feature maps, we also propose a spatial index generation module (Sec. III-C2) and a spherical resampling module (Sec. III-C3) to accomplish the feature transformation process. Finally, these feature maps are processed with a decoder to output the high-quality transformed ODI in ERP format. The ERP output is then transformed to the perspective format to be displayed in VR, effectively reducing blur and increasing user engagement.

For supervised learning, we create a dataset based on the ODI-SR dataset [20], called the ODIM dataset, including transformed ODIs under various Möbius transformations. We evaluate the effectiveness of OmniVR on the ODIM dataset under various Möbius transformations and up-sampling factors. Furthermore, we report the results of a user study for the VR experience to evaluate the effectiveness of our proposed system in quantitative and qualitative ways. Quantitatively, we record accuracy, response time, and confidence level in a series of scenarios and questions. Qualitatively, we conduct interviews about the subjects’ feelings, such as mental and physical costs, and immersive experience, etc. The results demonstrate that: 1) OmniVR is beneficial for participants to improve the recognition and understanding of the scenarios. 2) OmniVR can reduce the discomfort of participants. 3) OmniVR can significantly improve the immersive experience, and make navigation and zoom in the VR media more user-friendly.

The main contributions of this paper can be summarized as follows: (I) We propose a novel system OmniVR to enhance the visual clarity during VR navigation; (II) We propose a learning-based algorithm to enhance the ODI quality controlled by user commands; (III) We establish the ODIM dataset for supervised training. Compared with existing methods, OmniVR achieves state-of-the-art performance under various user commands and up-sampling factors; (IV) We conduct a user study, demonstrating the effectiveness of our system in both quantitative and qualitative ways.

II Related Works

Navigation in VR. VR is an emerging media that provides more immersive and interactive experiences compared with traditional media [4, 5, 6]. Based on this immersive experience, there are already various applications, spanning virtual tourism [7], education [10], and entertainment [3]. The high spatial resolution is important to ensure the user experience because the blurry effect from the low spatial resolution would influence the engagement of viewers and might further cause discomfort. To alleviate the discomfort from the blur, current methods have proposed a series of techniques, including spatial blur [15], defocus blur [16], depth-of-field blur [17], and foveated rendering [18]. However, these methods mainly focus on the blurry regions but do not improve the local details of objects. As a result, if the viewer finds an object of interest, whatever operations s/he tries, the object always keeps blurry without any other details.

ODI Super-Resolution. [21] and [22] take distortion maps as additional input to alleviate the distortions. LAU-Net [20] splits an ODI into several bands along the latitude because ODIs in different latitudes present different distortions. SphereSR [12] proposes to reconstruct an ODI with arbitrary projection formats, including ERP, cube maps, and perspective formats. Recently, OSRT [23] and OPDN [24] employ transformers to construct global context and obtain good performance. However, their outputs are restricted to ERP format with no transformation.

Möbius Transformation. It has been utilized towards straight line rectification [25, 26, 27], and rotation and zoom [28]. Recently, [29] employs Möbius transformation to transform the feature maps into different forms to enhance the learning robustness. However, [29] has not explored to generate transformed ODIs with high quality. Except for ODI applications, Möbius transformation is widely applied in data augmentation [30], activation function [31, 32], pose estimation [33], and convolutions [34]. In this paper, we propose a learning-based algorithm which is built based on our conference version [19], to improve the textural details of ODIs when navigating and zooming in to an object of interest in VR.

III OmniVR System

III-A System Overview

As shown in Fig. 2, we design a system, namely OmniVR to help users effortlessly navigate and zoom in the VR meida, aiming to enhance the visual quality, and subsequently improve the immersion and interaction experience. Firstly, our system displays an original ODI for the viewer in the VR media. The viewer can freely navigate the scenario and find the object of interest. Once finding the object of interest, the viewer might feel that the object of interest is too small or not in the center of the field of view (FoV). Our system allows viewers to send commands through the VR headset and controllers. Then, the user command is utilized to generate the parameters of the Möbius transformation matrix. Leveraging these parameters, we propose a learning-based algorithm (see Fig. 4), which is built based on our conference version [19], to transform the original ODI with high quality. In the end, the transformed ODI is displayed in VR to provide finer details for the viewer. Our system allows the transformed ODI with various projection formats to adapt to the visual contents in VR. Below, we describe the user command, algorithm, and view transformation of our system in detail.

III-B User Command and Parameter Conversion

Our system first collects data about navigation and zoom operations through the VR headset and controllers. The rotation angle of the headset represents the navigation direction, and the trigger of the right controller controls the zoom in/out operation, as shown in Fig. 3. Specifically, the zoom operation is achieved by setting the UI button with arrow patterns, such as up arrows (zoom in), down arrows (zoom out), and left/right/circulation arrows (scene transition), as shown in Fig. 7. Once the raycast emitted from the controller touches the regions of UI buttons, the command is sent by clicking the trigger. Then, the collected commands are summarized with three parameters: zoom level $s$ , horizontal rotation angle $\beta$ , and vertical rotation angle $\gamma$ , named user command. Next, the user command is transferred to parameters $\{a,b,c,d\}$ of the Möbius transformation matrix. We choose Möbius transformation as it is the only bijective transformation on the sphere with preserved shape. Specifically, when performing horizontal rotation with angle $\beta$ , the parameters of Möbius transformations can be represented as follows:

\small\begin{pmatrix}a&b\\ c&d\\ \end{pmatrix}=\begin{pmatrix}\cos(\beta)+j\sin(\beta)&0\\ 0&1\\ \end{pmatrix}.

(1)

Similarly, for vertical rotation with angle $\gamma$ , the parameters of Möbius transformations can be represented as follows:

\small\begin{pmatrix}a&b\\ c&d\\ \end{pmatrix}=\begin{pmatrix}\cos(\frac{\gamma}{2})&\sin(\frac{\gamma}{2})\\ -\sin(\frac{\gamma}{2})&\cos(\frac{\gamma}{2})\\ \end{pmatrix}.

(2)

An arbitrary rotation can be divided into horizontal rotation and vertical rotation. In addition, Möbius transformations can be composed to give a new Möbius transformation. Therefore, we can achieve arbitrary navigation on ODIs with horizontal rotation angle $\beta$ and vertical rotation angle $\gamma$ .

For zoom with level $s$ , if the pole is the North pole, the parameters of Möbius transformations can be as follows:

\small\begin{pmatrix}a&b\\ c&d\\ \end{pmatrix}=\begin{pmatrix}s&0\\ 0&1\\ \end{pmatrix}.

(3)

III-C The Proposed Algorithm

Overview. As depicted in Fig. 4, we propose a novel algorithm that allows for free navigation to objects of interest and zooming in with preserved shapes and high-quality details, based on our OmniZoomer [19]. Initially, we extract HR feature maps $F_{\text{UP}}\in\mathbb{R}^{H\times W\times C}$ from the input ODI $I_{\text{IN}}\in\mathbb{R}^{h\times w\times 3}$ through an encoder and an up-sampling block (Sec. III-C1). With $F_{\text{UP}}$ ’s index map $X\in\mathbb{R}^{H\times W\times 2}$ as the input, we propose the spatial index generation module (Sec. III-C2) to apply the Möbius transformation [35] according to the user command on $X$ to generate the transformed spatial index map $Y\in\mathbb{R}^{H\times W\times 2}$ . Note that the channel numbers of $X$ and $Y$ indicate the longitude and latitude, respectively. Subsequently, we introduce a spherical resampling module (Sec. III-C3) that generates the transformed HR feature maps $F_{\text{M}}\in\mathbb{R}^{H\times W\times C}$ by resampling the pixels on the sphere guided by $Y$ . Finally, we decode the transformed feature maps to output the transformed ODI. The decoder consists of three ResBlocks [36] and a convolution layer. We take the same parameters in the spatial index generation module to transform the HR ground truth ODIs and employ the $L1$ loss for supervision. The proposed algorithm in OmnIVR has two main differences with OmniZoomer [19]: 1) To stabilize the convergence during training, we employ the skip connection by applying Möbius transformation onto $I_{\text{IN}}$ , which is added to the output of the decoder. 2) The proposed algorithm in our OmniVR enjoys arbitrary spherical projection, i.e., ERP and perspective projection, to meet the demands of viewing ODIs in VR with different FoVs. We now provide detailed descriptions of these components.

III-C1 Feature Extraction

Given an ODI $I_{\text{IN}}\in\mathbb{R}^{h\times w\times 3}$ in ERP format, our initial step involves the use of an encoder composed of several convolution layers. This encoder is to extract the feature maps $F_{\text{IN}}\in\mathbb{R}^{h\times w\times C}$ . Subsequently, we employ an upsampling block, equipped with multiple pixel-shuffle layers [[37]], to produce HR feature maps $F_{\text{UP}}\in\mathbb{R}^{H\times W\times C}$ . Here, $H=s*h$ , $W=s*w$ represent the spatial dimensions of the up-sampled image, where $s$ denotes the scale factor and $C$ indicates the number of channels. Notably, we apply the Möbius transformation in the HR space to address the aliasing issue. This issue arises due to inadequate pixel representation for accurately describing continuous curves post-transformation, potentially leading to object shape distortion.

III-C2 Spatial Index Generation

We apply the Möbius transformation on the spatial index map $X$ of HR feature maps $F_{\text{UP}}$ and generate the transformed spatial index map $Y$ for the subsequent resampling operation. Möbius transformation is known as the only conformal bijective transformation between the sphere and the complex plane. To apply the Möbius transformation on the HR feature maps $F_{\text{UP}}$ , we first use spherical projection (SP) to project the spatial index map $X$ from spherical coordinates $(\theta,\phi)$ (where $\theta$ represents the longitude and $\phi$ represents the latitude) to the Riemann sphere ${\mathbb{S}^{2}=\{(x,y,z)\in\mathbb{C}^{3}|x^{2}+y^{2}+z^{2}=1}\}$ , formulated as:

\text{SP}:\begin{pmatrix}x\\ y\\ z\\ \end{pmatrix}=\begin{pmatrix}\cos(\phi)\cos(\theta)\\ \cos(\phi)\sin(\theta)\\ \sin(\phi)\\ \end{pmatrix}.

(4)

Then, with stereographic projection (STP) [38], we can project a point $(x,y,z)$ of the Riemann sphere $\mathbb{S}^{2}$ onto the complex plane and obtain the projected point ( $x^{\prime}$ , $y^{\prime}$ ). Let point $(0,0,1)$ be the pole, STP can be formulated as:

\text{STP}:x^{\prime}={\frac{x}{1-z}}\ ,\ y^{\prime}={\frac{y}{1-z}}.

(5)

Subsequently, given the projected point $p$ ( $Z_{p}$ = $x^{\prime}$ + $iy^{\prime}$ ) on the complex plane, we can conduct the Möbius transformation with the following formulation:

f(Z_{p})={\frac{aZ_{p}+b}{cZ_{p}+d}},

(6)

where $a$ , $b$ , $c$ , and $d$ are complex numbers satisfying $ad-bc\neq 0$ . Finally, we apply the inverse stereographic projection $\text{STP}^{-1}$ and inverse spherical projection $\text{SP}^{-1}$ to re-project the complex plane into the ERP plane:

\begin{split}\text{STP}^{-1}:\begin{pmatrix}x\\ y\\ z\\ \end{pmatrix}&=\begin{pmatrix}\frac{2x^{\prime}}{1+x^{\prime 2}+y^{\prime 2}}% \\ \frac{2y^{\prime}}{1+x^{\prime 2}+y^{\prime 2}}\\ \frac{-1+x^{\prime 2}+y^{\prime 2}}{1+x^{\prime 2}+y^{\prime 2}}\\ \end{pmatrix}\ ;\\ \ \text{SP}^{-1}:\begin{pmatrix}\theta\\ \phi\\ \end{pmatrix}&=\begin{pmatrix}\arctan(y/x)\\ \arcsin(z)\\ \end{pmatrix}\ .\end{split}

(7)

In summary, as shown in Fig. 5, we first project the index map $X$ of input feature $F_{\text{UP}}$ to the complex plane using SP (Eq. 4) and STP (Eq. 5), and then conduct the Möbius transformation with Eq. 6, and generate the transformed index map $Y$ through the inverse STP (Eq. 7) and inverse SP (Eq. 7).

III-C3 Spherical Resampling

Inspired by the inherent spherical representation of ODIs and the spherical conformality of Möbius transformation, we propose the spherical resampling module to generate the transformed feature maps $F_{\text{M}}$ . The spherical resampling module resamples on the curved sphere based on the spherical geodesic of two points on the sphere. Given a query pixel $q$ with the spatial index $(\theta_{q},\phi_{q})$ from the index map $Y$ , we choose its four corner pixels $\{p_{i},i=0,1,2,3\}$ as the neighboring pixels, which are located on the feature maps $F_{\text{UP}}$ (as shown in the left of Fig. 6. The indices of the neighboring pixels satisfy the following conditions: $\theta_{0}=\theta_{3}$ , $\theta_{1}=\theta_{2}$ , $\phi_{0}=\phi_{1}$ , and $\phi_{2}=\phi_{3}$ . To obtain the feature value of the query pixel $q$ , we employ the spherical linear interpolation (Slerp) [39], which is a constant-speed motion along the spherical geodesic of two points on the sphere, formulated as follows:

\text{Slerp}(a,b)=\frac{\sin(1-t)\beta}{\sin\beta}a+\frac{\sin t\beta}{\sin% \beta}b,

(8)

where $\beta$ is the angle subtended by $a$ and $b$ , and $t$ is the resampling weight. Note that $t$ is easy to determine if $a$ and $b$ are located on the same longitude. Therefore, we calculate the feature value of pixel $q$ with two steps. Firstly, we resample $p_{0},p_{1}$ and $p_{2},p_{3}$ to $p_{01}$ and $p_{23}$ , respectively, as shown in the right of Fig.6. Taking the resampling of $p_{0,1}$ as example, the formulation can be described as:

F(p_{01})=\frac{\sin(1-t_{01})\alpha_{01}}{\sin\alpha_{01}}F(p_{0})+\frac{\sin t% _{01}\alpha_{01}}{\sin\alpha_{01}}F(p_{1}),

(9)

where $\alpha_{01}$ is the angle subtended by $p_{0}$ and $p_{1}$ , and the weight $t_{01}$ is decided by the location of $p_{01}$ on the curve $\overset{\LARGE{\frown}}{p_{0}p_{1}}$ . Notably, $t_{01}$ should ensure $p_{01}$ to have the same longitude with the query pixel $q$ . Similarly, $\alpha_{23}$ is the angle subtended by $p_{2}$ and $p_{3}$ , and $p_{23}$ also has the same longitude with the query pixel $q$ by calculating the weight $t_{23}$ . After that, we follow the Slerp (Eq. 8) to calculate the feature value $F_{q}$ as follows:

F(q)=\frac{\sin(1-t_{q})\Omega}{\sin\Omega}F(p_{01})+\frac{\sin t_{q}\Omega}{% \sin\Omega}F(p_{23}),

(10)

where $\Omega$ is the angle subtended by $p_{01}$ and $p_{23}$ , and $t_{q}$ is decided by the location of $q$ on the curve $\overset{\LARGE{\frown}}{p_{01}p_{23}}$ . If we assume that $p_{0}$ , $p_{1}$ , and $p_{01}$ have the same latitude, the calculation of $t_{01}$ can be simplified into $\frac{\theta_{q}-\theta_{0}}{\theta_{1}-\theta_{0}}$ . Similarly, $t_{23}$ can be simplified into $\frac{\theta_{q}-\theta_{2}}{\theta_{3}-\theta_{2}}$ . Also, $t_{q}$ can be simplified into $\frac{\phi_{q}-\phi_{01}}{\phi_{23}-\phi_{01}}$ .

III-D View Transformation in VR

As shown in Fig. 7, the transformed ODI is displayed by our proposed system. As OmniZoomer [19] only supports generating ODIs with ERP format, its output can not meet the demands of viewing ODIs in VR with different FoVs. In contrast, the proposed algorithm in our OmniVR allows for various projection formats by adding a projection transformation layer. In this case, the transformed ODI can be displayed in VR with the perspective format that fits the FoV of the user with less distortion effect. For detailed projection transformations, we recommend readers to the recent survey [40]. The transformed ODI by our algorithm contains finer details, which could assist the user in recognizing and understanding the scenario, thus improving the immersive and interactive experience in VR. As shown in Fig. 7, our system allows displaying the generated HR ODIs with perspective views in the VR media under various user commands, including zooming in/out and moving towards different directions, e.g., left, right, and up.

IV Experiment for Algorithm

IV-A Dataset and Implementation Details

Datasets. No datasets exist for ODIs under Möbius transformations, and collecting real-world ODI pairs with corresponding Möbius transformation matrices is difficult. Thus, we propose ODI-Möbius (ODIM) dataset to train our OmniVR and compare methods in a supervised manner. Our dataset is based on the ODI-SR dataset [41] with 1191 images in the train set, 100 images in the validation set, and 100 images in the test set. Note that the proposed dataset is consistent with the conference version [19]. During training, as our system aims to freely navigate and zoom in the VR media, we generate user commands that include horizontal rotation, vertical rotation, and zoom-in/out operations. The user commands are converted to the parameters $\{a,b,c,d\}$ of the Möbius transformation matrix according to Eq. 6. During validating and testing, we assign a fixed user command and the corresponding Möbius transformation matrix for each ODI. Besides, we test on SUN360 [42] dataset with 100 images.

Scale	$\times 8$				$\times 16$
Method	ODI-SR		SUN 360		ODI-SR		SUN 360
Method	WS-PSNR	WS-SSIM	WS-PSNR	WS-SSIM	WS-PSNR	WS-SSIM	WS-PSNR	WS-SSIM
Bicubic	26.77	0.7725	25.87	0.7103	24.79	0.7404	23.87	0.6802
RCAN ${\rm(+Transform)}$ [43]	27.46	0.7906	27.04	0.7443	25.45	0.7541	24.70	0.7001
LAU-Net ${\rm(+Transform)}$ [20]	27.25	0.7813	26.77	0.7363	25.23	0.7455	24.49	0.6921
OmniZoomer-RCAN	27.53	0.7970	27.34	0.7592	25.50	0.7584	24.84	0.7034
OmniVR-RCAN	27.62	0.8005	27.50	0.7662	25.52	0.7629	24.89	0.7094

TABLE I: Quantitative comparison of Möbius transformation results on ODIs.

{\rm(+Transform)}

denotes that we first employ a scale-specific SR model for image SR and then conduct image-level Möbius transformation on the SR image. We report on ODI-SR dataset and SUN360 dataset with up-sampling factors

\times 8

and

\times 16

. Bold indicates the best results.

Scale	$\times 8$				$\times 16$
Method	ODI-SR		SUN 360		ODI-SR		SUN 360
Method	WS-PSNR	WS-SSIM	WS-PSNR	WS-SSIM	WS-PSNR	WS-SSIM	WS-PSNR	WS-SSIM
Bicubic	19.64	0.5908	19.72	0.5403	17.12	0.4332	17.56	0.4638
EDSR [36]	23.97	0.6483	23.79	0.6472	22.24	0.6090	21.83	0.5974
RCAN [43]	24.26	0.6554	23.88	0.6542	22.49	0.6176	21.86	0.5938
360-SS [44]	24.14	0.6539	24.19	0.6536	22.35	0.6102	22.10	0.5947
SphereSR [12]	24.37	0.6777	24.17	0.6820	22.51	0.6370	21.95	0.6342
LAU-Net [20]	24.36	0.6602	24.24	0.6708	22.52	0.6284	22.05	0.6058
LAU-Net+ [41]	24.63	0.6815	24.37	0.6710	22.97	0.6316	22.22	0.6111
OmniVR-RCAN	24.61	0.6822	24.53	0.7152	22.68	0.6324	22.14	0.6483

TABLE II: Quantitative comparison of ODI SR task. The numbers are excerpted from [41] except for [12], due to its reported results are obtained by utilizing 800 training images in the ODI-SR dataset. We report

\times 8

\times 16

SR results on the ODI-SR and SUN360 datasets. Bold indicates the best.

Implementation details. We mainly evaluate the ERP format. The resolution of the HR ERP images is $1024\times 2048$ , and the up-sampling factors we choose are $\times 8$ and $\times 16$ . We use L1 loss, which is optimized by Adam optimizer [45], with an initial learning rate of 1e-4. The batch size is 1 when using RCAN [43] as the backbone. Especially, considering the spherical imagery of ODIs, we use specific WS-PSNR [46] and WS-SSIM [47] metrics for evaluation.

IV-B Quantitative and Qualitative Evaluation

Navigate and Zoom in: Except for OmniZoomer [19], there are no prior arts that can be directly compared. For a fair comparison, we combine the existing image SR models [43, 20] with image-level Möbius transformations. The SR models designed for 2D planar images are retrained based on their official settings.

In Tab. I, by applying RCAN as the backbone, OmniVR outperforms current methods in all metrics, all up-sampling factors, and test sets. It reveals the effectiveness of our OmniVR incorporating Möbius transformation into the neural network. Note that LAU-Net shows inferior performance because it is limited to vertically-captured ODIs. Importantly, our OmniVR outperforms OmniZoomer in all metrics, demonstrating the importance of skip connection for stable convergence during training. As shown in the first row of Fig. 8 on the ODI-SR dataset, OmniVR predicts clearer picture frames. Similarly, in the second row of Fig. 8 on the SUN360 dataset, OmniVR reconstructs clearer structures of the chairs than other methods.

Direct SR: Our OmniVR can achieve the naive SR task by setting the Möbius transformation matrix as an identity matrix. Tab. II shows that OmniVR with RCAN as backbone obtains 5 (total 8) best metrics. It demonstrates the strong capability of our method to handle the inherent distortions.

V User Study

We conduct a within-subject user study in VR to explore the effectiveness of OmniVR and the overall user experience of our proposed system. Note that the user study involves no more than minimal risk, and the IRB board has granted a waiver for the review process. To establish a comparative baseline for assessing the image enhancement performance of OmniVR, we include a condition that only enables Möbius transformation with Bicubic interpolation in Fig. 9.

V-A Experiment Set-Up & Participants

Participants. We recruited 18 participants (P1-P18) through the university mailing list, including 8 males and 10 females. 56% of them are between 18 and 24 years old, and 44% of them are between 25 and 34 years old. 9 participants have viewed ODIs on VR devices and mobile phones, and 6 participants have viewed ODIs on computers. Furthermore, their familiarity with 3D games or 3D models is mandatory (if 1 represents very low and 7 represents very high, the average is 3.94). Each participant received 3 dollars as compensation.

Appratus & Data. We employ a Meta Quest 2 for the experiment, as shown in Fig. 10(a). We select five ODIs as five scenarios in the VR experience. The ODIs are from the training and testing sets of the Flickr360 dataset [48]. The selected ODIs have texts or textures in the equator regions, which have a distinction in different zoom levels and spatial resolutions. The ODIs are various from indoor scenarios to outdoor scenarios. The transition among different scenarios and zooming in/out are achieved using the CenarioVR software ¹¹1https://www.cenariovr.com/ and controlled through specific arrow keys. During the VR experience, the view direction is controlled through the head movements, while the scenario transition is controlled with the trigger of the right VR controller.

Experimental Conditions. As shown in Fig. 9, we set a baseline condition for comparison. The baseline condition enables Möbius transformation that enables zoom-in/out for users to find details in various zoom levels. Differently, the Möbius transformation in the baseline condition is conducted with Bicubic interpolation, while in OmniVR it is achieved by deep learning. Moreover, OmniVR additionally contains image enhancement to recover more details.

V-B Design & Procedure

The experiment was within-subjects: 2 technique $\times$ 5 scenarios $\times$ 3 answer ranges (i.e., accuracy, confidence, and response time) = 30 responses per participant. We fixed the order of two conditions because ODIs obtained from OmniVR exhibit higher quality than the baseline condition in Sec. IV. Firstly viewing ODIs generated with OmniVR would cause information leakage to the following baseline condition and thus influence the task performance from OmniVR. To further avoid the influence of this memorization issue when comparing two conditions, we design two questions with similar difficulty in each scenario. These questions are about text recognition in four scenarios, and number counting of a specific object in one scenario. Participants would respond to one question in the baseline condition, and respond to the other question in the OmniVR condition. The order of the two questions is random, while the number of participants receiving some order is consistent. We also fixed the order of scenarios because there is no relationship between them. The experiment lasted for about 20 minutes on average.

This study contains two parts. Before starting the study, each participant was given a short introduction to the study and our system. The VR headset and controllers were adjusted for each participant to ensure that the testing text was clear. The first part focused on assessing the image enhancement performance of our proposed technique. The participants first viewed the five scenarios generated with the baseline condition and then viewed them generated with OmniVR, as shown in Fig. 10(b). This order would help participants forget specific impressions for a fair comparison. For each technique $\times$ scenario, participants were asked to answer the question as quickly as possible. The second part focused on the overall user experience of the whole system. We asked participants to fill in a 7-point Likert questionnaire to measure their cybersickness, mental and physical costs, immersive experience, and usability of zoom-in techniques for two techniques, as shown in Fig. 10(c). For a fair comparison, the participants were only informed that the first and five scenarios are generated with "Technique 1" and "Technique 2" respectively. We also conduct a post-study interview to collect user feedback about the suggestions and expectations for our technique and system.

V-C Measures

In the first part, we set three metrics for evaluation: accuracy, response time, and confidence level. Ideally, higher accuracy, shorter response time, and higher confidence levels could demonstrate higher image quality. We set questions about text recognition and number counting for a specific object. For the text recognition question, the accuracy is the ratio of correct words. For the number counting question, the accuracy is calculated through $e^{-|N_{1}-N_{2}|}$ , where $N_{1}$ and $N_{2}$ are the responded and right numbers, respectively. The confidence level is measured within a 7-point Likert scale. In addition, we allow participants to respond “Invisible" if the scenario is too blurry to identify. In this case, the accuracy and confidence level are set to zero. In the second part, we collected their 7-point Likert scale ratings about cybersickness, mental and physical costs, immersive experience, and usability (1 – very low, 7 – very high).

V-D Results & Findings

V-D1 Quantitative Evaluation

We utilize a t-test to analyze the participants’ VR experience in five scenarios. The main reason is that we want to compare the difference between the baseline condition and OmniVR.

In Fig. 11, we report the comparison results between the baseline condition and OmniVR across five scenarios and three metrics. For accuracy in Fig. 11(a), OmniVR surpasses the baseline condition consistently in five scenarios. Especially, in scenario 1, OmniVR outperforms 0.48 compared with the baseline condition. This is mainly because in scenario 1 the baseline condition recovers few local details, and thus 5 participants respond “Invisible". Similar occasion occurs in scenario 4, where 8 participants respond “Invisible" under the baseline condition. For response time in Fig. 11(b), OmniVR consumes less time than the baseline condition in most of the scenarios and comparable time with the baseline condition in scenario 4. For confidence level in Fig. 11(c), OmniVR obtains higher confidence levels in all scenarios. It demonstrates that OmniVR helps participants identify the local details easily through zoom-in functions and image enhancement. In this case, the ambiguous occasions that degrade the confidence level are reduced significantly.

We further utilize a t-test to explore the difference between the baseline condition and OmniVR. In Fig. 11, the baseline condition and OmniVR show significant differences in most scenarios and metrics. For scenario 3, the accuracy and response time between the two conditions have no significant difference. We ascribe it as the simple and easily identifiable words in scenario 3. For scenario 4, the response time between the two groups shows no significant difference. In Fig. 11(b), we can find that OmniVR consumes a little more time than the baseline condition in scenario 4. We think it is related to the task design in scenario 4. The task in scenario 4 is high-level and about number counting of a specific object. As a result, if the scenario is difficult to identify, the participants prefer to give very fast “Invisible" responses. For example, P16 consumes 2.5 seconds for the “Invisible" response with the baseline condition while consuming 16.0 seconds to finish the counting task with OmniVR.

V-D2 Qualitative Evaluation

We present participants’ feedback on the reasons for their subjective ratings of VR experiences under two experimental conditions (Fig. 13) as well as their expectations and suggestions for further improvement.

Usability. Generally, participants rated a higher score for the usability of the zoom-in technique in OmniVR than that in the baseline condition with a significant difference ( $p<.001$ ). Most participants (N=7) thought that the zoom-in technique is “less effective in the baseline condition, but very helpful in OmniVR”, while only one participant (P15) thought that the zoom-in technique is “useful in both groups”. In contrast, few participants (P18) complained about “the more blurry effects” of zoomed-in regions than the original regions. Interestingly, P13 presented a different view for the usability of the zoom-in technique, “The second group of scenarios is clear enough, and I have no need to zoom in; but for the first group of scenarios, I still need to zoom in to find more details.”

Immersion. Overall, participants show better satisfaction with the immersive experiences under OmniVR other than the baseline condition with a significant difference ( $P<.001$ ). The main reason that improves such an immersive experience is mentioned as the high-quality details recovered from OmniVR. For example, P17 said that “The scenarios are natural and I feel interesting when watching these scenarios.” However, a few participants (N=3) also expressed that there are still some cases that destroy immersion. For example, “the spherical distortion” (P2) makes the scene “unrealistic” (P3), and “the seriously blurry effect exists out of the focused regions” (P6). These effects are raised due to the original property of ODIs and Möbius transformation. Individually, one participant (P8) complained about the scenario selection and experienced less immersion in the first scenario because it is “about a clock and looks like a planar scenario.”

Cybersickness. Participants reported less cybersickness with the OmniVR. Although there shows no significant difference between the two conditions, some participants (N=3) indicated that OmniVR could “recover more local details” and thus alleviate the “blurry effect”, which is the main reason to raise cybersickness. Some other participants explained that they did not feel the obvious cybersickness difference as “the duration of this VR experience was not very long”.

Workload. Related to the lower cybersickness, participants reported lower mental and physical loads with OmniVR. In particular, OmniVR shows a significant difference ( $P<.05$ ) compared with the baseline condition on mental cost. Some participants (N=4) mentioned the main reason as “the blurry effect would increase the workload simultaneously”. Specifically, two participants discussed that their mental costs would increase correspondingly “if the scenario is hard to identify”.

VI Discussion

About the zoom-in function. In Sec. V-D2, we have demonstrated the effectiveness of the zoom-in function, especially in OmniVR. However, four participants (P7, P8, P12, P16) reflected that the zoom-in function is limited to fixed zoom levels. Specifically, although the zoomed-in regions could provide more local details, the objects of interest often occupy out of the FoV and are not in the center of the FoV. In this case, the participants have to adjust their viewing directions continuously to find the optimal direction. This would result in a bad immersive experience and increase the burdens in both mental and physical aspects. To improve it, we would try to learn how to select an optimal transformation by only assigning the interested objects. This might include the techniques about scene understanding techniques, such as ODI object detection.

About the accuracy. In four scenarios (1,2,3,5), the questions are about text recognition. There are two issues about these questions. Firstly, P12 said that “In scenario 3, some words on the bridge are written with scrawl, making it ambiguous for recognition". Secondly, P16 said that “Some words might be guessed by associating them with prior knowledge". That is, although some words are difficult to identify due to blurry effect, they might be responded rightly if the meaning of the sentence is understood by participants.

Failure cases. There is an interesting phenomenon in the second question of scenario 4. The question is about how many pipes are within the hands of the three standing people. As shown in the top of Fig. 12, only one person (middle) takes a pipe in hand. Statistically speaking, only one participant gives the right response using OmniVR, while three participants give the right responses using the baseline condition. To further analyze, we find that two railings are along the standing people. OmniVR recovers clearer railings but fails to recover the detail of the pipe (See Fig. 12 middle). The railing might be mistaken as the pipe. As a result, most participants said that “Two pipes are in the hands of standing people". Instead, the baseline condition can not recover the detail of the pipe, and only one railing might be seen indistinctly. As a result, three participants said that “One pipe is in the hands of standing people". In the other question about pipes in hand from lying people (See Fig. 12 bottom), as OmniVR can recover the details of pipes clearly, the number of right responses in OmniVR increases obviously.

Limitation. Our system can recover HR and high-quality details under various user commands. However, the user commands are totally determined by the user operation, as discussed in the zoom-in function of Sec. VI. In addition, the streaming speed of our algorithm is also a limitation. In the current stage, we can only generate the transformed ODI in advance on GPUs according to the user commands, and then display the transformed ODI in VR.

VII Conclusion

In this paper, we have presented a novel OmniVR system to enable viewers to navigate and zoom in/out effortlessly in the VR media, and have developed a learning-based algorithm to refine the visual fidelity. By conducting a comprehensive user study, our system was witnessed the following benefits: 1) Our system improved the scenario recognition for viewers by recovering the details of objects of interest; 2) Our system reduced the discomfort and helped viewers gain confidence obviously during VR navigation; 3) Our system was user-friendly in various user commands, i.e., navigation and zoom in/out. Our study revealed the importance of visual quality under various user commands in VR navigation, especially when the objects of interest were too small and required to zoom in. We release the project code of our OminVR system to inspire future studies in the community at http://vlislab22.github.io/OmniVR/.

References

[1] A. Vermast and W. Hürst, “Introducing 3d thumbnails to access 360-degree videos in virtual reality,” IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 5, pp. 2547–2556, 2023.
[2] Z. Luo, B. Chai, Z. Wang, M. Hu, and D. Wu, “Masked360: Enabling robust 360-degree video streaming with ultra low bandwidth consumption,” IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 5, pp. 2690–2699, 2023.
[3] M. Dasari, E. Lu, M. W. Farb, N. Pereira, I. Liang, and A. Rowe, “Scaling vr video conferencing,” in 2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR). IEEE, 2023, pp. 648–657.
[4] Q. Zhang, J. Wei, S. Wang, S. Ma, and W. Gao, “Realvr: Efficient, economical, and quality-of-experience-driven vr video system based on mpeg omaf,” IEEE Transactions on Multimedia, 2022.
[5] J. Lou, Y. Wang, C. Nduka, M. Hamedi, I. Mavridou, F.-Y. Wang, and H. Yu, “Realistic facial expression reconstruction for vr hmd users,” IEEE Transactions on Multimedia, vol. 22, no. 3, pp. 730–743, 2019.
[6] P. Szabo, A. Simiscuka, S. Masneri, M. Zorrilla, and G.-M. Muntean, “A cnn-based framework for enhancing 360 vr experiences with multisensorial effects,” IEEE Transactions on Multimedia, 2022.
[7] S. Verma, L. Warrier, B. Bolia, and S. Mehta, “Past, present, and future of virtual tourism-a literature review,” International Journal of Information Management Data Insights, vol. 2, no. 2, p. 100085, 2022.
[8] A. Mohammad and H. Ismail, “Development and evaluation of an interactive 360 virtual tour for tourist destinations,” J. Inform. Technol. Impact, vol. 9, pp. 137–182, 2009.
[9] A. Azmi, R. Ibrahim, M. Abdul Ghafar, and A. Rashidi, “Smarter real estate marketing using virtual reality to influence potential homebuyers’ emotions and purchase intention,” Smart and Sustainable Built Environment, vol. 11, no. 4, pp. 870–890, 2022.
[10] J. Singh, M. Malhotra, and N. Sharma, “Metaverse in education: An overview,” Applying metalytics to measure customer experience in the metaverse, pp. 135–142, 2022.
[11] H. Chang and M. F. Cohen, “Panning and zooming high-resolution panoramas in virtual reality devices,” in Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, 2017, pp. 279–288.
[12] Y. Yoon, I. Chung, L. Wang, and K.-J. Yoon, “Spheresr: 360deg image super-resolution with arbitrary projection via continuous spherical image representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5677–5686.
[13] M. Kwon, R. Liu, and L. Chien, “Compensation for blur requires increase in field of view and viewing time,” PLoS One, vol. 11, no. 9, p. e0162711, 2016.
[14] L. O’Hare and P. B. Hibbard, “Visual discomfort and blur,” Journal of vision, vol. 13, no. 5, pp. 7–7, 2013.
[15] D. M. Hoffman, A. R. Girshick, K. Akeley, and M. S. Banks, “Vergence–accommodation conflicts hinder visual performance and cause visual fatigue,” Journal of vision, vol. 8, no. 3, pp. 33–33, 2008.
[16] S. Ang and J. Quarles, “Gingervr: An open source repository of cybersickness reduction techniques for unity,” in 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW). IEEE, 2020, pp. 460–463.
[17] R. Hussain, M. Chessa, and F. Solari, “Mitigating cybersickness in virtual reality systems through foveated depth-of-field blur,” Sensors, vol. 21, no. 12, p. 4006, 2021.
[18] X. Meng, R. Du, M. Zwicker, and A. Varshney, “Kernel foveated rendering,” Proceedings of the ACM on Computer Graphics and Interactive Techniques, vol. 1, no. 1, pp. 1–20, 2018.
[19] Z. Cao, H. Ai, Y.-P. Cao, Y. Shan, X. Qie, and L. Wang, “Omnizoomer: Learning to move and zoom in on sphere at high-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12 897–12 907.
[20] X. Deng, H. Wang, M. Xu, Y. Guo, Y. Song, and L. Yang, “Lau-net: Latitude adaptive upscaling network for omnidirectional image super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9189–9198.
[21] V. Fakour-Sevom, E. Guldogan, and J.-K. Kämäräinen, “360 panorama super-resolution using deep convolutional networks,” in Int. Conf. on Computer Vision Theory and Applications (VISAPP), vol. 1, 2018.
[22] A. Nishiyama, S. Ikehata, and K. Aizawa, “360 single image super resolution via distortion-aware network and distorted perspective images,” in 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 1829–1833.
[23] F. Yu, X. Wang, M. Cao, G. Li, Y. Shan, and C. Dong, “Osrt: Omnidirectional image super-resolution with distortion-aware transformer,” arXiv preprint arXiv:2302.03453, 2023.
[24] X. Sun, W. Li, Z. Zhang, Q. Ma, X. Sheng, M. Cheng, H. Ma, S. Zhao, J. Zhang, J. Li et al., “Opdn: Omnidirectional position-aware deformable network for omnidirectional image super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1293–1301.
[25] L. Penaranda, L. Velho, and L. Sacht, “Real-time correction of panoramic images using hyperbolic möbius transformations,” Journal of Real-Time Image Processing, vol. 15, pp. 725–738, 2018.
[26] L. S. Ferreira, L. Sacht, and L. Velho, “Local moebius transformations applied to omnidirectional images,” Computers & Graphics, vol. 68, pp. 77–83, 2017.
[27] L. S. Ferreira and L. Sacht, “Bounded biharmonic blending of möbius transformations for flexible omnidirectional image rectification,” Computers & Graphics, vol. 93, pp. 51–60, 2020.
[28] S. Schleimer and H. Segerman, “Squares that look round: transforming spherical images,” arXiv preprint arXiv:1605.01396, 2016.
[29] J. Wu, C. Xia, T. Yu, and J. Li, “View-aware salient object detection for 360 $\{$ $\backslash$ deg $\}$ omnidirectional image,” arXiv preprint arXiv:2209.13222, 2022.
[30] S. Zhou, J. Zhang, H. Jiang, T. Lundh, and A. Y. Ng, “Data augmentation with mobius transformations,” Machine Learning: Science and Technology, vol. 2, no. 2, p. 025016, 2021.
[31] D. P. Mandic and V. S. L. Goh, Complex valued nonlinear adaptive filters: noncircularity, widely linear and neural models. John Wiley & Sons, 2009.
[32] N. Özdemir, B. B. İskender, and N. Y. Özgür, “Complex valued neural network with möbius activation function,” Communications in Nonlinear Science and Numerical Simulation, vol. 16, no. 12, pp. 4698–4703, 2011.
[33] N. Azizi, H. Possegger, E. Rodolà, and H. Bischof, “3d human pose estimation using möbius graph convolutional networks,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I. Springer, 2022, pp. 160–178.
[34] T. W. Mitchel, N. Aigerman, V. G. Kim, and M. Kazhdan, “Möbius convolutions for spherical cnns,” in ACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–9.
[35] S. Kato and P. McCullagh, “Möbius transformation and a cauchy family on the sphere,” arXiv: Statistics Theory, 2015.
[36] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 136–144.
[37] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1874–1883, 2016.
[38] K. Eybpoosh, M. Rezghi, and A. Heydari, “Applying inverse stereographic projection to manifold learning and clustering,” Applied Intelligence, vol. 52, pp. 4443–4457, 2021.
[39] J. P. Fatelo and N. Martins-Ferreira, “Mobility spaces and geodesics for the n-sphere,” 2021.
[40] H. Ai, Z. Cao, J. Zhu, H. Bai, Y. Chen, and L. Wang, “Deep learning for omnidirectional vision: A survey and new perspectives,” arXiv preprint arXiv:2205.10468, 2022.
[41] X. Deng, H. Wang, M. Xu, L. Li, and Z. Wang, “Omnidirectional image super-resolution via latitude adaptive network,” IEEE Transactions on Multimedia, 2022.
[42] J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, “Recognizing scene viewpoint using panoramic place representation,” computer vision and pattern recognition, 2012.
[43] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 286–301.
[44] C. Ozcinar, A. Rana, and A. Smolic, “Super-resolution of omnidirectional images using adversarial learning,” in 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP). IEEE, 2019, pp. 1–6.
[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[46] Y. Sun, A. Lu, and L. Yu, “Weighted-to-spherically-uniform quality evaluation for omnidirectional video,” IEEE signal processing letters, vol. 24, no. 9, pp. 1408–1412, 2017.
[47] Y. Zhou, M. Yu, H. Ma, H. Shao, and G. Jiang, “Weighted-to-spherically-uniform ssim objective quality evaluation for panoramic video,” in 2018 14th IEEE International Conference on Signal Processing (ICSP). IEEE, 2018, pp. 54–57.
[48] M. Cao, C. Mou, F. Yu, X. Wang, Y. Zheng, J. Zhang, C. Dong, G. Li, Y. Shan, R. Timofte et al., “Ntire 2023 challenge on 360deg omnidirectional image and video super-resolution: Datasets, methods and results,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1731–1745.