Automatic Camera Trajectory Control with Enhanced Immersion for Virtual Cinematography

Xinyi Wu, Haohong Wang, , and Aggelos K. Katsaggelos This work was supported by TCL Research America through Interactive Hyperstory project.X. Wu and A. K. Katsaggelos are with the Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60208 USA (e-mail:[email protected];[email protected]).H. Wang is with TCL Research America, San Jose, CA 95110 USA ([email protected]).Manuscript received April 19, 2005; revised August 26, 2015.
Abstract

User-generated cinematic creations are gaining popularity as our daily entertainment, yet it is a challenge to master cinematography for producing immersive contents. Many existing automatic methods focus on roughly controlling predefined shot types or movement patterns, which struggle to engage viewers with the actor’s circumstances. Real-world cinematographic rules show that directors can create immersion by comprehensively synchronizing the camera with the actor. Inspired by this strategy, we propose a deep camera control framework that enables actor-camera synchronization in three aspects, considering frame aesthetics, spatial action, and emotional status in the 3D virtual stage. Following rule-of-thirds, our framework first modifies the initial camera placement to position the actor aesthetically. This adjustment is facilitated by a self-supervised adjustor that analyzes frame composition via camera projection. We then design a GAN model that can adversarially synthesize fine-grained camera movement based on the actor’s action and psychological state, using an encoder-decoder generator to map kinematics and emotional variables into camera trajectories. Moreover, we incorporate a regularizer to align the generated stylistic variances with specific emotional categories and intensities. The experimental results show that our proposed method yields immersive cinematic videos of high quality, both quantitatively and qualitatively. Live examples can be found in the supplementary video.

Index Terms:
Automatic cinematography, camera trajectory control, cinematographic immersion, actor-camera synchronization.
publicationid: pubid: 0000–0000/00$00.00 © 2021 IEEE

I Introduction

Refer to caption
Figure 1: We propose a virtual camera controller that automates camera movements to produce footage with improved immersive experiences. This is achieved by conducting actor-camera synchronization in three key aspects: maintaining aesthetic rule-of-thirds composition, tracking the spatial action of the focused character, and stylizing the camera trajectory based on a specific emotion variable.

User-generated content (UGC) has been increasingly popular on social media platforms, with creators often adopting immersive techniques from film directors to enhance the quality and appeal of their cinematic outputs. Despite possessing interesting scripts, creators may struggle to produce engaging works due to limited cinematographic knowledge and skills, thereby also causing difficulties in creating satisfactory immersive experiences for viewers.

Cinematography is essential in develo** high-quality cinematic creations. Given a certain scenario, directors follow cinematographic rules as guidance to optimize camera manipulation, which is a key component for effectively immersing the audience into the story. Such rules are defined according to combining objective aesthetic principles with subjective viewer feedback [1]. Although comprehensive guidelines are available for controlling camera behavior, conducting cinematography manually still remains a challenging and labor-intensive process to obtain satisfactory outputs.

Auto-cinematography has emerged as a solution to reduce labor costs and professional barriers during production. Partial answers efficiently generate film sequences via editing existing cinematic clips from multiple sources [2, 3]. Because of lacking camera controllability, these editing-based methods create immersion only at a basic level, ensuring either spatio-temporal [4] or semantic consistency [5] for the story plot. Another feasible way is to automate camera manipulation by mimicking human directorial techniques [6, 7]. Traditional camera planning methods encode specific behavior rules for varied scenarios [8, 9], yet the maintenance and extension of such encoded control patterns can result in difficulty.

As deep neural networks dominate generative tasks, they are promising to enable flexible camera control with enhanced viewer immersion. By designing reward functions based on scene analysis, camera placements that maximize frame aesthetics can be decided via reinforcement learning [10, 11]. Nevertheless, these methods overlook the crucial role of actor behavior in sha** the overall immersive experiences [1]. [12, 13] address this via predicting camera trajectories learned from human motion videos, utilizing an end-to-end deep model. While facilitating continuous camera tracking, such long-distance drone-based approaches struggle with refined controls over detailed actor dynamics both physically and psychologically. To achieve fine-grained immersive shooting, [14, 15] extract high-level cinematic features from footage of masterpieces and transfer them into retargetable camera rails. However, their dependency on references constrains user customization and yields implicit immersion representations. Despite the popularity in AI-based cinematography, a method that can explicitly and comprehensively immerse the audience is needed with controllable cameras [16].

In real-world cinematography, directors have observed that immersion can be significantly enhanced by simulating the viewpoint of an invisible character within the scene [17]. Empirical studies confirm this effect is often achieved through shooting techniques—aligning camera movement with the focused actor [18]—across multiple perceptual-sensitive dimensions, including aesthetics [19], action [1] and emotion [20]. Therefore, to thoroughly create immersive feelings for viewers, these factors should be collectively considered.

In this paper, we propose a deep camera control framework that can enable cinematographic immersion comprehensively by satisfying actor-camera synchronization for three aspects in the 3D virtual environment. Specifically, we generate actor-driven camera trajectories to ensure aesthetic frame composition, refined spatial movement tracking, and stylistic variances in alignment with emotion variables. Our framework first leverages a self-supervised adjustment network to modify the initial camera placement following rule-of-thirds [21] for optimizing composition. Unlike previous methods that focus on scene images [22, 23], our adjustor analyzes camera projection to efficiently position the actor in a way that enhances compositional aesthetics. Subsequently, we design a GAN model to synthesize camera movements adversarially mimicking ground-truth samples from human artists. Our generator employs an encoder-decoder architecture, with the encoder capturing the key kinematic features and their saliencies from disarticulated actor poses in the input. The decoder then transforms these features into camera trajectories, conditioned on an emotion variable that indicates the emotional category and intensity of the actor. Moreover, to better align the generated stylistic variances with the given emotion variable, we incorporate a regularizer to constrain the overall shape of our synthesized camera trajectory consistent with that of manual ground truth. Both quantitative and qualitative evaluations demonstrate that our camera control framework can perform immersive shooting resembling those of professional artists and produce high-quality cinematic videos with improved viewer immersion.

The contributions of our work are summarized as follows:

  • We propose a deep camera control framework that learns actor-camera synchronization based on specific cinematographic techniques across three aspects: frame aesthetics, spatial action, and emotional status, to significantly enhance immersion in user-generated video creations.

  • We design a self-supervised adjustor to efficiently adjust aesthetic compositions via camera projection analysis. We build a GAN model for fine-grained camera trajectory generation using actor-to-camera behavior transformation. A shape-based regularizer is further integrated to control stylistic variances in our generated trajectories.

  • We conduct both quantitative and qualitative evaluations to show our superior capability of creating comprehensive immersive experiences that deeply engage the audience into physical and psychological behaviors of the actor within aesthetically appealing frames.

II Related work

In this section, we review the progress made in auto-cinematography and explore advancements that analyze cinematographic-level factors for enhancing the immersive experience of viewers.

II-A Enabling automatic cinematography

The manual production of high-quality 2D cinematic creation is costly, requiring much labor and professional knowledge. To address this, methods such as multi-source clip editing, camera behavior planning, and text-to-video generation have been proposed to automatically obtain cinematic videos. Despite significant improvements in the capability of recent LLM-based text-to-video solutions [24], they still lack stability and detailed cinematographic controls during the generation. Thus, here we primarily discuss the former two auto-cinematography methods.

Multi-source clip editing: Editing serves as a feasible approach for producing a desired cinematic video by composing various short clips. To ensure the generation of required content, [5] annotated video clips with key content details, conducting retrieval through script keyword matching. [4] took one more step to improve spatio-temporal consistency during the retrieval using graph optimization. Moreover, [25] introduced attention maps of video frames and tracked human gaze behavior to guarantee the consistent appearance of salient objects across frames. For achieving semantic consistency, [2] compared script and video annotations via TF-IDF [26] scores, whereas [3] further performed text alignment by regularizing probability distribution between high-level semantic features. Instead of aligning text, [27] focused on maintaining frame-level semantic coherence within and between clips based on constraining extracted image-based embeddings. However, the effectiveness of these editing-based methods is heavily dependent on the diversity and quality of the available clip database, potentially limiting flexibility and posing difficulties for general user-generated application scenarios.

Camera behavior planning: Another trend of auto-cinematography imitates the workflow of film directors. They first design camera control patterns and then apply these behaviors in drones or virtual cameras to create final cinematic videos. In the early stage, research in this domain performed pre-defined camera movements like pan, tilt, and zoom automatically using analytical ways based on the given script annotations [28, 9]. This ensures that the focused character can always stay within the frame. Take a further step, optimization methods have emerged to manage more complex camera behaviors, adjusting either camera intrinsics, such as focal length [7], or extrinsics, from single placements [6] to dynamic rails [8]. Additionally, aesthetic principles regarding frame composition [29], actor viewpoint [30], and action continuity [31] are incorporated as constraints during the optimization process, enhancing the overall viewing experience.

Recently, neural networks have been effective in generation tasks. [10] utilized Reinforcement Learning (RL) to obtain a deep camera movement agent supervised by real-time human preference scores. [11] further designed reward functions that ensure aesthetics and fidelity, avoiding occlusion and poor shot angles following specific director rules. To offer high-quality tracking shots, [32] introduced a visual detection network for precise camera movement guidance, while [33] leveraged transformers to track based on forming the optimal placement and orientation of the actor. Moreover, [34] focused on improving camera control in scenes with interactive actions of the characters using a GAN model.

Given that cinematic videos are widely accessible online, [12] proposed a novel approach to replicate shooting patterns in reference videos, transferring human kinematics and optical flows into reusable camera movements. This method was extended by [13] and [35], who added a filming style extractor, using low-dimensional vector and RL-based path analysis, respectively, for stylistic alignment between the reference and generated footage. Yet, [12, 13, 35] handle primarily long-distance aerial shots and lack detailed camera controls to meet artist-level cinematic production. To alleviate this problem, [14] created a fine-grained cinematic feature space that learns shooting patterns from film masterpieces by analyzing the inter-relationships of character poses in the frame. Building on this, [15] introduced a keyframe control strategy to enforce stricter cinematic constraints. Similarly, [36] refined key cinematic features in Neural Radiance Fields (NeRF) with heatmap guidance, which contributes to improved views for the output cinematic videos. Despite these advances in auto-cinematography, few camera-based methods address the sense of immersion, a critical factor of cinematic video quality that significantly influences viewer experiences.

II-B Investigating cinematographic immersion

The popularity of short videos on social media drives creators to produce high-quality content that not only attracts the audience but also deeply immerses them in the story. Underscoring the importance of visual perception in the viewer experience, research reveals that the use of cinematographic techniques such as view range manipulation [37, 38], staging [6, 39] and lighting [40] can effectively convey the sense of immersion. Among these, camera control stands out as one of the most crucial and commonly used approaches for achieving cinematographic immersion.

In practice, directors discovered that camera behavior can benefit viewer engagement and foster resonance between actors and the audience [17], particularly through precisely synchronizing the camera with the actor [18]. Due to multiple factors affecting human vision, such camera-actor synchronization in manual cinematography usually involves spatial [1], emotional [41], and aesthetic aspects [19] for creating immersive experiences thoroughly.

Imitating human directors, auto-cinematographic methods utilized shooting techniques such as Orbiting [14, 15] and Tracking [12, 32] to synchronize with the actor’s physical movement. By doing so, the camera behaves in a way that the viewer immersively observes the scene, ensuring consistent presence of the actor regardless of the action. In terms of emotional resonance, Zooming [7] and Shaking [42] can handle the synchronization between the camera and the mental state of the actor, which adjust focal length and stability to reflect specific emotional categories and intensities, respectively. In addition, [43] learned a semantic space for emotion representation using a large crowd-sourced footage dataset. Such a semantic space is subsequently combined during camera behavior generation to guide the synthesis of stylistic variance aligning with the input emotion type. For aesthetic-level actor-camera synchronization, the camera should follow aesthetic principles like center-framing [9, 7], the 180-degree rule [31, 34], or the rule-of-thirds [8, 32]. This allows to aesthetically immerse the audience by shooting the actor in well-designed frame compositions. There are also methods controlling frame aesthetics based on subjective user preferences [10, 29] or replicating styles from film masterpieces [14, 15, 36].

Though various automatic camera control methods can partially address the synchronization with the actor, thereby implicitly improving the sense of immersion, there is still the need for a unified approach. This approach would systematically combine frame aesthetics, spatial action, and emotional state to explicitly solve actor-camera synchronization and produce cinematic creations that offer comprehensive immersive experiences.

III Method

Refer to caption

Figure 2: The overview of our proposed camera control framework, which takes user-specific data from a virtual environment to generate camera movements. Through flexible two-stage processing, it ensures actor-camera synchronization across multiple aspects for producing customized immersive cinematic videos.

To facilitate the production of high-quality user-generated cinematic videos, we propose a novel auto-cinematography method that can handle fine-grained camera control to significantly enhance immersion of the generated works. The detailed methodology is presented in the following section.

III-A Immersive camera control framework

Camera movement, especially when precisely synchronized with the actor, has been recognized by directors as one of the essentials for creating immersive feelings, thus engaging the audience in the story [18]. Building on such a empirical rule, we propose a two-stage camera control framework to address actor-camera synchronization in terms of aesthetic, action, and emotional levels. This design aligns with user logic, supporting flexible and repeated applications based on individual needs, allowing the user to thoroughly enhance immersion for the output cinematic creation. Please refer to supplementary materials for more information regarding the foundational cinematographic knowledge.

In our virtual stage, users can customize their characters as well as camera placements, where all objects in the 3D environment share Q=6𝑄6Q=6italic_Q = 6 degrees of freedom: three for the position (x, y, z) and three for the rotation (yaw, pitch, roll) axes. These specified setups are crucial in regard to automating a comprehensive synchronization between the actor and the camera. We capture the movement of the focused actor by obtaining T𝑇Titalic_T frames of poses MT×JQ𝑀superscript𝑇𝐽𝑄M\in\mathbb{R}^{T\times JQ}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_J italic_Q end_POSTSUPERSCRIPT, with J𝐽Jitalic_J denoting the number of joints per pose, to enable refined spatial action tracking. To represent the mental state of the actor, a non-negative emotion variable E𝐸Eitalic_E is utilized to compactly depict both the category and its intensity for facilitating emotional styling. Meanwhile, an initial camera placement C0Qsubscript𝐶0superscript𝑄C_{0}\in\mathbb{R}^{Q}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT is required from the user so that we are able to improve frame aesthetics and initialize the generation of camera trajectories. Therefore, given such user-specific data input, our camera control framework ΥΥ\Upsilonroman_Υ can be formulated as:

C=Υ(M,E,C0),superscript𝐶Υ𝑀𝐸subscript𝐶0C^{\prime}=\Upsilon(M,E,C_{0}),italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_Υ ( italic_M , italic_E , italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (1)

where CT×Qsuperscript𝐶superscript𝑇𝑄C^{\prime}\in\mathbb{R}^{T\times Q}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_Q end_POSTSUPERSCRIPT represents the synthesized camera movement sequence for rendering the cinematic video.

Fig. 2 demonstrates the workflow of our camera control framework ΥΥ\Upsilonroman_Υ, which consists of two key modules. Realizing that users may not possess professional insights of aesthetics, we first design a deep adjustment network ψ𝜓\psiitalic_ψ, acting as an aesthetics adjustor for constructing aesthetic frame composition. Utilizing the initial camera placement C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and actor pose M0JQsubscript𝑀0superscript𝐽𝑄M_{0}\in\mathbb{R}^{JQ}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J italic_Q end_POSTSUPERSCRIPT provided by the user, the adjustor ψ𝜓\psiitalic_ψ modifies the camera placement to C0superscriptsubscript𝐶0C_{0}^{*}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by analyzing the actor location within the frame under camera projection. This adjustment is guided by a self-supervised hybrid loss aessubscript𝑎𝑒𝑠\mathcal{L}_{aes}caligraphic_L start_POSTSUBSCRIPT italic_a italic_e italic_s end_POSTSUBSCRIPT, which offers aesthetic constraints based on the rule-of-thirds principle and simultaneously minimizes potential visual changes. Note that the replacement by any other aesthetic principles does not affect the current procedure.

Starting with a more aesthetic initialization, we employ a GAN-based deep model to generate camera movements that can synchronize with the actor’s spatial action and emotional state. Our generator G𝐺Gitalic_G adopts an encoder-decoder architecture. The encoder extracts local kinematics from the actor pose sequence M𝑀Mitalic_M and captures the hidden director tracking strategies using saliency maps. These obtained features are then transformed into a camera-space trajectory Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in the decoder, conditioned on the initial camera C0superscriptsubscript𝐶0C_{0}^{*}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and emotion variable E𝐸Eitalic_E. Notably, E𝐸Eitalic_E here controls the amplitude of camera movement to match different emotional states with specific styles. The overall generation is regularized using a hybrid trajectory loss tjsubscript𝑡𝑗\mathcal{L}_{tj}caligraphic_L start_POSTSUBSCRIPT italic_t italic_j end_POSTSUBSCRIPT, which includes point-level, shape-level, and feature-level constraints, along with an adversarial loss advsubscript𝑎𝑑𝑣\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT derived from a discriminator, compared to ground-truth human artist samples.

In this way, Equation (1) can thus be reformed as:

C=G(M,E,ψ(M0,C0))=G(M,E,C0).superscript𝐶𝐺𝑀𝐸𝜓subscript𝑀0subscript𝐶0𝐺𝑀𝐸superscriptsubscript𝐶0C^{\prime}=G(M,E,\psi(M_{0},C_{0}))=G(M,E,C_{0}^{*}).italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_G ( italic_M , italic_E , italic_ψ ( italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) = italic_G ( italic_M , italic_E , italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) . (2)

We dive into more processing details in the next two subsections.

III-B Aesthetic composition adjustment

This module modifies the user-chosen initial camera placement in order to shoot the actor in a way that achieves composition-based frame aesthetics for improving immersion. Instead of relying on image analysis [22, 23], which causes a high rendering cost, our approach conducts efficient adjustments by leveraging the 3D-to-2D camera projection following the widely-used rule-of-thirds [21] aesthetic principle.

III-B1 Rule-of-thirds with camera projection

The rule-of-thirds principle provides camera control guidance over various scenarios to ensure the actor is optimally positioned within the frame, forming an aesthetic frame composition. By projecting the actor’s 3D pose onto the 2D frame from a given camera placement, we can assess whether the on-frame actor contributes to compositional aesthetics according to the rule-of-thirds. Focusing on actor-camera synchronization, here we refine scenario factors by considering only the actor pose and shot side (i.e., from which side the actor is shot by the camera) to categorize common cinematographic cases.

Refer to caption
Figure 3: Our rule-of-thirds decision tree. Based on different situations, the on-frame body center of the actor (marked as yellow dot) should stay on a certain alignment line (marked in green) to achieve compositional aesthetics.

Fig. 3 illustrates a decision tree that specifies the aesthetic regulations utilized in our method. We divide each frame into a grid of four lines to mark one-third alignments. The actor’s body, simplified to a weighted mean of joints projected onto the frame, should closely approach a certain alignment line under different cases for aesthetic framing. This transforms our adjustment into a problem discussing the distance between the projected body center and the alignment line. Building on this idea, we train a render-free adjustment network ψ𝜓\psiitalic_ψ to understand camera projection during the modification process with aesthetic regularizers for self-supervision.

Refer to caption
Figure 4: The architecture of our adjustment network ψ𝜓\psiitalic_ψ.

III-B2 Network and self-supervised losses

The network architecture of ψ𝜓\psiitalic_ψ incorporates attention blocks [44], leveraging their ability to weight features with high importance. These blocks are utilized to analyze initial actor pose M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and camera placement C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, aiming to emphasize on-frame actor joints and specific camera axes that significantly influence shot composition, respectively. As depicted in Fig. 4, our model ψ𝜓\psiitalic_ψ uses a dense layer to extract latent features Zγsuperscript𝑍𝛾Z^{\gamma}italic_Z start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT, which are then merged with self-attention features Zωsuperscript𝑍𝜔Z^{\omega}italic_Z start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT through a scale-add operation, followed by layer normalization and a leaky ReLU activation. The captured camera and actor features are then fused together to imitate camera projection, enabling the estimation of an adjusted camera placement C0superscriptsubscript𝐶0C_{0}^{*}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that facilitates immersive frame aesthetics.

The training of ψ𝜓\psiitalic_ψ is self-supervised using a hybrid loss function aessubscript𝑎𝑒𝑠\mathcal{L}_{aes}caligraphic_L start_POSTSUBSCRIPT italic_a italic_e italic_s end_POSTSUBSCRIPT, which not only regularizes aesthetic frame composition but also ensures minimal adjustment in order to maintain user preference.

Composition loss: Given M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and C0superscriptsubscript𝐶0C_{0}^{*}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we use camera projection to calculate the projected 2D actor pose MpJ×2subscriptsuperscript𝑀𝑝superscript𝐽2M^{*}_{p}\in\mathbb{R}^{J\times 2}italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J × 2 end_POSTSUPERSCRIPT. The on-frame body center is thus located by computing the weighted mean joint of Mpsubscriptsuperscript𝑀𝑝M^{*}_{p}italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, which is denoted as Mp¯2¯subscriptsuperscript𝑀𝑝superscript2\overline{M^{*}_{p}}\in\mathbb{R}^{2}over¯ start_ARG italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Using the decision tree shown in Fig. 3, we obtain candidates of alignment lines as Al2×2subscript𝐴𝑙superscript22A_{l}\in\mathbb{R}^{2\times 2}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 2 end_POSTSUPERSCRIPT. The composition-driven rule-of-thirds constraint can be effectively formed via the point-to-line distance as:

cmp=min(Mp¯Al02,Mp¯Al12),subscript𝑐𝑚𝑝subscriptnorm¯subscriptsuperscript𝑀𝑝superscriptsubscript𝐴𝑙02subscriptnorm¯subscriptsuperscript𝑀𝑝superscriptsubscript𝐴𝑙12\mathcal{L}_{cmp}=\min(||\overline{M^{*}_{p}}-A_{l}^{0}||_{2},||\overline{M^{*% }_{p}}-A_{l}^{1}||_{2}),caligraphic_L start_POSTSUBSCRIPT italic_c italic_m italic_p end_POSTSUBSCRIPT = roman_min ( | | over¯ start_ARG italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG - italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , | | over¯ start_ARG italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG - italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (3)

where Al0superscriptsubscript𝐴𝑙0A_{l}^{0}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and Al1superscriptsubscript𝐴𝑙1A_{l}^{1}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT represent the two possible alignment candidates that are both 2absentsuperscript2\in\mathbb{R}^{2}∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. If only one alignment line is determined as candidate based on the decision tree, Al0superscriptsubscript𝐴𝑙0A_{l}^{0}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and Al1superscriptsubscript𝐴𝑙1A_{l}^{1}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT are set equal.

Adjustment loss: To prevent excessive modification over the inputted C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we constrain the extent of adjustment as:

adj=C0C02.subscript𝑎𝑑𝑗subscriptnormsubscript𝐶0superscriptsubscript𝐶02\mathcal{L}_{adj}=||{C_{0}}-C_{0}^{*}||_{2}.caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_j end_POSTSUBSCRIPT = | | italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (4)

Visualization loss: We make efforts to preserve the original shot type (e.g. full, medium, close shot) in C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to further avoid over-adjustment at the visualization level. Hence, a contrastive regularizer is designed to monitor the consistency of the actor’s on-frame and off-frame joints before and after modification as:

vis=Mb(1Mb)+(1Mb)Mb,subscript𝑣𝑖𝑠superscriptsubscript𝑀𝑏1subscript𝑀𝑏1superscriptsubscript𝑀𝑏subscript𝑀𝑏\mathcal{L}_{vis}=M_{b}^{*}\cdot(1-M_{b})+(1-M_{b}^{*})\cdot M_{b},caligraphic_L start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⋅ ( 1 - italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) + ( 1 - italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , (5)

where \cdot denotes the dot product. Mbsuperscriptsubscript𝑀𝑏M_{b}^{*}italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Mbsubscript𝑀𝑏M_{b}italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT are both binary vectors Jabsentsuperscript𝐽\in\mathbb{R}^{J}∈ blackboard_R start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT and each element represents whether the corresponding joint is visible in the shot. These vectors are obtained by binarizing the 2D actor pose projections under the camera placement C0superscriptsubscript𝐶0{C}_{0}^{*}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and C0subscript𝐶0{C}_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, respectively.

In summary, the complete aesthetic loss function aessubscript𝑎𝑒𝑠\mathcal{L}_{aes}caligraphic_L start_POSTSUBSCRIPT italic_a italic_e italic_s end_POSTSUBSCRIPT for optimizing ψ𝜓\psiitalic_ψ can be formulated as:

aes=λcmpcmp+λadjadj+λvisvis,subscript𝑎𝑒𝑠subscript𝜆𝑐𝑚𝑝subscript𝑐𝑚𝑝subscript𝜆𝑎𝑑𝑗subscript𝑎𝑑𝑗subscript𝜆𝑣𝑖𝑠subscript𝑣𝑖𝑠\mathcal{L}_{aes}=\lambda_{cmp}\mathcal{L}_{cmp}+\lambda_{adj}\mathcal{L}_{adj% }+\lambda_{vis}\mathcal{L}_{vis},caligraphic_L start_POSTSUBSCRIPT italic_a italic_e italic_s end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_c italic_m italic_p end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_m italic_p end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_j end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_j end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT , (6)

where each λ𝜆\lambdaitalic_λ denotes the corresponding weight for a certain loss component.

The derivation of all variables mentioned above is detailed in supplementary materials. By conducting such compositional adjustment, we obtain an aesthetic initialization of camera placement for the subsequent camera trajectory synthesis.

III-C Camera trajectory synthesis

In this module, we generate camera trajectories that provide precise spatial and emotional synchronization with the actor to enhance immersive viewer experiences. Leveraging the power of adversarial learning, our GAN-based generative model extracts features from actor-space Xhsuperscript𝑋X^{h}italic_X start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT, including both physical movements and psychological states preprocessed from M𝑀Mitalic_M, E𝐸Eitalic_E, and C0superscriptsubscript𝐶0C_{0}^{*}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. These features are then transformed into the final camera trajectory Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which effectively learns tracking and styling techniques that mimic those of a human director in cinematic production. We specify Xhsuperscript𝑋X^{h}italic_X start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT as follows:

Xh={M,Mv,Md,E,Cf},superscript𝑋𝑀superscript𝑀𝑣superscript𝑀𝑑𝐸superscript𝐶𝑓X^{h}=\{M,M^{v},M^{d},E,C^{f}\},italic_X start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = { italic_M , italic_M start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , italic_M start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_E , italic_C start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT } , (7)

where

  • M𝑀Mitalic_M, Mvsuperscript𝑀𝑣M^{v}italic_M start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT, and Mdsuperscript𝑀𝑑M^{d}italic_M start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT are all T×JQabsentsuperscript𝑇𝐽𝑄\in\mathbb{R}^{T\times JQ}∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_J italic_Q end_POSTSUPERSCRIPT. M𝑀Mitalic_M represents joint locations per frame, whereas Mvsuperscript𝑀𝑣M^{v}italic_M start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT and Mdsuperscript𝑀𝑑M^{d}italic_M start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denote the joint velocity and its absolute, respectively. Random noise is padded to Mdsuperscript𝑀𝑑M^{d}italic_M start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and Mvsuperscript𝑀𝑣M^{v}italic_M start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT for temporal alignment with M𝑀Mitalic_M.

  • E \in (0, Emaxsubscript𝐸𝑚𝑎𝑥E_{max}italic_E start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT] and Emax>1subscript𝐸𝑚𝑎𝑥1E_{max}>1italic_E start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT > 1. Following [45], we categorize actor emotion into tense or relaxed, where E<1𝐸1E<1italic_E < 1 indicates a relaxed emotion, with lower values for greater relaxation, while E>1𝐸1E>1italic_E > 1 denotes tension, the higher the tenser the emotion.

  • CfT×Qsuperscript𝐶𝑓superscript𝑇𝑄C^{f}\in\mathbb{R}^{T\times Q}italic_C start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_Q end_POSTSUPERSCRIPT is obtained by repeating T𝑇Titalic_T times of C0superscriptsubscript𝐶0C_{0}^{*}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the temporal domain, capturing initial correlations between the focused actor and the camera.

III-C1 Network architecture

To improve efficiency in actor-to-camera processing, our generator G𝐺Gitalic_G utilizes an encoder-decoder architecture with intermediate latent representation to facilitate cross-space transformation.

Refer to caption
Figure 5: The design of encoder in G𝐺Gitalic_G. See text descriptions for details.

Actor encoder: During the encoding phase, actor poses are disarticulated into four parts: head, arms, torso, and legs, to enable a fine-grained analysis of kinematic features at low cost for precise spatial tracking. As illustrated in Fig. 5, the encoder takes Mdsuperscript𝑀𝑑M^{d}italic_M start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT as input, which is divided into Mhdsubscriptsuperscript𝑀𝑑M^{d}_{h}italic_M start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, Madsubscriptsuperscript𝑀𝑑𝑎M^{d}_{a}italic_M start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, Mtdsubscriptsuperscript𝑀𝑑𝑡M^{d}_{t}italic_M start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and Mldsubscriptsuperscript𝑀𝑑𝑙M^{d}_{l}italic_M start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, corresponding to different body regions. We leverage several linear blocks, each comprising a dense layer, a layer normalization, and a leaky ReLu activation, to extract region-wise kinematics. These local motion features are concatenated to form the overall kinematic embeddings Zksuperscript𝑍𝑘Z^{k}italic_Z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. We then compress Zksuperscript𝑍𝑘Z^{k}italic_Z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT through a couple of linear blocks to generate the latent feature Zlsuperscript𝑍𝑙Z^{l}italic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. Additionally, a softmax operator is applied to derive a saliency map Zssuperscript𝑍𝑠Z^{s}italic_Z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT that implicitly learns the hidden director tracking strategy.

Refer to caption
Figure 6: The design of decoder in G𝐺Gitalic_G. See text descriptions for details.

Camera decoder: The decoder additionally incorporates Mvsuperscript𝑀𝑣M^{v}italic_M start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT, M𝑀Mitalic_M, and Cfsuperscript𝐶𝑓C^{f}italic_C start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT to further strengthen the actor-camera synchronization during feature transformation, with random noise merged into Cfsuperscript𝐶𝑓C^{f}italic_C start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT to enhance the model’s robustness. The encoded tracking strategy Zssuperscript𝑍𝑠Z^{s}italic_Z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and actor kinematics Zlsuperscript𝑍𝑙Z^{l}italic_Z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are integrated into the latent decoding space via element-wise multiplication and addition, respectively. To effectively synthesize camera sequences, Gated Recurrent Units (GRUs) [46], a refined variant of recurrent neural networks (RNNs), are employed due to their superior capacity for handling sequential data. Moreover, the variable E𝐸Eitalic_E, which indicates the user-desired emotional category and intensity, is repeatedly introduced in the generation. This enables appropriate emotional styling by adjusting the overall amplitude of the generated camera trajectory so as to match a certain actor emotion. Finally, a skip connection of features obtained from Cfsuperscript𝐶𝑓C^{f}italic_C start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT is applied at the end of the decoder to benefit efficiency, allowing the network to focus on learning only the temporal evolution when estimating the output Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Discriminator: Our discriminator D𝐷Ditalic_D adopts a Siamese [47] architecture, which is powerful in distinguishing between real and fake samples, particularly for sequence-based data [48]. It compares the generated Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT against ground-truth C𝐶Citalic_C through two identical shared-weight branches, each consisting of three-layer linear blocks to capture cinematic dynamics. All the features extracted from the paired samples are then concatenated and fed into a classifier to estimate the similarity, where a higher similarity score denotes greater generation accuracy. This discriminator D𝐷Ditalic_D facilitates supervising our generator G𝐺Gitalic_G in an adversarial manner.

III-C2 Loss functions

Our trajectory generator G𝐺Gitalic_G is trained using the adversarial loss advsubscript𝑎𝑑𝑣\mathcal{L}_{adv}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT and trajectory loss tjsubscript𝑡𝑗\mathcal{L}_{tj}caligraphic_L start_POSTSUBSCRIPT italic_t italic_j end_POSTSUBSCRIPT. The latter one comprises three regularizers at point-level, shape-level, and feature-level compared to real professional samples.

Point loss: We calculate L2 distance between the ground-truth C𝐶Citalic_C and our generated Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with total variation [49] for temporal smoothness as:

mse=CC22+t=1T2(Ct+1Ct+CtCt1),subscript𝑚𝑠𝑒superscriptsubscriptnorm𝐶𝐶22superscriptsubscript𝑡1𝑇2normsubscriptsuperscript𝐶𝑡1subscriptsuperscript𝐶𝑡normsubscriptsuperscript𝐶𝑡subscriptsuperscript𝐶𝑡1\mathcal{L}_{mse}=||C’-C||_{2}^{2}+\sum_{t=1}^{T-2}(||C^{\prime}_{t+1}-C^{% \prime}_{t}||+||C^{\prime}_{t}-C^{\prime}_{t-1}||),caligraphic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT = | | italic_C ’ - italic_C | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 2 end_POSTSUPERSCRIPT ( | | italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | + | | italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | | ) , (8)

where T𝑇Titalic_T denotes the total number of frame time and t𝑡titalic_t ranging from 0 to T1𝑇1T-1italic_T - 1.

Shape loss: Following [41], we control the amplitude of the camera trajectory to express varied psychological states of the actor. For precise emotional styling in the generated Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we align the amplitude-based trajectory shape with that of the well-designed real sample C𝐶Citalic_C.

Refer to caption
Figure 7: An example visualizing the calculation of amplitude in the pitch axis. See text below for the detailed algorithm.

As shown in Fig. 7, we collect the time points of peaks and valleys along a specific camera axis, denoted as Pv={pv0,pv1,,pvN}𝑃𝑣𝑝subscript𝑣0𝑝subscript𝑣1𝑝subscript𝑣𝑁Pv=\{pv_{0},pv_{1},...,pv_{N}\}italic_P italic_v = { italic_p italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_p italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }. The axis-wise amplitude can be analytically measured using fampsubscript𝑓𝑎𝑚𝑝f_{amp}italic_f start_POSTSUBSCRIPT italic_a italic_m italic_p end_POSTSUBSCRIPT as:

famp(Cq)=i=1N|CpviqCpvi1q|pvipvi1,subscript𝑓𝑎𝑚𝑝superscript𝐶𝑞superscriptsubscript𝑖1𝑁superscriptsubscript𝐶𝑝subscript𝑣𝑖𝑞superscriptsubscript𝐶𝑝subscript𝑣𝑖1𝑞𝑝subscript𝑣𝑖𝑝subscript𝑣𝑖1f_{amp}(C^{q})=\sum_{i=1}^{N}\frac{|C_{{pv}_{i}}^{q}-C_{{pv}_{i-1}}^{q}|}{pv_{% i}-pv_{i-1}},italic_f start_POSTSUBSCRIPT italic_a italic_m italic_p end_POSTSUBSCRIPT ( italic_C start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG | italic_C start_POSTSUBSCRIPT italic_p italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT - italic_C start_POSTSUBSCRIPT italic_p italic_v start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT | end_ARG start_ARG italic_p italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p italic_v start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_ARG , (9)

where Cqsuperscript𝐶𝑞C^{q}italic_C start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT denotes the trajectory in axis q𝑞qitalic_q among Q𝑄Qitalic_Q degrees of freedom. We use i𝑖iitalic_i to iterate through the time point collected in Pv𝑃𝑣Pvitalic_P italic_v. The overall amplitude of C𝐶Citalic_C can then be obtained by performing fampsubscript𝑓𝑎𝑚𝑝f_{amp}italic_f start_POSTSUBSCRIPT italic_a italic_m italic_p end_POSTSUBSCRIPT on all the camera axes, named CampQsubscript𝐶𝑎𝑚𝑝superscript𝑄C_{amp}\in\mathbb{R}^{Q}italic_C start_POSTSUBSCRIPT italic_a italic_m italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT. This allows us to regularize the trajectory shape generation as:

shp=CampCamp.subscript𝑠𝑝norm𝐶subscript𝑎𝑚𝑝subscript𝐶𝑎𝑚𝑝\mathcal{L}_{shp}=||C’_{amp}-C_{amp}||.caligraphic_L start_POSTSUBSCRIPT italic_s italic_h italic_p end_POSTSUBSCRIPT = | | italic_C ’ start_POSTSUBSCRIPT italic_a italic_m italic_p end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT italic_a italic_m italic_p end_POSTSUBSCRIPT | | . (10)

Feature loss: Given the effectiveness of VGG [50] in extracting perception-senstive features, we introduce it here to supervise the similarity between the real C𝐶Citalic_C and our generated Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in the feature space, as:

feat=VGG(C)VGG(C)22.subscript𝑓𝑒𝑎𝑡superscriptsubscriptnorm𝑉𝐺𝐺𝐶𝑉𝐺𝐺𝐶22\mathcal{L}_{feat}=||VGG(C’)-VGG(C)||_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT = | | italic_V italic_G italic_G ( italic_C ’ ) - italic_V italic_G italic_G ( italic_C ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (11)

Adversarial loss: Taking the advantage of the GAN framework, our discriminator D𝐷Ditalic_D tries its best to distinguish between real and fake pairs through maximizing the loss:

advd=𝔼[logD(C,C)]+𝔼[log(1D(C,C))].superscriptsubscript𝑎𝑑𝑣𝑑𝔼delimited-[]𝑙𝑜𝑔𝐷𝐶𝐶𝔼delimited-[]𝑙𝑜𝑔1𝐷𝐶𝐶\mathcal{L}_{adv}^{d}=\mathbb{E}[log{D(C,C)}]+\mathbb{E}[log(1-D(C’,C))].caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = blackboard_E [ italic_l italic_o italic_g italic_D ( italic_C , italic_C ) ] + blackboard_E [ italic_l italic_o italic_g ( 1 - italic_D ( italic_C ’ , italic_C ) ) ] . (12)

Based on the adversarial training, our generator G𝐺Gitalic_G is forced to improve the synthesized results in order to fool D𝐷Ditalic_D by minimizing the function:

advg=𝔼[log(D(C,C))].superscriptsubscript𝑎𝑑𝑣𝑔𝔼delimited-[]𝑙𝑜𝑔𝐷𝐶𝐶\mathcal{L}_{adv}^{g}=\mathbb{E}[-log(D(C’,C))].caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = blackboard_E [ - italic_l italic_o italic_g ( italic_D ( italic_C ’ , italic_C ) ) ] . (13)

In summary, combining all the regularizers above, the final loss function for the generator G𝐺Gitalic_G can be formulated as:

g=λmsemse+λfeatfeat+λshpshp+λadvadvg,subscript𝑔subscript𝜆𝑚𝑠𝑒subscript𝑚𝑠𝑒subscript𝜆𝑓𝑒𝑎𝑡subscript𝑓𝑒𝑎𝑡subscript𝜆𝑠𝑝subscript𝑠𝑝subscript𝜆𝑎𝑑𝑣superscriptsubscript𝑎𝑑𝑣𝑔\mathcal{L}_{g}=\lambda_{mse}\mathcal{L}_{mse}+\lambda_{feat}\mathcal{L}_{feat% }+\lambda_{shp}\mathcal{L}_{shp}+\lambda_{adv}\mathcal{L}_{adv}^{g},caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_s italic_h italic_p end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_h italic_p end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , (14)

where each λ𝜆\lambdaitalic_λ denotes the corresponding weight for a certain loss component.

Using the aesthetic adjustor ψ𝜓\psiitalic_ψ and trajectory generator G𝐺Gitalic_G, we can ensure that our camera control system outputs camera movements that satisfy a comprehensive actor-camera synchronization in terms of aesthetics, action, and emotion for significantly enhancing immersive viewer experiences.

IV Experiments

In this section, we provide training details of our camera control system and assess its generation accuracy via ablation studies. Then, we describe quantitative experiments as well as a qualitative user study that evaluates the viewing quality of cinematic videos produced by our system, focusing on immersion in spatial, emotional, and aesthetic aspects.

IV-A Experimental environment

Refer to caption
Figure 8: Main functional zones in the environment. Zone 1 is used for the overall scene view and acting display, while Zone 2 visualizes the camera view. Zone 3 allows users to manipulate camera and character behaviors by dragging and drop** blocks on the timeline. Zone 4 and Zone 5 offers detailed management for camera and character, respectively, via pop-up menus. They control configuration like the focused character of the camera, character action, and emotion variable.

As shown in Fig. 8, a Unity3D application is developed to create a 3D virtual environment with cinematic resources including scenes, characters, and cameras. This application provides users with interfaces to customize their cinematic works and monitor shooting through information panels or a real-time camera view window. Especially, it supports the export and import of designed actor and camera behaviors for research purposes. Our camera control system is finally integrated as a plugin within the environment, enhancing the sense of immersion in user-generated cinematic videos. For more information, please refer to the supplementary materials.

IV-B Dataset

Due to different training strategies, we build separate datasets for the self-supervised aesthetic adjustor ψ𝜓\psiitalic_ψ and the supervised GAN-based camera trajectory synthesis model G𝐺Gitalic_G.

Samples for training ψ𝜓\psiitalic_ψ: To facilitate the self-supervised training of ψ𝜓\psiitalic_ψ, we combine synthetic camera placement C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and actor pose M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to simulate diverse shooting scenarios for adjusting frame aesthetics. Apart from a few frequently used locations collected from professional artists, we use sphere meshes around the focused actor at various radial distances, yielding 481,536 potential camera placements. Furthermore, we retrieve 57 typical initial poses from action resources for pose input. By pairing these C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with invalid shots filtered out, our aesthetic adjustment dataset comprises 13,066,689 samples in total, which is subsequently split into 80% for training and 20% for testing.

Samples for training G𝐺Gitalic_G: In our virtual environment, we ask artists to design camera movements given several director screenplays. During production, they are required to use cinematographic techniques such as spatial tracking and emotional styling to achieve actor-camera synchronization for immersive viewer experiences. Each training pair includes scene settings—action sequences M𝑀Mitalic_M, emotion variable E𝐸Eitalic_E, and aesthetic initial camera placement C0superscriptsubscript𝐶0C_{0}^{*}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT—combined with the artist-designed camera trajectory C𝐶Citalic_C as ground truth, all exported via environment interface. Due to limitations of human labor, these manual pairs constitute 20% of our dataset, with the rest synthesized based on artist samples through offsetting, sequence flip**, and emotion-driven amplitude modification to increase data diversity. This results in a total of 25,230 five-second footage samples, involving 583 character actions and 5 typical emotion variables {0.5,0.75,1,1.5,2}0.50.7511.52\{0.5,0.75,1,1.5,2\}{ 0.5 , 0.75 , 1 , 1.5 , 2 }, semantically indicating ”relax-more,” ”relax-less,” ”neutral,” ”tense-less,” and ”tense-more,” respectively. In the testing phase, we randomly select 5550 samples excluded from training to evaluate the performance of our system.

IV-C Training details

We implemented all the networks and losses in PyTorch, with some of our loss functions mathematically smoothed to ensure differentiability. Our adjustment network ψ𝜓\psiitalic_ψ reaches convergence after 35 epochs of training, using a batch size of 1024 as well as the Adam optimizer [51]. We initialize the learning rate with 0.002 and decrease it by 10x every 10 epochs. The weights for the loss component in aessubscript𝑎𝑒𝑠\mathcal{L}_{aes}caligraphic_L start_POSTSUBSCRIPT italic_a italic_e italic_s end_POSTSUBSCRIPT are set as: λcmpsubscript𝜆𝑐𝑚𝑝\lambda_{cmp}italic_λ start_POSTSUBSCRIPT italic_c italic_m italic_p end_POSTSUBSCRIPT=1, λadjsubscript𝜆𝑎𝑑𝑗\lambda_{adj}italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_j end_POSTSUBSCRIPT=0.25, λvissubscript𝜆𝑣𝑖𝑠\lambda_{vis}italic_λ start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT=0.01, to maximize aesthetic compositions while preserving user input.

For the GAN model, we pretrain the generator G𝐺Gitalic_G for 100 epochs using only msesubscript𝑚𝑠𝑒\mathcal{L}_{mse}caligraphic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT with the Adam optimizer and a batch size of 10. The initial learning rate is set to 0.005 and gets decreased by a factor of 10 every 25 epochs to obtain convergence. Then, following the same optimizer and batch size, we train G𝐺Gitalic_G and the discriminator D𝐷Ditalic_D adversarially for 45 epochs, with both learning rates set to 0.0001. During this process, the loss components of gsubscript𝑔\mathcal{L}_{g}caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT are weighted as follows: λmsesubscript𝜆𝑚𝑠𝑒\lambda_{mse}italic_λ start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT=10, λfeatsubscript𝜆𝑓𝑒𝑎𝑡\lambda_{feat}italic_λ start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT=0.5, λshpsubscript𝜆𝑠𝑝\lambda_{shp}italic_λ start_POSTSUBSCRIPT italic_s italic_h italic_p end_POSTSUBSCRIPT=1, λadvsubscript𝜆𝑎𝑑𝑣\lambda_{adv}italic_λ start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT=0.2, enabling a balanced trajectory accuracy across different levels.

Note here that all the λ𝜆\lambdaitalic_λ values mentioned above are determined experimentally based on optimal model performance.

IV-D Ablation study

In order to demonstrate the effectiveness of our aesthetic adjustor ψ𝜓\psiitalic_ψ and camera trajectory generator G𝐺Gitalic_G, we compare them with variant models based on our testing dataset across multiple metrics.

IV-D1 Evaluation of aesthetic composition adjustment

We evaluate the performance of our aesthetic adjustor ψ𝜓\psiitalic_ψ by comparing it with its variants trained on different loss components. Our adjustments are assessed by rule-of-thirds shift (RoTSft), adjustment distance (AdjDis), and visibility accuracy (VisAcc). RoTSft offers a direct aesthetic evaluation by computing the on-frame distance between the actor body center and the one-third alignment lines. Additionally, we consider practical adjustment factors crucial to user preference by AdjDis and VisAcc. The AdjDis denotes the total adjustment distance calculated via mean absolute error (MAE), while VisAcc measures the percentage of body joints that are accurately visualized compared to their original visibility. For computational details please refer to Equation (3), (4), (5).

TABLE I: Comparison of aesthetic adjustment performance
Model Name RoTSft (px) \downarrow AdjDis (m) \downarrow VisAcc (%) \uparrow
Original 412.4 n/a n/a
ψ𝜓\psiitalic_ψ w/o adjsubscript𝑎𝑑𝑗\mathcal{L}_{adj}caligraphic_L start_POSTSUBSCRIPT italic_a italic_d italic_j end_POSTSUBSCRIPT & vissubscript𝑣𝑖𝑠\mathcal{L}_{vis}caligraphic_L start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT 97.8 0.7860 46.11%
ψ𝜓\psiitalic_ψ w/o vissubscript𝑣𝑖𝑠\mathcal{L}_{vis}caligraphic_L start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT 116.0 0.1191 47.87%
ψ𝜓\psiitalic_ψ 129.5 0.1473 71.08%

As demonstrated in Table I, our ψ𝜓\psiitalic_ψ model achieves a good balance across all metrics. In comparison with the leading competitor for RoTSft and AdjDis, our model exhibits a minor shortfall by around 32 pixels shifting from the alignment line in 1080P-resolution frames and about a 3cm of camera adjustment in the virtual environment. These differences are practically negligible. Notably, our ψ𝜓\psiitalic_ψ model outperforms others in VisAcc, which significantly affects human perception and determines shot type (e.g. full, medium, close-up), with exceeding 23% of the body joints. This indicates that our aesthetic adjustor ψ𝜓\psiitalic_ψ effectively modifies camera placement aligning with rule-of-thirds aesthetics while optimally maintaining the original user-preferred shot designs.

Refer to caption

Figure 9: Example of qualitative comparison over aesthetic adjustment performance. The rule-of-thirds alignment candidates are labeled in green, whereas the orange dot represents the on-frame body center of the actor.

The qualitative example in Fig. 9 illustrates the superior aesthetic adjustment capability of our ψ𝜓\psiitalic_ψ model. Unlike the other two variant models, it prevents over-adjustment that arises from overfitting the rule-of-thirds compositional constraint or making excessive changes for actor joint visibility.

IV-D2 Evaluation of camera trajectory synthesis

We compare our generator G𝐺Gitalic_G against its msesubscript𝑚𝑠𝑒\mathcal{L}_{mse}caligraphic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT-based pretraining model and two other GAN variants with partial loss components. Metrics including MSE, CosDAsubscriptCosDA\text{CosD}_{\text{A}}CosD start_POSTSUBSCRIPT A end_POSTSUBSCRIPT, LPIPS [50], and FID [52] are used to evaluate the models, where CosDAsubscriptCosDA\text{CosD}_{\text{A}}CosD start_POSTSUBSCRIPT A end_POSTSUBSCRIPT calculates the cosine distance between the overall amplitudes of real and synthesized trajectories, Campsubscript𝐶𝑎𝑚𝑝C_{amp}italic_C start_POSTSUBSCRIPT italic_a italic_m italic_p end_POSTSUBSCRIPT and Campsubscriptsuperscript𝐶𝑎𝑚𝑝C^{\prime}_{amp}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_m italic_p end_POSTSUBSCRIPT, respectively (please refer to Equation (9)). This enables us to comprehensively assess the generated camera trajectories based on ground-truth samples, considering point accuracy, shape consistency, and feature space similarity, from low to high levels.

TABLE II: Comparison of camera trajectory synthesis
Model Name MSE \downarrow CosDAsubscriptCosDA\text{CosD}_{\text{A}}CosD start_POSTSUBSCRIPT A end_POSTSUBSCRIPT \downarrow LPIPS \downarrow FID \downarrow
Pretrain 0.0082 0.2935 0.0510 0.0702
G𝐺Gitalic_G w/o feat&shpsubscript𝑓𝑒𝑎𝑡subscript𝑠𝑝\mathcal{L}_{feat}~{}\&~{}\mathcal{L}_{shp}caligraphic_L start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT & caligraphic_L start_POSTSUBSCRIPT italic_s italic_h italic_p end_POSTSUBSCRIPT 0.0149 0.2569 0.0477 0.0739
G𝐺Gitalic_G w/o shpsubscript𝑠𝑝\mathcal{L}_{shp}caligraphic_L start_POSTSUBSCRIPT italic_s italic_h italic_p end_POSTSUBSCRIPT 0.0135 0.2509 0.0385 0.0681
G𝐺Gitalic_G 0.0124 0.2380 0.0405 0.0634

In Table II, the best results are marked in bold, while the second-best are in italics. It reveals that the pretraining model outperforms in MSE but falls behind for the other medium-to-high level metrics, indicating its limitation in generating fine-detailed camera trajectories. The performance of the two GAN variants highlights the benefits of adversarial training and featsubscript𝑓𝑒𝑎𝑡\mathcal{L}_{feat}caligraphic_L start_POSTSUBSCRIPT italic_f italic_e italic_a italic_t end_POSTSUBSCRIPT for an enhanced feature-space representation. However, they perform less effectively in CosDAsubscriptCosDA\text{CosD}_{\text{A}}CosD start_POSTSUBSCRIPT A end_POSTSUBSCRIPT compared to our G𝐺Gitalic_G model, which integrates shpsubscript𝑠𝑝\mathcal{L}_{shp}caligraphic_L start_POSTSUBSCRIPT italic_s italic_h italic_p end_POSTSUBSCRIPT that highly influences the shape of the trajectory. Generally, our G𝐺Gitalic_G model achieves the best trading-off, leading in CosDAsubscriptCosDA\text{CosD}_{\text{A}}CosD start_POSTSUBSCRIPT A end_POSTSUBSCRIPT and FID with a slight compromise in MSE and LPIPS.

Refer to caption
Figure 10: A qualitative comparison of trajectories from different models, showing camera positions over the temporal evolution.

Meanwhile, the qualitative example shown in Fig. 10 further supports our conclusions drawn from the table analysis. By utilizing a hybrid loss function, our G𝐺Gitalic_G model can accurately capture high-level features and refine the generated camera trajectory to match the shape of the real artist sample, thereby enabling the learning of immersive shooting techniques.

IV-E Quantitative evaluation of immersion

We quantitatively assess the immersive performance of our camera control system ΥΥ\Upsilonroman_Υ, which combines both the ψ𝜓\psiitalic_ψ and G𝐺Gitalic_G models. This evaluation, following actor-camera synchronization, specifically focuses on three aspects: spatial action, emotional status, and frame aesthetics, all significantly impacting immersive viewer experiences. We conduct separate experiments to target these immersion-critical factors in comparison with other feasible methods.

IV-E1 Spatial immersion

To evaluate the immersion arising from spatial actor-camera synchronization, we measure the consistency between character action features Zsasuperscriptsubscript𝑍𝑠𝑎Z_{s}^{a}italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and camera movement features Zscsuperscriptsubscript𝑍𝑠𝑐Z_{s}^{c}italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT to quantify the spatial tracking performance. Due to cross-platform difficulty, we implement two advanced auto-cinematography methods from Burelli et al. [20] and Yu et al. [31] within our environment for comparison. Burelli et al. [20] propose attaching the camera fixedly to a specific actor joint based on the desired shot type for efficient action tracking. Yu et al. [31] offer an optimization-based camera controller that follows the focused actor considering fidelity and aesthetics. We tested these models across 37 virtually-staged screenplays, designed with character action M𝑀Mitalic_M, initial camera placement C0superscriptsubscript𝐶0C_{0}^{*}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and neural emotion variable (i.e., E𝐸Eitalic_E=1). The mean action velocity, serving as ground-truth actor features Zsasuperscriptsubscript𝑍𝑠𝑎Z_{s}^{a}italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, and the corresponding camera velocity Zscsuperscriptsubscript𝑍𝑠𝑐Z_{s}^{c}italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT are compared together using the Hausdorff Distance (HD) for calculating feature-level similarity. Both features are normalized to [0,1] and structured as (T1)×Qsuperscript𝑇1𝑄\mathbb{R}^{(T-1)\times Q}blackboard_R start_POSTSUPERSCRIPT ( italic_T - 1 ) × italic_Q end_POSTSUPERSCRIPT.

Refer to caption
Figure 11: Comparison of immersive performance based on spatial tracking accuracy. Left: HD results, the lower the better. Right: An qualitative example showing spatial actor-camera synchronization from different models over time.

The Hausdorff Distance (HD) results in the left of Fig. 11 demonstrate that our system significantly surpasses the other two models in generating camera features consistent with the performed action. This improvement is attributed to our actor-to-camera generation framework. The qualitative example on the right side of Fig. 11 further verifies that our method can smoothly track actor movements in real time, which outperforms the periodic updating strategy in Yu et al. [31]. Moreover, compared to Burelli et al. [20], the use of saliency maps in our method avoids constant focus on a fixed body joint during camera trajectory synthesis, thereby better preserving the overall trend of character action.

IV-E2 Emotional immersion

We assess the immersive performance of emotional actor-camera synchronization by analyzing the feature-based correlation between actor emotion status and camera behavior. Given the lack of existing methods addressing such an emotional immersion, we simulate two variant models based on our method as feasible competitors, named PlainCam and EmoCam. The PlainCam is obtained using our dataset with emotion-related variances removed, and for the EmoCam, we disable the use of shpsubscript𝑠𝑝\mathcal{L}_{shp}caligraphic_L start_POSTSUBSCRIPT italic_s italic_h italic_p end_POSTSUBSCRIPT in its training. The evaluation involves the same screenplays from the spatial immersion analysis, with each screenplay tested under 5 typical emotion variables E{0.5,0.75,1,1.5,2}𝐸0.50.7511.52E\in\{0.5,0.75,1,1.5,2\}italic_E ∈ { 0.5 , 0.75 , 1 , 1.5 , 2 }, representing diverse psychological states of the actor. These emotion variables constitute a vector acting as the ground-truth emotional actor feature Zeasubscriptsuperscript𝑍𝑎𝑒Z^{a}_{e}italic_Z start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. Correspondingly, for each E𝐸Eitalic_E, we average the trajectory amplitude Camp𝐶subscript𝑎𝑚𝑝C’_{amp}italic_C ’ start_POSTSUBSCRIPT italic_a italic_m italic_p end_POSTSUBSCRIPT and combine them into a vector as our camera stylistic feature Zecsubscriptsuperscript𝑍𝑐𝑒Z^{c}_{e}italic_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. Hence, the immersion from emotional styling can be evaluated by calculating various correlation coefficients between Zeasubscriptsuperscript𝑍𝑎𝑒Z^{a}_{e}italic_Z start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and Zecsubscriptsuperscript𝑍𝑐𝑒Z^{c}_{e}italic_Z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, both of which are in 5superscript5\mathbb{R}^{5}blackboard_R start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT.

TABLE III: Comparison of emotional immersion over behavior correlation
Model Name PCC \uparrow SRCC \uparrow KRCC \uparrow
PlainCam 0.7748 0.7822 0.7607
EmoCam 0.8017 0.8112 0.7652
Our System ΥΥ\Upsilonroman_Υ 0.9235 0.9356 0.9153

Table III presents three correlation coefficients: Pearson (PCC), Spearman Rank (SRCC), and Kendall Rank (KRCC), to measure the relationship between actor emotion and our generated camera behavior. It is evident that, compared to other models, our method outperforms in synthesizing camera trajectories with adaptive amplitudes that closely align with the actor’s emotional states. This benefits from the emotion-related specifics incorporated into our network architecture and loss functions, enabling effective emotional actor-camera synchronization to achieve an enhanced sense of immersion

Refer to caption
Figure 12: Visualization of our generated adaptive camera trajectory amplitude. Left: Mean amplitude per sample over emotion variables. Right: A qualitative example showing camera trajectory under different E𝐸Eitalic_E over time.

Fig. 12 provides an in-depth look at the emotional styling ability of our system. On the left, we can observe a transition from low to high trajectory amplitudes in response to varied emotion variables E𝐸Eitalic_E, ranging from relaxation to tension. This imitates professional cinematographic techniques to enhance emotional immersion. On the right, despite the amplitude adjustments for different E𝐸Eitalic_E settings, the generated trajectories maintain consistent motion trends crucial for accurate spatial tracking. Note that although the experiment focuses on a set of pre-defined E𝐸Eitalic_E values, practically our camera control system is capable of processing any positive E𝐸Eitalic_E inputted by users.

IV-E3 Aesthetic immersion

To verify the ability in achieving composition-based aesthetic immersion, we compare our method with the state-of-the-art camera controller Yu et al. [31] and two leading aesthetic crop** models Li et al. [22] and Hong et al. [23]. These models are all known for their performance enabling photography-level aesthetic framing. Utilizing the Aesthetic Visual Analysis (AVA) model [53], we evaluate the aesthetic quality of the produced cinematic videos by analyzing compositional features from the given frames and predicting scores that reflect human aesthetics judgments. Both Yu et al. [31] and our system are tested across 24 randomly selected screenplays with varied 3D stage settings, yielding 216 frames evenly sampled from videos generated by each method for evaluation. Meanwhile, the image-based crop** models Li et al. [22] and Hong et al. [23] are assessed based on processing their results using frames from Yu et al. [31].

TABLE IV: Comparison of aesthetic immersion over AVA score
Model \  Scene Library Forest Park Bedroom Average
Yu et al. [31] 4.6017 4.8571 4.5665 4.2329 4.5645
Li et al. [22] 4.5512 4.8643 4.5070 4.2274 4.5375
Hong et al. [23] 4.5645 4.8695 4.5185 4.2498 4.5506
Our System ΥΥ\Upsilonroman_Υ 4.8077 5.0970 4.8556 4.4089 4.7923

Table IV shows the tested aesthetic scores for 4 typical indoor and outdoor scenes, where higher scores indicate better frame aesthetics. The two crop** models Li et al. [22] and Hong et al. [23] can improve aesthetic frame composition beyond Yu et al. [31], though limited to specific scenes. Conversely, our method robustly earns the highest aesthetic scores by incorporating the rule-of-thirds aesthetic principle. It’s worth noting that, although our aesthetic adjustments primarily focus on the initial camera placements, the use of spatial tracking ensures the preservation of these adjusted aesthetic compositions throughout the entire cinematic production, which effectively contributes to achieving aesthetic immersion. We also present examples in Fig. 13 to qualitatively verify the aesthetic improvements of our method that strictly follows one-third alignments, compared to other models.

Refer to caption

Figure 13: Qualitative examples for aesthetic comparison. We display frames from different models with the corresponding aesthetic scores predicted by [53].

The quantitative experiments described above have separately demonstrated the effectiveness of our method in facilitating actor-camera synchronization through spatial tracking, emotional styling, and aesthetic framing, thereby jointly enhancing the overall immersive viewer experiences. In the next subsection, we conduct a user study to further evaluate the perceptual immersion of our generated cinematic videos in a qualitative way.

IV-F Qualitative evaluation of immersion

Refer to caption

Figure 14: Rating results of immersive performance compared to the baseline Yu et al. [31], with higher scores being better. Left: Average scores for different models. Middle: Histogram of score distributions. Right: Comparison of scores between professional and non-professional participants.

A user study is additionally carried out to qualitatively assess the immersive performance of our camera control system. We compare our system against Yu et al. [31] and three variants of our methods, each emphasizing different actor-camera synchronization aspects: spatial (S.I.), spatial-emotional (S.I. + E.I.), and spatial-aesthetic (S.I. + A.I.) for distinctively creating immersion. Twelve participants are invited, half with cinematography knowledge while half without. Given screenplays, these participants play the role of directors, watching and blindly rating 10 sets of generated cinematic videos on a scale from 1 to 5. Yu et al. [31] serves as a baseline for the judgment, with scores 3absent3\geq 3≥ 3 indicating that our method or its variant models possess a greater capability to immersively convey the story. Conversely, a score below 3 means lesser effectiveness. The detailed procedures of this study are available in the supplementary materials.

Fig. 14 demonstrates the user study outcomes. The left part shows that all four models averagely score above 3, outperforming the baseline Yu et al. [31], with ratings increasing as the models integrate additional actor-camera synchronization techniques. This underscores the importance of each utilized shooting technique and the collective impact of our comprehensive method in enhancing immersion. The middle graph presents the detailed score distribution over the number of votes. Unlike other variant models, our method achieves leading stable viewer satisfaction by thoroughly addressing immersion across several perceptual dimensions. On the right, the comparison between professional and non-professional participants suggests professionals are relatively more sensitive to cinematographic changes, while both groups agree on the highest immersive enhancement provided by our model.

Fig. 15 presents two cinematic examples utilized in our user study for qualitative comparison. Model S.I. provides a smooth and continuous spatial tracking of the focused character, which surpasses the baseline Yu et al. [31]. The addition of E.I. strengthens the amplitude of camera movement to more vividly express the desired character emotion. By incorporating A.I. for aesthetic control, our method further achieves a comprehensive balance among the three aspects of actor-camera synchronization techniques. This helps alleviate the potential out-of-frame issues (like those observed with Model S.I+E.I. in Screenplay A) and, all the used shooting techniques jointly contribute to the overall enhancement of immersive viewer experiences. For live demonstrations, please refer to the supplementary video.

Refer to caption
Figure 15: Qualitative comparison of immersive performance across cinematic videos from different models. Frame sequences should be read from left to right.

V Discussion

Our camera control system shows outstanding immersive enhancement compared to other methods. When deployed as a plugin in the 3D virtual environment, it allows users to adjust variable settings and repeatedly generate camera movements for obtaining ideal results. Our system is also adaptable to long-sequence and multi-person scenarios, treating them as independent footage segments according to the user-approved expression flow and shifting of the focused character. Due to establishing on well-recognized empirical rules, the diversity of cinematographic principles and personal preferences of users might affect the effectiveness of our method. Moving forward, we plan to extend our framework with a broader range of shooting techniques to enrich user selection. Additionally, considering the impact of other cinematographic factors like camera intrinsics and lighting on immersion, we aim to introduce more degrees of freedom, such as screen-based representations, to further enhance our toolkit and approach the capabilities of professional cinematic production.

VI Conclusion

In this paper, we propose a novel auto-cinematographic method for facilitating user-generated cinematic videos with enhanced sense of immersion. This is achieved by planning immersive camera movements following real-world cinematographic rules. More specifically, given the user-preferred setups, our camera control system synchronizes the camera with the focused actor across the aesthetic, spatial, and emotional levels. In the 3D stage, we design a deep camera control framework comprising an aesthetic adjustor and a camera trajectory synthesis model. The adjustor leverages the rule-of-thirds principle to conduct composition-based aesthetic framing self-supervisedly through camera projection analysis. Building on this aesthetic initialization, our GAN-based trajectory generator employs an encode-decoder architecture, map** actor kinematics and emotional variables into camera movements. This ensures precise spatial tracking and emotional styling with constraints controlling trajectory accuracy and stylistic variances. The experiment demonstrates that our method outperforms other competitive models significantly in enhancing immersive viewer experiences under both quantitative and qualitative assessments.

Refer to caption

Figure 16: Real-world shooting techniques and their corresponding cinematic samples. (a), (b), and (c) exemplify how the actor-camera synchronization is addressed using camera movement at spatial, emotional, and aesthetic levels, respectively. See text above for detailed explanation. The live clips are available in the supplementary video.

VII Supplementary Materials

VII-A Cinematographic knowledge

In cinematography, directors strive to create an immersive experience so as to enhance viewer engagement and foster resonance between the actor and the audience. To produce high-quality immersive works, creators had to consider two fundamental questions: (1) What factors contribute to cinematographic immersion? and (2) How can these factors be effectively achieved?

Traditionally, immersion is often achieved by dynamic camera movements that simulate the perspective of an invisible character within the scene [17]. Directors typically focus on three key perceptual factors—spatial, emotional, and aesthetic—to enhance the immersive experience in practical productions [54, 19]. Building upon the well-recognized actor-camera synchronization principle [18], directors have broadened its applicability and adapted it to handle different perceptually critical aspects, which can be summarized as follows:

  • Synchronize camera with actor’s physical movements for spatial-level immersion

  • Synchronize camera with actor’s mental state for emotional-level immersion

  • Synchronize camera with pleasing on-frame locations of the actor for aesthetic-level immersion

Various shooting techniques are proposed to tackle these rules. Among them, the most general and straight-forward ones are spatial tracking, emotional styling and aesthetic framing. We detail each of them with samples demonstrated in Fig. 16.

Spatial tracking: A widely used technique to achieve spatial-level immersion, which requires the camera closely following the actor’s movements in a scene [1]. This involves capturing simple transition-based actions, such as running and walking, as well as complex fine-grained actor behaviors (which are primarily focused in this paper) like agree, refuse, curse, etc. We show an example of Mindhunter directed by David Fincher in Fig. 16(a), where a camera tilt is employed to synchronize with the rise-up of the actor’s upper body, creating an observational view for the audience.

Emotional styling: A popular technique for creating emotional-level immersion involves styling the camera behavior to reflect the psychological state of the actor [55]. This styling can manifest as variations in the amplitude of camera movements—either intensified or weakened—to match the intensity and specific type of the actor’s emotion, thereby conveying the corresponding mental feeling effectively to the audience [41]. An illustrative example can be seen in Fig. 16(b) from the series Succession, directed by Mark Mylod. In a scene where an actor exhibits extreme anger with trembling, Mylod utilizes exaggeratedly magnified camera shakiness to make the audience visually stroked and emotionally engaged with the actor.

Aesthetic framing: Achieving aesthetic-level immersion often needs control over frame composition according to aesthetic principles. For instance, following the rule-of-thirds principle [56], the actor’s body should be positioned at one-third of the frame throughout the shooting, taking into account compositional factors like the actor’s pose and leading room [57]. In Fig. 16(c), an example is presented from Mission Impossible - Fallout directed by Christopher McQuarrie. In this scene, despite dynamic camera movements, the actor’s body remains aligned with a certain third of the frame to create perceptually pleasing cinematic sequences for the audience’s attention and affection.

In practice, a combination of these techniques is usually employed to optimize the immersive viewer experience. Beyond what has been discussed in this section, many other cinematographic rules and techniques, both camera-driven and not, can also contribute to the enhancement of immersion. While our paper mainly addresses the three crucial aspects of immersion using spatial, emotional, and aesthetic actor-camera synchronization, this foundational approach can be expanded with additional cinematographic modules to further benefit cinematic production comprehensively in the future.

VII-B Implementation of aesthetic loss

In this section, we provide implementation details related to camera projection, decision tree, and the derivation process of the variables used in our approach.

Camera projection: To simplify the explanation, we denote the process of camera projection as fpsubscript𝑓𝑝f_{p}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Suppose there is an arbitrary pair of actor’s single joint Pjsubscript𝑃𝑗P_{j}italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and camera placement C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where both of them are Q=6absentsuperscript𝑄6\in\mathbb{R}^{Q=6}∈ blackboard_R start_POSTSUPERSCRIPT italic_Q = 6 end_POSTSUPERSCRIPT based on the 3D virtual environment. By performing fp(Pj,C0)subscript𝑓𝑝subscript𝑃𝑗subscript𝐶0f_{p}(P_{j},C_{0})italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), we can obtain the projection result P2d2subscript𝑃2𝑑superscript2P_{2d}\in\mathbb{R}^{2}italic_P start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. This allows us to map the coordinate from the 3D world to the 2D shot.

The entire process can be divided into three main steps. Initially, the first three elements of Pjsubscript𝑃𝑗P_{j}italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT indicating positions on the x,y,z axes, are extracted to construct the world coordinate Pw=[Xw,Yw,Zw,1]Tsubscript𝑃𝑤superscriptsubscript𝑋𝑤subscript𝑌𝑤subscript𝑍𝑤1𝑇P_{w}=[X_{w},Y_{w},Z_{w},1]^{T}italic_P start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = [ italic_X start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. This coordinate is then transformed into the camera coordinate Pc=[Xc,Yc,Zc,1]Tsubscript𝑃𝑐superscriptsubscript𝑋𝑐subscript𝑌𝑐subscript𝑍𝑐1𝑇P_{c}=[X_{c},Y_{c},Z_{c},1]^{T}italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = [ italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT via transformation matrix constructed according to the corresponding rotation and position of C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Subsequently, Pcsubscript𝑃𝑐P_{c}italic_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is projected onto the sensor plane Pi=[x,y,1]Tsubscript𝑃𝑖superscript𝑥𝑦1𝑇P_{i}=[x,y,1]^{T}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x , italic_y , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT based on the focal length of the camera. The final stage involves scaling and translating Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using intrinsic camera parameters to yield the shot coordinate Ps=[u,v,1]Tsubscript𝑃𝑠superscript𝑢𝑣1𝑇P_{s}=[u,v,1]^{T}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = [ italic_u , italic_v , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. The resulting 2D projection P2d=(u,v)subscript𝑃2𝑑𝑢𝑣P_{2d}=(u,v)italic_P start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT = ( italic_u , italic_v ) can then be derived by extracting the corresponding values from Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Utilizing this process, we can project the entire actor pose M0J×Qsubscript𝑀0superscript𝐽𝑄M_{0}\in\mathbb{R}^{J\times Q}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J × italic_Q end_POSTSUPERSCRIPT, given the camera placement C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, to MpJ×2subscript𝑀𝑝superscript𝐽2M_{p}\in\mathbb{R}^{J\times 2}italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J × 2 end_POSTSUPERSCRIPT through multi-dimensional matrix operations.

Projection-related variables: Based on the camera projection process fpsubscript𝑓𝑝f_{p}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, the variables utilized in our loss components are derived as follows:

  • MpJ×2superscriptsubscript𝑀𝑝superscript𝐽2M_{p}^{*}\in\mathbb{R}^{J\times 2}italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J × 2 end_POSTSUPERSCRIPT is obtained by fp(M0,C0)subscript𝑓𝑝subscript𝑀0superscriptsubscript𝐶0f_{p}(M_{0},C_{0}^{*})italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) using the aesthetically adjusted camera placement C0superscriptsubscript𝐶0C_{0}^{*}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Based on the actual size of shot, joints detected off-frame are set to (0,0)00(0,0)( 0 , 0 ) in the Mpsuperscriptsubscript𝑀𝑝M_{p}^{*}italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

  • Mp¯¯superscriptsubscript𝑀𝑝\overline{M_{p}^{*}}over¯ start_ARG italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG is obtained by computing the weighted mean of Mpsuperscriptsubscript𝑀𝑝M_{p}^{*}italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, with the weights following a normal distribution. The torso center joint is assigned a weight corresponding to the peak of the distribution, while the weights of the other joints are decreased depending on their distances to the torso center. Note that off-frame joints are excluded from this operation. This ensures the body center is controlled to be robustly close to the torso whatever the visibility of the actor pose is.

  • Mbsuperscriptsubscript𝑀𝑏M_{b}^{*}italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is derived by binarizing Mpsuperscriptsubscript𝑀𝑝M_{p}^{*}italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to indicate whether a certain joint is on-frame (labeled as 1) or off-frame (labeled as 0). This is achieved by multiplying the projected u𝑢uitalic_u and v𝑣vitalic_v coordinates of each joint in Mpsuperscriptsubscript𝑀𝑝M_{p}^{*}italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and using 0 as a threshold for binarization.

  • Mbsubscript𝑀𝑏M_{b}italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is obtained following the same process for Mbsuperscriptsubscript𝑀𝑏M_{b}^{*}italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Instead of using the adjusted camera placement C0superscriptsubscript𝐶0C_{0}^{*}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, its derivation is based on the initial camera placement C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to obtain the original shot preference from user input.

From decision tree to Alsubscript𝐴𝑙A_{l}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT : Referring to the main text, we use a decision tree to determine the alignment lines Alsubscript𝐴𝑙A_{l}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT that should be followed to improve aesthetic composition based on the rule-of-thirds principle. In this subsection, we provide detailed derivation process for obtaining Alsubscript𝐴𝑙A_{l}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT under the consideration of actor pose and shot side. To indicate whether the actor is in a lying or standing pose, we compute the height difference between the actor’s head and pelvis joints, which is denoted as hpd𝑝𝑑hpditalic_h italic_p italic_d. Additionally, the side of the shot is represented by the relative angle ra𝑟𝑎raitalic_r italic_a between the camera and the actor. Given that the on-frame body center Mp¯=(um,vm)¯superscriptsubscript𝑀𝑝subscript𝑢𝑚subscript𝑣𝑚\overline{M_{p}^{*}}=(u_{m},v_{m})over¯ start_ARG italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG = ( italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) has been obtained and each shot is of size (width,height)𝑤𝑖𝑑𝑡𝑒𝑖𝑔𝑡(width,height)( italic_w italic_i italic_d italic_t italic_h , italic_h italic_e italic_i italic_g italic_h italic_t ), the candidates of alignment Al={(u1,v1),(u2,v2)}subscript𝐴𝑙subscript𝑢1subscript𝑣1subscript𝑢2subscript𝑣2A_{l}=\{(u_{1},v_{1}),(u_{2},v_{2})\}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) } can be derived as shown in Algorithm 1.

{(u1,v1),(u2,v2)}{(0,0),(0,0)}subscript𝑢1subscript𝑣1subscript𝑢2subscript𝑣20000\{(u_{1},v_{1}),(u_{2},v_{2})\}\leftarrow\{(0,0),(0,0)\}{ ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) } ← { ( 0 , 0 ) , ( 0 , 0 ) };
thresLie-to-stand threshold𝑡𝑟𝑒𝑠Lie-to-stand thresholdthres\leftarrow\text{Lie-to-stand threshold}italic_t italic_h italic_r italic_e italic_s ← Lie-to-stand threshold;
if hpdthres𝑝𝑑𝑡𝑟𝑒𝑠hpd\geq thresitalic_h italic_p italic_d ≥ italic_t italic_h italic_r italic_e italic_s then // Confirmed stand pose
       if ra[45,135]𝑟𝑎superscript45superscript135ra\in[45^{\circ},135^{\circ}]italic_r italic_a ∈ [ 45 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 135 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ] then // Confirmed right shot
             u1,u213widthsubscript𝑢1subscript𝑢213𝑤𝑖𝑑𝑡u_{1},u_{2}\leftarrow\frac{1}{3}widthitalic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG 3 end_ARG italic_w italic_i italic_d italic_t italic_h;
             v1,v2vmsubscript𝑣1subscript𝑣2subscript𝑣𝑚v_{1},v_{2}\leftarrow v_{m}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT;
            
       else if ra[225,315]𝑟𝑎superscript225superscript315ra\in[225^{\circ},315^{\circ}]italic_r italic_a ∈ [ 225 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 315 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ] then // Confirmed left shot
             u1,u223widthsubscript𝑢1subscript𝑢223𝑤𝑖𝑑𝑡u_{1},u_{2}\leftarrow\frac{2}{3}widthitalic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_w italic_i italic_d italic_t italic_h;
             v1,v2vmsubscript𝑣1subscript𝑣2subscript𝑣𝑚v_{1},v_{2}\leftarrow v_{m}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT;
            
       else  // Confirmed front or back shot
             u113widthsubscript𝑢113𝑤𝑖𝑑𝑡u_{1}\leftarrow\frac{1}{3}widthitalic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG 3 end_ARG italic_w italic_i italic_d italic_t italic_h;
             u223widthsubscript𝑢223𝑤𝑖𝑑𝑡u_{2}\leftarrow\frac{2}{3}widthitalic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_w italic_i italic_d italic_t italic_h;
             v1,v2vmsubscript𝑣1subscript𝑣2subscript𝑣𝑚v_{1},v_{2}\leftarrow v_{m}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT;
            
       end if
      
else
       if ra[45,135][225,315]𝑟𝑎superscript45superscript135superscript225superscript315ra\in[45^{\circ},135^{\circ}]\bigcup[225^{\circ},315^{\circ}]italic_r italic_a ∈ [ 45 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 135 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ] ⋃ [ 225 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT , 315 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ] then
             v113heightsubscript𝑣113𝑒𝑖𝑔𝑡v_{1}\leftarrow\frac{1}{3}heightitalic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG 3 end_ARG italic_h italic_e italic_i italic_g italic_h italic_t;
             v223heightsubscript𝑣223𝑒𝑖𝑔𝑡v_{2}\leftarrow\frac{2}{3}heightitalic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_h italic_e italic_i italic_g italic_h italic_t;
             u1,u2umsubscript𝑢1subscript𝑢2subscript𝑢𝑚u_{1},u_{2}\leftarrow u_{m}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT;
            
      else
             u113widthsubscript𝑢113𝑤𝑖𝑑𝑡u_{1}\leftarrow\frac{1}{3}widthitalic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG 3 end_ARG italic_w italic_i italic_d italic_t italic_h;
             u223widthsubscript𝑢223𝑤𝑖𝑑𝑡u_{2}\leftarrow\frac{2}{3}widthitalic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_w italic_i italic_d italic_t italic_h;
             v1,v2vmsubscript𝑣1subscript𝑣2subscript𝑣𝑚v_{1},v_{2}\leftarrow v_{m}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT;
            
       end if
      
end if
Algorithm 1 Deriving Alignment Candidates

VII-C Platform and Application

To facilitate cinematic production and test the effectiveness of our proposed method, we have integrated our camera control system as a plug-in within a 3D virtual filmmaking application. As a supplement to the main text, this integration is further detailed in this section. Given scripts from the user, the application first conducts an automatic script analysis and staging, according to the methodologies described in [31]. Such processes efficiently pre-initialize scenarios, character behaviors—including actions and emotional variables—and camera placements for users. Notably, all these elements are customizable through interfaces, allowing the user to make adjustments based on their own preferences.

Refer to caption

Figure 17: Instructions for using our integrated camera movement generation module in the application. See texts for details.

In the application, users are provided with three distinct camera tracks as well as multiple character tracks for designing camera and character behaviors. As indicated by the yellow box in Fig. 17, the ”Manual Camera” track is user-editable, whereas the ”Auto Camera” and ”Moving Camera” tracks are reserved for automatically generated cameras, utilizing [31] and our proposed method, respectively. The configuration of our camera movement generation module is displayed in the light blue box in Fig. 17. Here, flag 1 determines the activation of camera trajectory synthesis while flag 2 for visual aesthetic adjustment, enabling users to flexibly control the generation result to their specific needs. Additionally, if the ”AUTOCAM” checkbox is selected, this means the camera from the ”Auto Camera” track is used as the initial placement for the subsequent camera movement generation, otherwise, the system by default refers to the ”Manual Camera” track.

To synthesize camera movement, users can right-click on a deep blue behavior block on the character track and select the ”Moving Camera Generation” option, as depicted in step 2 of Fig. 17. The generation will then start using the current settings, involving the initialization from the nearest camera location, as well as actor poses and an emotion variable. Once the camera trajectory has been generated, it is displayed as a new camera behavior block on the ”Moving Camera” track, highlighted with a purple box in Fig. 17. Such the type of camera is given the highest priority for playing in the monitor view. We also offer a live usage instruction in the supplementary video.

VII-D Process of User Study

Refer to caption

Figure 18: The overall testing flow for the user study. See the texts for details.

In this section, we elaborate on every procedure in the user study. The whole testing flow is illustrated in Fig. 18. Before the test, we briefly introduce participants with basic cinematographic knowledge, including the actor-camera synchronization principle as well as shooting techniques of spatial tracking, emotional styling, and aesthetic framing. This is crucial to help all participants get familiarized with the testing contents quickly.

We begin the test by asking whether participants have any prior experience or background in cinematography, which allows us to categorize them as either professional or non-professional testers. Participants will be given 10 groups of tests blindly. In each test group, participants are first presented with scenario scripts, where critical descriptions are highlighted in distinct colors with configuration settings for conducting auto-cinematography. After familiar with the scripts, participants are shown a video generated by the baseline method [31], denoted as V000. They then watch another four videos produced using our method and its variant models, where these videos are randomly labeled from V001 to V004 without the actual model names.

After viewing all videos, participants are asked to rate the immersive enhancement of videos from V001 to V004 compared to V000, using a 1 to 5 scale with 0.5 increments. Scores below 3 indicate that the method does not demonstrate an enhanced immersion beyond the baseline [31], while scores above 3 indicate an increasing improvement in immersive performance. To facilitate precise qualitative evaluations, during the test we offer both independent video assessments and side-by-side video comparisons, where the latter is available in one-pair and four-pair formats. This ensures our participants to clearly observe the perceptual quality of videos from each method. Finally, all submitted test results are collected for the following data analysis described in the main text.

References

  • [1] B. Brown, Cinematography Theory and Practice: Imagemaking for Cinematographers & Directors.   Routledge, 2016.
  • [2] A. Truong, F. Berthouzoz, W. Li, and M. Agrawala, “Quickcut: An interactive tool for editing narrated video,” in Proceedings of the 29th Annual Symposium on User Interface Software and Technology, 2016, pp. 497–507.
  • [3] M. Wang, G.-W. Yang, S.-M. Hu, S.-T. Yau, and A. Shamir, “Write-a-video: computational video montage from themed text.” ACM Trans. Graph., vol. 38, no. 6, pp. 177–1, 2019.
  • [4] I. Arev, H. S. Park, Y. Sheikh, J. Hodgins, and A. Shamir, “Automatic editing of footage from multiple social cameras,” ACM Transactions on Graphics (TOG), vol. 33, no. 4, pp. 1–11, 2014.
  • [5] C. Liang, C. Xu, J. Cheng, W. Min, and H. Lu, “Script-to-movie: a computational framework for story movie composition,” IEEE transactions on multimedia, vol. 15, no. 2, pp. 401–414, 2012.
  • [6] A. Louarn, M. Christie, and F. Lamarche, “Automated staging for virtual cinematography,” in Proceedings of the 11th Annual International Conference on Motion, Interaction, and Games, 2018, pp. 1–10.
  • [7] I. Karakostas, I. Mademlis, N. Nikolaidis, and I. Pitas, “Shot type constraints in uav cinematography for autonomous target tracking,” Information Sciences, vol. 506, pp. 273–294, 2020.
  • [8] Q. Galvane, “Automatic cinematography and editing in virtual environments.” Ph.D. dissertation, Université Grenoble Alpes (ComUE), 2015.
  • [9] H. Subramonyam, W. Li, E. Adar, and M. Dontcheva, “Taketoons: Script-driven performance animation,” in Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, 2018, pp. 663–674.
  • [10] M. Gschwindt, E. Camci, R. Bonatti, W. Wang, E. Kayacan, and S. Scherer, “Can a robot become a movie director? learning artistic principles for aerial cinematography,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 1107–1114.
  • [11] Z. Yu, C. Yu, H. Wang, and J. Ren, “Enabling automatic cinematography with reinforcement learning,” in 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR).   IEEE, 2022, pp. 103–108.
  • [12] C. Huang, C.-E. Lin, Z. Yang, Y. Kong, P. Chen, X. Yang, and K.-T. Cheng, “Learning to film from professional human motion videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4244–4253.
  • [13] C. Huang, Y. Dang, P. Chen, X. Yang, and K.-T. Cheng, “One-shot imitation drone filming of human motion videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5335–5348, 2021.
  • [14] H. Jiang, B. Wang, X. Wang, M. Christie, and B. Chen, “Example-driven virtual cinematography by learning camera behaviors,” ACM Transactions on Graphics (TOG), vol. 39, no. 4, pp. 45–1, 2020.
  • [15] H. Jiang, M. Christie, X. Wang, L. Liu, B. Wang, and B. Chen, “Camera keyframing with style and control,” ACM Transactions on Graphics (TOG), vol. 40, no. 6, pp. 1–13, 2021.
  • [16] H. Wang, D. Smith, and M. Kudelska, “Enabling automatic cinematography with reinforcement learning,” in 2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR).
  • [17] T. Flight. (2021) The succession character you never see. [Online]. Available: https://www.youtube.com/watch?v=_lU91279xZk
  • [18] E. Puschak. (2017) How david fincher hijacks your eyes. [Online]. Available: https://www.youtube.com/watch?v=GfqD5WqChUY
  • [19] H. Mäcklin et al., “Going elsewhere: A phenomenology of aesthetic immersion,” 2019.
  • [20] P. Burelli, “Game cinematography: From camera control to player emotions,” in Emotion in Games.   Springer, 2016, pp. 181–195.
  • [21] B. Krages, Photography: the art of composition.   Simon and Schuster, 2012.
  • [22] D. Li, H. Wu, J. Zhang, and K. Huang, “A2-rl: Aesthetics aware reinforcement learning for image crop**,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8193–8201.
  • [23] C. Hong, S. Du, K. Xian, H. Lu, Z. Cao, and W. Zhong, “Composing photos like a photographer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7057–7066.
  • [24] H. Huang, Y. Feng, C. Shi, L. Xu, J. Yu, and S. Yang, “Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [25] J. Wang, M. Xu, L. Jiang, and Y. Song, “Attention-based deep reinforcement learning for virtual cinematography of 360 degree videos,” IEEE Transactions on Multimedia, vol. 23, pp. 3227–3238, 2020.
  • [26] C. D. Manning, An introduction to information retrieval.   Cambridge university press, 2009.
  • [27] X. Yang, T. Zhang, and C. Xu, “Text2video: An end-to-end learning framework for expressing text with videos,” IEEE Transactions on Multimedia, vol. 20, no. 9, pp. 2360–2370, 2018.
  • [28] M. Hayashi, S. Inoue, M. Douke, N. Hamaguchi, H. Kaneko, S. Bachelder, and M. Nakajima, “T2v: New technology of converting text to cg animation,” ITE Transactions on Media Technology and Applications, vol. 2, no. 1, pp. 74–81, 2014.
  • [29] P. Pueyo, J. Dendarieta, E. Montijano, A. C. Murillo, and M. Schwager, “Cinempc: A fully autonomous drone cinematography system incorporating zoom, focus, pose, and scene composition,” IEEE Transactions on Robotics, 2024.
  • [30] R. Bonatti, W. Wang, C. Ho, A. Ahuja, M. Gschwindt, E. Camci, E. Kayacan, S. Choudhury, and S. Scherer, “Autonomous aerial cinematography in unstructured environments with learned artistic decision-making,” Journal of Field Robotics, vol. 37, no. 4, pp. 606–641, 2020.
  • [31] Z. Yu, H. Wang, A. K. Katsaggelos, and J. Ren, “A novel automatic content generation and optimization framework,” IEEE Internet of Things Journal, 2023.
  • [32] Y. Ren, N. Yan, X. Yu, F. Tang, Q. Tang, Y. Wang, and W. Lu, “On automatic camera shooting systems via ptz control and dnn-based visual sensing,” Intelligent Service Robotics, pp. 1–21, 2023.
  • [33] C. Xie, I. Hemmi, H. Shishido, and I. Kitahara, “Camera motion generation method based on performer’s position for performance filming,” in 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE).   IEEE, 2023, pp. 957–960.
  • [34] Z. Yu, X. Wu, H. Wang, A. K. Katsaggelos, and J. Ren, “Automated adaptive cinematography for user interaction in open world,” IEEE Transactions on Multimedia, 2023.
  • [35] Y. Dang, C. Huang, P. Chen, R. Liang, X. Yang, and K.-T. Cheng, “Path-analysis-based reinforcement learning algorithm for imitation filming,” IEEE Transactions on Multimedia, 2022.
  • [36] X. Wang, R. Courant, J. Shi, E. Marchand, and M. Christie, “Jaws: Just a wild shot for cinematic transfer in neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16 933–16 942.
  • [37] H. Jung, H.-J. Lee, and C. E. Rhee, “Immersive virtual reality content supporting a wide and free viewpoint made with a single 360° camera,” IEEE Access, 2023.
  • [38] T. Horiuchi, S. Okubo, and T. Kobayashi, “Augmented immersive viewing and listening experience based on arbitrarily angled interactive audiovisual representation,” in Proceedings of the 25th International Conference on Multimodal Interaction, 2023, pp. 79–83.
  • [39] D. Traparic, M.-C. Larabi, and L. Bellatreche, “Towards automatic content generation for immersive cinema theater based on artificial intelligence,” in 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP).   IEEE, 2023, pp. 1–6.
  • [40] Z. Wei, X. Xu, L.-H. Lee, W. Tong, H. Qu, and P. Hui, “Feeling present! from physical to virtual cinematography lighting education with metashadow,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1127–1136.
  • [41] Morgan. (2013) Camera movement tutorial: How to create emotion. [Online]. Available: https://theslantedlens.com/2013/camera-movement-tutorial-how-to-create-emotion/
  • [42] M. Sayed, R. Cinca, E. Costanza, and G. Brostow, “Lookout! interactive camera gimbal controller for filming long takes,” ACM Transactions on Graphics (TOG), vol. 41, no. 3, pp. 1–16, 2022.
  • [43] R. Bonatti, A. Bucker, S. Scherer, M. Mukadam, and J. Hodgins, “Batteries, camera, action! learning a semantic control space for expressive robot cinematography,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 7302–7308.
  • [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [45] H.-Y. Wu, F. Palù, R. Ranon, and M. Christie, “Thinking like a director: Film editing patterns for virtual cinematographic storytelling,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 14, no. 4, pp. 1–22, 2018.
  • [46] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  • [47] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1.   IEEE, 2005, pp. 539–546.
  • [48] H.-Y. Lee, X. Yang, M.-Y. Liu, T.-C. Wang, Y.-D. Lu, M.-H. Yang, and J. Kautz, “Dancing to music,” Advances in neural information processing systems, vol. 32, 2019.
  • [49] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5188–5196.
  • [50] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [52] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
  • [53] L. Zhao, M. Shang, F. Gao, R. Li, F. Huang, and J. Yu, “Representation learning of image composition for aesthetic prediction,” Computer Vision and Image Understanding, vol. 199, p. 103024, 2020.
  • [54] S. Björk and J. Holopainen, “Games and design patterns,” The game design reader: A rules of play anthology, pp. 410–437, 2005.
  • [55] J. Mekas, “A note on the shaky camera,” Film Culture, no. 24-27, p. 40, 1962.
  • [56] S. Wright. (2017) The rule of thirds: What is it? filmmaking and photography training. [Online]. Available: https://www.youtube.com/watch?v=A7wnhDKyBuM
  • [57] P. May, Essential Digital Video Handbook: A Comprehensive Guide to Making Videos That Make Money.   Routledge, 2020.