(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: LIX, Ecole Polytechnique, IP Paris 22institutetext: LIGM, Ecole des Ponts, CNRS, UGE 33institutetext: Inria, IRISA, CNRS, Univ. Rennes

E.T. the Exceptional Trajectories:
Text-to-camera-trajectory generation
with character awareness

Robin Courant Nicolas Dufour Xi Wang Marc Christie Vicky Kalogeiton 111122113311
Abstract

Stories and emotions in movies emerge through the effect of well-thought-out directing decisions, in particular camera placement and movement over time. Crafting compelling camera trajectories remains a complex iterative process, even for skilful artists. To tackle this, in this paper, we propose a dataset called the Exceptional Trajectories (E.T.) with camera trajectories along with character information and textual captions encompassing descriptions of both camera and character. To our knowledge, this is the first dataset of its kind. To show the potential applications of the E.T. dataset, we propose a diffusion-based approach, named Director, which generates complex camera trajectories from textual captions that describe the relation and synchronisation between the camera and characters. To ensure robust and accurate evaluations, we train on the E.T. dataset CLaTr, a Contrastive Language-Trajectory embedding for evaluation metrics. We posit that our proposed dataset and method significantly advance the democratization of cinematography, making it more accessible to common users.

Refer to caption
Figure 1: Different results generated by our camera trajectory diffusion system. Project page https://www.lix.polytechnique.fr/vista/projects/2024_et_courant.

1 Introduction

Cinematography is a collaborative and complex crafting process that mixes technical, artistic and storytelling skills. The ultimate objective is to communicate a distinct message to the audience, at a cognitive (e.g., revealing facts), emotional and aesthetic level, through tasks such as laying out the scene (mise-en-scène), setting up the lighting and making decisions to place and move the camera in relation to the characters, their actions or the overall scene content. In this context, the camera is the only window into this staged world and therefore plays a critical role in conveying the director’s intention. Through more than a hundred years of practice, cinematography has forged a common language for directors – the film grammar – that prescribes how to place and move the camera to achieve intended effects. Yet mastering camera placements and motions remains challenging, especially for novice users confronted with hundreds of possibilities and little insights into how to generate the best ones.

To lower the barriers in handling camera placement and camera motion, researchers have introduced a variety of methods. These include purely geometric approaches [4, 30], optimization- and control-based strategies [11, 12], as well as deep learning-grounded methodologies [23, 5, 20, 11] to interactively or automatically compute the parameters of camera trajectories. Typically, these methods address cinematographic tasks as either cinematic-rule-based control [20, 5, 12] or example-based imitation [23, 22, 45], conceptually resembling discriminative and regression models or registration and adaptation methods, respectively. Such techniques, however, suffer from the need to either design the underlying geometric model for each type of motion, or to design carefully crafted cost functions for each motion, and are often limited in their capacity to combine mixed motions creatively.

Recent advances in video generation [46, 52] enable users to explore more creative possibilities by capturing and reproducing camera motion in their generated videos. Jiang et al. [24] followed this path and addressed camera trajectory generation using diffusion models, which incorporate a high degree of controllability. Yet, this work displayed two main drawbacks: first, it relied on a character-centric coordinate system to simplify the problem, thus limiting its generation capabilities, and second its evaluation metrics relied on camera trajectory features with oversimplified assumptions.

In other domains, the generative techniques often rely on the availability of large datasets enriched with textual descriptions, such as language-motion obtained via motion capture (mocap) [36, 14] or language-vision [29, 40] datasets. Yet in cinematography, there is no movie datasets where crucial cinematic information such as camera and character trajectories are available. Most recent approaches build on synthetic data [23, 22, 24], or general videos from streaming platforms (see [20] for drone trajectory generation, or [53] for dedicated real-estate videos) without the cinematic features that conform to the film grammar. Some example-based approaches address cinematic transfer tasks from real film clips [45, 25], these approaches only retarget and adapt the camera trajectory with little control or variability in the results and do not encode cinematographic knowledge.

In this work, we propose a new camera trajectory dataset extracted from real movie clips, called E.T. the Exceptional Trajectories. It comprises camera trajectories together with textual descriptions of both camera and character trajectory over time (see Figure 2). E.T. contains more than 11111111M frames with the corresponding camera and character trajectories, as well as two types of captions: camera-only and camera-character, describing the trajectory of the camera with respect to the trajectory of the character. To our knowledge, E.T. is the first extensive dataset with geometric information on both camera and character trajectories accompanied by textual descriptions.

To exploit this dataset, we also propose Director (DiffusIon tRansformEr Camera TrajectORy), a diffusion-based model that generates camera trajectories by leveraging text descriptions and character information, as shown in Figure 1. This allows us to better encode the correlation between character and camera trajectories. Moreover, unlike previous methods [24] that use a constrained character-relative coordinate system, we propose to use a global coordinate system. Director relies on a classical diffusion framework with three distinct architectures for conditioning: in-context, AdaLN and cross-attention settings. Furthermore, we propose a language-trajectory embedding: CLaTr (Contrastive Language-Trajectory), trained at scale using the E.T. dataset. CLaTr serves as a foundation for computing default generative metrics similar to Frechet-Inception-Distance (FID) [16] for generated trajectories. Our experiments show that all three architectures of Director successfully leverage the combination of input captions and character trajectories as conditions. Overall, Director sets the new state-of-the-art on the camera trajectory generation task.

Our contributions are: (1) We introduce the E.T. camera trajectory dataset extracted from real movie clips. We complement camera trajectories with character trajectories and captions for both camera and character. (2) We present Director, a camera trajectory diffusion model that exploits both character trajectories and textual descriptions. It offers higher controllability and granularity for users than existing approaches [24] and achieves state-of-the-art performances. (3) We propose CLaTr, a robust and accurate language-trajectory embedding, which facilitates the evaluation of camera trajectory generation models.

2 Related work

Refer to caption
Figure 2: Examples E.T. samples. Each subfigure presents frames from the original movie shot on the left, while the right side depicts the extracted and processed camera and character trajectories. Additionally, the bottom part showcases the generated camera trajectory caption with or without the character trajectory.

Camera control. Over the past twenty years, there have been several paradigm shifts in camera planning and control. Initial studies [4] predominantly focused on geometric modeling [30] and rule-based trajectory controls [11] to direct and create camera trajectories that comply with either hand-crafted cinematic rules or image-based criteria. With the progress of deep learning, [23] introduced a method to synthesize camera trajectory for 3D animations in two stages: (i) capturing cinematic styles from a reference clip using a Mixture-of-Experts model, and (ii) generating trajectories based on 3D character animations autoregressively. Subsequent research [22], building on this, incorporates keyframing to provide extra control such as positional and velocity constraints. More recently, JAWS [45] pioneered the direction for example-based camera retargeting within a Neural Radiance Field (NeRF) [31] setting, by optimising camera trajectory directly given the 2D reference clip in a 3D NeRF. All these example-based methods share a common limitation: they struggle with generalization because they require carefully selected reference videos to ensure high quality.

Unlike example-based methods, many cinematic-rule-based methods readily integrate with Deep Reinforcement Learning (DRL) and Imitation Learning (IL) techniques, particularly in the drone cinematography domain: [20] exploit optical flow and human poses to guide drone controls via an IL framework. Similarly, [5] use DRL to control drone actions for multiple rewards, including obstacle avoidance, target tracking, shooting style etc. Recently, GAIT [48] employs an aesthetic score-based RL method instead of handcrafted rewards to control the camera in the virtual 3D environment. However, these RL-based camera control approaches also have limitations: (1) they need environment-specific training; (2) they inherently restrict the diversity of results, often leading to collapsed trajectory styles. Instead, we leverage the generalization capabilities of generative models to address the camera control task.

Camera diffusion. Generative models have recently gained much progress and attention in domains such as textual-conditioned image generation [39, 37, 33], video synthesis [41, 3] and human motion generation [42, 7, 51]. Among these, diffusion models stand out for their strong ability to produce high-fidelity and diverse generative samples [47, 10], making them particularly well-suited for camera trajectory generation tasks.

The first application of diffusion models in camera control is the Cinematographic Camera Diffusion (CCD) [24], which relies on the MDM architecture (human Motion Diffusion Model) [42] and is trained on synthetic data. However, CCD simplifies the task by expressing all the camera trajectories in character-centric relative coordinates. Its small-scale synthetic training dataset also limits the broader application of the method (e.g., only 48-size vocabulary is used during training), thus making it unable to generate camera trajectories from real datasets and, in turn, impractical for common users. In contrast, in our proposed E.T. dataset, we represent camera trajectories in a global coordinate system, distinct from character trajectories. This approach allows for more diverse correlations between character and camera movements. Additionally, E.T. offers a rich vocabulary (similar-to\sim 5.4k) and extensive camera trajectory data.

Recent literature also includes several text-to-video generation techniques that can handle different categories of camera motions [46, 52]. These methods, however, assume access to 3D camera trajectories, whereas our approach generates them. Furthermore, they typically overlook the camera’s primary targets (i.e., the characters), which are essential for defining camera trajectories. In contrast, our dataset contains character information, and we leverage it to generate camera trajectories that focus on a specific target character.

Camera trajectory datasets. Many modern generative methods leverage large multimodal datasets. For instance, in text-to-image generation, the default dataset is LAION [40] with around 400 million image-text pairs. Similarly, in human motion synthesis, the large-scale KIT [36] and HumanML3D [14] datasets offer detailed textual captions that enhance comprehension of human motion. Yet, for camera control, only a few datasets are available [53, 24]. This is largely due to the intricacies involved in extracting camera poses from real-world videos, especially in cinematic contexts due to the presence of stylistic elements (e.g. motion blur or depth-of-field). Zhou et al. [53] applied Structure-from-Motion (SfM) methods to YouTube real-estate videos, creating the RealEstate10K dataset. This dataset, designed primarily for 3D reconstruction, comprises solely smooth camera movements and limited scene variation, lacking the nuanced complexity of cinematic camera motion and human presence. More recently Jiang et al. [24] introduced a synthetic cinematic camera trajectory dataset, aiming to circumvent extraction challenges. However, this dataset oversimplifies the intricate cinematic dynamics present in real-world movies.

A recent breakthrough in 3D human pose estimation for videos, termed SLAHMR [13], offers a compelling trade-off between robustness and accuracy by jointly optimizing camera and character trajectory estimations. Motivated by the lack of camera trajectory datasets, the capabilities of SLAHMR and the recent advances in other domains, we propose a new multi-modal camera trajectory dataset E.T. extracted from cinematic content, which we enhance with automatically generated captions for camera and character trajectories.

3 Exceptional Trajectories (E.T.)

Dataset #Samples #Frames #Hours Domain Character Camera #Vocabulary
Traj #Captions Traj #Captions
KIT Motion-Language [36] 4K 0.8M 11.23 Mocap 6K - 1,623
HumanML3D [14] 14K 2M 28.59 Mocap 45K - 5,371
RealEstate10k [53] 79K 11M 121 Youtube - - -
CCD [24] 25K 4.5M 50 Synthetic - 25K 48
E.T. (Ours) 115K 11M 120 Movie 115K 230K 1,790
Table 1: Dataset comparison. We compare the E.T. dataset to (i) two human motion datasets KIT [36] and HumanML3D [14]; and (ii) camera trajectory datasets RealEstate10K [53] and CCD [24]. Here the notion of sample is common across all datasets and corresponds to data associated with a continuous temporal sequence.

We introduce a camera trajectory dataset called Exceptional Trajectories (E.T.), extracted from real movies. E.T. is built upon the Condensed Movies Dataset (CMD) [1]. Each sample in E.T. represents a camera trajectory at the shot level together with a character trajectory and two types of textual captions: a camera-only caption, which describes the camera motion; and a joint camera-character trajectory caption, which describes the motion of the camera according to the motion of the character (see Figure 2). Below, we describe the key properties and statistics of E.T. (Section 3.1) followed by the creation pipeline (Section 3.2).

3.1 E.T. properties and statistics

The key properties of E.T. are as follows:

Cinematic content. The camera trajectories in E.T. are both realistic and cinematic, since they are extracted from real-world movies (Table 1). This dual nature allows for effective modelling of various visual styles, in contrast to RealEstate10k’s [53] focus on shots characterized by smooth camera trajectories and limited scene variation. Furthermore, by extracting data from real-world movies, E.T. sets itself apart from CCD [24], which only relies on synthetic camera trajectories.

Scale. E.T. is built upon 16,2101621016,21016 , 210 different scenes from CMD [1]. It comprises 115115115115K samples spanning 11111111M frames and totalling 120120120120 hours of footage, offering extensive and diverse camera and character (human) trajectories based on real movies. In contrast, existing human motion datasets are much smaller, with only 11.2311.2311.2311.23 hours for KIT [36] and 28.5928.5928.5928.59 hours for HumanMl3D [14] (see Table 1). When compared against datasets with camera trajectories, it far exceeds CCD [24] in terms of hours, frames and samples. Although its scale is comparable to RealEstate10k [53], it provides additional character trajectories and captions referring to real movies as opposed to RealEstate10k, which focuses only on camera trajectories in another domain.

Controllability. E.T. stands out by comprising not only camera and character trajectories but also camera-only and camera-character captions (see Figure 2). Incorporating caption information into the model offers multiple advantages: (1) it democratizes the input format for general users; and (2) it adds complementary semantic information to the trajectory data. In comparison, RealEstate lacks captions entirely. CCD’s captions are limited by a small vocabulary size and focus only on camera while lacking character information111Note that CCD indirectly comprises camera trajectories through the character-relative coordinate system.. The richness and complexity of E.T.’s captions are on par in terms of vocabulary size –above a thousand– with human motion datasets such as KIT and HumanML3D, which provide detailed, hand-crafted human motion descriptions222Note that E.T. has no overlap with human motion datasets. E.T.’s extracted 3D poses (see Section 3.2) are less accurate than the ones in motion capture, while its captions describe camera trajectory relative to character trajectory, as opposed to describing exact human motions targeted by these datasets..

Statistics. Figures 4(a)4(b) display the statistics of the E.T. dataset, confirming the diversity and all six degrees of freedom coverage of both camera and character trajectories (see more in Appendix 0.B.1.)

3.2 Dataset creation pipeline

Refer to caption
Figure 3: Dataset creation pipeline. Given RGB frames from a video, we first extract and pre-process camera and character poses, then tag resulting camera and character trajectories (sequence of poses) to obtain rough independent descriptions (middle part). Finally, we translate these descriptions into rich textual captions, aligning the camera trajectory with that of the character (right part).

E.T. is constructed by a three-step process (see Figure 3). First, we extract the 3D coordinates of cameras and characters over time, which we further refine to form uniform trajectories. Second, we perform motion tagging, i.e. partition each trajectory into segments with each segment comprising a pure camera motion that we label (tag). Third, we generate captions that describe both the camera and the character trajectory over time. We detail each step below.

Data extraction and pre-processing.

To extract camera and character poses, we apply on each shot the joint camera and 3D human poses estimator SLAHMR [50]. Given the complexity of estimating 3D poses from 2D data, the raw outputs tend to be noisy. To address this, we perform various pre-processing steps such as alignment, filtering, smoothing and crop** to a maximum length of 300 frames as in [14]. Refer to the Appendix 0.B.2 for further details.

Motion tagging.

Our objective is to partition camera or character trajectories into segments of pure motion: tags. Besides static, we consider the six fundamental motions across three degrees of freedom. They include lateral movements left, and right; vertical movements up and down; and depth movements forward and backwards. Each trajectory is partitioned into motion tags with one, two, or three pure camera motions, totalling 27 combinations (see Figure 4(a)).

We propose a thresholding-based method that uses trajectory velocity for motion tagging: This method consists of two stages: (i) for each dimension (XYZ), we use an initial threshold on velocity to detect whether the camera or character remains static along the dimension; (ii) when multiple dimensions are non-static, we calculate pairwise velocity rates and use a threshold to pinpoint dominant velocities. A dimension is classified as static if its velocity is outmatched. The tag of motion between two points is then determined by the combination of non-static dimensions. Finally, we apply smoothing to avoid noisy and sparse tags and hence enhance the overall trajectory-level tagging.

For camera trajectory tagging, we use the rigid body velocity SE(3)absent𝑆𝐸3\in{SE(3)}∈ italic_S italic_E ( 3 ) – derived from rotation and translation– to account for the camera’s facing direction. this enables us to differentiate between similar motions, such as ‘trucking’, where the camera moves along an axis with a perpendicular facing direction, and ‘depth’, where the facing direction aligns with the movement axis. For character trajectory tagging, we assume that characters face the direction of their movement. Hence, we represent character trajectory using only the linear velocity, as derived from the translation of their hip centres.

These result in a coarse description of both camera and character trajectories over time as shown in Figure 3 (left).

Caption generation.

Our objective is to provide rich textual descriptions of the extracted camera trajectories according to the character trajectory. In movie, cameras typically move relative to the subject being filmed, i.e., the main character. Therefore, for each shot, we first identify the main character following [43]333Hitchcock’s rule: ‘the size of an object in the frame should equal its importance in the story at the moment’ [43]. based on the temporal and spatial coverage of their bounding boxes within the shot. Then, for both camera and main character trajectories, we generate captions for each motion tag, as shown in the center of Figure 3. Then, inspired by [9], our goal is to convert the descriptions obtained via motion tagging for camera and character trajectories into detailed textual annotations. For this, we prompt an LLM –Mistral-7B [21]– to generate camera trajectory captions by referencing the main character’s trajectory as anchor points. Our prompt formulation follows a structured approach with context, instruction, constraint, and example. Further details can be found in the Appendix 0.B.3.

This step results in a rich description of both camera and character trajectories over time as shown in Figure 3 (right).

staticinrightleftoutupdownright+inleft+inright+outleft+outright+upleft+upleft+downright+downdown+inup+indown+outup+outleft+down+outright+up+inleft+down+inright+down+inleft+up+inright+up+outright+down+outleft+up+out102superscript102{10^{2}}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT103superscript103{10^{3}}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT104superscript104{10^{4}}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPTNum frames (log)
(a) Camera segment distribution
staticbackwardforwardrightleftupdownright-backwardleft-backwardleft-forwardright-forwarddown-backwardup-forwardup-backwarddown-forwardright-upleft-downright-downleft-upright-up-forwardleft-down-backwardright-down-backwardleft-down-forwardright-up-backwardleft-up-backwardleft-up-forwardright-down-forward102superscript102{10^{2}}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT103superscript103{10^{3}}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT104superscript104{10^{4}}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT105superscript105{10^{5}}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPTNum frames (log)
(b) Character segment distribution
Figure 4: E.T. statistics.

4 Method

Here, we introduce our proposed DiffusIon tRansformEr Camera TrajectORy (Director) method for camera trajectory generation (Section 4.1). Director takes as input the character trajectory with the camera-character caption and generates a camera trajectory. Additionally, we present the Contrastive Language-Trajectory embedding (CLaTR) that serves as a basis for creating a common space between text and trajectories (Section 4.2), enabling the computation of evaluation metrics.

4.1 Camera trajectory diffusion

Problem formulation.

We consider a camera trajectory 𝐱1:Nsubscript𝐱:1𝑁\mathbf{x}_{1:N}bold_x start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT as a sequence of N𝑁Nitalic_N consecutive camera poses. Each camera pose 𝐱=[𝐑|𝐭]𝐱delimited-[]conditional𝐑𝐭\mathbf{x}=[\mathbf{R}|\mathbf{t}]bold_x = [ bold_R | bold_t ] comprises a rotation 𝐑𝐑\mathbf{R}bold_R representing the camera’s orientation and a translation 𝐭𝐭\mathbf{t}bold_t indicating its position. We aim at generating camera trajectories under two conditions: (i) a target character trajectory 𝐡1:Nsubscript𝐡:1𝑁\mathbf{h}_{1:N}bold_h start_POSTSUBSCRIPT 1 : italic_N end_POSTSUBSCRIPT capturing the 3D positions of the main character; and (ii) a textual description c𝑐citalic_c specifying the desired camera movement relative to the character movement.

Diffusion framework.

We follow the general diffusion paradigm established in EDM [26]. In essence, diffusion models consist of randomly sampling 𝐱0𝒩(𝟎,σmax2𝐈)similar-tosuperscript𝐱0𝒩0superscriptsubscript𝜎𝑚𝑎𝑥2𝐈\mathbf{x}^{0}\sim\mathcal{N}(\mathbf{0},\sigma_{max}^{2}\mathbf{I})bold_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , italic_σ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ), and progressively denoising it to reach the endpoint 𝐱Ksuperscript𝐱𝐾\mathbf{x}^{K}bold_x start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT of this process, distributed according to the initial data distribution. During the training stage, we perturb an initial data distribution with standard deviation σdatasubscript𝜎data\sigma_{\text{data}}italic_σ start_POSTSUBSCRIPT data end_POSTSUBSCRIPT, with i.i.d. Gaussian noise with standard deviation σ𝜎\sigmaitalic_σ. When σmaxσdatamuch-greater-thansubscript𝜎maxsubscript𝜎data\sigma_{\text{max}}\gg\sigma_{\text{data}}italic_σ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ≫ italic_σ start_POSTSUBSCRIPT data end_POSTSUBSCRIPT, the noise distribution equivalent to a normal distribution 𝒩(𝟎,σmax2𝐈)𝒩0superscriptsubscript𝜎𝑚𝑎𝑥2𝐈\mathcal{N}(\mathbf{0},\sigma_{max}^{2}\mathbf{I})caligraphic_N ( bold_0 , italic_σ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ). We use these modified versions of the initial data distribution to train a denoiser module D𝐷Ditalic_D, which takes as input a sample 𝐱𝐱\mathbf{x}bold_x to denoise, the two conditions (character trajectory 𝐡𝐡\mathbf{h}bold_h and the caption c𝑐citalic_c), and the corresponding standard deviation σ𝜎\sigmaitalic_σ. Then, D𝐷Ditalic_D is trained using the denoising score matching loss:

score=(D(𝐱,𝐡,c;σ)𝐱)/σ2.subscriptscore𝐷𝐱𝐡𝑐𝜎𝐱superscript𝜎2.\mathcal{L}_{\text{score}}=\big{(}D(\mathbf{x},\mathbf{h},c;~{}\sigma)-\mathbf% {x}\big{)}/\sigma^{2}\text{.}caligraphic_L start_POSTSUBSCRIPT score end_POSTSUBSCRIPT = ( italic_D ( bold_x , bold_h , italic_c ; italic_σ ) - bold_x ) / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (1)

During the sampling phase, we apply the 2nd order deterministic sampling introduced in EDM [26] with classifier-free guidance [19].

Refer to caption
(a) Director A
Refer to caption
(b) Director B
Refer to caption
(c) Director C
Figure 5: DiffusIon tRansformEr Camera TrajectORy (Director). We display 3 variants of our diffusion model Director. Director A incorporates the conditioning as in-context tokens. Director B leverages AdaLN modulation of the transformer block to add the conditioning. Director C uses the full text and character trajectory sequences by relying on cross-attention.

Director architecture.

Director (DiffusIon tRansformEr Camera TrajectORy) takes as input the character trajectory and the caption and generates a camera trajectory. Its architecture is illustrated in Figure 5. The base of Director is a pre-norm Transformer [44, 49]. We condition the transformer on the diffusion timestep, the character trajectory, and a textual description that describes the relative movement between the camera and character trajectories (see Figure 2). The timestep is tokenized using a sinusoidal positional embedding [44] and then mapped with an MLP.

Inspired by the DiT architecture variants [34], we explore three distinct ways to include the conditioning in the denoising process (Figure 5).

Director A (Figure 5(a)). The conditioning is added to the context of the transformer input. We only use the global clip token for the text, and we do a linear embedding of the character trajectories, which in turn gets averaged pooled into a single token.

Director B (Figure 5(b)). Both conditionings (character trajectory and caption) are concatenated into a single token which gets mapped at each layer into 6 vectors, γ1,β1,λ1,γ2,β2,λ2subscript𝛾1subscript𝛽1subscript𝜆1subscript𝛾2subscript𝛽2subscript𝜆2\gamma_{1},\beta_{1},\ \lambda_{1},\gamma_{2},\beta_{2},\ \lambda_{2}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then, the layer-norm of the transformer is replaced by the following AdaLN operation:

ADALN(γ,β,x)=(1+γ)LN(X)+β,ADALN𝛾𝛽𝑥1𝛾LN𝑋𝛽\text{ADALN}(\gamma,\beta,x)=(1+\gamma)\text{LN}(X)+\beta\quad,ADALN ( italic_γ , italic_β , italic_x ) = ( 1 + italic_γ ) LN ( italic_X ) + italic_β , (2)

where LN refers to the Layer Normalization, γ,β𝛾𝛽\gamma,\betaitalic_γ , italic_β are the scale and bias, respectively. The AdaLN operation is performed before each self-attention and feed-forward layer in the transformer. The output of each self-attention and cross-attention is rescaled by λ𝜆\lambdaitalic_λ. Following [34], we initialize the modulation such that the output is zero.

Director C (Figure 5(c)). We leverage the full sequence length of the conditioning. We retrieve the CLIP-embedded text sequence and the linearly projected trajectory and concatenate them into a single sequence. We then use 2 layers of transformer encoders to pre-process this sequence, which is then incorporated into the Director transformer with a cross-attention block.

4.2 Contrastive Language-Trajectory embedding (CLaTr)

Given the scarcity of relevant camera trajectory methods and datasets, the community has not introduced adequate metrics for this task. In the concurrent cinematic camera trajectory diffusion work [24], the authors evaluate their model with metrics from the human motion community. For this, they train a dedicated camera trajectory classifier to extract features. However, their classifier is trained on a simplistic task, comprising only six basic camera motion classes on synthetic data, which fails to capture the true complexity of camera trajectories.

To address this lack of proper evaluation metrics, in this section, we propose to extend existing metrics from text-image-based and text-motion-based generation (which rely on feature embeddings to measure the generation quality) to text-trajectory generation. The main obstacle is that no commonly accepted text-trajectory feature embedding exists. Therefore, we propose to learn a general text-trajectory embedding in a contrastive CLIP-like manner to acquire an accurate and robust feature representation, which can serve as a foundation for computing camera trajectory evaluation metrics.

We introduce Contrastive Language-Trajectory embedding (CLaTr) by capitalizing our multi-modal dataset E.T. with a CLIP-like approach [38]. Our language-trajectory embedding follows the methodology outlined in [35], originally designed for human motion. CLaTr consists of a VAE [27] framework with trajectory and text encoders and a shared feature decoder. CLaTr is trained with three losses: (a) a reconstruction loss Rsubscript𝑅\mathcal{L}_{R}caligraphic_L start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, quantifying trajectory reconstruction of both trajectory and text features; (b) four KL loss terms KLsubscript𝐾𝐿\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT, which regularize each modality distribution and also enforce inter-modality similarity; and (c) a cross-modal embedding similarity loss Esubscript𝐸\mathcal{L}_{E}caligraphic_L start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT, ensuring alignment between text and trajectory features. See Appendix 0.C for more details.

5 Experiments

Set Methods ω𝜔\mathbf{\omega}italic_ω Camera trajectory quality Text-camera coherence
FDCLaTr \downarrow   P \uparrow   R \uparrow   D \uparrow   C \uparrow   CS \uparrow  C-P \uparrow  C-R \uparrow C-F1 \uparrow
E.T. pure trajectories CCD [24] 5.5 31.33   0.79   0.55   0.83   0.72   3.21  0.53  0.28 0.27
MDM [42] 1.8 6.10   0.77   0.68   0.89   0.80   21.26  0.81  0.75 0.76
Director A 1.6 5.16   0.82   0.67   1.00   0.86   21.88 0.84 0.78 0.80
Director B 1.8 6.61   0.80   0.72   0.92   0.82   23.10 0.85 0.80 0.86
Director C 1.6 4.57   0.83   0.65   1.00   0.87   21.49  0.83 0.78 0.80
E.T. mixed trajectories CCD [24] 6.0 35.81   0.73   0.55   0.75   0.67   6.26  0.37  0.20 0.17
MDM [42] 2.0 6.79   0.78   0.65   0.85   0.76   18.32  0.36  0.36 0.34
Director A 1.4 3.88   0.82   0.68   0.98   0.85   20.76 0.43 0.43 0.42
Director B 1.6 6.10   0.78   0.74   0.85   0.78   20.78  0.41  0.40 0.39
Director C 1.4 3.76   0.83   0.67   1.00   0.86   21.95 0.49 0.49 0.48
Table 2: Quantitative Results. Comparison of Director and concurrent methods on E.T. pure and mixed subsets, evaluating trajectory quality (left) and caption coherence (right). First best and second best.
Refer to caption
Figure 6: FDCLaTr vs CLaTr-Score. Guidance range between 0.6 and 2.2 on E.T. mixed subset.

Implementation details.

We train Director with a batch size of 128 using the AdamW optimizer with a learning rate of 1e-4, (β1,β2)=(0.9,0.95)\beta_{1},\beta_{2}){=}(0.9,0.95)italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( 0.9 , 0.95 ) and a weight decay of 0.10.10.10.1. We use a cosine decay learning rate scheduler with 5k steps of warmup for a total of 170k steps in bfloat16 mixed precision. The model has 8 layers with a hidden dim of 512 and 16 attention heads. We use dropout and stochastic depth of 0.1. We set the default temporal input size to 300 to match the E.T. sample size (see Section 3.2) and use masking to handle inputs with fewer than 300 frames. For the camera trajectory, we use the 6D continuous representation for rotation [54] combined with the 3D translation component. For the character trajectory, we use the 3D position of the character’s hip center.

5.1 Quantitative results

Metrics.

We use two sets of metrics.
First, we assess camera trajectory quality, specifically how well the generated camera trajectories match the distribution of the ground truth camera trajectories. For this, we use the CLaTr-based metrics described in Section 4.2: the Frechet CLaTr Distance (FDCLaTr) similar to FID [17]), Precision (R), Recall (R), Density (D) and Coverage (C) [32]. As the validation set comprises only a few samples and these metrics need a critical amount of samples (10k+), we compare to the train set as it is common practice in small dataset generative models (e.g. CIFAR image generation [28, 18]).
Second, we use text-camera coherence metrics, which measure the coherence between the given caption (text) and the generated camera trajectory. For this, we use the CLaTr-Score (CS) (see Section 4.2), similar to CLIP-Score [15]. Additionally, we derive Classifier Precision (C-P), Classifier Recall (C-R) and Classifier F1-Score (C-F1) by performing motion tagging (described in Section 3.2) on generated camera trajectories and compare them to the ground truth.

Dataset.

In our experiments, we train and evaluate our model on two different subsets of the E.T. dataset. First, the pure camera trajectory subset, where we only keep the samples having a single camera motion trajectory (e.g. “the camera trucks right”). Second, the mixed camera trajectories subset, which excludes some static-only camera trajectories to create a balanced subset. In this way, we can both correctly compare against methods suited for simple, pure trajectories and emphasize the difficulty of the mixed compositional camera trajectories. We compare in Table 2 Director with concurrent methods on the pure subset (top) and mixed subset (bottom).

Comparison to the state of the art.

We report in Table 2 and Figure 6 quantitative results of the different Director architectures against the previous state-of-the-art CCD [24], and MDM [42], a default modern method in human motion. We observe that overall we outperform both works on all metrics and both subsets. Particularly, in the mixed trajectory subset (bottom of Table 2), we demonstrate superior camera trajectory quality metrics (left section of Table 2) with a margin of 3.03.0-3.0- 3.0 FDCLaTr against MDM and 32.132.1-32.1- 32.1 against CCD. Additionally, our method excels in text-camera coherence (right section of Table 2) within the same subset, achieving a substantial improvement of +3.63.6+3.6+ 3.6 ClaTR-Score against MDM and +15.715.7+15.7+ 15.7 against CCD.

Additionally, we show in Figure 6 the trade-off between FDCLaTr (trajectory quality) and CLaTr-Score (conditioning coherence) for varying guidance weights. The optimal point is at the bottom right, where FDCLaTr is lowest and CLaTr-Score is highest. We observe that the MDM curve (blue) consistently lies above Director’s curves, indicating that MDM performs worse.

These results reveal the effectiveness of our method both in generating high-quality camera trajectories and in handling the input caption conditioning.

Ablation of Director architectures.

We observe in Table 2 and Figure 6 that Director C outperforms other variants, followed closely by Director A. The cross-attention mechanism in Director C enables effective incorporation of conditioning into the model, leading to its superior performance. Director A offers a compelling balance of efficiency and performance: it exhibits comparable results to Director C with a simpler concept and fewer parameters. In contrast, Director B excels in text-camera coherence on the pure trajectory subset (top-right of Table 2) but struggles on the mixed trajectory subset (bottom-right of Table 2). We attribute this to the AdaLN’s ability to condition the model in simple setups, but its failure to capture sequential complexity in harder scenarios.

5.2 Qualitative results

Refer to caption
(a) Controllability
Refer to caption
(b) Diversity
Refer to caption
(c) Complexity
Refer to caption
(d) Character-aware
Figure 7: Qualitative results. Generated camera trajectories with corresponding prompts and character trajectories, highlighting (a) controllability, (b) diversity, (c) complexity, and (d) character awareness. Darker shades indicate later frames.

Figure 7 shows generated camera trajectories from Director (architecture C). Each sub-figure displays the trajectories with pyramid markers for keyframes, along with character meshes and corresponding captions. The output trajectories are smooth and consistent with the input conditions. We highlight four key strengths of our method:

Controllability (Figure 7(a)). Director offers high controllability: by modifying only two words in the caption, the user can generate all kinds of camera trajectories, e.g. “trucks right”, “trucks left”, “booms top” and “booms bottom”.

Diversity (Figure 7(b)). Given the same input conditions (i.e. character trajectory and caption), Director generates diverse camera trajectories, allowing users to explore a wide range of creative and unique outputs.

Complexity (Figure 7(c)). Director can handle complex input conditions, including character trajectories (e.g., “moves right” then “stops”) and camera trajectories descriptions (e.g., “stays static and pushes-in” and “trucks right and remains static”).

Character-awareness (Figure 7(d)). Director effectively considers the character, generating camera trajectories that follow the character’s movement when the prompt and character trajectory are mirrored.

6 Conclusion

We designed and implemented E.T., a dataset of camera and character trajectories extracted from movie sequences that we believe will be very beneficial to the community. In addition to their trajectories, E.T. comes with text captions that describe the camera and character trajectories over time. We showed how E.T. can be exploited to train a diffusion-based approach to generate complex camera trajectories from high-level textual descriptions which correlate the trajectory of the camera with the trajectory of the characters. For this, we propose the diffusion-based method Director, which sets the new state of the art on camera trajectory generation. In the future, we plan to address the expressiveness of the trajectory captions, by including more information about modifiers and the exact position on the screen where the characters should be located.

Acknowledgements

This work was supported by ANR-22-CE23-0007, ANR-22-CE39-0016, Hi!Paris grant and fellowship, and was granted access to the HPC resources of IDRIS under the allocation 2023-AD011013951 made by GENCI. We would like to thank Hongda Jiang, Mathis Petrovich, Pierre Vassal and the anonymous reviewers for their insightful comments and suggestions.

References

  • [1] Bain, M., Nagrani, A., Brown, A., Zisserman, A.: Condensed movies: Story based retrieval with contextual embeddings. In: ACCV (2020)
  • [2] Björck, Å.: Least squares methods. Handbook of numerical analysis (1990)
  • [3] Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
  • [4] Blinn, J.: Where am I? what am I looking at? (cinematography). IEEE Computer Graphics and Applications (1988)
  • [5] Bonatti, R., Wang, W., Ho, C., Ahuja, A., Gschwindt, M., Camci, E., Kayacan, E., Choudhury, S., Scherer, S.: Autonomous aerial cinematography in unstructured environments with learned artistic decision-making. J. Field Robotics. (2020)
  • [6] Castellano, B.: Pyscenedetect. https://github.com/Breakthrough/PySceneDetect (2014)
  • [7] Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: CVPR (2023)
  • [8] Courant, R., Lino, C., Christie, M., Kalogeiton, V.: High-level features for movie style understanding. In: ICCV-W (2021)
  • [9] Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: Posescript: 3d human poses from natural language. In: ECCV (2022)
  • [10] Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)
  • [11] Drucker, S.M., Galyean, T.A., Zeltzer, D.: Cinema: A system for procedural camera movements. In: Symposium on Interactive 3D graphics (1992)
  • [12] Galvane, Q., Christie, M., Lino, C., Ronfard, R.: Camera-on-rails: automated computation of constrained camera paths. In: ACM Motion In Games (2015)
  • [13] Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa*, A., Malik*, J.: Humans in 4D: Reconstructing and tracking humans with transformers. In: ICCV (2023)
  • [14] Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: CVPR (2022)
  • [15] Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: CLIPScore: A reference-free evaluation metric for image captioning. In: EMNLP (2021)
  • [16] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS (2017)
  • [17] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
  • [18] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS (2020)
  • [19] Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: NeurIPS-W (2021)
  • [20] Huang, C., Lin, C., Yang, Z., Kong, Y., Chen, P., Yang, X., Cheng, K.: Learning to film from professional human motion videos. In: CVPR (2019)
  • [21] Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)
  • [22] Jiang, H., Christie, M., Wang, X., Liu, L., Wang, B., Chen, B.: Camera keyframing with style and control. ACM TOG (2021)
  • [23] Jiang, H., Wang, B., Wang, X., Christie, M., Chen, B.: Example-driven virtual cinematography by learning camera behaviors. ACM TOG (2020)
  • [24] Jiang, H., Wang, X., Christie, M., Liu, L., Chen, B.: Cinematographic camera diffusion model. Computer Graphics Forum (2024)
  • [25] Jiang, X., Rao, A., Wang, J., Lin, D., Dai, B.: Cinematic behavior transfer via nerf-based differentiable filming. arXiv preprint arXiv:2311.17754 (2023)
  • [26] Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion-based generative models. NeurIPS (2022)
  • [27] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. stat (2014)
  • [28] Krizhevsky, A., et al.: Learning multiple layers of features from tiny images. Toronto, ON, Canada (2009)
  • [29] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)
  • [30] Lino, C., Christie, M.: Intuitive and efficient camera control with the toric space. ACM TOG (2015)
  • [31] Mildenhall, B., Srinivasan, P., Tancik, M., Barron, J., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
  • [32] Naeem, M.F., Oh, S.J., Uh, Y., Choi, Y., Yoo, J.: Reliable fidelity and diversity metrics for generative models. In: ICML (2020)
  • [33] Nichol, A.Q., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., Mcgrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In: ICML (2022)
  • [34] Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV (2023)
  • [35] Petrovich, M., Black, M.J., Varol, G.: TMR: Text-to-motion retrieval using contrastive 3d human motion synthesis. In: ICCV (2023)
  • [36] Plappert, M., Mandery, C., Asfour, T.: The KIT motion-language dataset. Big Data (2016)
  • [37] Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
  • [38] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
  • [39] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
  • [40] Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: LAION-400M: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
  • [41] Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
  • [42] Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)
  • [43] Truffaut, F., Scott, H.: Hitchcock/truffaut. revised edition. Simon and Schuster (1985)
  • [44] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
  • [45] Wang, X., Courant, R., Shi, J., Marchand, E., Christie, M.: JAWS: Just A Wild Shot for cinematic transfer in neural radiance fields. In: CVPR (2023)
  • [46] Wang, Z., Yuan, Z., Wang, X., Chen, T., Xia, M., Luo, P., Shan, Y.: Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641 (2023)
  • [47] Xiao, Z., Kreis, K., Vahdat, A.: Tackling the generative learning trilemma with denoising diffusion GANs. In: ICLR (2021)
  • [48] Xie, D., Hu, P., Sun, X., Pirk, S., Zhang, J., Mech, R., Kaufman, A.E.: GAIT: Generating aesthetic indoor tours with deep reinforcement learning. In: ICCV (2023)
  • [49] Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., Liu, T.: On layer normalization in the transformer architecture. In: ICML (2020)
  • [50] Ye, V., Pavlakos, G., Malik, J., Kanazawa, A.: Decoupling human and camera motion from videos in the wild. In: CVPR (2023)
  • [51] Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: MotionDiffuse: Text-driven human motion generation with diffusion model. IEEE TPAMI (2024)
  • [52] Zhao, R., Gu, Y., Wu, J.Z., Zhang, D.J., Liu, J., Wu, W., Keppo, J., Shou, M.Z.: Motiondirector: Motion customization of text-to-video diffusion models. arXiv preprint arXiv:2310.08465 (2023)
  • [53] Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learning view synthesis using multiplane images. ACM TOG (2018)
  • [54] Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: CVPR (2019)

Appendix

Appendix 0.A Ethical discussion

We discuss the ethical impact of our method across several aspects:

  • Creative Integrity: It is a fine line between using AI tool to enhance the human creativity and allowing it to deprive human creative process. Under misusage, the proposed method could diminish the artistic expression instead of support it.

  • Intellectual Property: The use of AI-generated content raises questions about ownership and copyright. The Intellectual Property ownership of the generated content can be debatable.

  • Job Displacement or Creation: The automation of certain aspects of filmmaking could lead to concerns about job displacement within the industry, or under proper usage, may also help to create new types of jobs in the domain.

Appendix 0.B Exceptional Trajectories dataset (E.T.)

Refer to caption
Figure 8: Examples E.T. samples. Each subfigure presents frames from the original movie shot (left), and processed camera and character trajectories (right). Additionally, the bottom part showcases the generated camera trajectory caption with or without the character trajectory caption.

0.B.1 Additional statistics

050100150200250300Duration (num frames)01000200030004000500060007000Num samples
(a) Trajectory length (in
#frames)
020406080100Length (meters)100superscript100{10^{0}}10 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT101superscript101{10^{1}}10 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT102superscript102{10^{2}}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT103superscript103{10^{3}}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT104superscript104{10^{4}}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT105superscript105{10^{5}}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPTNum samples (log)
(b) Camera length (in
meters)
020406080100Length (meters)100superscript100{10^{0}}10 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT101superscript101{10^{1}}10 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT102superscript102{10^{2}}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT103superscript103{10^{3}}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT104superscript104{10^{4}}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT105superscript105{10^{5}}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPTNum samples (log)
(c) Character length (in
meters)
Figure 9: E.T. statistics.

We build our E.T. dataset the Condensed Movies Dataset [1] (CMD), encompassing over 30,0003000030,00030 , 000 scenes from 3,00030003,0003 , 000 diverse movies, totaling more than 1,00010001,0001 , 000 hours of video. We segment each movie scene into continuous shots by leveraging changes in color and intensity between frames [6].

We show additional statistics of E.T. in Figure 9. We observe that for both camera and character, the majority of trajectories are smaller than 20 meters, i.e. corresponding to a velocity of 20 meters/(300 frames/25 fps)=1.67m.s1formulae-sequence20 meters300 frames25 fps1.67𝑚superscript𝑠120\text{ meters}/(300\text{ frames}/25\text{ fps})=1.67m.s^{-1}20 meters / ( 300 frames / 25 fps ) = 1.67 italic_m . italic_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

Additionally, in Figure 8, we show extensive examples of E.T. samples.

0.B.2 Data pre-processing

Refer to caption
(a) Before alignment.
Refer to caption
(b) After alignment.
Figure 10: Raw chunk alignment. We show in (a) the raw independent chunks just after the SLAHMR [50] extraction. In (b) we display the result of the chunk alignment process. Each color (red, blue, green) corresponds to a different chunk.

Chunk alignment.

A limitation of SLAHMR [50] is its inability to handle long videos (exceeding 100 frames). Consequently, we divide each shot into chunks of 100 frames and process them independently. However, it produces non-consitant outputs: it exhibits translational bias/offset and different scales, as shown in Figure 10(a).

To address this issue, we propose the following alignment method: dividing shots into overlap** chunks, where consecutive chunks share frames, and performing alignment on these overlap** frames. A chunk contains camera trajectories with SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) poses represented as [𝐑|𝐭]delimited-[]conditional𝐑𝐭[\mathbf{R}|\mathbf{t}][ bold_R | bold_t ] (where 𝐑𝐑\mathbf{R}bold_R denotes rotation and 𝐭𝐭\mathbf{t}bold_t translation), and 3D human poses described by 𝐕𝐕\mathbf{V}bold_V (vertices of a 3D mesh).

Given two consecutive chunks at k𝑘kitalic_k and k+1𝑘1k+1italic_k + 1, we initially align the cameras. The alignment involves determining a scale parameter s𝑠sitalic_s and a SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) rigid transformation [𝐁|𝐛]delimited-[]conditional𝐁𝐛[\mathbf{B}\;|\;\mathbf{b}][ bold_B | bold_b ]:

[𝐑k|𝐭k]=[𝐁k|𝐛k][𝐑k+1|sk𝐭k+1],delimited-[]conditionalsubscript𝐑𝑘subscript𝐭𝑘delimited-[]conditionalsubscript𝐁𝑘subscript𝐛𝑘delimited-[]conditionalsubscript𝐑𝑘1subscript𝑠𝑘subscript𝐭𝑘1\displaystyle[\mathbf{R}_{k}\;|\;\mathbf{t}_{k}]=[\mathbf{B}_{k}\;|\;\mathbf{b% }_{k}]\,[\mathbf{R}_{k+1}\;|\;s_{k}\,\mathbf{t}_{k+1}],[ bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] = [ bold_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] [ bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] , (3)
[𝐑k|𝐭k]=[𝐁k𝐑k+1|sk𝐁k𝐭k+1+𝐛k],delimited-[]conditionalsubscript𝐑𝑘subscript𝐭𝑘delimited-[]conditionalsubscript𝐁𝑘subscript𝐑𝑘1subscript𝑠𝑘subscript𝐁𝑘subscript𝐭𝑘1subscript𝐛𝑘\displaystyle[\mathbf{R}_{k}\;|\;\mathbf{t}_{k}]=[\mathbf{B}_{k}\,\mathbf{R}_{% k+1}\;|\;s_{k}\,\mathbf{B}_{k}\,\mathbf{t}_{k+1}+\mathbf{b}_{k}],[ bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] = [ bold_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] , (4)

which simplifies to:

(a)𝐑k=𝐁k𝐑k+1,𝑎subscript𝐑𝑘subscript𝐁𝑘subscript𝐑𝑘1\displaystyle(a)\quad\mathbf{R}_{k}=\mathbf{B}_{k}\,\mathbf{R}_{k+1},( italic_a ) bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , (5)
(b)𝐭k=sk𝐁k𝐭k+1+𝐛k.𝑏subscript𝐭𝑘subscript𝑠𝑘subscript𝐁𝑘subscript𝐭𝑘1subscript𝐛𝑘\displaystyle(b)\quad\mathbf{t}_{k}=s_{k}\,\mathbf{B}_{k}\,\mathbf{t}_{k+1}+% \mathbf{b}_{k}.( italic_b ) bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (6)

Notably, the rotation estimated by SLAHMR remains consistent across chunks, implying 𝐁k=𝐈subscript𝐁𝑘𝐈\mathbf{B}_{k}=\mathbf{I}bold_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_I, and simplifying Equations 5 and 6 :

(a)𝐑k=𝐑k+1,𝑎subscript𝐑𝑘subscript𝐑𝑘1\displaystyle(a)\quad\mathbf{R}_{k}=\mathbf{R}_{k+1},( italic_a ) bold_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , (7)
(b)𝐭k=sk𝐭k+1+𝐛k.𝑏subscript𝐭𝑘subscript𝑠𝑘subscript𝐭𝑘1subscript𝐛𝑘\displaystyle(b)\quad\mathbf{t}_{k}=s_{k}\,\mathbf{t}_{k+1}+\mathbf{b}_{k}.( italic_b ) bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (8)

Subsequently, alignment entails determining the scaling factor s𝑠sitalic_s and translational bias 𝐛𝐛\mathbf{b}bold_b. These parameters can be accurately estimated using the least-square method [2], as represented by:

[𝐭k𝐈][sk𝐛𝐤]=𝐭k+1,matrixsubscript𝐭𝑘𝐈matrixsubscript𝑠𝑘subscript𝐛𝐤subscript𝐭𝑘1\displaystyle\begin{bmatrix}\mathbf{t}_{k}&\mathbf{I}\end{bmatrix}\begin{% bmatrix}s_{k}\\ \mathbf{b_{k}}\end{bmatrix}=\mathbf{t}_{k+1},[ start_ARG start_ROW start_CELL bold_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL bold_I end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_b start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , (9)

which can be further expressed as:

[tkx100tky010tkz001][skbkxbkybkz]=[tk+1xtk+1ytk+1z].matrixsuperscriptsubscript𝑡𝑘𝑥100superscriptsubscript𝑡𝑘𝑦010superscriptsubscript𝑡𝑘𝑧001matrixsubscript𝑠𝑘subscriptsuperscript𝑏𝑥𝑘subscriptsuperscript𝑏𝑦𝑘subscriptsuperscript𝑏𝑧𝑘matrixsuperscriptsubscript𝑡𝑘1𝑥superscriptsubscript𝑡𝑘1𝑦superscriptsubscript𝑡𝑘1𝑧\displaystyle\begin{bmatrix}t_{k}^{x}&1&0&0\\ t_{k}^{y}&0&1&0\\ t_{k}^{z}&0&0&1\end{bmatrix}\begin{bmatrix}s_{k}\\ b^{x}_{k}\\ b^{y}_{k}\\ b^{z}_{k}\end{bmatrix}=\begin{bmatrix}t_{k+1}^{x}\\ t_{k+1}^{y}\\ t_{k+1}^{z}\end{bmatrix}.[ start_ARG start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] . (10)

We also seek the alignment transform ΔbsubscriptΔ𝑏\Delta_{b}roman_Δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, such that:

[𝐑k+1|sk𝐭k+1+𝐛k]𝚫b=[𝐑k+1|𝐭k+1],delimited-[]conditionalsubscript𝐑𝑘1subscript𝑠𝑘subscript𝐭𝑘1subscript𝐛𝑘subscript𝚫𝑏delimited-[]conditionalsubscript𝐑𝑘1subscript𝐭𝑘1\displaystyle[\mathbf{R}_{k+1}\;|\;s_{k}\,\mathbf{t}_{k+1}+\mathbf{b}_{k}]\,% \mathbf{\Delta}_{b}=[\mathbf{R}_{k+1}\,|\,\mathbf{t}_{k+1}],[ bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] bold_Δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = [ bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] , (11)

resulting in:

𝚫b=[𝐑k+1|sk𝐭k+1+𝐛k]1[𝐑k+1|𝐭k+1].subscript𝚫𝑏superscriptdelimited-[]conditionalsubscript𝐑𝑘1subscript𝑠𝑘subscript𝐭𝑘1subscript𝐛𝑘1delimited-[]conditionalsubscript𝐑𝑘1subscript𝐭𝑘1\mathbf{\Delta}_{b}=[\mathbf{R}_{k+1}\;|\;s_{k}\,\mathbf{t}_{k+1}+\mathbf{b}_{% k}]^{-1}\,[\mathbf{R}_{k+1}\,|\,\mathbf{t}_{k+1}].bold_Δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = [ bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ] . (12)

Considering the inverse of a 4x4 transformation matrix representing a rigid transformation:

[𝐑T𝐑T𝐭𝟎1],matrixsuperscript𝐑𝑇superscript𝐑𝑇𝐭01\displaystyle\begin{bmatrix}\mathbf{R}^{T}&-\mathbf{R}^{T}\mathbf{t}\\ \mathbf{0}&1\end{bmatrix},[ start_ARG start_ROW start_CELL bold_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL - bold_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_t end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , (13)

we obtain from Eq. 12:

𝚫b=[𝐑k+1T𝐑k+1T(s𝐭k+1+𝐛k)𝟎1][𝐑k+1𝐭k+1𝟎1],subscript𝚫𝑏matrixsuperscriptsubscript𝐑𝑘1𝑇superscriptsubscript𝐑𝑘1𝑇𝑠subscript𝐭𝑘1subscript𝐛𝑘01matrixsubscript𝐑𝑘1subscript𝐭𝑘101\displaystyle\mathbf{\Delta}_{b}=\begin{bmatrix}\mathbf{R}_{k+1}^{T}&-\mathbf{% R}_{k+1}^{T}(s\mathbf{t}_{k+1}+\mathbf{b}_{k})\\ \mathbf{0}&1\end{bmatrix}\,\begin{bmatrix}\mathbf{R}_{k+1}&\mathbf{t}_{k+1}\\ \mathbf{0}&1\end{bmatrix},bold_Δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL - bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_s bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_CELL start_CELL bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , (14)
𝚫b=[𝐈𝐑k+1T(𝐭k+1(s𝐭k+1+𝐛k))𝟎1].subscript𝚫𝑏matrix𝐈superscriptsubscript𝐑𝑘1𝑇subscript𝐭𝑘1𝑠subscript𝐭𝑘1subscript𝐛𝑘01\displaystyle\mathbf{\Delta}_{b}=\begin{bmatrix}\mathbf{I}&\mathbf{R}_{k+1}^{T% }(\mathbf{t}_{k+1}-(s\mathbf{t}_{k+1}+\mathbf{b}_{k}))\\ \mathbf{0}&1\end{bmatrix}.bold_Δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL bold_I end_CELL start_CELL bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - ( italic_s bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] . (15)

Ultimately, to align the 3D human poses based on their vertices V𝑉Vitalic_V:

[𝐕kT1]=𝚫b[𝐕k+1T1]=[𝐕k+1T+𝐑k+1T(𝐭k+1(sk𝐭k+1+𝐛k))1],matrixsuperscriptsubscript𝐕𝑘𝑇1subscript𝚫𝑏matrixsuperscriptsubscript𝐕𝑘1𝑇1matrixsuperscriptsubscript𝐕𝑘1𝑇superscriptsubscript𝐑𝑘1𝑇subscript𝐭𝑘1subscript𝑠𝑘subscript𝐭𝑘1subscript𝐛𝑘1\displaystyle\begin{bmatrix}\mathbf{V}_{k}^{T}\\ 1\end{bmatrix}=\mathbf{\Delta}_{b}\,\begin{bmatrix}\mathbf{V}_{k+1}^{T}\\ 1\end{bmatrix}=\begin{bmatrix}\mathbf{V}_{k+1}^{T}+\mathbf{R}_{k+1}^{T}(% \mathbf{t}_{k+1}-(s_{k}\mathbf{t}_{k+1}+\mathbf{b}_{k}))\\ 1\end{bmatrix},[ start_ARG start_ROW start_CELL bold_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] = bold_Δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL bold_V start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL bold_V start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - ( italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] , (16)
𝐕k=𝐕k+1+(𝐭k+1(sk𝐭k+1+𝐛k))T𝐑k+1.subscript𝐕𝑘subscript𝐕𝑘1superscriptsubscript𝐭𝑘1subscript𝑠𝑘subscript𝐭𝑘1subscript𝐛𝑘𝑇subscript𝐑𝑘1\displaystyle\mathbf{V}_{k}=\mathbf{V}_{k+1}+(\mathbf{t}_{k+1}-(s_{k}\mathbf{t% }_{k+1}+\mathbf{b}_{k}))^{T}\mathbf{R}_{k+1}.bold_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_V start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + ( bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - ( italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT . (17)

The alignment process outcome is illustrated in Figure 10(b).

Data cleaning.

The extracted trajectories have limitations from the data extraction method [50], including discontinuities, ruptures and jerky motions. To address this, we first clean the data by removing outliers (i.e., discontinuous segments), with a velocity threshold. Specifically, we eliminate trajectory points holding velocities greater than the 95th percentile of the overall trajectory velocity multiplied by a scaling factor. Subsequently, the trajectory is partitioned into sub-trajectories without outliers. Finally, we use Kalman filter on each chunk to reduce residual jerkiness and enhance overall smoothness.

0.B.3 Dataset creation pipeline

Motion tagging.

We tune the parameters of our motion tagging method using the dataset introduced in [8]. This small dataset of 75 short clips includes annotated sequences of pure camera motion. For the character trajectory tagging, we extended this dataset by annotating human trajectories. We select parameters (i.e. mainly threshold values) that corresponds to the best classification metrics described in Section 5 of the main manuscript.

Caption generation.

We show the prompt used for caption generation (see Section 3.2 of the main manuscript):

You act as a camera operator writing a technical script for camera
motion descriptions.

Given a rough outline of the camera motion and main character motion,
write the camera motion description according to the main character
motion.

The sentence should be short, and factual. Do not mention frame
indices.

# Examples
Outline: Total frames 209.
    [Camera motion] Between frames 0 and 154: boom top, Between
    frames 155 and 209: static.
    [Main character motion] Between frames 0 and 146: move up,
    Between frames 147 and 209: static.
Description: While the character climbs up, the camera follows them
with a boom top, and as soon as the character stops, it remains
static.
# End of examples

Outline: Total frames {CURRENT_NUM_FRAME}.
    [Camera motion] {CURRENT_CAMERA_DESCRIPTION}.
    [Main character motion] {CURRENT_CAMERA_DESCRIPTION}.
Description:

Appendix 0.C Contrastive Language-Trajectory embedding (CLaTr)

Refer to caption
(a) Overview of CLaTr framework. CLaTr projects both text and camera trajectories into a common latent space using encoders. Self-similarity is then computed, and a shared-weight decoder decodes both text and camera trajectory features back into a camera trajectory.
Refer to caption
(b) t-SNE visualization of CLaTr embedding of text (vivid colors) and trajectory (pastel colors). Each color corresponds to a K-Mean cluster of the text embedding.
Text-trajectory retrieval Trajectory-text retrieval
  R@1 \uparrow   R@2 \uparrow   R@3 \uparrow   R@5 \uparrow   R@10 \uparrow   MedR \downarrow   R@1 \uparrow   R@2 \uparrow   R@3 \uparrow   R@5 \uparrow   R@10\uparrow   MedR \downarrow
  19.7319.7319.7319.73   31.6731.6731.6731.67   40.840.840.840.8   52.0852.0852.0852.08   64.6964.6964.6964.69   5.05.05.05.0   11.1511.1511.1511.15   17.2517.2517.2517.25   20.9120.9120.9120.91   26.526.526.526.5   34.6634.6634.6634.66   28.028.028.028.0
Table 3: CLaTr evaluation. We report the retrieval scores of CLaTr on the E.T. dataset.

We show in Figure 11(a) the overview of the CLaTr framework as described in Section 4.2 of the main manuscript.

Implementation details.

We train CLaTr with a batch size of 32323232 using the AdamW optimizer with a learning rate of 1e51𝑒51e-51 italic_e - 5. The set the weight of the reconstruction loss at 1.01.01.01.0, of the latent loss at 1.0e51.0𝑒51.0e-51.0 italic_e - 5, of the KL loss at 1.0e51.0𝑒51.0e-51.0 italic_e - 5, and of the contrastive loss at 0.10.10.10.1. The model has 6666 layers with a hidden dim of 256256256256 and 4444 attention heads. We use dropout of 0.10.10.10.1. Similar to Director, we set the default temporal input size to 300 and use masking to handle inputs with fewer than 300 frames. We represent the camera trajectory with the 6D continuous representation for rotation [54] combined with the 3D translation component.

CLaTr Evaluation.

Table 3 presents standard retrieval performance measures from [35, 14]. Recall at rank k (R@k) indicates the percentage of times the correct caption is within the top k results (higher is better). Median rank (MedR) is also reported, where lower values are better.

As shown in Table 3, text-to-trajectory metrics outperform trajectory-to-text metrics. This may be because text descriptions are more ambiguous and varied in describing trajectories, making it easier to match a text description to a unique trajectory than to match a trajectory to a specific description among many possibilities.

CLaTr embedding.

We show in Figure 11(b) a t-SNE visualization of CLaTr text (vivid colors) and trajectory (pastel colors) embeddings. We applied K-Means clustering to the text embeddings and visualized the corresponding clusters on the trajectory embeddings to assess the consistency of the joint embedding. Notably, we find that text clusters are preserved in the trajectory space, with vivid and pastel clusters overlap**, indicating a robust alignment between text and trajectory representations.