M3Act: Learning from Synthetic Human Group Activities

Che-Jui Chang
Rutgers University
[email protected]
   Danrui Li
Rutgers University
[email protected]
   Deep Patel
NEC Laboratories
[email protected]
   Parth Goel
Rutgers University
[email protected]
   Honglu Zhou
NEC Laboratories
[email protected]
   Seonghyeon Moon
Rutgers University
[email protected]
   Samuel S. Sohn
Rutgers University
[email protected]
   Sejong Yoon
The College of New Jersey
[email protected]
   Vladimir Pavlovic
Rutgers University
[email protected]
   Mubbasir Kapadia
Roblox
[email protected]
Abstract

The study of complex human interactions and group activities has become a focal point in human-centric computer vision. However, progress in related tasks is often hindered by the challenges of obtaining large-scale labeled datasets from real-world scenarios. To address the limitation, we introduce M3Act, a synthetic data generator for multi-view multi-group multi-person human atomic actions and group activities. Powered by Unity Engine, M3Act features multiple semantic groups, highly diverse and photorealistic images, and a comprehensive set of annotations, which facilitates the learning of human-centered tasks across single-person, multi-person, and multi-group conditions. We demonstrate the advantages of M3Act across three core experiments. The results suggest our synthetic dataset can significantly improve the performance of several downstream methods and replace real-world datasets to reduce cost. Notably, M3Act improves the state-of-the-art MOTRv2 on DanceTrack dataset, leading to a hop on the leaderboard from 10thsuperscript10𝑡10^{th}10 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT to 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT place. Moreover, M3Act opens new research for controllable 3D group activity generation. We define multiple metrics and propose a competitive baseline for the novel task. Our code and data are available at our project page: http://cjerry1243.github.io/M3Act. Work done during internship at Roblox

Refer to caption
Figure 1: M3Act is a large-scale synthetic data generator designed to support multi-person and multi-group research topics. M3Act features multiple semantic groups and produces highly diverse and photorealistic videos with a rich set of annotations suitable for human-centered tasks including multi-person tracking, group activity recognition, and controllable human group activity generation.

1 Introduction

Understanding collective human activities and social groups carries significant implications across diverse domains, as it contributes to bolstering public safety within surveillance systems, ensuring safe navigation for autonomous robots and vehicles amidst human crowds, and enriching social awareness in human-robot interactions [57, 12, 11, 39, 9, 8, 21, 54]. However, the advancement in related tasks is often impeded by the challenges of obtaining large-scale human group activity datasets in real-world scenarios with fine-grained multifaceted annotations.

Generating synthetic data is an emerging alternative to collecting real-world data due to its capability of producing large-scale datasets with perfect annotations. Nonetheless, most synthetic datasets [52, 42, 20, 59, 4] are primarily designed to facilitate human pose and shape estimation. They can only provide data with independently-animated persons, which is unsuitable for tasks in single-group and multi-group conditions [57]. To address the limitation, we propose M3Act, a synthetic data generator, with multi-view multi-group multi-person human actions and group activities. As presented in Tab. 1, M3Act stands out by offering comprehensive annotations including both 2D and 3D annotations as well as fine-grained person-level and group-level labels, thereby making it an ideal synthetic dataset generator to support tasks such as human activity recognition and multi-person tracking across all listed real-world datasets.

Illustrated in Fig. 1, our synthetic data generator features multiple semantic groups, highly diverse and photorealistic images, and a rich set of annotations. It encompasses 25 photometric 3D scenes, 104 HDRIs (High Dynamic Range Images), 5 lighting volumes, 2200 human models, 384 animations (categorized into 14 atomic action classes), and 6 group activities. For our experiments, We generated two datasets, M3ActRGB and M3Act3D. M3ActRGB contains both single-group and multi-group data with a total of 6M frames of RGB images and 48M bounding boxes, rendered in 20 FPS. M3Act3D is a 3D-only and single-group dataset, which contains 3D motions of all persons within a group. It has large group sizes (max 27 people) and an average of 6.7 persons per group. In total, M3Act3D contains a duration of 87.6 hours of group activities, captured in 30 FPS.

Dataset Image Type Avatar Num. Video Multi- View Multi- Person Multi- Group Annotations
2D 3D Atomic Atn. Group Act.
SURREAL, 2017 [52] Composite 145 \checkmark \checkmark \checkmark
AGORA, 2021 [42] HDRI 350 \checkmark \checkmark \checkmark
HSPACE, 2021 [2] 3D Scene 1600 \checkmark \checkmark \checkmark \checkmark \checkmark
GTA-Humans, 2021 [6] 3D Scene 600 \checkmark \checkmark \checkmark \checkmark
PSP-HDRI+, 2022 [20] HDRI 28 \checkmark \checkmark \checkmark
SynBody, 2023 [59] 3D Scene 10k \checkmark \checkmark \checkmark \checkmark \checkmark
BEDLAM, 2023 [4] 3D Scene 271 \checkmark \checkmark \checkmark \checkmark
M3Act (Ours) Photometric 3D + HDRI 2200 \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
CAD, 2011 [15] Real - \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
Volleyball Dataset, 2016 [31] Real - \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
NTU-RGBD 120, 2019 [35] Real - \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
HiEve, 2020 [34] Real - \checkmark \checkmark \checkmark \checkmark \checkmark
PoseTrack, 2021 [18] Real - \checkmark \checkmark \checkmark \checkmark
MOT, 2020 [17] Real - \checkmark \checkmark \checkmark \checkmark
DanceTrack, 2022 [48] Real - \checkmark \checkmark \checkmark
JRDB, 2023 [39, 54, 21] Real - \checkmark \checkmark \checkmark \checkmark \checkmark \checkmark
Table 1: A comparison of synthetic datasets as well as commonly-used real datasets for activity understanding and person tracking. We refer to JRDB as a union set of JRDB, JRDB-Act, and JRDB-Pose datasets. Note that it offers 3D bounding boxes, but not poses.

We first demonstrate the merit of M3Act via synthetic data pre-training and mixed training on multi-person tracking and group activity recognition. For multi-person tracking, training with our synthetic data yields significant performance gain on several downstream methods [65, 63, 23, 58]. We also demonstrate notable improvements in the state-of-the-art MOTRv2 method [65] and observe that our synthetic data can substitute for 62.5% more real-world data, without compromising performance. In terms of group activity recognition, results indicate that pre-training with M3ActRGB greatly improves both person-level and group-level accuracy for Composer [66] and ActorTransformer [25] methods. Based on our generated data, we then introduce a novel task, controllable 3D group activity generation, which aims to synthesize a group of 3D human motions, given control signals such as activity labels and group sizes. We systematically approach the new task by introducing both learning-based and heuristics-based metrics, along with a competitive baseline to generate meaningful human activities.

This paper makes the following contributions:

  • We propose a novel synthetic data generator, M3Act, and provide two large-scale synthetic datasets with highly diverse human activities, photorealistic multi-view videos, and comprehensive annotations.

  • We demonstrate that M3Act can significantly improve benchmark performances for multi-person tracking and group activity recognition and replace a large portion of real-world training data to reduce cost.

  • M3Act promotes new research initiatives for controllable 3D group activity generation, suggesting that synthetic data can not only support existing tasks but also create datasets for novel research.

2 Related Works

Human Centered Synthetic Datasets. The use of synthetic datasets for human-centered tasks has become increasingly prominent due to their diversity, scalability, and perfect annotations, with proven merits connected to various fields in machine learning, including domain adaptation [49, 36], heterogeneous multitask learning [60], and sim2real [24] or task2sim [41] transfer. Most previous synthetic datasets are constructed to support human pose estimation. For example, SURREAL [52] contains renderings of human motions from 145 avatars composited to a background image. Subsequent works [42, 2, 20] managed to improve the image quality by leveraging realistic 3D scenes, high-quality renderings, and HDRI images. Recently, synthetic datasets have been proposed to tackle human shape and mesh estimation. SynBody [59] constructs layered human assets to increase character diversity. BEDLAM [4] adds physically simulated hair and clothes to achieve state-of-the-art performances on shape and mesh estimation. Nonetheless, data with collective human motions and group activities cannot be obtained from them. Our work, M3Act, is constructed with animated human groups tailored to multi-person and multi-group research.

Real Multi-Person Datasets. Real-world datasets [15, 31, 35, 34, 17, 48, 18, 39] with multiple persons are usually collected for tasks such as group activity understanding, multi-person tracking, and human trajectory prediction. Recognizing and parsing collective human activities [66, 25] rely primarily on multiple modalities (RGB, bounding box, pose) and hierarchical action and activity labels. These fine-grained labels are provided by datasets like CAD [15] and Volleyball Dataset [31]. On the other hand, datasets for person tracking, such as HiEve [34], MOT [40, 17], and DanceTrack [48], require not only 2D annotations (e.g., bounding box) for individual frames, but also the association of the objects between them. Specifically, DanceTrack provides multiple persons in a group with the same clothing, making it difficult for the association of the individuals. MOT datasets target tracking for human crowds and contain mostly outdoor scenes from a bird-eye view. Recently, JRDB [39, 21, 54] with rich annotations is released. The images are captured by a social robot, navigating around daily scenes. It provides fine-grained annotations that support various tasks, including person detection, pose estimation, tracking, collective activity detection, and understanding. M3Act not only offers the same modalities and annotations for supporting the aforementioned tasks, but it also provides full 3D annotations, making it suitable for a wide range of applications beyond the 2D domain.

3 M3Act

M3Act is a multi-view multi-group multi-person human atomic action and group activity data generator built with Unity Engine and the Perception [5] library. Inspired by PeopleSansPeople [19] that populates randomly posed human avatars in a scene and renders static images, M3Act not only offers the same functionalities for human poses but also extends it to the spatio-temporal domain. It generates RGB videos for dynamic human motions and produces a rich set of annotations simultaneously, including (a) 2D and 3D joints/meshes, (b) 2D and 3D bounding boxes for individual persons, (c) atomic action and group activity categories, (d) tracking information such as individual and group IDs, (e) segmentation, depth, and normal images, and (f) scene description.

Refer to caption
Figure 2: The data generation process of M3Act. It consists of multiple data simulations with scene instantiation, group activity authoring, and a data capture module. A high degree of randomization is involved in all aspects of the process to ensure diverse data.

3.1 Data Generation

The process of our data generation is illustrated in Fig. 2. First, the generation process is configured by the simulation scenario that manages multiple independent simulations of human activities. Then for each simulation, a 3D scene with background objects, lights, and cameras is set up, and groups of human characters are instantiated to be animated. Lastly, the multi-view RGB image frames are rendered and the annotations are exported at the end of the simulation.

Scene Instantiation. We represent the environment through 25 photometric 3D scenes and 104 panoramic HDRIs. Each scene is initiated with randomized lighting and camera configuration. To attain a balance between realistic environmental illumination and pronounced shadow detail, our lighting schema integrates HDRI Sky lighting with a directional light. This directional light is subject to random variations in its direction, color temperature, and intensity. Regarding camera placement, we always point cameras towards the center of avatar groups and introduce variability by randomizing both the field of view and the camera’s distance to these groups.

Human Models and Motion Assets. M3Act leverages 2000 human models generated by Synthetic Humans [45], ranging across all ages (1 similar-to\sim100), genders, ethnicities (such as Caucasian, Asian, Latin American, African, Middle Eastern), diverse body shapes, hair, and clothing. We also incorporated 200 widely-used human characters from RenderPeople [1]. For the human motions, we collected 384 animation clips from AMASS [38], categorized into 14 atomic action classes. We created a universal animation controller that blends styles, including arm space and stride size, to create diverse motions from the collected clips.

Modular Group Activities. Each group activity is structured as a parameterized module, allowing for the customization of numerous variables. These variables include the number of individuals in the group and the specific atomic actions permitted within the group activity. This modularization ensures easy duplication, repositioning, and reuse of the group activities, enabling simulations of multiple groups at the same time. To procedurally animate a group of humans within a modular group, we establish the positions and orientations of the selected characters while choosing the appropriate animation clip for each character through the activity script. It’s important to note that, despite drawing characters from the same set of avatars, the configurations and animations of these characters can vary significantly from one group to another. For example, animating a queueing activity may require all characters to be aligned in a straight line, while those in a walking group may form various shapes. The atomic actions that a person can perform also depend on the specific group activity. We carefully consider all these factors when authoring activities and provide a summary in the Sup. Mat.

Domain Randomization. M3Act provides domain randomization for almost all aspects of the data generation process to ensure the simulation data is highly diverse. These aspects include the number of groups in a scene, the number of persons in each group, the positions of groups, the alignment of persons in a group, the positions of individuals, the textures for the instantiated characters, and the selection of scenes, lighting conditions, camera positions, characters, group activities, atomic actions, and animation clips. Despite the fact that animating group activities inherently limits the degree of freedom in the placement of characters, by altering the shapes in which characters align (e.g, either in a cluster, a straight line, or a curve), M3Act nonetheless generates diverse activities and achieves sufficient randomization for downstream model generalization. More details regarding the randomization variables and their distributions are provided in Sup. Mat.

Rendering and Annotations. M3Act utilizes the Unity high-definition render pipeline for the creation of photorealistic RGB images and leverages the Perception library for capturing annotations. On average, data is generated at a rate of 4.2 FPS using one NVIDIA RTX 3070 Ti graphics card, with a resolution of 1920x1080, and all annotations are enabled. Similar to PeopleSansPeople, the 2D skeleton follows COCO [33] format, with additional labelers for exporting 3D joints, meshes, group IDs, and activity classes. After the data generation, the 3D joints and meshes are fitted with SMPL parameters [37, 43].

3.2 Dataset Statistics

M3Act comprises 25 photometric 3D scenes, 104 HDRIs, 5 lighting volumes, 2200 human models, 384 animations (categorized into 14 atomic action classes), and 6 group activities. Using the generator, we first generated our synthetic dataset, M3ActRGB. It contains 6K simulations of every single-group activity and 9K simulations of multi-group configuration, with 4 camera views. Each simulation produces a 5-second video clip, captured in 20 FPS and FHD resolution. In total, M3ActRGB contains 6M RGB images and 48M bounding boxes. The average track length is 4.65 seconds.

Additionally, we generated M3Act3D, a large-scale 3D-only dataset. It consists of more than 65K simulations of single-group activity. Each contains 150 frames of multi-person collective interactions in 30 FPS, resulting in a total duration of 87.6 hours. As shown in Tab. 2, both the group size and the interaction complexity are significantly higher than those in previous multi-person motion datasets. Specifically, when compared with GTA-Combat [47], M3Act3D contains more persons in a group, has more group activities, and provides a variable number of persons in a group. To the best of our knowledge, M3Act3D is the first large-scale 3D dataset for human group activities with large group sizes as well as per-frame individual action labels. See Sup. Mat. for detailed statistics of both datasets.

Dataset FPS # of # of Persons # of Duration
Acty. Avg Max Actn.
SBU [61] 15 8 2.0 2 - 7.6 mins
Duet Dance [32] 25 5 2.0 2 - 2.3 hrs
CHI3D [22] 50 8 2.0 2 - 2.7 hrs
NTU RGBD 120 [35] 30 26 2.0 2 - 15.0 hrs
GTA-Combat [47] 15 1 3.2 5 - 39.0 hrs
M3Act3D 30 6 6.7 27 8 87.6 hrs
Table 2: A comparison of datasets for 3D multi-person human activities. M3Act3D is the largest dataset with labels for atomic actions and more persons in a group.

4 Experiments

We showcase the practical utilities of M3Act through three core experiments: Multi-Person Tracking (MPT), Group Activity Recognition (GAR), and controllable Group Activity Generation (GAG). The experiments are carefully designed to cover the following three perspectives:

  • -

    Multi-modality: Our experiments cover various modalities contained within our dataset, including RGB videos, 2D keypoints, and 3D joints. We leverage the rich annotations including bounding boxes, tracklets, group activities, and person action labels.

  • -

    Performance: We conduct the ablation study by altering the amount and the type of synthetic data used for training to see its effect on the model performance.

  • -

    Novel task: We introduce a novel generative task (GAG), showing that synthetic data can not only support existing CV tasks but also create datasets for new research.

4.1 Multi-Person Tracking

The objective of MPT is to predict the trajectories of all persons from a dynamic video stream. Typically, person tracking involves two separate processes, person detection and association. While the tracking task is approached in some prior works with the tracking-by-detection method [3, 56, 7], we consider end-to-end approaches [63, 65, 58, 23] to evaluate the use of synthetic data on the performance of MPT as a whole, in lieu of an improved performance caused only by refined detection.

Real-world Dataset: DanceTrack [48] (DT) is a challenging MPT dataset characterized by dynamic movements with human subjects in uniform appearances. It has a total of 100 videos with over 105K frames.

Synthetic Dataset. Given the motion categories in the real-world dataset, we select a subset of M3ActRGB with groups of people dancing, walking, and running. We use 1000 video clips with a single “dance” group as well as 1500 videos with a “walk” group and a “run” group simulated at the same time (denoted as WalkRun). We alter the use of the synthetic group activities (Dance, WalkRun, and Dance+WalkRun) in our experiments.

Training Data HOTA\uparrow DetA\uparrow AssA\uparrow IDF1\uparrow MOTA\uparrow
DT 69.8 83.0 58.9 71.6 89.3
DT 68.8 (10) 82.5 57.4 70.3 90.8
DT + Syn (D) 59.0 75.5 46.1 59.0 82.6
DT + Syn (WR) 70.1 83.1 59.4 72.5 92.0
DT + Syn (WR+D) 71.9 (2) 83.6 62.0 74.7 92.6
DT + Syn (WR+D) 72.2 83.4 62.6 75.5 92.7
DT (MOTRv2*) 73.4 83.7 64.4 76.0 92.1
DT + BEDLAM [4] 55.9 68.7 44.5 53.8 79.1
DT + GTA-Humans [6] 54.1 66.8 44.2 52.1 78.8
Table 3: MPT results on DanceTrack with MOTRv2. “D” means synthetic dance group. “WR” means walk and run groups. “WR+D” refers to “D” and “WR” combined. The symbol \circledast represents the author-provided checkpoint. The symbol {\dagger} marks the same model with additional association at inference. Numbers in parentheses represent the rank in the DanceTrack leaderboard.
Model Syn. Data HOTA DetA AssA IDF1 MOTA
54.2 73.5 40.2 51.5 79.7
MOTR [63] 60.0 76.4 48.1 56.0 83.8
68.5 80.5 58.4 71.2 89.9
MeMOTR [23] 71.1 81.8 62.3 74.1 92.2
69.4 82.1 58.9 71.9 91.2
CO-MOT [58] 72.5 83.6 63.3 75.9 92.8
68.8 82.5 57.4 70.3 90.8
MOTRv2 [65] 71.9 83.6 62.0 74.7 92.6
73.4 83.7 64.4 76.0 92.1
MOTRv2* [65] 74.6 84.1 64.9 76.4 93.1
Table 4: MPT results on DanceTrack using different methods trained with our synthetic data.

Results. We mix together both synthetic and real data during training and present the results in Tab. 3. First, adding our synthetic data yields significant improvement in all 5 tracking metrics as well as a hop in ranking on HOTA from 10th to 2nd place. The model trained with our synthetic data plus the extra association, marked as DT+Syn (WR+D), achieves similar performance to MOTRv2*, the same model that is trained with additional validation data with an ensemble of 4 models [65]. This suggests that the synthetic data used in our experiment is equivalent to at least 62.5% more real data. Second, Compared with other synthetic data sources, such as BEDLAM and GTA-Humans, M3Act demonstrates superior performance, indicating its better suitability for multi-person dynamic conditions. Third, we observe that the type of synthetic groups affects the model performance on real data. Adding the “WalkRun” groups to the training data is more effective than adding the “Dance” group. This is because while the DanceTrack dataset contains dynamic dance movements, the real challenge lies in detection and tracking when the subjects switch positions. By design, the positions of the characters in our dance group are well-staged and the movements are nearly synchronous. (See Sup. Mat. for the design.) Contrarily, having a walk and a run group together in a scene leads to frequent position switches relative to the camera view and thus improves the model performance. Lastly, Tab. 4 presents the tracking results using different methods. Results indicate that our synthetic data is effective across various models.

4.2 Group Activity Recognition

The goal of GAR is to determine the class of the group activity performed by the dominant group as well as the action class of each person. We consider Composer [66] and Actor Transformer [25] as the benchmark models. The former is a multi-scale transformer-based model and accepts only 2D keypoints as input. The latter can take combinations of multiple input modalities.

Real-world Dataset: CAD2 [15] and Volleyball Dataset [31]. CAD2 is an extended version of the Collective Activity Dataset [14] that records human group activities and is widely used for GAR benchmarks [57]. Volleyball Dataset (VD) is an action recognition dataset. It has 55 videos with 9 player action labels and 8 team activity labels.

Synthetic Dataset. We use a subset of all single-group data from M3ActRGB. It contains a total of 10K videos and over 600K frames. It contains all group activity and individual action classes that CAD2 provides and further includes 7 more action types.

Model Pretrained Syn. Data Group Activity Person Action
Top 1 Acc (%) \uparrow Top 1 Acc (%) \uparrow
Composer [66] N/A 84.87±2.3superscript84.87plus-or-minus2.384.87^{\pm 2.3}84.87 start_POSTSUPERSCRIPT ± 2.3 end_POSTSUPERSCRIPT (88.20) 81.31±2.4superscript81.31plus-or-minus2.481.31^{\pm 2.4}81.31 start_POSTSUPERSCRIPT ± 2.4 end_POSTSUPERSCRIPT (83.13)
10%percent1010\%10 % 86.12±1.8superscript86.12plus-or-minus1.886.12^{\pm 1.8}86.12 start_POSTSUPERSCRIPT ± 1.8 end_POSTSUPERSCRIPT (87.87) 84.16±1.8superscript84.16plus-or-minus1.884.16^{\pm 1.8}84.16 start_POSTSUPERSCRIPT ± 1.8 end_POSTSUPERSCRIPT (86.03)
25%percent2525\%25 % 87.65±1.2superscript87.65plus-or-minus1.287.65^{\pm 1.2}87.65 start_POSTSUPERSCRIPT ± 1.2 end_POSTSUPERSCRIPT (89.01) 86.36±1.3superscript86.36plus-or-minus1.386.36^{\pm 1.3}86.36 start_POSTSUPERSCRIPT ± 1.3 end_POSTSUPERSCRIPT (86.81)
50%percent5050\%50 % 89.39±0.4superscript89.39plus-or-minus0.489.39^{\pm 0.4}89.39 start_POSTSUPERSCRIPT ± 0.4 end_POSTSUPERSCRIPT (90.14) 86.68±1.5superscript86.68plus-or-minus1.586.68^{\pm 1.5}86.68 start_POSTSUPERSCRIPT ± 1.5 end_POSTSUPERSCRIPT (87.99)
100%percent100100\%100 % 89.74±1.0superscript89.74plus-or-minus1.0\mathbf{89.74}^{\pm 1.0}bold_89.74 start_POSTSUPERSCRIPT ± 1.0 end_POSTSUPERSCRIPT (91.51) 88.74±1.7superscript88.74plus-or-minus1.7\mathbf{88.74}^{\pm 1.7}bold_88.74 start_POSTSUPERSCRIPT ± 1.7 end_POSTSUPERSCRIPT (89.05)
Gains +4.87 (+3.31) +7.43 (+5.92)
Actor Transformer [25] N/A 78.08±1.0superscript78.08plus-or-minus1.078.08^{\pm 1.0}78.08 start_POSTSUPERSCRIPT ± 1.0 end_POSTSUPERSCRIPT (79.47) 76.22±2.2superscript76.22plus-or-minus2.276.22^{\pm 2.2}76.22 start_POSTSUPERSCRIPT ± 2.2 end_POSTSUPERSCRIPT (78.07)
10%percent1010\%10 % 77.59±2.4superscript77.59plus-or-minus2.477.59^{\pm 2.4}77.59 start_POSTSUPERSCRIPT ± 2.4 end_POSTSUPERSCRIPT (81.00) 76.01±3.2superscript76.01plus-or-minus3.276.01^{\pm 3.2}76.01 start_POSTSUPERSCRIPT ± 3.2 end_POSTSUPERSCRIPT (79.76)
25%percent2525\%25 % 81.36±2.1superscript81.36plus-or-minus2.181.36^{\pm 2.1}81.36 start_POSTSUPERSCRIPT ± 2.1 end_POSTSUPERSCRIPT (83.19) 78.86±2.4superscript78.86plus-or-minus2.478.86^{\pm 2.4}78.86 start_POSTSUPERSCRIPT ± 2.4 end_POSTSUPERSCRIPT (80.05)
50%percent5050\%50 % 82.72±1.3superscript82.72plus-or-minus1.382.72^{\pm 1.3}82.72 start_POSTSUPERSCRIPT ± 1.3 end_POSTSUPERSCRIPT (84.56) 79.95±1.6superscript79.95plus-or-minus1.679.95^{\pm 1.6}79.95 start_POSTSUPERSCRIPT ± 1.6 end_POSTSUPERSCRIPT (81.47)
100%percent100100\%100 % 83.67±1.2superscript83.67plus-or-minus1.2\mathbf{83.67}^{\pm 1.2}bold_83.67 start_POSTSUPERSCRIPT ± 1.2 end_POSTSUPERSCRIPT (84.88) 81.65±1.2superscript81.65plus-or-minus1.2\mathbf{81.65}^{\pm 1.2}bold_81.65 start_POSTSUPERSCRIPT ± 1.2 end_POSTSUPERSCRIPT (82.22)
Gains +5.59 (+5.41) +5.43 (+4.15)
Table 5: Results of 2D keypoint-based group activity and person action recognition on CAD2 dataset. The best results are shown in parentheses. The results suggest that pre-training with our synthetic data largely increases the accuracy for both group activity and person actions. Note that group accuracy saturates at 93.4% and 86.2% for Composer and Actor Transformer respectively.
Model 2D RGB Flow CAD2 Syn+CAD2 VD Syn+VD
Composer \checkmark 88.2 91.5 94.6 95.1
Actor Transformer \checkmark 79.5 84.9 92.3 93.7
\checkmark 78.2 80.7 91.4 92.5
\checkmark \checkmark 81.0 85.2 93.5 94.3
\checkmark \checkmark 81.3 85.0 94.4 95.0
\checkmark \checkmark 79.5 81.9 93.0 94.1
Table 6: The group activity recognition accuracy on CAD2 and Volleyball Dataset using different input modalities.

Results. We experimentally study how the size of our pre-training synthetic dataset and the capacity of a GAR model affect generalization from the synthetic to real domains. We first train the models on different amounts of synthetic data. Then we fine-tune them on CAD2 and report the top 1 accuracy of both group activity and person action recognition on the test set of CAD2. Tab. 5 presents the results using only 2D keypoints as input. We see a common trend for both GAR models that the recognition accuracy increases as more synthetic data is used for pre-training. With 100% of synthetic data, the accuracy of Composer increases, with an average of 4.87% at the group level and 7.43% at the person level, whereas Actor Transformer sees a 5.59% increase at group level and 5.43% increase at person level. Moreover, Tab. 6 shows the group recognition accuracy using different input modalities on CAD2 and VD. The performance gains in the experiment indicate that our synthetic data can effectively benefit the downstream GAR task across different methods, input modalities, and datasets.

4.3 Controllable 3D Group Activity Generation

While the procedural generation of our group activities in M3Act yields realistic and diverse human activities, the implementation requires considerable effort and involves the design and application of specific heuristics. Learning a generative model for human group activities, instead, encodes the heuristics to the architecture inherently and encompasses the capabilities of probabilistically generating diverse activities, with control over the entire group of human motions from various signals. To this end, we introduce controllable 3D group activity generation (GAG).

Definition. Let Gtp={min}i=1t,n=1psuperscriptsubscript𝐺𝑡𝑝subscriptsuperscriptsubscript𝑚𝑖𝑛formulae-sequence𝑖1similar-to𝑡𝑛1similar-to𝑝G_{t}^{p}=\{m_{i}^{n}\}_{i=1\sim t,n=1\sim p}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = { italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 ∼ italic_t , italic_n = 1 ∼ italic_p end_POSTSUBSCRIPT be a group of human motions with t𝑡titalic_t frames and p𝑝pitalic_p persons. The individual pose is denoted as minRj×dsuperscriptsubscript𝑚𝑖𝑛superscript𝑅𝑗𝑑m_{i}^{n}\in R^{j\times d}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_j × italic_d end_POSTSUPERSCRIPT, where j𝑗jitalic_j is the number of joints in the skeleton and d𝑑ditalic_d is the joint dimension. The goal of GAG is to synthesize a group of 3D human motions Gtpsuperscriptsubscript𝐺𝑡𝑝G_{t}^{p}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT from Gaussian noise, given an activity label and an arbitrary group size as input conditions. It requires a model capable of learning the temporal and spatial motion dependencies among persons within the same group and generating human motions with any group size simultaneously. GAG is related to dyadic motion generation [16, 10] and partner-conditioned reaction generation [46], but involves the motion interactions of more than two persons.

Baselines. Although previous works [16, 46, 47] can generate motions for multiple persons, they are limited to dyadic scenarios or groups with a fixed number of persons. Therefore, we present two baselines. The first one is the vanilla motion diffusion model, MDM [50]. It was proposed for probabilistic single-person motion generation from an input condition. We adopt their action-to-motion architecture for conditional synthesis and train the model on M3Act3D for generating an individual person’s motion from an input group activity class label. In order to generate a group of human motions from a given group size, we repeat the single-person inference several times. In other words, the individual motions are generated independently by MDM. Our second baseline, MDM+IFormer is extended from MDM and includes an additional interaction transformer (IFormer) that works along the dimension of the persons. The interaction transformer encourages the model to learn the inter-person motion dependencies. At inference, MDM+IFormer is capable of producing coordinated group activities in one forward pass, due to its modeling of human interactions.

Implementation. We utilize a common skeleton for all individual persons with 25 joints. We process the data so each motion is represented as both the 6D joint rotations [67] and the root positions. The final representation of a collective group activity with multiple persons is a tensor with shape (#persons ×\times× #frames ×\times× 26 ×\times× 6). For a fair comparison, both baseline models were trained on an NVIDIA RTX 3090 graphics card with 90% data from M3Act3D for 320K iterations and then tested on the other 10%.

Group Level Person Level
Acc \uparrow FID \downarrow Diversity \rightarrow Multimodality \rightarrow FID \downarrow Diversity \rightarrow Multimodality \rightarrow
GT 99.93799.93799.93799.937 0.001±0.000plus-or-minus0.0010.0000.001\pm 0.0000.001 ± 0.000 17.752±0.025plus-or-minus17.7520.02517.752\pm 0.02517.752 ± 0.025 3.491±0.012plus-or-minus3.4910.0123.491\pm 0.0123.491 ± 0.012 0.001±0.000plus-or-minus0.0010.0000.001\pm 0.0000.001 ± 0.000 14.506±0.013plus-or-minus14.5060.01314.506\pm 0.01314.506 ± 0.013 7.546±0.010plus-or-minus7.5460.0107.546\pm 0.0107.546 ± 0.010
MDM 97.36797.36797.36797.367 3.909±0.019plus-or-minus3.9090.0193.909\pm 0.0193.909 ± 0.019 17.683±0.037plus-or-minus17.6830.03717.683\pm 0.03717.683 ± 0.037 4.155±0.019plus-or-minus4.1550.019\textbf{4.155}\pm 0.0194.155 ± 0.019 4.434±0.010plus-or-minus4.4340.0104.434\pm 0.0104.434 ± 0.010 14.158±0.035plus-or-minus14.1580.03514.158\pm 0.03514.158 ± 0.035 7.588±0.013plus-or-minus7.5880.013\textbf{7.588}\pm 0.0137.588 ± 0.013
MDM+IFormer 98.100 3.242±0.016plus-or-minus3.2420.016\textbf{3.242}\pm 0.0163.242 ± 0.016 17.855±0.040plus-or-minus17.8550.04017.855\pm 0.04017.855 ± 0.040 4.198±0.021plus-or-minus4.1980.0214.198\pm 0.0214.198 ± 0.021 3.066±0.007plus-or-minus3.0660.007\textbf{3.066}\pm 0.0073.066 ± 0.007 14.827±0.031plus-or-minus14.8270.03114.827\pm 0.03114.827 ± 0.031 6.945±0.011plus-or-minus6.9450.0116.945\pm 0.0116.945 ± 0.011
Table 7: The results of the generated group activities with the learning-based metrics at both levels. An up arrow means the result is better when the metric score is higher. A right arrow means the metric score should be close to ground truth (GT).

4.3.1 Evaluation

Metrics. Due to the probabilistic nature of the task, we consider the following learning-based metrics, recognition accuracy, Frechet Inception Distance (FID), diversity, and multimodality, defined in [27]. These metrics, however, were originally designed for single-person motion generation. To evaluate the generated group activities, we report them at both group and person levels because they account for the fidelity and variations for the groups and the individuals. We train a multi-scale group activity recognition model using the Composer [66] architecture for the metrics. See Sup. Mat. for detailed explanations of how to construct the learning-based metrics, including the recognition model as well as the latent representations at both levels.

Refer to caption
Figure 3: The qualitative comparison of two group activities from ground truth (GT), MDM, and MDM+IFormer. The distribution of the persons from MDM+IFormer is closer to GT.

In addition to the learning-based metrics, we tailor four position-based metrics, collision frequency, repulsive interaction force, contact repulsive force, and total repulsive force, to the evaluation of human groups. The latter three are based on the social force model [28, 29], which explains crowd behaviors using socio-psychological and physical forces. Here we describe the four metrics:

Collision frequency indicates how often a collision (or an invalid interaction) would occur within a group. The collision count is calculated based on a distance threshold between any two persons in a group. It is then normalized by the total number of interactions to obtain the frequency. In other words, if N𝑁Nitalic_N persons are in a group, the normalization denominator is N(N1)/2𝑁𝑁12N\cdot(N-1)/2italic_N ⋅ ( italic_N - 1 ) / 2.

Repulsive interaction force describes the psychological tendency of two persons to stay away from each other. As the distance between two persons decreases, the repulsive force increases exponentially.

Contact repulsive force represents the compression body force when two persons collide with each other. The contact force is nonzero only when two persons collide. A larger contact force means the interaction is less likely to occur.

Total repulsive force is the accumulation of interaction and contact forces.

All four position-based metrics are calculated using the Euclidean distances of the persons’ positions, with the shoulder width as the collision threshold. The social forces are calculated by averaging the magnitude of each individual’s force accumulated through all its interactions with other persons. A well-performing model should generate group activities with low collision frequency and similar social force values to the ground truth. For the evaluation, we generated 500 samples for each group activity using the two well-trained baseline models. Each generated sequence contains 60 frames. We use the test split as the ground truth and randomly extract the group activity of the same length for evaluation. To ensure the distributions of group sizes are similar to the ground truth, we calculate the minimum and maximum group sizes from the training split and uniformly sample a group size from that range to generate group activities. Please refer to Sup. Mat. for more details regarding the baseline architectures, metrics formulas, and evaluation.

Collision Interaction Contact Total
Freq. \downarrow Force \rightarrow Force \rightarrow Force \rightarrow
GT 0.037 65.79 46.55 112.33
MDM 3.643 7,121.50 3,822.47 10,903.57
MDM+IFormer 1.157 1,796.25 1,373.40 3167.80
Table 8: Results of the generated human activities with position-based metrics. The collision frequency is calculated on a 60-frame group activity and normalized by the total number of interactions in a group.

4.3.2 Results

MDM+IFormer is capable of generating group activities with well-aligned character positions. As shown in Fig. 3, MDM generates human groups that are poorly positioned. For example, the persons in a walking group do not walk in the same direction and the persons are poorly placed in a queueing group. This is because MDM generates the group activities by inferring the individual motions independently. The placement of the individuals simply follows the probabilistic distribution of all persons’ positions in the dataset. On the other hand, MDM+IFormer successfully learns the probabilistic distribution for the entire group due to its interaction transformer. The persons are better aligned in a group and they have coordinated motions.

Both baselines are capable of generating diverse activities that match the input condition, but MDM+IFormer obtains better FID scores. Tab. 7 shows the results for the learning-based metrics. When compared with ground truth, both baselines obtain similar scores on recognition accuracy, diversity, and multimodality at both levels. The results indicate that both models successfully learn to generate distinguishable individual motions and group activities. The generated motions are also as diverse as the ground truth. The observations align with the results on action-to-motion generation in MDM [50]. MDM+IFormer receives a lower FID score than MDM, suggesting that MDM+IFormer generates group activities with higher quality.

The interaction transformer in MDM+IFormer greatly lowers the collision frequency within the generated group activities. As shown in Tab. 8, the collision frequency of the group activities generated by MDM+IFormer is much lower than the vanilla MDM. It suggests that the interaction transformer better learns the inter-person dependencies and generates more valid person interactions. In fact, we observe that the group activities generated by the vanilla MDM sometimes contain overlap** person positions. The high collision frequency of the MDM baseline also affects the repulsive forces, which makes social forces within the group activities of MDM implausible.

5 Discussion and Conclusion

We show the merit of M3Act by conducting three core experiments with multiple modalities and enhanced performances, as well as introducing a novel generative task. In both MPT and GAR experiments, we observe positive correlations between the volume of synthetic data used for training and model performance, indicating an improved model generalizability to unseen test cases with more synthetic data. Moreover, our comparison between DT+Syn and MOTRv2* reveals that synthetic data can replace certain real-world data from the target domain without sacrificing performance [13]. Essentially, our synthetic data reduces the need for extensive real data during training, thereby effectively lowering the costs associated with data collection and annotation. This discovery represents a promising step towards achieving few-shot and potentially zero-shot sim2real transfer. In our 3D Group Activity Generation experiment, we observe that MDM+IFormer, despite being a baseline for the novel task, learns to embed the heuristics for person interactions and produces well-aligned groups given the controls. It’s important to highlight that the generative approach, though currently underperforms the procedural method (GT), demonstrates the unique potential of controlling the group motions directly from various signals, including activity class, group size, trajectory, density, speed, and text inputs. With the anticipation of more data availability and increased model capacity for generative models in the future, we expect the generative method to eventually prevail, leading to broader applications for social interactions and human collective activities.

While the complexity of group behaviors in our dataset may be constrained by the heuristics used for activity authoring, M3Act offers notable flexibility for incorporating new group dynamics tailored to any specific downstream tasks. These new groups could be derived from expert-guided heuristics, rules generated by large language models, or outputs from our 3D GAG model. Furthermore, we recognize the existing domain gaps between synthetic and real-world data. With more assets included in our data generator in future iterations, we can enhance model generalizability and alleviate the disparities.

6 Acknowledgement

The research was supported in part by NSF awards: IIS-1703883, IIS-1955404, IIS-1955365, RETTL-2119265, and EAGER-2122119. This work was also partially supported by the Center for Smart Streetscapes, an NSF Engineering Research Center, under cooperative agreement EEC-2133516. This material is based upon work supported by the U.S. Department of Homeland Security111Disclaimer. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security. under Grant Award Number 22STESE00001 01 01.

References

  • ren [2023] Renderpeople. https://renderpeople.com/, 2023. Accessed: 2023-02-10.
  • Bazavan et al. [2021] Eduard Gabriel Bazavan, Andrei Zanfir, Mihai Zanfir, William T Freeman, Rahul Sukthankar, and Cristian Sminchisescu. Hspace: Synthetic parametric humans animated in complex environments. arXiv preprint arXiv:2112.12867, 2021.
  • Bewley et al. [2016] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016.
  • Black et al. [2023] Michael J Black, Priyanka Patel, Joachim Tesch, and **long Yang. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023.
  • Borkman et al. [2021] Steve Borkman, Adam Crespi, Saurav Dhakad, Sujoy Ganguly, Jonathan Hogins, You-Cyuan Jhang, Mohsen Kamalzadeh, Bowen Li, Steven Leal, Pete Parisi, et al. Unity perception: Generate synthetic data for computer vision. arXiv preprint arXiv:2107.04259, 2021.
  • Cai et al. [2021] Zhongang Cai, Mingyuan Zhang, Jiawei Ren, Chen Wei, Daxuan Ren, Zhengyu Lin, Haiyu Zhao, Lei Yang, Chen Change Loy, and Ziwei Liu. Playing for 3d human recovery. arXiv preprint arXiv:2110.07588, 2021.
  • Cao et al. [2023] **kun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirodkar, and Kris Kitani. Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9686–9696, 2023.
  • Chang [2020] Che-Jui Chang. Transfer learning from monolingual asr to transcription-free cross-lingual voice conversion. arXiv preprint arXiv:2009.14668, 2020.
  • Chang and Jeng [2021] Che-Jui Chang and Shyh-Kang Jeng. Acoustic anomaly detection using multilayer neural networks and semantic pointers. Journal of Information Science & Engineering, 37(1), 2021.
  • Chang et al. [2022a] Che-Jui Chang, Sen Zhang, and Mubbasir Kapadia. The ivi lab entry to the genea challenge 2022–a tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In Proceedings of the 2022 International Conference on Multimodal Interaction, pages 784–789, 2022a.
  • Chang et al. [2022b] Che-Jui Chang, Long Zhao, Sen Zhang, and Mubbasir Kapadia. Disentangling audio content and emotion with adaptive instance normalization for expressive facial animation synthesis. Computer Animation and Virtual Worlds, 33(3-4):e2076, 2022b.
  • Chang et al. [2023] Che-Jui Chang, Samuel S Sohn, Sen Zhang, Rajath Jayashankar, Muhammad Usman, and Mubbasir Kapadia. The importance of multimodal emotion conditioning and affect consistency for embodied conversational agents. In Proceedings of the 28th International Conference on Intelligent User Interfaces, pages 790–801, 2023.
  • Chang et al. [2024] Che-Jui Chang, Danrui Li, Seonghyeon Moon, and Mubbasir Kapadia. On the equivalency, substitutability, and flexibility of synthetic data, 2024.
  • Choi et al. [2009] Wongun Choi, Khuram Shahid, and Silvio Savarese. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In 2009 IEEE 12th international conference on computer vision workshops, ICCV Workshops, pages 1282–1289. IEEE, 2009.
  • Choi et al. [2011] Wongun Choi, Khuram Shahid, and Silvio Savarese. Learning context for collective activity recognition. In CVPR 2011, pages 3273–3280. IEEE, 2011.
  • Chopin et al. [2023] Baptiste Chopin, Hao Tang, Naima Otberdout, Mohamed Daoudi, and Nicu Sebe. Interaction transformer for human reaction generation. IEEE Transactions on Multimedia, 2023.
  • Dendorfer et al. [2020] Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taixé. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003, 2020.
  • Doering et al. [2022] Andreas Doering, Di Chen, Shanshan Zhang, Bernt Schiele, and Juergen Gall. Posetrack21: A dataset for person search, multi-object tracking and multi-person pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20963–20972, 2022.
  • Ebadi et al. [2021] Salehe Erfanian Ebadi, You-Cyuan Jhang, Alex Zook, Saurav Dhakad, Adam Crespi, Pete Parisi, Steven Borkman, Jonathan Hogins, and Sujoy Ganguly. Peoplesanspeople: a synthetic data generator for human-centric computer vision. arXiv preprint arXiv:2112.09290, 2021.
  • Ebadi et al. [2022] Salehe Erfanian Ebadi, Saurav Dhakad, Sanjay Vishwakarma, Chunpu Wang, You-Cyuan Jhang, Maciek Chociej, Adam Crespi, Alex Thaman, and Sujoy Ganguly. Psp-hdri +++: A synthetic dataset generator for pre-training of human-centric computer vision models. arXiv preprint arXiv:2207.05025, 2022.
  • Ehsanpour et al. [2022] Mahsa Ehsanpour, Fatemeh Saleh, Silvio Savarese, Ian Reid, and Hamid Rezatofighi. Jrdb-act: A large-scale dataset for spatio-temporal action, social group and activity detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20983–20992, 2022.
  • Fieraru et al. [2020] Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. Three-dimensional reconstruction of human interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7214–7223, 2020.
  • Gao and Wang [2023] Ruopeng Gao and Limin Wang. MeMOTR: Long-term memory-augmented transformer for multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9901–9910, 2023.
  • Gao et al. [2022] Ruohan Gao, Zilin Si, Yen-Yu Chang, Samuel Clarke, Jeannette Bohg, Li Fei-Fei, Wenzhen Yuan, and Jiajun Wu. Objectfolder 2.0: A multisensory object dataset for sim2real transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10598–10608, 2022.
  • Gavrilyuk et al. [2020] Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, and Cees GM Snoek. Actor-transformers for group activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 839–848, 2020.
  • Ge et al. [2021] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  • Guo et al. [2020] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020.
  • Helbing and Molnar [1995] Dirk Helbing and Peter Molnar. Social force model for pedestrian dynamics. Physical review E, 51(5):4282, 1995.
  • Helbing et al. [2000] Dirk Helbing, Illés Farkas, and Tamas Vicsek. Simulating dynamical features of escape panic. Nature, 407(6803):487–490, 2000.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Ibrahim et al. [2016] Mostafa S Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. A hierarchical deep temporal model for group activity recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1971–1980, 2016.
  • Kundu et al. [2020] Jogendra Nath Kundu, Himanshu Buckchash, Priyanka Mandikal, Anirudh Jamkhandi, Venkatesh Babu Radhakrishnan, et al. Cross-conditioned recurrent networks for long-term synthesis of inter-person human motion interactions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2724–2733, 2020.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • Lin et al. [2020] Weiyao Lin, Huabin Liu, Shizhan Liu, Yuxi Li, Rui Qian, Tao Wang, Ning Xu, Hongkai Xiong, Guo-Jun Qi, and Nicu Sebe. Human in events: A large-scale benchmark for human-centric video analysis in complex events. arXiv preprint arXiv:2005.04490, 2020.
  • Liu et al. [2019] Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10):2684–2701, 2019.
  • Liu et al. [2022] Qing Liu, Adam Kortylewski, Zhishuai Zhang, Zizhang Li, Mengqi Guo, Qihao Liu, Xiaoding Yuan, Jiteng Mu, Weichao Qiu, and Alan Yuille. Learning part segmentation through unsupervised domain adaptation from synthetic vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19140–19151, 2022.
  • Loper et al. [2015] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
  • Mahmood et al. [2019] Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision, pages 5442–5451, 2019.
  • Martin-Martin et al. [2021] Roberto Martin-Martin, Mihir Patel, Hamid Rezatofighi, Abhijeet Shenoi, JunYoung Gwak, Eric Frankel, Amir Sadeghian, and Silvio Savarese. Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE transactions on pattern analysis and machine intelligence, 2021.
  • Milan et al. [2016] Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
  • Mishra et al. [2022] Samarth Mishra, Rameswar Panda, Cheng Perng Phoo, Chun-Fu Richard Chen, Leonid Karlinsky, Kate Saenko, Venkatesh Saligrama, and Rogerio S Feris. Task2sim: Towards effective pre-training and transfer from synthetic data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9194–9204, 2022.
  • Patel et al. [2021] Priyanka Patel, Chun-Hao P Huang, Joachim Tesch, David T Hoffmann, Shashank Tripathi, and Michael J Black. Agora: Avatars in geography optimized for regression analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13468–13478, 2021.
  • Pavlakos et al. [2019] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Perez et al. [2022] Mauricio Perez, Jun Liu, and Alex C Kot. Skeleton-based relational reasoning for group activity analysis. Pattern Recognition, 122:108360, 2022.
  • Picetti et al. [2023] Francesco Picetti, Shrinath Deshpande, Jonathan Leban, Soroosh Shahtalebi, Jay Patel, Peifeng **g, Chunpu Wang, Charles Metze III au2, Cameron Sun, Cera Laidlaw, James Warren, Kathy Huynh, River Page, Jonathan Hogins, Adam Crespi, Sujoy Ganguly, and Salehe Erfanian Ebadi. Anthronet: Conditional generation of humans via anthropometrics. 2023.
  • Rahman et al. [2022] Md Ashiqur Rahman, Jasorsi Ghosh, Hrishikesh Viswanath, Kamyar Azizzadenesheli, and Aniket Bera. Pacmo: Partner dependent human motion generation in dyadic human activity using neural operators. arXiv preprint arXiv:2211.16210, 2022.
  • Song et al. [2022] Ziyang Song, Dongliang Wang, Nan Jiang, Zhicheng Fang, Chen**g Ding, Weihao Gan, and Wei Wu. Actformer: A gan transformer framework towards general action-conditioned 3d human motion generation. arXiv preprint arXiv:2203.07706, 2022.
  • Sun et al. [2022a] Peize Sun, **kun Cao, Yi Jiang, Zehuan Yuan, Song Bai, Kris Kitani, and ** Luo. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20993–21002, 2022a.
  • Sun et al. [2022b] Tao Sun, Mattia Segu, Janis Postels, Yuxuan Wang, Luc Van Gool, Bernt Schiele, Federico Tombari, and Fisher Yu. Shift: a synthetic driving dataset for continuous multi-task domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21371–21382, 2022b.
  • Tevet et al. [2022] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
  • Thilakarathne et al. [2022] Haritha Thilakarathne, Aiden Nibali, Zhen He, and Stuart Morgan. Pose is all you need: The pose only group activity recognition system (pogars). Machine Vision and Applications, 33(6):95, 2022.
  • Varol et al. [2017] Gul Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 109–117, 2017.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Vendrow et al. [2023] Edward Vendrow, Duy Tho Le, Jianfei Cai, and Hamid Rezatofighi. Jrdb-pose: A large-scale dataset for multi-person pose estimation and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4811–4820, 2023.
  • Wojke and Bewley [2018] Nicolai Wojke and Alex Bewley. Deep cosine metric learning for person re-identification. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 748–756. IEEE, 2018.
  • Wojke et al. [2017] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017.
  • Wu et al. [2021] Li-Fang Wu, Qi Wang, Meng Jian, Yu Qiao, and Bo-Xuan Zhao. A comprehensive review of group activity recognition in videos. International Journal of Automation and Computing, 18:334–350, 2021.
  • Yan et al. [2023] Feng Yan, Weixin Luo, Yujie Zhong, Yiyang Gan, and Lin Ma. Bridging the gap between end-to-end and non-end-to-end multi-object tracking, 2023.
  • Yang et al. [2023] Zhitao Yang, Zhongang Cai, Haiyi Mei, Shuai Liu, Zhaoxi Chen, Weiye Xiao, Yukun Wei, Zhongfei Qing, Chen Wei, Bo Dai, et al. Synbody: Synthetic dataset with layered human models for 3d human perception and modeling. arXiv preprint arXiv:2303.17368, 2023.
  • Yu et al. [2020] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
  • Yun et al. [2012] Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L Berg, and Dimitris Samaras. Two-person interaction detection using body-pose features and multiple instance learning. In 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pages 28–35. IEEE, 2012.
  • Zappardino et al. [2021] Fabio Zappardino, Tiberio Uricchio, Lorenzo Seidenari, and Alberto Del Bimbo. Learning group activities from skeletons without individual action labels. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 10412–10417. IEEE, 2021.
  • Zeng et al. [2022] Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. Motr: End-to-end multiple-object tracking with transformer. In European Conference on Computer Vision, pages 659–675. Springer, 2022.
  • Zhai et al. [2022] Xiaolin Zhai, Zhengxi Hu, Dingye Yang, Lei Zhou, and **gtai Liu. Spatial temporal network for image and skeleton based group activity recognition. In Proceedings of the Asian Conference on Computer Vision, pages 20–38, 2022.
  • Zhang et al. [2023] Yuang Zhang, Tiancai Wang, and Xiangyu Zhang. Motrv2: Bootstrap** end-to-end multi-object tracking by pretrained object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22056–22065, 2023.
  • Zhou et al. [2022] Honglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long Zhao, Ting Liu, Mubbasir Kapadia, and Hans Peter Graf. Composer: Compositional reasoning of group activity in videos with keypoint-only modality. Proceedings of the 17th European Conference on Computer Vision (ECCV 2022), 2022.
  • Zhou et al. [2019] Yi Zhou, Connelly Barnes, **gwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.

A. Data Generator: M3Act

A.1. Authoring of Group Activities

Authoring group activities in M3Act is non-trivial because people adhere to social norms while forming groups. The authoring requires nuanced adjustments varied from group to group, including the alignment of characters, their orientations, and the permitted atomic actions. We summarize these rules and adjustments in Tab. 9. For example, characters in a talking group are positioned in a circle facing the center. For queueing groups, characters can form a straight line, curve, etc., and individuals in a queue can be texting, idling, talking, and so on.

Group walking, running (jogging), and dancing are the three group activities with drastic body movements. Particularly, collision avoidance is one social norm that is implicitly followed by humans during collective walking or running. Therefore, we propose a simple algorithm to dynamically adjust the animation speed for each character engaged in the walking and running activities, thus mitigating avatar collisions. The detailed algorithm is shown in Alg. 1. Specifically, the animation speed of a character decreases when any character is in front of it and close to it. When no potential collision is detected, the animation speed of a character can increase up to its initial speed. While animating complex interactions for group dancing is challenging, we enforce nearly synchronous movements for all individuals.

Activity Name Alignment Face At Atomic Actions Other Conditions
Walking Straight Line, Circle, Rectangle Same Direction walk Adjust animation speed at runtime
Waiting Multi-Row Straight Lines Same Direction idle, text, talk, point, wave N/A
Queueing Straight Line, One-Corner Line, Front of Queue idle, text, talk, point N/A
Two-Corner Line, Parabola, Curve
Talking Circle Group Center talk N/A
Dancing Multi-Row Straight Lines Same Direction dance Nearly-synchronous movements
Jogging Straight Line, Circle, Rectangle Same Direction run Adjust animation speed at runtime
Table 9: Rules and adjustments for activity authoring.
Category Randomizer Variable Distribution
Scenes Scene Selection scene a set of prebuilt 3D environments
HDRI Selection hdri a set of collected HDRIs
Camera Camera Position radius Uniform(6, 10)
camera rotation Uniform(0, 360)
camera height Uniform(1, 5)
perturbation Cartesian[Uniform(-1, 1), 0 Uniform(-1, 1)]
Lights Lighting Volume volume a set of lighting volume conditions
Light Type light type a set of light types
Light Position XZ position Cartesian[Uniform(-20, 20), 0, Uniform(-20, 20)]
height Uniform(5, 10)
Light Intensity intensity range Uniform(0.5, 3)
Light Rotation orientation range Face at Cartesian[Uniform(-50, 50), 0 Uniform(-50, 50)]
Multi-Group Group Number group number range UniformRange(1, MaxNumGroups)
Group Selection group activity a set of modular groups
Group Placement group position Cartesian[Uniform(-20, 20), 0, Uniform(-20, 20)]
group rotation Euler[0, Uniform(0, 360), 0]
Activity Authoring Character Number character number range UniformRange(1, MaxNumCharacters)
Multi-person Subgroup multi-person number range UniformRange(0, MaxNum)
Character Selection character a set of 2200 characters
Character Texture body color RGBA[Uniform(0.4, 1), Uniform(0.4, 1), Uniform(0.4, 1), Uniform(0.6, 1)]
clothes colors HSV[Uniform(0, 1), Uniform(0, 1), Uniform(0.4, 1)]
Character Alignment alignment method a set of aligment methods
Character Interval interval Uniform(MinInterval, MaxInterval)
Character Perturbation position perturbation Uniform(-0.25*interval, 0.25*interval)
rotation perturbation Uniform(-45, 45)
Atomic Action atomic action a set of permitted atomic actions
Animation animation clip a set of animation clips
blended parameter Uniform(0, 1)
speed Uniform(0.8, 1.2)
normalized starting time Uniform(0, 1)
Table 10: List of randomizers, variables, and distributions used in M3Act.

A.2. List of Variables for Domain Randomization

Domain Randomization allows M3Act to generate massive-scale diverse group activity data. Compared to PeopleSansPeople [19]M3Act contains a much higher degree of domain randomization for animating human motions and activities. M3Act consists of a total of 14 atomic action classes and 384 animation clips, each with several blended style parameters such as character arm-space and stride. The domain randomization covers the scenes, cameras, lights, multi-groups, and activity authoring, as listed in Tab. 10. We describe the randomizers in M3Act below.

Algorithm 1 Dynamic speed adjustment for collision avoidance.
1:a list of instantiated characters and their initial animation speeds.
2:At every frame,
3:for character in Characters do
4:     init_speed \leftarrow Initial Speed of character
5:     Speed \leftarrow Current Speed of character
6:     pos \leftarrow Position of character
7:     forward \leftarrow Forward Vector of character
8:     flag \leftarrow False
9:     for other_character in Characters do
10:         if character is not other_character then
11:              pos_other \leftarrow Position of other_character
12:              offset \leftarrow pos_other - pos
13:              angle \leftarrow Angle Between forward and offset
14:              dist \leftarrow Length of offset
15:              if dist \leq 0.8 &\And& angle \leq 60 then
16:                  flag \leftarrow True
17:              end if
18:         end if
19:     end for
20:     if flag then
21:         Speed \leftarrow Max(Speed * 0.96, 0.1)
22:     else
23:         Speed \leftarrow Min(Speed * 1.03, init_speed)
24:     end if
25:end for
  • Scene Selection Randomizer randomizes the selection of 3D scene.

  • HDRI Randomizer randomizes the selection of panorama HDRIs.

  • Camera Position Randomizer includes the randomizations of camera height, distance, and angle in a cylindrical coordinate.

  • Light Type Randomizer randomizes the light type.

  • Light Position Randomizer randomizes the positions of all lights.

  • Light Intensity Randomizer randomizes the intensities of all lights.

  • Light Rotation Randomizer randomizes the rotations of all lights.

  • Group Number Randomizer randomizes the number of groups being instantiated during the simulation.

  • Group Selection Randomizer randomizes the selection of the activity for each group in the scene.

  • Group Placement Randomizer randomizes the center position for each group.

  • Character Number Randomizer randomizes the number of characters being instantiated in a group.

  • Multi-person Subgroup Randomizer randomizes the number of subgroups in an activity, such as two persons talking to each other in a queueing group. This randomizer applies to queueing and waiting groups.

  • Character Selection Randomizer randomizes the selection of characters.

  • Character Texture Randomizer randomizes the clothes and body colors of all characters.

  • Character Alignment Randomizer randomizes the method used to align characters in a group.

  • Character Interval Randomizer randomizes the interval between characters.

  • Character Perturbation Randomizer adds small perturbations to the characters’ positions and rotations.

  • Atomic Action Randomizer randomizes the selection of permitted atomic actions.

  • Animation Randomizer includes the randomization of animation clips, blended style parameters, animation speeds, and playback offsets.

B. Datasets

M3ActRGB contains 9K videos of multi-group and 6K videos of single-group activities, with a total of 6M RGB images and 48M bounding boxes. We show a collage of images from M3ActRGB in Fig. 7, which contains diverse and realistic multi-group and single-group activities. (See our supplementary video for animated data samples.) The distribution of M3ActRGB is also shown in Fig. 4. M3ActRGB contains as many as 19 persons per frame and an average of 8.1 persons per frame. Additionally, we show the comparison of several tracking datasets in Tab. 11. Even though we only selected the “WalkRun” and “Dance” data from M3ActRGB for the tracking experiments, the dataset size is much larger compared to MOT17 [40] and DanceTrack [48]. In terms of the average number of tracks per video, our dataset is closer to DanceTrack. MOT17 mostly contains crowded scenes, while DanceTrack has only one dancing group per video.

Dataset MOT17 [40] DanceTrack [48] M3ActRGB
#Videos 14 100 2500
Avg. #Tracks 56 9 8
Avg. Track Len. (s) 35.4 52.9 4.7
FPS (s) 30 20 20
Total Frames 11,235 105,855 250,000
Table 11: Comparison of multi-object tracking datasets. M3ActRGB consists of “WalkRun” and “Dance” data used in our tracking experiments.

M3Act3D has 65K simulations of 3D single-group motions with a total duration of 87.6 hours, captured in 30 FPS. Unlike M3ActRGB which contains equally simulated group activities, M3Act3D has different data sizes of all semantic groups based on their complexity. The complexity of a group includes its alignment methods, permitted atomic actions, animation clips, and styles. Fig. 5 shows the distribution of M3Act3D. Specifically, M3Act3D has more queueing groups than talking groups because the persons can form various shapes and more atomic actions can be performed within a queueing group. We also slightly increased the range of the number of persons in the group for M3Act3D. On average, it has 6.7 persons for every single group and a maximum of 27 persons.

Refer to caption
Figure 4: Distributions of M3ActRGB.
Refer to caption
Figure 5: Distributions of M3Act3D.

C. More Details of Experiments

C.1 Multi-Person Tracking

The goal of multi-person tracking (MPT) is to predict the trajectories (bounding boxes + identification) of all persons across an image sequence from a dynamic video stream. Traditionally, multi-object tracking is approached by adding a re-identification layer, either using trainable architecture [56, 55] or applying heuristics-based algorithms [3], on top of the object detection results, aiming to associate the bounding boxes across frames. Recently, end-to-end methods [63, 65] have shown to be more effective in several challenging datasets, such as DanceTrack [48] and MOT17 [40]. To demonstrate the effectiveness of our synthetic data in enhancing real-world performance in multi-person tracking, we assess the impact primarily on MOTRv2 [65]. It is an extension of MOTR [63] by incorporating YOLO-X [26] for bootstrap** detections. Using an end-to-end benchmark allows us to evaluate improvements with the synthetic dataset in both detection and identification.

Implementation Details. We follow the same hyperparameters and data preprocessing procedure from the author-provided MOTRv2 repository222https://github.com/megvii-research/MOTRv2 for all training jobs. For mixed training with our synthetic data, we simply combined both data from M3ActRGB and DanceTrack as one large dataset, without any additional probability sampling from the real and synthetic data. All models were trained with 16 NVIDIA A4000 GPUs, using a batch size of 1. It took roughly 7 days of training for all synthetic and real data combined.

C.2 Group Activity Recognition

Understanding collective human behaviors and social groups brings significant importance to various domains, including humanoid robots, autonomous vehicles, and human-computer interactions [8, 9, 11, 12, 10, 21, 39, 57]. State-of-the-art methods for group activity recognition (GAR) leverage 2D skeletons as input due to the effectiveness and robustness gained from the less biased and more action-focused representations [51, 66, 44, 62, 64, 25]. We describe the details of 2D skeleton-based GAR experiments that were primarily studied in our main paper. Let [s1,,st]subscript𝑠1subscript𝑠𝑡[s_{1},\cdots,s_{t}][ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] denote a video with t𝑡titalic_t frames and n𝑛nitalic_n persons, each frame si:={p1,,pn}assignsubscript𝑠𝑖subscript𝑝1subscript𝑝𝑛s_{i}:=\{p_{1},\cdots,p_{n}\}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } where pi:=[(x1,y1,c1),,(xj,yj,cj)]assignsubscript𝑝𝑖subscript𝑥1subscript𝑦1subscript𝑐1subscript𝑥𝑗subscript𝑦𝑗subscript𝑐𝑗p_{i}:=[(x_{1},y_{1},c_{1}),\cdots,(x_{j},y_{j},c_{j})]italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := [ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ]. Here, j𝑗jitalic_j represents the number of joints in a person’s skeleton, and each three-tuple (xj,yj,cj)subscript𝑥𝑗subscript𝑦𝑗subscript𝑐𝑗(x_{j},y_{j},c_{j})( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) respectively denotes the x𝑥xitalic_x and y𝑦yitalic_y coordinates in the pixel space and the class c𝑐citalic_c of the joint. respectively denotes the x𝑥xitalic_x, y𝑦yitalic_y coordinates in the pixel space and the class c𝑐citalic_c of the joint. For the given input [s1,,st]subscript𝑠1subscript𝑠𝑡[s_{1},\cdots,s_{t}][ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], the objective of GAR is to output the class of the group activity performed by the dominant group among these n𝑛nitalic_n persons and identify the action class of each individual in the video. Usually, the task assumes each video has one dominant group, as any outlier person does not contribute to the group.

Implementation Details. Our implementations of Composer [66] and Actor Transformer [25] for the experiments are based on the open-sourced implementation and hyperparameter settings333https://github.com/hongluzhou/composer. The only modified hyperparameter is the batch size, from 384384384384 to 256256256256 due to our computation constraints with an NVIDIA RTX 3090 graphics card. Both Composer and Actor Transformer are transformer [53] based architectures. Note that both architectures are slightly modified after synthetic pre-training, due to the differences between the synthetic and real datasets in the maximum number of persons in a clip and the number of atomic action classes. Specifically, we set a different maximum sequence length to the transformer encoders of Composer and Actor Transformer and replaced the last layers (i.e., the group activity classifier and the person action classifier) with new classifiers to output the correct data shape for the target real-world dataset.

C.3 Controllable 3D Group Activity Generation

Let GTP={min}i=1T,n=1Psuperscript𝐺𝑇𝑃subscriptsuperscript𝑚𝑖𝑛formulae-sequence𝑖1similar-to𝑇𝑛1similar-to𝑃G^{TP}=\{m^{in}\}_{i=1\sim T,n=1\sim P}italic_G start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT = { italic_m start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 ∼ italic_T , italic_n = 1 ∼ italic_P end_POSTSUBSCRIPT be a group of human motions with a total of T𝑇Titalic_T frames and P𝑃Pitalic_P persons. The 3D pose of each person is denoted as minRj×dsuperscript𝑚𝑖𝑛superscript𝑅𝑗𝑑m^{in}\in R^{j\times d}italic_m start_POSTSUPERSCRIPT italic_i italic_n end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_j × italic_d end_POSTSUPERSCRIPT, where j𝑗jitalic_j is the number of joints of a person and d𝑑ditalic_d is the joint’s feature dimension. We concatenate the global root position and all joints’ 6D rotations [67, 10] as the pose representation. Therefore, j=26𝑗26j=26italic_j = 26 (with 25252525 actual joints) and d=6𝑑6d=6italic_d = 6. The same representation is used as the input for 3D group activity recognition with Composer [66] and used as the ground truth during training for both MDM [50] and MDM+IFormer baselines for 3D group activity generation.

Refer to caption
Figure 6: The architecture of MDM+IFormer. The model takes the noised group motion GkTPsubscriptsuperscript𝐺𝑇𝑃𝑘G^{TP}_{k}italic_G start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, an activity class, and a time step k𝑘kitalic_k as inputs and outputs G~0TPsubscriptsuperscript~𝐺𝑇𝑃0\tilde{G}^{TP}_{0}over~ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, an estimation of the clean group motion. The interaction transformer encoder is “added” after the transformer encoder layer of MDM [50] for modeling the person interactions.

Composer for 3D Group Activity Recognition. The learning-based metrics (recognition accuracy, FID, diversity, and multimodality) [27] that are used to evaluate the generated 3D group activities require a well-trained 3D group activity recognition model. Composer [66] is a 2D skeleton-based group activity recognition model with a multi-scale transformer-based architecture. We chose Composer because it is a hierarchical architecture with fine-grained latent group-level and person-level features. We modified the first layer of Composer so that it accepts as input the aforementioned 3D group representation. We used the same set of hyperparameters as the GAR experiments including a learning rate of 0.00050.00050.00050.0005, a weight decay of 0.0010.0010.0010.001, a hidden dimension of 256256256256 for the transformer encoders, except a batch size of 64646464, a maximum number of 27272727 persons, and a total number of 26262626 joints.

After the 3D group activity recognition model is well-trained, we obtain the 3D group activity recognition accuracy, and extract latent group and person features to calculate the FID, diversity, and multi-modality metrics. The latent group representation is the learned CLS token from the last block and last scale of the Multi-Scale Transformer module of Composer, whereas the latent person representations are the learned person tokens from the last block and the second scale (i.e., the transformer encoder of the person scale) of the Multi-Scale Transformer module.

MDM & MDM+IFormer. We follow the same diffusion scheme as MDM [50] to obtain the noised group activity at every k𝑘kitalic_k-th diffusion time step, GkTPsubscriptsuperscript𝐺𝑇𝑃𝑘G^{TP}_{k}italic_G start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Specifically, G0TP:=GTPassignsubscriptsuperscript𝐺𝑇𝑃0superscript𝐺𝑇𝑃G^{TP}_{0}:=G^{TP}italic_G start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := italic_G start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT, meaning no noise is added at the 00-th diffusion time step. The reversed diffusion process is then formulated as:

G~0TP=D(GkTP,k,c),subscriptsuperscript~𝐺𝑇𝑃0𝐷subscriptsuperscript𝐺𝑇𝑃𝑘𝑘𝑐\tilde{G}^{TP}_{0}=D(G^{TP}_{k},k,c),over~ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_D ( italic_G start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k , italic_c ) , (1)

where D𝐷Ditalic_D is the MDM+IFormer network, illustrated in Fig. 6. G~0TPsubscriptsuperscript~𝐺𝑇𝑃0\tilde{G}^{TP}_{0}over~ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the estimated clean group activity and c𝑐citalic_c is the one-hot activity label. The loss function follows the objective in [30] and is defined as:

L=G0TPG~0TP2.𝐿superscriptnormsubscriptsuperscript𝐺𝑇𝑃0subscriptsuperscript~𝐺𝑇𝑃02L=\left|\left|G^{TP}_{0}-\tilde{G}^{TP}_{0}\right|\right|^{2}.italic_L = | | italic_G start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - over~ start_ARG italic_G end_ARG start_POSTSUPERSCRIPT italic_T italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (2)

Our implementations of the MDM and MDM+IFormer baselines for 3D group activity generation are based on the author-released implementation of MDM444https://github.com/GuyTevet/motion-diffusion-model without any hyperparameter tuning. Both models were optimized using the same loss function described above and trained on an NVIDIA RTX 3090 graphics card for 320K iterations.

Formulas of Social Repulsive Forces [29]. The three position-based metrics (in Sec. 4.3.1 of the main paper) are formulated as follows.

– Repulsive interaction force:

fijint=Aexp[(ri+rjdij)/B]nij.superscriptsubscript𝑓𝑖𝑗𝑖𝑛𝑡𝐴𝑒𝑥𝑝delimited-[]subscript𝑟𝑖subscript𝑟𝑗subscript𝑑𝑖𝑗𝐵subscript𝑛𝑖𝑗\vec{f}_{ij}^{int}=A\cdot exp[(r_{i}+r_{j}-d_{ij})/B]\cdot\vec{n}_{ij}.over→ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t end_POSTSUPERSCRIPT = italic_A ⋅ italic_e italic_x italic_p [ ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) / italic_B ] ⋅ over→ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT . (3)

fijintsuperscriptsubscript𝑓𝑖𝑗𝑖𝑛𝑡\vec{f}_{ij}^{int}over→ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t end_POSTSUPERSCRIPT is the interaction force of character j𝑗jitalic_j applied to character i𝑖iitalic_i. A𝐴Aitalic_A and B𝐵Bitalic_B are constants (A:=2,000assign𝐴2000A:=2,000italic_A := 2 , 000 and B:=0.08assign𝐵0.08B:=0.08italic_B := 0.08). risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are the radius of the characters. dijsubscript𝑑𝑖𝑗d_{ij}italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the distance between the characters. nijsubscript𝑛𝑖𝑗\vec{n}_{ij}over→ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the unit vector pointing from character j𝑗jitalic_j to character i𝑖iitalic_i.

– Contact repulsive force:

fijcont=kmax(0,ri+rjdij)nij,superscriptsubscript𝑓𝑖𝑗𝑐𝑜𝑛𝑡𝑘𝑚𝑎𝑥0subscript𝑟𝑖subscript𝑟𝑗subscript𝑑𝑖𝑗subscript𝑛𝑖𝑗\vec{f}_{ij}^{cont}=k\cdot max(0,r_{i}+r_{j}-d_{ij})\cdot\vec{n}_{ij},over→ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_o italic_n italic_t end_POSTSUPERSCRIPT = italic_k ⋅ italic_m italic_a italic_x ( 0 , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ⋅ over→ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , (4)

where k𝑘kitalic_k is a constant (k:=120,000assign𝑘120000k:=120,000italic_k := 120 , 000).

– Total repulsive force:

fijtotal=fijint+fijcont.superscriptsubscript𝑓𝑖𝑗𝑡𝑜𝑡𝑎𝑙superscriptsubscript𝑓𝑖𝑗𝑖𝑛𝑡superscriptsubscript𝑓𝑖𝑗𝑐𝑜𝑛𝑡\vec{f}_{ij}^{total}=\vec{f}_{ij}^{int}+\vec{f}_{ij}^{cont}.over→ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUPERSCRIPT = over→ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_t end_POSTSUPERSCRIPT + over→ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_o italic_n italic_t end_POSTSUPERSCRIPT . (5)

All constants follow the social force model proposed in [29].

Generated 3D Group Activities. Please refer to our supplementary video for the rendered 3D group activities generated by MDM and MDM+IFormer.

D. Additional Experiments

#Epochs required for model convergence Target
(Composer [66] / Actor Transformer [25]) CAD2 [15]
Source CAD2 [15] 88 / 233
M3Act 13 / 92
Table 12: Pre-training with our synthetic data leads to faster model convergence on the target domain for GAR. (Composer: 6.8×6.8\times6.8 × faster; Actor Transformer: 2.5×2.5\times2.5 × faster)

Pretraining with data from M3Act can improve convergence speed on the target dataset. We conduct the GAR experiment using 2D skeletons as the only input modality for both models and compare the number of epochs required for model convergence in Tab. 12. To automatically determine whether or not the model training has saturated, we adopted early stop** by setting a maximum number of 500500500500 epochs with stop** patience of 50505050 epochs. The results suggest that training Composer from scratch on CAD2 requires 88 epochs on average; with M3Act pre-training, Composer only requires 13 epochs for convergence.

E. Limitations & Future Work

We demonstrate that synthetic data can replace a great amount of real data [13] and successfully mitigate the scarcity of real data for multi-person and multi-group tasks, despite the domain gap that restricts the generalizability of models trained with synthetic data. With the release of our data generator, M3Act, we encourage the community to create their own data or enhance the synthetic data generator. While collecting more assets and generating more data with adjusted camera views can surely increase data diversity and shorten the gap, we would also like to point out some aspects of the generator that should be addressed in the future to create more realistic data.

Publicly-available assets. Most assets we use in M3Act are publicly available, including HDRIs, human characters, and animations. However, most existing assets such as photometric 3D scenes and high-quality avatars may require additional licensing for model training. Some assets could be restricted to specific game engines, which hinders the development of synthetic data.

Simulated Hair and clothes. The avatars used in M3Act do not contain hair and clothes physics that are in accordance with their body motions. Adding cloth and strand-based hair simulation would be sufficient for realistic interactions between hair/cloth and body, and thus improve the data quality.

Finger and face Movements. Our animations do not contain finger and face movements. While it might be reasonable as human groups are generally captured from a distance, adding finger and face movements can still improve the fidelity of the human motions.

Human-environment interactions. Like most synthetic datasets [4, 59, 52, 2], M3Act lacks meaningful interactions between human and environment. The interactions might include groups of humans navigating in a complex environment, a person picking up a phone while texting, or holding a suitcase. Animating human motions and activities with scene awareness is incredibly challenging. A simple solution is to polish each scene by carefully placing the avatars and staging the human behaviors. However, it would require significant manual efforts and limit the scale and diversity of the synthetic data.

Complexity for human groups. Animating human groups is significantly more challenging than animating the motions of a single person because the complexity (the number of interactions) increases quadratically as the number of persons increases. We apply relatively simple heuristics when designing rules for authoring human groups, which could only reflect a certain portion of real-world activities. These underlying rules that drive the group motions could also lead to datasets that are less complex than real-world ones, limiting the model generalization on downstream tasks. Creating new groups with expert-guided heuristics, LLM-generated rules, or directly from the 3D GAG method, should be considered. An alternative would be using existing motion capture data in replace of the procedural generation method. However, the lack of fine-grained motion capture data for large-scale collective 3D group motions is an obstacle to the development.

Societal Impacts. While we demonstrate the effectiveness of our synthetic data on several tasks, it is important to note that the use of synthetic data, in all manners, may still result in unbalanced and biased results. We strive to ensure the inclusiveness and fairness of our datasets by incorporating human avatars of all ages, genders, and ethnicities, providing a representative and equitable approach to generating data for responsible advancement in related fields.

Refer to caption
Figure 7: Collage of images from M3ActRGB, including multi-group activities (first 3 rows) and single-group activities (last 7 rows).