Search | arXiv e-print repository

Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

Authors: Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo ** Kim, Kevin Bailey, David Soriano Fosas, C. Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, Richard Newcombe

Abstract: We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. The dataset comes with a) full-body 3D motion ground truth; b) egocentric multimodal recordings from Project Aria devices with RGB, grayscale, eye-tracking cameras, IMUs, magnetometer, barometer, and microphones; and c) an additional "observer" dev… ▽ More We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. The dataset comes with a) full-body 3D motion ground truth; b) egocentric multimodal recordings from Project Aria devices with RGB, grayscale, eye-tracking cameras, IMUs, magnetometer, barometer, and microphones; and c) an additional "observer" device providing a third-person viewpoint. We compute world-aligned 6DoF transformations for all sensors, across devices and capture sessions. The dataset also provides 3D scene point clouds and calibrated gaze estimation. We derive a protocol to annotate hierarchical language descriptions of in-context human motion, from fine-grain pose narrations, to atomic actions and activity summarization. To the best of our knowledge, the Nymeria dataset is the world largest in-the-wild collection of human motion with natural and diverse activities; first of its kind to provide synchronized and localized multi-device multimodal egocentric data; and the world largest dataset with motion-language descriptions. It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km. The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545. To demonstrate the potential of the dataset we define key research tasks for egocentric body tracking, motion synthesis, and action recognition and evaluate several state-of-the-art baseline algorithms. Data and code will be open-sourced. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.06526 [pdf, other]

GaussianCity: Generative Gaussian Splatting for Unbounded 3D City Generation

Authors: Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu

Abstract: 3D city generation with NeRF-based methods shows promising generation results but is computationally inefficient. Recently 3D Gaussian Splatting (3D-GS) has emerged as a highly efficient alternative for object-level 3D generation. However, adapting 3D-GS from finite-scale 3D objects and humans to infinite-scale 3D cities is non-trivial. Unbounded 3D city generation entails significant storage over… ▽ More 3D city generation with NeRF-based methods shows promising generation results but is computationally inefficient. Recently 3D Gaussian Splatting (3D-GS) has emerged as a highly efficient alternative for object-level 3D generation. However, adapting 3D-GS from finite-scale 3D objects and humans to infinite-scale 3D cities is non-trivial. Unbounded 3D city generation entails significant storage overhead (out-of-memory issues), arising from the need to expand points to billions, often demanding hundreds of Gigabytes of VRAM for a city scene spanning 10km^2. In this paper, we propose GaussianCity, a generative Gaussian Splatting framework dedicated to efficiently synthesizing unbounded 3D cities with a single feed-forward pass. Our key insights are two-fold: 1) Compact 3D Scene Representation: We introduce BEV-Point as a highly compact intermediate representation, ensuring that the growth in VRAM usage for unbounded scenes remains constant, thus enabling unbounded city generation. 2) Spatial-aware Gaussian Attribute Decoder: We present spatial-aware BEV-Point decoder to produce 3D Gaussian attributes, which leverages Point Serializer to integrate the structural and contextual characteristics of BEV points. Extensive experiments demonstrate that GaussianCity achieves state-of-the-art results in both drone-view and street-view 3D city generation. Notably, compared to CityDreamer, GaussianCity exhibits superior performance with a speedup of 60 times (10.72 FPS v.s. 0.18 FPS). △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.04872 [pdf, other]

Diversified Batch Selection for Training Acceleration

Authors: Feng Hong, Yueming Lyu, Jiangchao Yao, Ya Zhang, Ivor W. Tsang, Yanfeng Wang

Abstract: The remarkable success of modern machine learning models on large datasets often demands extensive training time and resource consumption. To save cost, a prevalent research line, known as online batch selection, explores selecting informative subsets during the training process. Although recent efforts achieve advancements by measuring the impact of each sample on generalization, their reliance o… ▽ More The remarkable success of modern machine learning models on large datasets often demands extensive training time and resource consumption. To save cost, a prevalent research line, known as online batch selection, explores selecting informative subsets during the training process. Although recent efforts achieve advancements by measuring the impact of each sample on generalization, their reliance on additional reference models inherently limits their practical applications, when there are no such ideal models available. On the other hand, the vanilla reference-model-free methods involve independently scoring and selecting data in a sample-wise manner, which sacrifices the diversity and induces the redundancy. To tackle this dilemma, we propose Diversified Batch Selection (DivBS), which is reference-model-free and can efficiently select diverse and representative samples. Specifically, we define a novel selection objective that measures the group-wise orthogonalized representativeness to combat the redundancy issue of previous sample-wise criteria, and provide a principled selection-efficient realization. Extensive experiments across various tasks demonstrate the significant superiority of DivBS in the performance-speedup trade-off. The code is publicly available. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: ICML 2024

arXiv:2406.04354 [pdf, other]

QiandaoEar22: A high quality noise dataset for identifying specific ship from multiple underwater acoustic targets using ship-radiated noise

Authors: Xiaoyang Du, Feng Hong

Abstract: Target identification of ship-radiated noise is a crucial area in underwater target recognition. However, there is currently a lack of multi-target ship datasets that accurately represent real-world underwater acoustic conditions. To tackle this issue, we conducted experimental data acquisition, resulting in the release of QiandaoEar22 \textemdash a comprehensive underwater acoustic multi-target d… ▽ More Target identification of ship-radiated noise is a crucial area in underwater target recognition. However, there is currently a lack of multi-target ship datasets that accurately represent real-world underwater acoustic conditions. To tackle this issue, we conducted experimental data acquisition, resulting in the release of QiandaoEar22 \textemdash a comprehensive underwater acoustic multi-target dataset. This dataset encompasses 9 hours and 28 minutes of real-world ship-radiated noise data and 21 hours and 58 minutes of background noise data. To demonstrate the availability of QiandaoEar22, we executed two experimental tasks. The first task focuses on assessing the presence of ship-radiated noise, while the second task involves identifying specific ships within the recognized targets in the multi-ship mixed data. In the latter task, we extracted eight features from the data and employed six deep learning networks for classification, aiming to evaluate and compare the performance of various features and networks. The experimental results reveal that ship-radiated noise can be successfully identified from background noise in over 99\% of cases. Additionally, for the specific identification of individual ships, the optimal recognition accuracy achieves 99.56\%. Finally, based on our findings, we provide advice on selecting appropriate features and deep learning networks, which may offer valuable insights for related research. Our work not only establishes a benchmark for algorithm evaluation but also inspires the development of innovative methods to enhance UATD and UATR systems. △ Less

Submitted 15 May, 2024; originally announced June 2024.

arXiv:2406.04353 [pdf, other]

Introducing the Brand New QiandaoEar22 Dataset for Specific Ship Identification Using Ship-Radiated Noise

Authors: Xiaoyang Du, Feng Hong

Abstract: Target identification of ship-radiated noise is a crucial area in underwater target recognition. However, there is currently a lack of multi-target ship datasets that accurately represent real-world underwater acoustic conditions. To ntackle this issue, we release QiandaoEar22 \textemdash an underwater acoustic multi-target dataset, which can be download on https://ieee-dataport.org/documents/qian… ▽ More Target identification of ship-radiated noise is a crucial area in underwater target recognition. However, there is currently a lack of multi-target ship datasets that accurately represent real-world underwater acoustic conditions. To ntackle this issue, we release QiandaoEar22 \textemdash an underwater acoustic multi-target dataset, which can be download on https://ieee-dataport.org/documents/qiandaoear22. This dataset encompasses 9 hours and 28 minutes of real-world ship-radiated noise data and 21 hours and 58 minutes of background noise data. We demonstrate the availability of QiandaoEar22 by conducting an experiment of identifying specific ship from the multiple targets. Taking different features as the input and six deep learning networks as classifier, we evaluate the baseline performance of different methods. The experimental results reveal that identifying the specific target of UUV from others can achieve the optimal recognition accuracy of 97.78\%, and we find using spectrum and MFCC as feature inputs and DenseNet as the classifier can achieve better recognition performance. Our work not only establishes a benchmark for the dataset but helps the further development of innovative methods for the tasks of underwater acoustic target detection (UATD) and underwater acoustic target recognition(UATR). △ Less

Submitted 15 May, 2024; originally announced June 2024.

arXiv:2405.10305 [pdf, other]

4D Panoptic Scene Graph Generation

Authors: **gkang Yang, Jun Cen, Wenxuan Peng, Shuai Liu, Fangzhou Hong, Xiangtai Li, Kaiyang Zhou, Qifeng Chen, Ziwei Liu

Abstract: We are living in a three-dimensional space while moving forward through a fourth dimension: time. To allow artificial intelligence to develop a comprehensive understanding of such a 4D environment, we introduce 4D Panoptic Scene Graph (PSG-4D), a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding. Specifically, PSG-4D abstracts r… ▽ More We are living in a three-dimensional space while moving forward through a fourth dimension: time. To allow artificial intelligence to develop a comprehensive understanding of such a 4D environment, we introduce 4D Panoptic Scene Graph (PSG-4D), a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding. Specifically, PSG-4D abstracts rich 4D sensory data into nodes, which represent entities with precise location and status information, and edges, which capture the temporal relations. To facilitate research in this new area, we build a richly annotated PSG-4D dataset consisting of 3K RGB-D videos with a total of 1M frames, each of which is labeled with 4D panoptic segmentation masks as well as fine-grained, dynamic scene graphs. To solve PSG-4D, we propose PSG4DFormer, a Transformer-based model that can predict panoptic segmentation masks, track masks along the time axis, and generate the corresponding scene graphs via a relation component. Extensive experiments on the new dataset show that our method can serve as a strong baseline for future research on PSG-4D. In the end, we provide a real-world application example to demonstrate how we can achieve dynamic scene understanding by integrating a large language model into our PSG-4D system. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: Accepted as NeurIPS 2023. Code: https://github.com/**gkang50/PSG4D Previous Series: PSG https://github.com/**gkang50/OpenPSG and PVSG https://github.com/**gkang50/OpenPVSG

arXiv:2405.08055 [pdf, other]

DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

Authors: Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu

Abstract: Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. The… ▽ More Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges with a single model. To handle the large diversity and complexity in geometry and texture across categories efficiently, we 1) adopt improved triplane to guarantee efficiency; 2) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and 3) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware Diffusion model with TransFormer, DiffTF, we propose a stronger version for 3D generation, i.e., DiffTF++. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2309.07920

arXiv:2405.07029 [pdf]

A framework of text-dependent speaker verification for chinese numerical string corpus

Authors: Litong Zheng, Feng Hong, Weijie Xu, Wan Zheng

Abstract: The Chinese numerical string corpus, serves as a valuable resource for speaker verification, particularly in financial transactions. Researches indicate that in short speech scenarios, text-dependent speaker verification (TD-SV) consistently outperforms text-independent speaker verification (TI-SV). However, TD-SV potentially includes the validation of text information, that can be negatively impa… ▽ More The Chinese numerical string corpus, serves as a valuable resource for speaker verification, particularly in financial transactions. Researches indicate that in short speech scenarios, text-dependent speaker verification (TD-SV) consistently outperforms text-independent speaker verification (TI-SV). However, TD-SV potentially includes the validation of text information, that can be negatively impacted by reading rhythms and pauses. To address this problem, we propose an end-to-end speaker verification system that enhances TD-SV by decoupling speaker and text information. Our system consists of a text embedding extractor, a speaker embedding extractor and a fusion module. In the text embedding extractor, we employ an enhanced Transformer and introduce a triple loss including text classification loss, connectionist temporal classification (CTC) loss and decoder loss; while in the speaker embedding extractor, we create a multi-scale pooling method by combining sliding window attentive statistics pooling (SWASP) with attentive statistics pooling (ASP). To mitigate the scarcity of data, we have recorded a publicly available Chinese numerical corpus named SHALCAS22A (hereinafter called SHAL), which can be accessed on Open-SLR. Moreover, we employ data augmentation techniques using Tacotron2 and HiFi-GAN. Our method achieves an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively. △ Less

Submitted 21 May, 2024; v1 submitted 11 May, 2024; originally announced May 2024.

Comments: arXiv admin note: text overlap with arXiv:2312.01645

arXiv:2404.19678 [pdf, ps, other]

Density-wave-like gap evolution in La$_3$Ni$_2$O$_7$ under high pressure revealed by ultrafast optical spectroscopy

Authors: Yanghao Meng, Yi Yang, Hualei Sun, Sasa Zhang, Jianlin Luo, Meng Wang, Fang Hong, Xinbo Wang, Xiaohui Yu

Abstract: We explore the quasiparticle dynamics in bilayer nickelate La$_3$Ni$_2$O$_7$ crystal using ultrafast optical pump-probe spectroscopy at high pressure up to 34.2 GPa. At ambient pressure, the temperature dependence of relaxation indicates appearance of phonon bottleneck effect due to the opening of density-wave-like gap at 151 K. By analyzing the data with RT model, we identified the energy scale o… ▽ More We explore the quasiparticle dynamics in bilayer nickelate La$_3$Ni$_2$O$_7$ crystal using ultrafast optical pump-probe spectroscopy at high pressure up to 34.2 GPa. At ambient pressure, the temperature dependence of relaxation indicates appearance of phonon bottleneck effect due to the opening of density-wave-like gap at 151 K. By analyzing the data with RT model, we identified the energy scale of the gap to be 70 meV at ambient pressure. The relaxation bottleneck effect is suppressed gradually by the pressure and disappears around 26 GPa. At high pressure above 29.4 GPa, we discover a new density-wave like order with transition temperature of $\sim$130 K. Our results not only provide the first experimental evidence of the density-wave like gap evolution under high pressure, but also offering insight into the underline interplay between the density wave order and superconductivity in pressured La$_3$Ni$_2$O$_7$. △ Less

Submitted 30 April, 2024; originally announced April 2024.

Comments: 6 pages, 4 figures

arXiv:2404.11062 [pdf, other]

doi 10.1103/PhysRevApplied.21.064015

Generation of a precise time scale assisted by a near-continuously operating optical lattice clock

Authors: Takumi Kobayashi, Daisuke Akamatsu, Kazumoto Hosaka, Yusuke Hisai, Akiko Nishiyama, Akio Kawasaki, Masato Wada, Hajime Inaba, Takehiko Tanabe, Feng-Lei Hong, Masami Yasuda

Abstract: We report on a reduced time variation of a time scale with respect to Coordinated Universal Time (UTC) by steering a hydrogen-maser-based time scale with a near-continuously operating optical lattice clock. The time scale is generated in a post-processing analysis for 230 days with a hydrogen maser with its fractional frequency stability limited by a flicker floor of $2\times10^{-15}$ and an Yb op… ▽ More We report on a reduced time variation of a time scale with respect to Coordinated Universal Time (UTC) by steering a hydrogen-maser-based time scale with a near-continuously operating optical lattice clock. The time scale is generated in a post-processing analysis for 230 days with a hydrogen maser with its fractional frequency stability limited by a flicker floor of $2\times10^{-15}$ and an Yb optical lattice clock operated with an uptime of 81.6 $\%$. During the 230-day period, the root mean square time variation of our time scale with respect to UTC is 0.52 ns, which is a better performance compared with those of time scales steered by microwave fountain clocks that exhibit root mean square variations from 0.99 ns to 1.6 ns. With the high uptime achieved by the Yb optical lattice clock, our simulation implies the potential of generating a state-of-the-art time scale with a time variation of $<0.1$ ns over a month using a better hydrogen maser reaching the mid $10^{-16}$ level. This work demonstrates that a use of an optical clock with a high uptime enhances the stability of a time scale. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.01655 [pdf, other]

FashionEngine: Interactive 3D Human Generation and Editing via Multimodal Controls

Authors: Tao Hu, Fangzhou Hong, Zhaoxi Chen, Ziwei Liu

Abstract: We present FashionEngine, an interactive 3D human generation and editing system that creates 3D digital humans via user-friendly multimodal controls such as natural languages, visual perceptions, and hand-drawing sketches. FashionEngine automates the 3D human production with three key components: 1) A pre-trained 3D human diffusion model that learns to model 3D humans in a semantic UV latent space… ▽ More We present FashionEngine, an interactive 3D human generation and editing system that creates 3D digital humans via user-friendly multimodal controls such as natural languages, visual perceptions, and hand-drawing sketches. FashionEngine automates the 3D human production with three key components: 1) A pre-trained 3D human diffusion model that learns to model 3D humans in a semantic UV latent space from 2D image training data, which provides strong priors for diverse generation and editing tasks. 2) Multimodality-UV Space encoding the texture appearance, shape topology, and textual semantics of human clothing in a canonical UV-aligned space, which faithfully aligns the user multimodal inputs with the implicit UV latent space for controllable 3D human editing. The multimodality-UV space is shared across different user inputs, such as texts, images, and sketches, which enables various joint multimodal editing tasks. 3) Multimodality-UV Aligned Sampler learns to sample high-quality and diverse 3D humans from the diffusion prior. Extensive experiments validate FashionEngine's state-of-the-art performance for conditional generation/editing tasks. In addition, we present an interactive user interface for our FashionEngine that enables both conditional and unconditional generation tasks, and editing tasks including pose/view/shape control, text-, image-, and sketch-driven 3D human editing and 3D virtual try-on, in a unified framework. Our project page is at: https://taohuumd.github.io/projects/FashionEngine. △ Less

Submitted 20 May, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: Project Page: https://taohuumd.github.io/projects/FashionEngine

arXiv:2404.01284 [pdf, other]

Large Motion Model for Unified Multi-Modal Motion Generation

Authors: Mingyuan Zhang, Daisheng **, Chenyang Gu, Fangzhou Hong, Zhongang Cai, **gfang Huang, Chongzhi Zhang, Xinying Guo, Lei Yang, Ying He, Ziwei Liu

Abstract: Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on develo** specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation t… ▽ More Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on develo** specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. A unified motion model is appealing since it can leverage a wide range of motion data to achieve broad generalization beyond a single task. However, it is also challenging due to the heterogeneous nature of substantially different motion data and tasks. LMM tackles these challenges from three principled aspects: 1) Data: We consolidate datasets with different modalities, formats and tasks into a comprehensive yet unified motion generation dataset, MotionVerse, comprising 10 tasks, 16 datasets, a total of 320k sequences, and 100 million frames. 2) Architecture: We design an articulated attention mechanism ArtAttention that incorporates body part-aware modeling into Diffusion Transformer backbone. 3) Pre-Training: We propose a novel pre-training strategy for LMM, which employs variable frame rates and masking forms, to better exploit knowledge from diverse training data. Extensive experiments demonstrate that our generalist LMM achieves competitive performance across various standard motion generation tasks over state-of-the-art specialist models. Notably, LMM exhibits strong generalization capabilities and emerging properties across many unseen tasks. Additionally, our ablation studies reveal valuable insights about training and scaling up large motion models for future research. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: Homepage: https://mingyuan-zhang.github.io/projects/LMM.html

arXiv:2404.01241 [pdf, other]

StructLDM: Structured Latent Diffusion for 3D Human Generation

Authors: Tao Hu, Fangzhou Hong, Ziwei Liu

Abstract: Recent 3D human generative models have achieved remarkable progress by learning 3D-aware GANs from 2D images. However, existing 3D human generative methods model humans in a compact 1D latent space, ignoring the articulated structure and semantics of human body topology. In this paper, we explore more expressive and higher-dimensional latent space for 3D human modeling and propose StructLDM, a dif… ▽ More Recent 3D human generative models have achieved remarkable progress by learning 3D-aware GANs from 2D images. However, existing 3D human generative methods model humans in a compact 1D latent space, ignoring the articulated structure and semantics of human body topology. In this paper, we explore more expressive and higher-dimensional latent space for 3D human modeling and propose StructLDM, a diffusion-based unconditional 3D human generative model, which is learned from 2D images. StructLDM solves the challenges imposed due to the high-dimensional growth of latent space with three key designs: 1) A semantic structured latent space defined on the dense surface manifold of a statistical human body template. 2) A structured 3D-aware auto-decoder that factorizes the global latent space into several semantic body parts parameterized by a set of conditional structured local NeRFs anchored to the body template, which embeds the properties learned from the 2D training data and can be decoded to render view-consistent humans under different poses and clothing styles. 3) A structured latent diffusion model for generative human appearance sampling. Extensive experiments validate StructLDM's state-of-the-art generation performance and illustrate the expressiveness of the structured latent space over the well-adopted 1D latent space. Notably, StructLDM enables different levels of controllable 3D human generation and editing, including pose/view/shape control, and high-level tasks including compositional generations, part-aware clothing editing, 3D virtual try-on, etc. Our project page is at: https://taohuumd.github.io/projects/StructLDM/. △ Less

Submitted 2 July, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: Accepted to ECCV 2024. Project page: https://taohuumd.github.io/projects/StructLDM/

arXiv:2404.01225 [pdf, other]

SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering

Authors: Tao Hu, Fangzhou Hong, Ziwei Liu

Abstract: Dynamic human rendering from video sequences has achieved remarkable progress by formulating the rendering as a map** from static poses to human images. However, existing methods focus on the human appearance reconstruction of every single frame while the temporal motion relations are not fully explored. In this paper, we propose a new 4D motion modeling paradigm, SurMo, that jointly models the… ▽ More Dynamic human rendering from video sequences has achieved remarkable progress by formulating the rendering as a map** from static poses to human images. However, existing methods focus on the human appearance reconstruction of every single frame while the temporal motion relations are not fully explored. In this paper, we propose a new 4D motion modeling paradigm, SurMo, that jointly models the temporal dynamics and human appearances in a unified framework with three key designs: 1) Surface-based motion encoding that models 4D human motions with an efficient compact surface-based triplane. It encodes both spatial and temporal motion relations on the dense surface manifold of a statistical body template, which inherits body topology priors for generalizable novel view synthesis with sparse training observations. 2) Physical motion decoding that is designed to encourage physical motion learning by decoding the motion triplane features at timestep t to predict both spatial derivatives and temporal derivatives at the next timestep t+1 in the training stage. 3) 4D appearance decoding that renders the motion triplanes into images by an efficient volumetric surface-conditioned renderer that focuses on the rendering of body surfaces with motion learning conditioning. Extensive experiments validate the state-of-the-art performance of our new paradigm and illustrate the expressiveness of surface-based motion triplanes for rendering high-fidelity view-consistent humans with fast motions and even motion-dependent shadows. Our project page is at: https://taohuumd.github.io/projects/SurMo/ △ Less

Submitted 2 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: Accepted to CVPR 2024. Project Page: https://taohuumd.github.io/projects/SurMo/

arXiv:2403.12019 [pdf, other]

LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

Authors: Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, Chen Change Loy

Abstract: The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harn… ▽ More The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and 3D latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: project webpage: https://nirvanalan.github.io/projects/ln3diff/

arXiv:2403.02234 [pdf, other]

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

Authors: Fangzhou Hong, Jiaxiang Tang, Ziang Cao, Min Shi, Tong Wu, Zhaoxi Chen, Shuai Yang, Tengfei Wang, Liang Pan, Dahua Lin, Ziwei Liu

Abstract: We present a two-stage text-to-3D generation system, namely 3DTopia, which generates high-quality general 3D assets within 5 minutes using hybrid diffusion priors. The first stage samples from a 3D diffusion prior directly learned from 3D data. Specifically, it is powered by a text-conditioned tri-plane latent diffusion model, which quickly generates coarse 3D samples for fast prototy**. The sec… ▽ More We present a two-stage text-to-3D generation system, namely 3DTopia, which generates high-quality general 3D assets within 5 minutes using hybrid diffusion priors. The first stage samples from a 3D diffusion prior directly learned from 3D data. Specifically, it is powered by a text-conditioned tri-plane latent diffusion model, which quickly generates coarse 3D samples for fast prototy**. The second stage utilizes 2D diffusion priors to further refine the texture of coarse 3D models from the first stage. The refinement consists of both latent and pixel space optimization for high-quality texture generation. To facilitate the training of the proposed system, we clean and caption the largest open-source 3D dataset, Objaverse, by combining the power of vision language models and large language models. Experiment results are reported qualitatively and quantitatively to show the performance of the proposed system. Our codes and models are available at https://github.com/3DTopia/3DTopia △ Less

Submitted 6 May, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

Comments: Code available at https://github.com/3DTopia/3DTopia

arXiv:2402.16427 [pdf]

Electronic phase transitions and superconductivity in ferroelectric Sn$_2$P$_2$Se$_6$ under pressure

Authors: He Zhang, Wei Zhong, Xiaohui Yu, Binbin Yue, Fang Hong

Abstract: Since there is both strong electron-phonon coupling during a ferroelectric/FE transition and superconducting/SC transition, it has been an important topic to explore superconductivity from the FE instability. Sn$_2$P$_2$Se$_6$ arouses broad attention due to its unique FE properties. Here, we reported the electronic phase transitions and superconductivity in this compound based on high-pressure ele… ▽ More Since there is both strong electron-phonon coupling during a ferroelectric/FE transition and superconducting/SC transition, it has been an important topic to explore superconductivity from the FE instability. Sn$_2$P$_2$Se$_6$ arouses broad attention due to its unique FE properties. Here, we reported the electronic phase transitions and superconductivity in this compound based on high-pressure electrical transport measurement, optical absorption spectroscopy and Raman based structural analysis. Upon compression, the conductivity of Sn$_2$P$_2$Se$_6$ was elevated monotonously, an electronic phase transition occurred near 5.4 GPa, revealed by optical absorption spectroscopy, and the insulating state is estimated to be fully suppressed near 15 GPa. Then, it started to show the signature of superconductivity near 15.3 GPa. The zero-resistance state was presented from 19.4 GPa, and the superconductivity was enhanced with pressure continuously. The magnetic field effect further confirmed the SC behavior and this compound had a $T_c$ of 5.4 K at 41.8 GPa with a zero temperature upper critical field of 6.55 T. The Raman spectra confirmed the structural origin of the electronic transition near 5.4 GPa, which should due to the transition from the paraelectric phase to the incommensurate phase, and suggested a possible first-order phase transition when the sample underwent the semiconductor-metal transition near 15 GPa. This work demonstrates the versatile physical properties in ferroelectrics and inspires the further investigation on the correlation between FE instability and SC in M$_2$P$_2$X$_6$ family. △ Less

Submitted 26 February, 2024; originally announced February 2024.

Comments: 13 pages, 5 figures

arXiv:2312.16484 [pdf]

Emergence of superconductivity near 11 K by suppressing the 3-fold helical-chain structure in noncentrosymmetric HgS

Authors: He Zhang, Wei Zhong, Yanghao Meng, Bowen Tang, Binbin Yue, Xiaohui Yu, Fang Hong

Abstract: The trigonal $α$-HgS has a 3-fold helical chain structure, and is in form of a noncentrosymmetric $P3_121$ phase, known as the cinnabar phase. However, under pressure, the helical chains gradually approach and connect with each other, finally reconstructing into a centrosymmetric NaCl structure at 21 GPa. Superconductivity emerges just after this helical-nonhelical structural transition. The maxim… ▽ More The trigonal $α$-HgS has a 3-fold helical chain structure, and is in form of a noncentrosymmetric $P3_121$ phase, known as the cinnabar phase. However, under pressure, the helical chains gradually approach and connect with each other, finally reconstructing into a centrosymmetric NaCl structure at 21 GPa. Superconductivity emerges just after this helical-nonhelical structural transition. The maximum critical temperature ($T_c$) reaches 11 K at 25.4 GPa, $T_c$ decreases with further compression, and is still 3.5 K at 44.8 GPa. Furthermore, the $T_c$-critical magnetic field ($B_{c2}$) relation exhibits multi-band features, with a $B_{c2}$ of 5.65 T at 0 K by two-band fitting. Raman spectra analysis demonstrates that phonon softening plays a key role in structural transition and the emergence of superconductivity. It is noted that HgS is the first reported IIB group metal sulfide superconductor and the only NaCl-type metal sulfide superconductor with a $T_c$ above 10 K. This work will inspire the exploration of superconductivity in other chiral systems and will extend our understanding of the versatile behavior in such kinds of materials. △ Less

Submitted 27 December, 2023; originally announced December 2023.

Comments: 16 pages, 6 figures

arXiv:2312.11038 [pdf, other]

UniChest: Conquer-and-Divide Pre-training for Multi-Source Chest X-Ray Classification

Authors: Tianjie Dai, Ruipeng Zhang, Feng Hong, Jiangchao Yao, Ya Zhang, Yanfeng Wang

Abstract: Vision-Language Pre-training (VLP) that utilizes the multi-modal information to promote the training efficiency and effectiveness, has achieved great success in vision recognition of natural domains and shown promise in medical imaging diagnosis for the Chest X-Rays (CXRs). However, current works mainly pay attention to the exploration on single dataset of CXRs, which locks the potential of this p… ▽ More Vision-Language Pre-training (VLP) that utilizes the multi-modal information to promote the training efficiency and effectiveness, has achieved great success in vision recognition of natural domains and shown promise in medical imaging diagnosis for the Chest X-Rays (CXRs). However, current works mainly pay attention to the exploration on single dataset of CXRs, which locks the potential of this powerful paradigm on larger hybrid of multi-source CXRs datasets. We identify that although blending samples from the diverse sources offers the advantages to improve the model generalization, it is still challenging to maintain the consistent superiority for the task of each source due to the existing heterogeneity among sources. To handle this dilemma, we design a Conquer-and-Divide pre-training framework, termed as UniChest, aiming to make full use of the collaboration benefit of multiple sources of CXRs while reducing the negative influence of the source heterogeneity. Specially, the ``Conquer" stage in UniChest encourages the model to sufficiently capture multi-source common patterns, and the ``Divide" stage helps squeeze personalized patterns into different small experts (query networks). We conduct thorough experiments on many benchmarks, e.g., ChestX-ray14, CheXpert, Vindr-CXR, Shenzhen, Open-I and SIIM-ACR Pneumothorax, verifying the effectiveness of UniChest over a range of baselines, and release our codes and pre-training models at https://github.com/Elfenreigen/UniChest. △ Less

Submitted 21 March, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: Accepted at IEEE Transactions on Medical Imaging

arXiv:2312.04559 [pdf, other]

PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation

Authors: Zhaoxi Chen, Fangzhou Hong, Haiyi Mei, Guangcong Wang, Lei Yang, Ziwei Liu

Abstract: We present PrimDiffusion, the first diffusion-based framework for 3D human generation. Devising diffusion models for 3D human generation is difficult due to the intensive computational cost of 3D representations and the articulated topology of 3D humans. To tackle these challenges, our key insight is operating the denoising diffusion process directly on a set of volumetric primitives, which models… ▽ More We present PrimDiffusion, the first diffusion-based framework for 3D human generation. Devising diffusion models for 3D human generation is difficult due to the intensive computational cost of 3D representations and the articulated topology of 3D humans. To tackle these challenges, our key insight is operating the denoising diffusion process directly on a set of volumetric primitives, which models the human body as a number of small volumes with radiance and kinematic information. This volumetric primitives representation marries the capacity of volumetric representations with the efficiency of primitive-based rendering. Our PrimDiffusion framework has three appealing properties: 1) compact and expressive parameter space for the diffusion model, 2) flexible 3D representation that incorporates human prior, and 3) decoder-free rendering for efficient novel-view and novel-pose synthesis. Extensive experiments validate that PrimDiffusion outperforms state-of-the-art methods in 3D human generation. Notably, compared to GAN-based methods, our PrimDiffusion supports real-time rendering of high-quality 3D humans at a resolution of $512\times512$ once the denoising process is done. We also demonstrate the flexibility of our framework on training-free conditional generation such as texture transfer and 3D inpainting. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: NeurIPS 2023; Project page https://frozenburning.github.io/projects/primdiffusion/ Code available at https://github.com/FrozenBurning/PrimDiffusion

arXiv:2312.01645 [pdf]

A text-dependent speaker verification application framework based on Chinese numerical string corpus

Authors: Litong Zheng, Feng Hong, Weijie Xu

Abstract: Researches indicate that text-dependent speaker verification (TD-SV) often outperforms text-independent verification (TI-SV) in short speech scenarios. However, collecting large-scale fixed text speech data is challenging, and as speech length increases, factors like sentence rhythm and pauses affect TDSV's sensitivity to text sequence. Based on these factors, We propose the hypothesis that strate… ▽ More Researches indicate that text-dependent speaker verification (TD-SV) often outperforms text-independent verification (TI-SV) in short speech scenarios. However, collecting large-scale fixed text speech data is challenging, and as speech length increases, factors like sentence rhythm and pauses affect TDSV's sensitivity to text sequence. Based on these factors, We propose the hypothesis that strategies such as more fine-grained pooling methods on time scales and decoupled representations of speech speaker embedding and text embedding are more suitable for TD-SV. We have introduced an end-to-end TD-SV system based on a dataset comprising longer Chinese numerical string texts. It contains a text embedding network, a speaker embedding network, and back-end fusion. First, we recorded a dataset consisting of long Chinese numerical text named SHAL, which is publicly available on the Open-SLR website. We addressed the issue of dataset scarcity by augmenting it using Tacotron2 and HiFi-GAN. Next, we introduced a dual representation of speech with text embedding and speaker embedding. In the text embedding network, we employed an enhanced Transformer and introduced a triple loss that includes text classification loss, CTC loss, and decoder loss. For the speaker embedding network, we enhanced a sliding window attentive statistics pooling (SWASP), combined with attentive statistics pooling (ASP) to create a multi-scale pooling method. Finally, we fused text embedding and speaker embedding. Our pooling methods achieved an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively. △ Less

Submitted 4 December, 2023; originally announced December 2023.

arXiv:2310.17622 [pdf, other]

Combating Representation Learning Disparity with Geometric Harmonization

Authors: Zhihan Zhou, Jiangchao Yao, Feng Hong, Ya Zhang, Bo Han, Yanfeng Wang

Abstract: Self-supervised learning (SSL) as an effective paradigm of representation learning has achieved tremendous success on various curated datasets in diverse scenarios. Nevertheless, when facing the long-tailed distribution in real-world applications, it is still hard for existing methods to capture transferable and robust representation. Conventional SSL methods, pursuing sample-level uniformity, eas… ▽ More Self-supervised learning (SSL) as an effective paradigm of representation learning has achieved tremendous success on various curated datasets in diverse scenarios. Nevertheless, when facing the long-tailed distribution in real-world applications, it is still hard for existing methods to capture transferable and robust representation. Conventional SSL methods, pursuing sample-level uniformity, easily leads to representation learning disparity where head classes dominate the feature regime but tail classes passively collapse. To address this problem, we propose a novel Geometric Harmonization (GH) method to encourage category-level uniformity in representation learning, which is more benign to the minority and almost does not hurt the majority under long-tailed distribution. Specially, GH measures the population statistics of the embedding space on top of self-supervised learning, and then infer an fine-grained instance-wise calibration to constrain the space expansion of head classes and avoid the passive collapse of tail classes. Our proposal does not alter the setting of SSL and can be easily integrated into existing methods in a low-cost manner. Extensive results on a range of benchmark datasets show the effectiveness of GH with high tolerance to the distribution skewness. Our code is available at https://github.com/MediaBrain-SJTU/Geometric-Harmonization. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: Accepted to NeurIPS 2023 (spotlight)

arXiv:2310.16112 [pdf, other]

Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge

Authors: Gregory Holste, Yiliang Zhou, Song Wang, Ajay Jaiswal, Mingquan Lin, Sherry Zhuge, Yuzhe Yang, Dongkyun Kim, Trong-Hieu Nguyen-Mau, Minh-Triet Tran, Jaehyup Jeong, Wongi Park, Jongbin Ryu, Feng Hong, Arsh Verma, Yosuke Yamagishi, Changhyun Kim, Hyeryeong Seo, Myungjoo Kang, Leo Anthony Celi, Zhiyong Lu, Ronald M. Summers, George Shih, Zhangyang Wang, Yifan Peng

Abstract: Many real-world image recognition problems, such as diagnostic medical imaging exams, are "long-tailed" $\unicode{x2013}$ there are a few common findings followed by many more relatively rare conditions. In chest radiography, diagnosis is both a long-tailed and multi-label problem, as patients often present with multiple findings simultaneously. While researchers have begun to study the problem of… ▽ More Many real-world image recognition problems, such as diagnostic medical imaging exams, are "long-tailed" $\unicode{x2013}$ there are a few common findings followed by many more relatively rare conditions. In chest radiography, diagnosis is both a long-tailed and multi-label problem, as patients often present with multiple findings simultaneously. While researchers have begun to study the problem of long-tailed learning in medical image recognition, few have studied the interaction of label imbalance and label co-occurrence posed by long-tailed, multi-label disease classification. To engage with the research community on this emerging topic, we conducted an open challenge, CXR-LT, on long-tailed, multi-label thorax disease classification from chest X-rays (CXRs). We publicly release a large-scale benchmark dataset of over 350,000 CXRs, each labeled with at least one of 26 clinical findings following a long-tailed distribution. We synthesize common themes of top-performing solutions, providing practical recommendations for long-tailed, multi-label medical image classification. Finally, we use these insights to propose a path forward involving vision-language foundation models for few- and zero-shot disease classification. △ Less

Submitted 1 April, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: Update after major revision

arXiv:2310.05532 [pdf]

Observation of Emergent Superconductivity in the Quantum Spin Hall Insulator Ta2Pd3Te5 via Pressure Manipulation

Authors: Hui Yu, Dayu Yan, Zhaopeng Guo, Yizhou Zhou, Xue Yang, Peiling Li, Zhijun Wang, Xiaojun Xiang, Junkai Li, Xiaoli Ma, Rui Zhou, Fang Hong, Yunxiao Wuli, Youguo Shi, Jian-Tao Wang, Xiaohui Yu

Abstract: Quantum Spin Hall (QSH) insulators possess distinct helical in-gap states, enabling their edge states to act as one-dimensional conducting channels when backscattering is prohibited by time-reversal symmetry. However, it remains challenging to achieve high-performance combinations of nontrivial topological QSH states with superconductivity for applications and requires understanding of the complic… ▽ More Quantum Spin Hall (QSH) insulators possess distinct helical in-gap states, enabling their edge states to act as one-dimensional conducting channels when backscattering is prohibited by time-reversal symmetry. However, it remains challenging to achieve high-performance combinations of nontrivial topological QSH states with superconductivity for applications and requires understanding of the complicated underlying mechanisms. Here, our experimental observations for a novel superconducting phase in the pressurized QSH insulator Ta2Pd3Te5 is reported, and the high-pressure phase maintains its original ambient pressure lattice symmetry up to 45 GPa. Our in-situ high-pressure synchrotron X-ray diffraction, electrical transport, infrared reflectance, and Raman spectroscopy measurements, in combination with rigorous theoretical calculations, provide compelling evidence for the association between the superconducting behavior and the abnormal densified phase. The isostructural transition was found to modify the topology of the Fermi surface directly, accompanied by a fivefold amplification of the density of states at 20 GPa compared to ambient pressure, which synergistically fosters the emergence of robust superconductivity. A profound comprehension of the fascinating properties exhibited by the compressed Ta2Pd3Te5 phase is achieved, highlighting the extraordinary potential of van der Waals (vdW) QSH insulators for exploring and investigating high-performance electronic advanced devices under extreme conditions. △ Less

Submitted 9 October, 2023; originally announced October 2023.

Comments: 16pages,4figures

arXiv:2309.07920 [pdf, other]

Large-Vocabulary 3D Diffusion Model with Transformer

Authors: Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu

Abstract: Creating diverse and high-quality 3D assets with an automatic generative model is highly desirable. Despite extensive efforts on 3D generation, most existing works focus on the generation of a single category or a few categories. In this paper, we introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model. Notably,… ▽ More Creating diverse and high-quality 3D assets with an automatic generative model is highly desirable. Despite extensive efforts on 3D generation, most existing works focus on the generation of a single category or a few categories. In this paper, we introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model. Notably, there are three major challenges for this large-vocabulary 3D generation: a) the need for expressive yet efficient 3D representation; b) large diversity in geometry and texture across categories; c) complexity in the appearances of real-world objects. To this end, we propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects. 1) Considering efficiency and robustness, we adopt a revised triplane representation and improve the fitting speed and accuracy. 2) To handle the drastic variations in geometry and texture, we regard the features of all 3D objects as a combination of generalized 3D knowledge and specialized 3D features. To extract generalized 3D knowledge from diverse categories, we propose a novel 3D-aware transformer with shared cross-plane attention. It learns the cross-plane relations across different planes and aggregates the generalized 3D knowledge with specialized 3D features. 3) In addition, we devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge in the encoded triplanes for handling categories with complex appearances. Extensive experiments on ShapeNet and OmniObject3D (over 200 diverse real-world categories) convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance with large diversity, rich semantics, and high quality. △ Less

Submitted 15 September, 2023; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: Project page at https://ziangcao0312.github.io/difftf_pages/

arXiv:2309.04410 [pdf, other]

DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields

Authors: Junzhe Zhang, Yushi Lan, Shuai Yang, Fangzhou Hong, Quan Wang, Chai Kiat Yeo, Ziwei Liu, Chen Change Loy

Abstract: In this paper, we address the challenging problem of 3D toonification, which involves transferring the style of an artistic domain onto a target 3D face with stylized geometry and texture. Although fine-tuning a pre-trained 3D GAN on the artistic domain can produce reasonable performance, this strategy has limitations in the 3D domain. In particular, fine-tuning can deteriorate the original GAN la… ▽ More In this paper, we address the challenging problem of 3D toonification, which involves transferring the style of an artistic domain onto a target 3D face with stylized geometry and texture. Although fine-tuning a pre-trained 3D GAN on the artistic domain can produce reasonable performance, this strategy has limitations in the 3D domain. In particular, fine-tuning can deteriorate the original GAN latent space, which affects subsequent semantic editing, and requires independent optimization and storage for each new style, limiting flexibility and efficient deployment. To overcome these challenges, we propose DeformToon3D, an effective toonification framework tailored for hierarchical 3D GAN. Our approach decomposes 3D toonification into subproblems of geometry and texture stylization to better preserve the original latent space. Specifically, we devise a novel StyleField that predicts conditional 3D deformation to align a real-space NeRF to the style space for geometry stylization. Thanks to the StyleField formulation, which already handles geometry stylization well, texture stylization can be achieved conveniently via adaptive style mixing that injects information of the artistic domain into the decoder of the pre-trained 3D GAN. Due to the unique design, our method enables flexible style degree control and shape-texture-specific style swap. Furthermore, we achieve efficient training without any real-world 2D-3D training pairs but proxy samples synthesized from off-the-shelf 2D toonification models. △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: ICCV 2023. Code: https://github.com/junzhezhang/DeformToon3D Project page: https://www.mmlab-ntu.com/project/deformtoon3d/

arXiv:2309.00610 [pdf, other]

CityDreamer: Compositional Generative Model of Unbounded 3D Cities

Authors: Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu

Abstract: 3D city generation is a desirable yet challenging task, since humans are more sensitive to structural distortions in urban environments. Additionally, generating 3D cities is more complex than 3D natural scenes since buildings, as objects of the same class, exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address thes… ▽ More 3D city generation is a desirable yet challenging task, since humans are more sensitive to structural distortions in urban environments. Additionally, generating 3D cities is more complex than 3D natural scenes since buildings, as objects of the same class, exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges, we propose \textbf{CityDreamer}, a compositional generative model designed specifically for unbounded 3D cities. Our key insight is that 3D city generation should be a composition of different types of neural fields: 1) various building instances, and 2) background stuff, such as roads and green lands. Specifically, we adopt the bird's eye view scene representation and employ a volumetric render for both instance-oriented and stuff-oriented neural fields. The generative hash grid and periodic positional embedding are tailored as scene parameterization to suit the distinct characteristics of building instances and background stuff. Furthermore, we contribute a suite of CityGen Datasets, including OSM and GoogleEarth, which comprises a vast amount of real-world city imagery to enhance the realism of the generated 3D cities both in their layouts and appearances. CityDreamer achieves state-of-the-art performance not only in generating realistic 3D cities but also in localized editing within the generated cities. △ Less

Submitted 5 June, 2024; v1 submitted 1 September, 2023; originally announced September 2023.

Comments: CVPR 2024. Project page: https://haozhexie.com/project/city-dreamer

arXiv:2308.14492 [pdf, other]

PointHPS: Cascaded 3D Human Pose and Shape Estimation from Point Clouds

Authors: Zhongang Cai, Liang Pan, Chen Wei, Wanqi Yin, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang, Ziwei Liu

Abstract: Human pose and shape estimation (HPS) has attracted increasing attention in recent years. While most existing studies focus on HPS from 2D images or videos with inherent depth ambiguity, there are surging need to investigate HPS from 3D point clouds as depth sensors have been frequently employed in commercial devices. However, real-world sensory 3D points are usually noisy and incomplete, and also… ▽ More Human pose and shape estimation (HPS) has attracted increasing attention in recent years. While most existing studies focus on HPS from 2D images or videos with inherent depth ambiguity, there are surging need to investigate HPS from 3D point clouds as depth sensors have been frequently employed in commercial devices. However, real-world sensory 3D points are usually noisy and incomplete, and also human bodies could have different poses of high diversity. To tackle these challenges, we propose a principled framework, PointHPS, for accurate 3D HPS from point clouds captured in real-world settings, which iteratively refines point features through a cascaded architecture. Specifically, each stage of PointHPS performs a series of downsampling and upsampling operations to extract and collate both local and global cues, which are further enhanced by two novel modules: 1) Cross-stage Feature Fusion (CFF) for multi-scale feature propagation that allows information to flow effectively through the stages, and 2) Intermediate Feature Enhancement (IFE) for body-aware feature aggregation that improves feature quality after each stage. To facilitate a comprehensive study under various scenarios, we conduct our experiments on two large-scale benchmarks, comprising i) a dataset that features diverse subjects and actions captured by real commercial sensors in a laboratory environment, and ii) controlled synthetic data generated with realistic considerations such as clothed humans in crowded outdoor scenes. Extensive experiments demonstrate that PointHPS, with its powerful point feature extraction and processing scheme, outperforms State-of-the-Art methods by significant margins across the board. Homepage: https://caizhongang.github.io/projects/PointHPS/. △ Less

Submitted 28 August, 2023; originally announced August 2023.

arXiv:2308.09712 [pdf, other]

HumanLiff: Layer-wise 3D Human Generation with Diffusion Model

Authors: Shoukang Hu, Fangzhou Hong, Tao Hu, Liang Pan, Haiyi Mei, Weiye Xiao, Lei Yang, Ziwei Liu

Abstract: 3D human generation from 2D images has achieved remarkable progress through the synergistic utilization of neural rendering and generative models. Existing 3D human generative models mainly generate a clothed 3D human as an undetectable 3D model in a single pass, while rarely considering the layer-wise nature of a clothed human body, which often consists of the human body and various clothes such… ▽ More 3D human generation from 2D images has achieved remarkable progress through the synergistic utilization of neural rendering and generative models. Existing 3D human generative models mainly generate a clothed 3D human as an undetectable 3D model in a single pass, while rarely considering the layer-wise nature of a clothed human body, which often consists of the human body and various clothes such as underwear, outerwear, trousers, shoes, etc. In this work, we propose HumanLiff, the first layer-wise 3D human generative model with a unified diffusion process. Specifically, HumanLiff firstly generates minimal-clothed humans, represented by tri-plane features, in a canonical space, and then progressively generates clothes in a layer-wise manner. In this way, the 3D human generation is thus formulated as a sequence of diffusion-based 3D conditional generation. To reconstruct more fine-grained 3D humans with tri-plane representation, we propose a tri-plane shift operation that splits each tri-plane into three sub-planes and shifts these sub-planes to enable feature grid subdivision. To further enhance the controllability of 3D generation with 3D layered conditions, HumanLiff hierarchically fuses tri-plane features and 3D layered conditions to facilitate the 3D diffusion model learning. Extensive experiments on two layer-wise 3D human datasets, SynBody (synthetic) and TightCap (real-world), validate that HumanLiff significantly outperforms state-of-the-art methods in layer-wise 3D human generation. Our code will be available at https://skhu101.github.io/HumanLiff. △ Less

Submitted 18 August, 2023; originally announced August 2023.

Comments: Project page: https://skhu101.github.io/HumanLiff/

arXiv:2308.08853 [pdf, other]

Bag of Tricks for Long-Tailed Multi-Label Classification on Chest X-Rays

Authors: Feng Hong, Tianjie Dai, Jiangchao Yao, Ya Zhang, Yanfeng Wang

Abstract: Clinical classification of chest radiography is particularly challenging for standard machine learning algorithms due to its inherent long-tailed and multi-label nature. However, few attempts take into account the coupled challenges posed by both the class imbalance and label co-occurrence, which hinders their value to boost the diagnosis on chest X-rays (CXRs) in the real-world scenarios. Besides… ▽ More Clinical classification of chest radiography is particularly challenging for standard machine learning algorithms due to its inherent long-tailed and multi-label nature. However, few attempts take into account the coupled challenges posed by both the class imbalance and label co-occurrence, which hinders their value to boost the diagnosis on chest X-rays (CXRs) in the real-world scenarios. Besides, with the prevalence of pretraining techniques, how to incorporate these new paradigms into the current framework lacks of the systematical study. This technical report presents a brief description of our solution in the ICCV CVAMD 2023 CXR-LT Competition. We empirically explored the effectiveness for CXR diagnosis with the integration of several advanced designs about data augmentation, feature extractor, classifier design, loss function reweighting, exogenous data replenishment, etc. In addition, we improve the performance through simple test-time data augmentation and ensemble. Our framework finally achieves 0.349 mAP on the competition test set, ranking in the top five. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: Accepted for the ICCV 2023 Workshop on Computer Vision for Automated Medical Diagnosis (CVAMD)

arXiv:2308.01698 [pdf, other]

Balanced Destruction-Reconstruction Dynamics for Memory-replay Class Incremental Learning

Authors: Yuhang Zhou, Jiangchao Yao, Feng Hong, Ya Zhang, Yanfeng Wang

Abstract: Class incremental learning (CIL) aims to incrementally update a trained model with the new classes of samples (plasticity) while retaining previously learned ability (stability). To address the most challenging issue in this goal, i.e., catastrophic forgetting, the mainstream paradigm is memory-replay CIL, which consolidates old knowledge by replaying a small number of old classes of samples saved… ▽ More Class incremental learning (CIL) aims to incrementally update a trained model with the new classes of samples (plasticity) while retaining previously learned ability (stability). To address the most challenging issue in this goal, i.e., catastrophic forgetting, the mainstream paradigm is memory-replay CIL, which consolidates old knowledge by replaying a small number of old classes of samples saved in the memory. Despite effectiveness, the inherent destruction-reconstruction dynamics in memory-replay CIL are an intrinsic limitation: if the old knowledge is severely destructed, it will be quite hard to reconstruct the lossless counterpart. Our theoretical analysis shows that the destruction of old knowledge can be effectively alleviated by balancing the contribution of samples from the current phase and those saved in the memory. Motivated by this theoretical finding, we propose a novel Balanced Destruction-Reconstruction module (BDR) for memory-replay CIL, which can achieve better knowledge reconstruction by reducing the degree of maximal destruction of old knowledge. Specifically, to achieve a better balance between old knowledge and new classes, the proposed BDR module takes into account two factors: the variance in training status across different classes and the quantity imbalance of samples from the current phase and memory. By dynamically manipulating the gradient during training based on these factors, BDR can effectively alleviate knowledge destruction and improve knowledge reconstruction. Extensive experiments on a range of CIL benchmarks have shown that as a lightweight plug-and-play module, BDR can significantly improve the performance of existing state-of-the-art methods with good generalization. △ Less

Submitted 3 August, 2023; originally announced August 2023.

arXiv:2307.16438 [pdf]

Coexistence of Superconductivity and ferromagnetism in high entropy carbide ceramics

Authors: Huchen Shu, Wei Zhong, Jiajia Feng, Hongyang Zhao, Fang Hong, Binbin Yue

Abstract: Generally, the superconductivity was expected to be absent in magnetic systems, but this reception was disturbed by unconventional superconductors, such as cuprates, iron-based superconductors and recently discovered nickelate, since their superconductivity is proposed to be related to the electron-electron interaction mediated by the spin fluctuation. However, the coexistence of superconductivity… ▽ More Generally, the superconductivity was expected to be absent in magnetic systems, but this reception was disturbed by unconventional superconductors, such as cuprates, iron-based superconductors and recently discovered nickelate, since their superconductivity is proposed to be related to the electron-electron interaction mediated by the spin fluctuation. However, the coexistence of superconductivity and magnetism is still rare in conventional superconductors. In this work, we reported the coexistence of these two quantum orderings in high entropy carbide ceramics (Mo0.2Nb0.2Ta0.2V0.2W0.2)C0.9, (Ta0.25Ti0.25Nb0.25Zr0.25)C, and they are expected to be conventional superconductors. Clear magnetic hysteresis loop was observed in these high entropy carbides, indicating a ferromagnetic ground state. A sharp superconducting transition is observed in (Mo0.2Nb0.2Ta0.2V0.2W0.2)C0.9 with a Tc of 3.4 K and upper critical field of ~3.35 T. Meanwhile, superconductivity is suppressed to some extent and zero-resistance state disappears in (Ta0.25Ti0.25Nb0.25Zr0.25)C, in which stronger magnetism is presented. The upper critical field of (Ta0.25Ti0.25Nb0.25Zr0.25)C is only ~1.5 T, though they show higher transition temperature near 5.7 K. The ferromagnetism stems from the carbon vacancies which occurs often during the high temperature synthesis process. This work not just demonstrate the observation of superconductivity in high entropy carbide ceramics, but also provide alternative exotic platform to study the correlation between superconductivity and magnetism, and is of great benefit for the design of multifunctional electronic devices. △ Less

Submitted 31 July, 2023; originally announced July 2023.

Comments: 16 pages, 5 figures, 1 table. Suggestion and comments are welcome

arXiv:2307.09906 [pdf, other]

Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation

Authors: Fa-Ting Hong, Dan Xu

Abstract: Talking head video generation aims to animate a human face in a still image with dynamic poses and expressions using motion information derived from a target-driving video, while maintaining the person's identity in the source image. However, dramatic and complex motions in the driving video cause ambiguous generation, because the still source image cannot provide sufficient appearance information… ▽ More Talking head video generation aims to animate a human face in a still image with dynamic poses and expressions using motion information derived from a target-driving video, while maintaining the person's identity in the source image. However, dramatic and complex motions in the driving video cause ambiguous generation, because the still source image cannot provide sufficient appearance information for occluded regions or delicate expression variations, which produces severe artifacts and significantly degrades the generation quality. To tackle this problem, we propose to learn a global facial representation space, and design a novel implicit identity representation conditioned memory compensation network, coined as MCNet, for high-fidelity talking head generation.~Specifically, we devise a network module to learn a unified spatial facial meta-memory bank from all training samples, which can provide rich facial structure and appearance priors to compensate warped source facial features for the generation. Furthermore, we propose an effective query mechanism based on implicit identity representations learned from the discrete keypoints of the source image. It can greatly facilitate the retrieval of more correlated information from the memory bank for the compensation. Extensive experiments demonstrate that MCNet can learn representative and complementary facial memory, and can clearly outperform previous state-of-the-art talking head generation methods on VoxCeleb1 and CelebV datasets. Please check our \href{https://github.com/harlanhong/ICCV2023-MCNET}{Project}. △ Less

Submitted 18 August, 2023; v1 submitted 19 July, 2023; originally announced July 2023.

Comments: Accepted by ICCV2023, update the reference and figures

arXiv:2305.16504 [pdf, other]

On the Tool Manipulation Capability of Open-source Large Language Models

Authors: Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, Jian Zhang

Abstract: Recent studies on software tool manipulation with large language models (LLMs) mostly rely on closed model APIs. The industrial adoption of these models is substantially constrained due to the security and robustness risks in exposing information to closed LLM API services. In this paper, we ask can we enhance open-source LLMs to be competitive to leading closed LLM APIs in tool manipulation, with… ▽ More Recent studies on software tool manipulation with large language models (LLMs) mostly rely on closed model APIs. The industrial adoption of these models is substantially constrained due to the security and robustness risks in exposing information to closed LLM API services. In this paper, we ask can we enhance open-source LLMs to be competitive to leading closed LLM APIs in tool manipulation, with practical amount of human supervision. By analyzing common tool manipulation failures, we first demonstrate that open-source LLMs may require training with usage examples, in-context demonstration and generation style regulation to resolve failures. These insights motivate us to revisit classical methods in LLM literature, and demonstrate that we can adapt them as model alignment with programmatic data generation, system prompts and in-context demonstration retrievers to enhance open-source LLMs for tool manipulation. To evaluate these techniques, we create the ToolBench, a tool manipulation benchmark consisting of diverse software tools for real-world tasks. We demonstrate that our techniques can boost leading open-source LLMs by up to 90% success rate, showing capabilities competitive to OpenAI GPT-4 in 4 out of 8 ToolBench tasks. We show that such enhancement typically requires about one developer day to curate data for each tool, rendering a recipe with practical amount of human supervision. △ Less

Submitted 25 May, 2023; originally announced May 2023.

arXiv:2305.06225 [pdf, other]

DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head Video Generation

Authors: Fa-Ting Hong, Li Shen, Dan Xu

Abstract: Predominant techniques on talking head generation largely depend on 2D information, including facial appearances and motions from input face images. Nevertheless, dense 3D facial geometry, such as pixel-wise depth, plays a critical role in constructing accurate 3D facial structures and suppressing complex background noises for generation. However, dense 3D annotations for facial videos is prohibit… ▽ More Predominant techniques on talking head generation largely depend on 2D information, including facial appearances and motions from input face images. Nevertheless, dense 3D facial geometry, such as pixel-wise depth, plays a critical role in constructing accurate 3D facial structures and suppressing complex background noises for generation. However, dense 3D annotations for facial videos is prohibitively costly to obtain. In this work, firstly, we present a novel self-supervised method for learning dense 3D facial geometry (ie, depth) from face videos, without requiring camera parameters and 3D geometry annotations in training. We further propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning. Secondly, we design an effective geometry-guided facial keypoint estimation module, providing accurate keypoints for generating motion fields. Lastly, we develop a 3D-aware cross-modal (ie, appearance and depth) attention mechanism, which can be applied to each generation layer, to capture facial geometries in a coarse-to-fine manner. Extensive experiments are conducted on three challenging benchmarks (ie, VoxCeleb1, VoxCeleb2, and HDTF). The results demonstrate that our proposed framework can generate highly realistic-looking reenacted talking videos, with new state-of-the-art performances established on these benchmarks. The codes and trained models are publicly available on the GitHub project page at https://github.com/harlanhong/CVPR2022-DaGAN △ Less

Submitted 10 December, 2023; v1 submitted 10 May, 2023; originally announced May 2023.

Comments: Accepted at TPAMI; CVPR 2022 extension

arXiv:2304.01116 [pdf, other]

ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model

Authors: Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, Ziwei Liu

Abstract: 3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation, leading to substantial progress in capturing common motions. However, the performance on more diverse motions remains unsatisfactory. In this work, we propose ReMoDiffuse, a diffusion-model-based motion generation framework that integrates… ▽ More 3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation, leading to substantial progress in capturing common motions. However, the performance on more diverse motions remains unsatisfactory. In this work, we propose ReMoDiffuse, a diffusion-model-based motion generation framework that integrates a retrieval mechanism to refine the denoising process. ReMoDiffuse enhances the generalizability and diversity of text-driven motion generation with three key designs: 1) Hybrid Retrieval finds appropriate references from the database in terms of both semantic and kinematic similarities. 2) Semantic-Modulated Transformer selectively absorbs retrieval knowledge, adapting to the difference between retrieved samples and the target motion sequence. 3) Condition Mixture better utilizes the retrieval database during inference, overcoming the scale sensitivity in classifier-free guidance. Extensive experiments demonstrate that ReMoDiffuse outperforms state-of-the-art methods by balancing both text-motion consistency and motion quality, especially for more diverse motion generation. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2303.15944 [pdf, other]

doi 10.1109/LSP.2023.3280851

Cluster-Guided Unsupervised Domain Adaptation for Deep Speaker Embedding

Authors: Haiquan Mao, Feng Hong, Man-wai Mak

Abstract: Recent studies have shown that pseudo labels can contribute to unsupervised domain adaptation (UDA) for speaker verification. Inspired by the self-training strategies that use an existing classifier to label the unlabeled data for retraining, we propose a cluster-guided UDA framework that labels the target domain data by clustering and combines the labeled source domain data and pseudo-labeled tar… ▽ More Recent studies have shown that pseudo labels can contribute to unsupervised domain adaptation (UDA) for speaker verification. Inspired by the self-training strategies that use an existing classifier to label the unlabeled data for retraining, we propose a cluster-guided UDA framework that labels the target domain data by clustering and combines the labeled source domain data and pseudo-labeled target domain data to train a speaker embedding network. To improve the cluster quality, we train a speaker embedding network dedicated for clustering by minimizing the contrastive center loss. The goal is to reduce the distance between an embedding and its assigned cluster center while enlarging the distance between the embedding and the other cluster centers. Using VoxCeleb2 as the source domain and CN-Celeb1 as the target domain, we demonstrate that the proposed method can achieve an equal error rate (EER) of 8.10% on the CN-Celeb1 evaluation set without using any labels from the target domain. This result outperforms the supervised baseline by 39.6% and is the state-of-the-art UDA performance on this corpus. △ Less

Submitted 28 March, 2023; originally announced March 2023.

arXiv:2303.12791 [pdf, other]

SHERF: Generalizable Human NeRF from a Single Image

Authors: Shoukang Hu, Fangzhou Hong, Liang Pan, Haiyi Mei, Lei Yang, Ziwei Liu

Abstract: Existing Human NeRF methods for reconstructing 3D humans typically rely on multiple 2D images from multi-view cameras or monocular videos captured from fixed camera views. However, in real-world scenarios, human images are often captured from random camera angles, presenting challenges for high-quality 3D human reconstruction. In this paper, we propose SHERF, the first generalizable Human NeRF mod… ▽ More Existing Human NeRF methods for reconstructing 3D humans typically rely on multiple 2D images from multi-view cameras or monocular videos captured from fixed camera views. However, in real-world scenarios, human images are often captured from random camera angles, presenting challenges for high-quality 3D human reconstruction. In this paper, we propose SHERF, the first generalizable Human NeRF model for recovering animatable 3D humans from a single input image. SHERF extracts and encodes 3D human representations in canonical space, enabling rendering and animation from free views and poses. To achieve high-fidelity novel view and pose synthesis, the encoded 3D human representations should capture both global appearance and local fine-grained textures. To this end, we propose a bank of 3D-aware hierarchical features, including global, point-level, and pixel-aligned features, to facilitate informative encoding. Global features enhance the information extracted from the single input image and complement the information missing from the partial 2D observation. Point-level features provide strong clues of 3D human structure, while pixel-aligned features preserve more fine-grained details. To effectively integrate the 3D-aware hierarchical feature bank, we design a feature fusion transformer. Extensive experiments on THuman, RenderPeople, ZJU_MoCap, and HuMMan datasets demonstrate that SHERF achieves state-of-the-art performance, with better generalizability for novel view and pose synthesis. △ Less

Submitted 16 August, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

Comments: Accepted by ICCV2023. Project webpage: https://skhu101.github.io/SHERF/

arXiv:2303.09721 [pdf, ps, other]

doi 10.1103/PhysRevA.107.032608

Frequency-multiplexed Hong-Ou-Mandel interference

Authors: Mayuka Ichihara, Daisuke Yoshida, Feng-Lei Hong, Tomoyuki Horikiri

Abstract: The implementation of quantum repeaters needed for long-distance quantum communication requires the generation of quantum entanglement distributed among the elementary links. These entanglements must be swapped among the quantum repeaters through Bell-state measurements. This study aims to improve the entanglement generation rate by frequency multiplexing the Bell-state measurements. As a prelimin… ▽ More The implementation of quantum repeaters needed for long-distance quantum communication requires the generation of quantum entanglement distributed among the elementary links. These entanglements must be swapped among the quantum repeaters through Bell-state measurements. This study aims to improve the entanglement generation rate by frequency multiplexing the Bell-state measurements. As a preliminary step of the frequency-multiplexed Bell-state measurements, three frequency modes are mapped to a temporal mode by an atomic frequency comb prepared in $\mathrm{Pr^{3+}}$ ion-doped $\mathrm{Y_2SiO_5}$ crystals using a weak coherent state, and Hong-Ou-Mandel interference, which is a measure of the indistinguishability of two inputs, is observed in each frequency mode by coincidence detection. The visibility for all the modes was 40%-42% (theoretically up to 50%). Furthermore, we show that a mixture of different modes is avoided. The present results are connected to frequency-selective Bell-state measurements and therefore frequency-multiplexed quantum repeaters. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: 9 pages, 6 figures

Journal ref: Physical Review A 107, 032608 (2023)

arXiv:2303.01677 [pdf]

doi 10.1103/PhysRevApplied.19.024070

Frequency-multiplexed storage and distribution of narrowband telecom photon pairs over a 10-km fiber link with long-term system stability

Authors: Ko Ito, Takeshi Kondo, Kyoko Mannami, Kazuya Niizeki, Daisuke Yoshida, Kohei Minaguchi, Mingyang Zheng, Feng-Lei Hong, Tomoyuki Horikiri

Abstract: The ability to transmit quantum states over long distances is a fundamental requirement of the quantum internet and is reliant upon quantum repeaters. Quantum repeaters involve entangled photon sources that emit and deliver photonic entangled states at high rates and quantum memories that can temporarily store quantum states. Improvement of the entanglement distribution rate is essential for quant… ▽ More The ability to transmit quantum states over long distances is a fundamental requirement of the quantum internet and is reliant upon quantum repeaters. Quantum repeaters involve entangled photon sources that emit and deliver photonic entangled states at high rates and quantum memories that can temporarily store quantum states. Improvement of the entanglement distribution rate is essential for quantum repeaters, and multiplexing is expected to be a breakthrough. However, limited studies exist on multiplexed photon sources and their coupling with a multiplexed quantum memory. Here, we demonstrate the storing of a frequency-multiplexed two-photon source at telecommunication wavelengths in a quantum memory accepting visible wavelengths via wavelength conversion after 10-km distribution. To achieve this, quantum systems are connected via wavelength conversion with a frequency stabilization system and a noise reduction system. The developed system was stably operated for more than 42 h. Therefore, it can be applied to quantum repeater systems comprising various physical systems requiring long-term system stability. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: 10 pages

Journal ref: Physical Review Applied 19, 024070 (2023)

arXiv:2302.05080 [pdf, other]

Long-Tailed Partial Label Learning via Dynamic Rebalancing

Authors: Feng Hong, Jiangchao Yao, Zhihan Zhou, Ya Zhang, Yanfeng Wang

Abstract: Real-world data usually couples the label ambiguity and heavy imbalance, challenging the algorithmic robustness of partial label learning (PLL) and long-tailed learning (LT). The straightforward combination of LT and PLL, i.e., LT-PLL, suffers from a fundamental dilemma: LT methods build upon a given class distribution that is unavailable in PLL, and the performance of PLL is severely influenced i… ▽ More Real-world data usually couples the label ambiguity and heavy imbalance, challenging the algorithmic robustness of partial label learning (PLL) and long-tailed learning (LT). The straightforward combination of LT and PLL, i.e., LT-PLL, suffers from a fundamental dilemma: LT methods build upon a given class distribution that is unavailable in PLL, and the performance of PLL is severely influenced in long-tailed context. We show that even with the auxiliary of an oracle class prior, the state-of-the-art methods underperform due to an adverse fact that the constant rebalancing in LT is harsh to the label disambiguation in PLL. To overcome this challenge, we thus propose a dynamic rebalancing method, termed as RECORDS, without assuming any prior knowledge about the class distribution. Based on a parametric decomposition of the biased output, our method constructs a dynamic adjustment that is benign to the label disambiguation process and theoretically converges to the oracle class prior. Extensive experiments on three benchmark datasets demonstrate the significant gain of RECORDS compared with a range of baselines. The code is publicly available. △ Less

Submitted 10 February, 2023; originally announced February 2023.

Comments: ICLR 2023

arXiv:2210.13112 [pdf, other]

Optimization-based Motion Planning for Autonomous Parking Considering Dynamic Obstacle: A Hierarchical Framework

Authors: Xuemin Chi, Zhitao Liu, Jihao Huang, Feng Hong, Hongye Su

Abstract: This paper introduces a hierarchical framework that integrates graph search algorithms and model predictive control to facilitate efficient parking maneuvers for Autonomous Vehicles (AVs) in constrained environments. In the high-level planning phase, the framework incorporates scenario-based hybrid A* (SHA*), an optimized variant of traditional Hybrid A*, to generate an initial path while consider… ▽ More This paper introduces a hierarchical framework that integrates graph search algorithms and model predictive control to facilitate efficient parking maneuvers for Autonomous Vehicles (AVs) in constrained environments. In the high-level planning phase, the framework incorporates scenario-based hybrid A* (SHA*), an optimized variant of traditional Hybrid A*, to generate an initial path while considering static obstacles. This global path serves as an initial guess for the low-level NLP problem. In the low-level optimizing phase, a nonlinear model predictive control (NMPC)-based framework is deployed to circumvent dynamic obstacles. The performance of SHA* is empirically validated through 148 simulation scenarios, and the efficacy of the proposed hierarchical framework is demonstrated via a real-time parallel parking simulation. △ Less

Submitted 14 November, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

Comments: Update some typos and references

arXiv:2210.10313 [pdf, ps, other]

doi 10.1103/PhysRevA.106.052602

Single-shot high-resolution identification of discrete frequency modes of single-photon-level optical pulses

Authors: Daisuke Yoshida, Mayuka Ichihara, Takeshi Kondo, Feng-Lei Hong, Tomoyuki Horikiri

Abstract: Frequency-multiplexed quantum communication usually requires a single-shot identification of the frequency mode of a single photon . In this paper, we propose a scheme that can identify the frequency mode with high-resolution even for spontaneously emitted photons whose generation time is unknown, by combining the time-to-space and frequency-to-time mode map**. We also demonstrate the map** of… ▽ More Frequency-multiplexed quantum communication usually requires a single-shot identification of the frequency mode of a single photon . In this paper, we propose a scheme that can identify the frequency mode with high-resolution even for spontaneously emitted photons whose generation time is unknown, by combining the time-to-space and frequency-to-time mode map**. We also demonstrate the map** of the frequency mode (100 MHz intervals) to the temporal mode (435 ns intervals) for weak coherent pulses using atomic frequency combs. This frequency interval is close to the minimum frequency mode interval of the atomic frequency comb quantum memory with Pr3+ ion-doped Y2SiO5 crystal, and the proposed scheme has the potential to maximize the frequency multiplexing of the quantum repeater scheme with the memory. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: 6 pages, 5 figures

arXiv:2210.08828 [pdf, other]

Search-Based Path Planning Algorithm for Autonomous Parking:Multi-Heuristic Hybrid A*

Authors: Jihao Huang, Zhitao Liu, Xuemin Chi, Feng Hong, Hongye Su

Abstract: This paper proposed a novel method for autonomous parking. Autonomous parking has received a lot of attention because of its convenience, but due to the complex environment and the non-holonomic constraints of vehicle, it is difficult to get a collision-free and feasible path in a short time. To solve this problem, this paper introduced a novel algorithm called Multi-Heuristic Hybrid A* (MHHA*) wh… ▽ More This paper proposed a novel method for autonomous parking. Autonomous parking has received a lot of attention because of its convenience, but due to the complex environment and the non-holonomic constraints of vehicle, it is difficult to get a collision-free and feasible path in a short time. To solve this problem, this paper introduced a novel algorithm called Multi-Heuristic Hybrid A* (MHHA*) which incorporates the characteristic of Multi-Heuristic A* and Hybrid A*. So it could provide the guarantee for completeness, the avoidance of local minimum and sub-optimality, and generate a feasible path in a short time. And this paper also proposed a new collision check method based on coordinate transformation which could improve the computational efficiency. The performance of the proposed method was compared with Hybrid A* in simulation experiments and its superiority has been proved. △ Less

Submitted 17 October, 2022; originally announced October 2022.

arXiv:2210.04888 [pdf, other]

EVA3D: Compositional 3D Human Generation from 2D Image Collections

Authors: Fangzhou Hong, Zhaoxi Chen, Yushi Lan, Liang Pan, Ziwei Liu

Abstract: Inverse graphics aims to recover 3D models from 2D observations. Utilizing differentiable rendering, recent 3D-aware generative models have shown impressive results of rigid object generation using 2D images. However, it remains challenging to generate articulated objects, like human bodies, due to their complexity and diversity in poses and appearances. In this work, we propose, EVA3D, an uncondi… ▽ More Inverse graphics aims to recover 3D models from 2D observations. Utilizing differentiable rendering, recent 3D-aware generative models have shown impressive results of rigid object generation using 2D images. However, it remains challenging to generate articulated objects, like human bodies, due to their complexity and diversity in poses and appearances. In this work, we propose, EVA3D, an unconditional 3D human generative model learned from 2D image collections only. EVA3D can sample 3D humans with detailed geometry and render high-quality images (up to 512x256) without bells and whistles (e.g. super resolution). At the core of EVA3D is a compositional human NeRF representation, which divides the human body into local parts. Each part is represented by an individual volume. This compositional representation enables 1) inherent human priors, 2) adaptive allocation of network parameters, 3) efficient training and rendering. Moreover, to accommodate for the characteristics of sparse 2D human image collections (e.g. imbalanced pose distribution), we propose a pose-guided sampling strategy for better GAN learning. Extensive experiments validate that EVA3D achieves state-of-the-art 3D human generation performance regarding both geometry and texture quality. Notably, EVA3D demonstrates great potential and scalability to "inverse-graphics" diverse human bodies with a clean framework. △ Less

Submitted 10 October, 2022; originally announced October 2022.

Comments: Project Page at https://hongfz16.github.io/projects/EVA3D.html

arXiv:2208.15001 [pdf, other]

MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model

Authors: Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, Ziwei Liu

Abstract: Human motion modeling is important for many modern graphics applications, which typically require professional skills. In order to remove the skill barriers for laymen, recent motion generation methods can directly generate human motions conditioned on natural languages. However, it remains challenging to achieve diverse and fine-grained motion generation with various text inputs. To address this… ▽ More Human motion modeling is important for many modern graphics applications, which typically require professional skills. In order to remove the skill barriers for laymen, recent motion generation methods can directly generate human motions conditioned on natural languages. However, it remains challenging to achieve diverse and fine-grained motion generation with various text inputs. To address this problem, we propose MotionDiffuse, the first diffusion model-based text-driven motion generation framework, which demonstrates several desired properties over existing methods. 1) Probabilistic Map**. Instead of a deterministic language-motion map**, MotionDiffuse generates motions through a series of denoising steps in which variations are injected. 2) Realistic Synthesis. MotionDiffuse excels at modeling complicated data distribution and generating vivid motion sequences. 3) Multi-Level Manipulation. MotionDiffuse responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts. Our experiments show MotionDiffuse outperforms existing SoTA methods by convincing margins on text-driven motion generation and action-conditioned motion generation. A qualitative analysis further demonstrates MotionDiffuse's controllability for comprehensive motion generation. Homepage: https://mingyuan-zhang.github.io/projects/MotionDiffuse.html △ Less

Submitted 31 August, 2022; originally announced August 2022.

arXiv:2206.11011 [pdf, other]

Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning

Authors: Jia-Run Du, Jia-Chang Feng, Kun-Yu Lin, Fa-Ting Hong, Xiao-Ming Wu, Zhongang Qi, Ying Shan, Wei-Shi Zheng

Abstract: Weakly Supervised Temporal Action Localization (WSTAL) aims to localize and classify action instances in long untrimmed videos with only video-level category labels. Due to the lack of snippet-level supervision for indicating action boundaries, previous methods typically assign pseudo labels for unlabeled snippets. However, since some action instances of different categories are visually similar,… ▽ More Weakly Supervised Temporal Action Localization (WSTAL) aims to localize and classify action instances in long untrimmed videos with only video-level category labels. Due to the lack of snippet-level supervision for indicating action boundaries, previous methods typically assign pseudo labels for unlabeled snippets. However, since some action instances of different categories are visually similar, it is non-trivial to exactly label the (usually) one action category for a snippet, and incorrect pseudo labels would impair the localization performance. To address this problem, we propose a novel method from a category exclusion perspective, named Progressive Complementary Learning (ProCL), which gradually enhances the snippet-level supervision. Our method is inspired by the fact that video-level labels precisely indicate the categories that all snippets surely do not belong to, which is ignored by previous works. Accordingly, we first exclude these surely non-existent categories by a complementary learning loss. And then, we introduce the background-aware pseudo complementary labeling in order to exclude more categories for snippets of less ambiguity. Furthermore, for the remaining ambiguous snippets, we attempt to reduce the ambiguity by distinguishing foreground actions from the background. Extensive experimental results show that our method achieves new state-of-the-art performance on two popular benchmarks, namely THUMOS14 and ActivityNet1.3. △ Less

Submitted 14 November, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

arXiv:2205.08535 [pdf, other]

AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars

Authors: Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, Ziwei Liu

Abstract: 3D avatar creation plays a crucial role in the digital age. However, the whole production process is prohibitively time-consuming and labor-intensive. To democratize this technology to a larger audience, we propose AvatarCLIP, a zero-shot text-driven framework for 3D avatar generation and animation. Unlike professional software that requires expert knowledge, AvatarCLIP empowers layman users to cu… ▽ More 3D avatar creation plays a crucial role in the digital age. However, the whole production process is prohibitively time-consuming and labor-intensive. To democratize this technology to a larger audience, we propose AvatarCLIP, a zero-shot text-driven framework for 3D avatar generation and animation. Unlike professional software that requires expert knowledge, AvatarCLIP empowers layman users to customize a 3D avatar with the desired shape and texture, and drive the avatar with the described motions using solely natural languages. Our key insight is to take advantage of the powerful vision-language model CLIP for supervising neural human generation, in terms of 3D geometry, texture and animation. Specifically, driven by natural language descriptions, we initialize 3D human geometry generation with a shape VAE network. Based on the generated 3D human shapes, a volume rendering model is utilized to further facilitate geometry sculpting and texture generation. Moreover, by leveraging the priors learned in the motion VAE, a CLIP-guided reference-based motion synthesis method is proposed for the animation of the generated 3D avatar. Extensive qualitative and quantitative experiments validate the effectiveness and generalizability of AvatarCLIP on a wide range of avatars. Remarkably, AvatarCLIP can generate unseen 3D avatars with novel animations, achieving superior zero-shot capability. △ Less

Submitted 17 May, 2022; originally announced May 2022.

Comments: SIGGRAPH 2022; Project Page https://hongfz16.github.io/projects/AvatarCLIP.html Codes available at https://github.com/hongfz16/AvatarCLIP

arXiv:2204.13686 [pdf, other]

HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling

Authors: Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Wenjia Wang, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang, Ziwei Liu

Abstract: 4D human sensing and modeling are fundamental tasks in vision and graphics with numerous applications. With the advances of new sensors and algorithms, there is an increasing demand for more versatile datasets. In this work, we contribute HuMMan, a large-scale multi-modal 4D human dataset with 1000 human subjects, 400k sequences and 60M frames. HuMMan has several appealing properties: 1) multi-mod… ▽ More 4D human sensing and modeling are fundamental tasks in vision and graphics with numerous applications. With the advances of new sensors and algorithms, there is an increasing demand for more versatile datasets. In this work, we contribute HuMMan, a large-scale multi-modal 4D human dataset with 1000 human subjects, 400k sequences and 60M frames. HuMMan has several appealing properties: 1) multi-modal data and annotations including color images, point clouds, keypoints, SMPL parameters, and textured meshes; 2) popular mobile device is included in the sensor suite; 3) a set of 500 actions, designed to cover fundamental movements; 4) multiple tasks such as action recognition, pose estimation, parametric human recovery, and textured mesh reconstruction are supported and evaluated. Extensive experiments on HuMMan voice the need for further study on challenges such as fine-grained action recognition, dynamic human mesh reconstruction, point cloud-based parametric human recovery, and cross-device domain gaps. △ Less

Submitted 16 April, 2023; v1 submitted 28 April, 2022; originally announced April 2022.

Comments: Homepage: https://caizhongang.github.io/projects/HuMMan/

arXiv:2204.05064 [pdf]

Quantum sensing with diamond NV centers under megabar pressures

Authors: Jian-Hong Dai, Yan-Xing Shang, Yong-Hong Yu, Yue Xu, Hui Yu, Fang Hong, Xiao-Hui Yu, Xin-Yu Pan, Gang-Qin Liu

Abstract: Megabar pressures are of crucial importance for cutting-edge studies of condensed matter physics and geophysics. With the development of diamond anvil cell, laboratory studies of high pressure have entered the megabar era for decades. However, it is still challenging to implement in-situ magnetic sensing under ultrahigh pressures. Here, we demonstrate optically detected magnetic resonance of diamo… ▽ More Megabar pressures are of crucial importance for cutting-edge studies of condensed matter physics and geophysics. With the development of diamond anvil cell, laboratory studies of high pressure have entered the megabar era for decades. However, it is still challenging to implement in-situ magnetic sensing under ultrahigh pressures. Here, we demonstrate optically detected magnetic resonance of diamond nitrogen-vacancy (NV) centers, a promising quantum sensor of strain and magnetic fields, up to 1.4 Mbar. We quantify the reduction and blueshifts of NV fluorescence under high pressures. We demonstrate coherent manipulation of NV electron spins and extend its working pressure to the megabar region. These results shed new light on our understanding of diamond NV centers and will benefit quantum sensing under extreme conditions. △ Less

Submitted 11 April, 2022; originally announced April 2022.

Comments: 9 pages, 4 figures

Showing 1–50 of 123 results for author: Hong, F