Skip to main content

Showing 1–16 of 16 results for author: Zhang, D J

.
  1. arXiv:2403.07420  [pdf, other

    cs.CV

    DragAnything: Motion Control for Anything using Entity Representation

    Authors: Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, Di Zhang

    Abstract: We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation. Comparison to existing motion control methods, DragAnything offers several advantages. Firstly, trajectory-based is more userfriendly for interaction, when acquiring other guidance signals (e.g., masks, depth maps) is labor-intensive. Users only need to draw… ▽ More

    Submitted 15 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

    Comments: The project website is at: https://weijiawu.github.io/draganything_page/ . The code is at: https://github.com/showlab/DragAnything

  2. arXiv:2401.07781  [pdf, other

    cs.CV

    Towards A Better Metric for Text-to-Video Generation

    Authors: Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou

    Abstract: Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: Project page: https://showlab.github.io/T2VScore/

  3. arXiv:2401.01827  [pdf, other

    cs.CV

    Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

    Authors: David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo

    Abstract: Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB),… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

    Comments: project page: https://showlab.github.io/Moonshot/

  4. arXiv:2312.12684  [pdf, ps, other

    nlin.SI math-ph

    On the Lagrangian multiform structure of the extended lattice Boussinesq system

    Authors: F. W. Nijhoff, D. J. Zhang

    Abstract: The lattice Boussinesq (lBSQ) equation is a member of the lattice Gel'fand-Dikii (lGD) hierarchy, introduced in \cite{NijPapCapQui1992}, which is an infinite family of integrable systems of partial difference equations labelled by an integer $N$, where $N=2$ represents the lattice Korteweg-de Vries (KdV) system, and $N=3$ the Boussinesq system. In \cite{Hiet2011} it was shown that, written as thre… ▽ More

    Submitted 27 January, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: 10 pages, 0 figures Accepted version, formatted in accordence with the publication template

    Journal ref: Open Communications in Nonlinear Mathematical Physics, Special Issue in Memory of Decio Levi (February 15, 2024) ocnmp:12759

  5. arXiv:2312.02087  [pdf, other

    cs.CV

    VideoSwap: Customized Video Subject Swap** with Interactive Semantic Point Correspondence

    Authors: Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Tang

    Abstract: Current diffusion-based video editing primarily focuses on structure-preserved editing by utilizing various dense correspondences to ensure temporal consistency and motion alignment. However, these approaches are often ineffective when the target edit involves a shape change. To embark on video editing with shape change, we explore customized video subject swap** in this work, where we aim to re… ▽ More

    Submitted 5 December, 2023; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: Project page at https://videoswap.github.io

  6. arXiv:2310.08465  [pdf, other

    cs.CV

    MotionDirector: Motion Customization of Text-to-Video Diffusion Models

    Authors: Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jiawei Liu, Weijia Wu, Jussi Keppo, Mike Zheng Shou

    Abstract: Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. For example, generating a video with a car moving in a prescribed manner under specific camera movements to make… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: Project Page: https://showlab.github.io/MotionDirector/

  7. arXiv:2309.15818  [pdf, other

    cs.CV

    Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

    Authors: David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, Mike Zheng Shou

    Abstract: Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marri… ▽ More

    Submitted 17 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: project page is https://showlab.github.io/Show-1

  8. arXiv:2309.07698  [pdf, other

    cs.CV

    Dataset Condensation via Generative Model

    Authors: David Junhao Zhang, Heng Wang, Chuhui Xue, Rui Yan, Wenqing Zhang, Song Bai, Mike Zheng Shou

    Abstract: Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation meth… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: old work,done in 2022

  9. arXiv:2308.06739  [pdf, other

    cs.CV

    Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

    Authors: David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenqing Zhang, Xiaoguang Han, Song Bai, Mike Zheng Shou

    Abstract: Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been… ▽ More

    Submitted 13 August, 2023; originally announced August 2023.

  10. arXiv:2305.20087  [pdf, other

    cs.CV

    Too Large; Data Reduction for Vision-Language Pre-Training

    Authors: Alex **peng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou

    Abstract: This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR, which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major s… ▽ More

    Submitted 18 August, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: ICCV2023. Code: https://github.com/showlab/datacentric.vlp

  11. arXiv:2303.08685  [pdf, other

    cs.CV cs.LG

    Making Vision Transformers Efficient from A Token Sparsification View

    Authors: Shuning Chang, Pichao Wang, Ming Lin, Fan Wang, David Junhao Zhang, Rong **, Mike Zheng Shou

    Abstract: The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. I… ▽ More

    Submitted 30 March, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted by CVPR2023

  12. arXiv:2206.00309  [pdf, other

    cs.CV

    Label-Efficient Online Continual Object Detection in Streaming Video

    Authors: Jay Zhangjie Wu, David Junhao Zhang, Wynne Hsu, Mengmi Zhang, Mike Zheng Shou

    Abstract: Humans can watch a continuous video stream and effortlessly perform continual acquisition and transfer of new knowledge with minimal supervision yet retaining previously learnt experiences. In contrast, existing continual learning (CL) methods require fully annotated labels to effectively learn from individual frames in a video stream. Here, we examine a more realistic and challenging problem… ▽ More

    Submitted 23 August, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

    Comments: ICCV 2023

  13. arXiv:2205.15723  [pdf, other

    cs.CV

    DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes

    Authors: Jia-Wei Liu, Yan-Pei Cao, Weijia Mao, Wenqiao Zhang, David Junhao Zhang, Jussi Keppo, Ying Shan, Xiaohu Qie, Mike Zheng Shou

    Abstract: Modeling dynamic scenes is important for many applications such as virtual reality and telepresence. Despite achieving unprecedented fidelity for novel view synthesis in dynamic scenes, existing methods based on Neural Radiance Fields (NeRF) suffer from slow convergence (i.e., model training time measured in days). In this paper, we present DeVRF, a novel representation to accelerate learning dyna… ▽ More

    Submitted 4 June, 2022; v1 submitted 31 May, 2022; originally announced May 2022.

    Comments: Project page: https://jia-wei-liu.github.io/DeVRF/

  14. arXiv:2204.02148  [pdf, other

    cs.CV

    Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition

    Authors: Mingfei Han, David Junhao Zhang, Yali Wang, Rui Yan, Lina Yao, Xiaojun Chang, Yu Qiao

    Abstract: Learning spatial-temporal relation among multiple actors is crucial for group activity recognition. Different group activities often show the diversified interactions between actors in the video. Hence, it is often difficult to model complex group activities from a single view of spatial-temporal actor evolution. To tackle this problem, we propose a distinct Dual-path Actor Interaction (DualAI) fr… ▽ More

    Submitted 6 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: CVPR 2022 Oral presentation

  15. arXiv:2111.12527  [pdf, other

    cs.CV

    MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

    Authors: David Junhao Zhang, Kunchang Li, Yali Wang, Yunpeng Chen, Shashwat Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou

    Abstract: Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC)… ▽ More

    Submitted 23 August, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: ECCV2022

  16. arXiv:2006.11588  [pdf, other

    physics.flu-dyn physics.comp-ph

    Two-fluid discrete Boltzmann model for compressible flows: based on Ellipsoidal Statistical Bhatnagar-Gross-Krook

    Authors: D. J. Zhang, A. G. Xu, Y. D. Zhang, Y. J. Li

    Abstract: A two-fluid Discrete Boltzmann Model(DBM) for compressible flows based on Ellipsoidal Statistical Bhatnagar-Gross-Krook(ES-BGK) is presented. The model has flexible Prandtl number or specific heat ratio. Mathematically, the model is composed of two coupled Discrete Boltzmann Equations(DBE). Each DBE describes one component of the fluid. Physically, the model is equivalent to a macroscopic fluid mo… ▽ More

    Submitted 14 December, 2020; v1 submitted 20 June, 2020; originally announced June 2020.

    Journal ref: Phys. Fluids 32, 126110 (2020)