-
DragAnything: Motion Control for Anything using Entity Representation
Authors:
Weijia Wu,
Zhuang Li,
Yuchao Gu,
Rui Zhao,
Yefei He,
David Junhao Zhang,
Mike Zheng Shou,
Yan Li,
Tingting Gao,
Di Zhang
Abstract:
We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation. Comparison to existing motion control methods, DragAnything offers several advantages. Firstly, trajectory-based is more userfriendly for interaction, when acquiring other guidance signals (e.g., masks, depth maps) is labor-intensive. Users only need to draw…
▽ More
We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation. Comparison to existing motion control methods, DragAnything offers several advantages. Firstly, trajectory-based is more userfriendly for interaction, when acquiring other guidance signals (e.g., masks, depth maps) is labor-intensive. Users only need to draw a line (trajectory) during interaction. Secondly, our entity representation serves as an open-domain embedding capable of representing any object, enabling the control of motion for diverse entities, including background. Lastly, our entity representation allows simultaneous and distinct motion control for multiple objects. Extensive experiments demonstrate that our DragAnything achieves state-of-the-art performance for FVD, FID, and User Study, particularly in terms of object motion control, where our method surpasses the previous methods (e.g., DragNUWA) by 26% in human voting.
△ Less
Submitted 15 March, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
Towards A Better Metric for Text-to-Video Generation
Authors:
Jay Zhangjie Wu,
Guian Fang,
Haoning Wu,
Xintao Wang,
Yixiao Ge,
Xiaodong Cun,
David Junhao Zhang,
Jia-Wei Liu,
Yuchao Gu,
Rui Zhao,
Weisi Lin,
Wynne Hsu,
Ying Shan,
Mike Zheng Shou
Abstract:
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However…
▽ More
Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts. Moreover, to evaluate the proposed metrics and facilitate future improvements on them, we present the TVGE dataset, collecting human judgements of 2,543 text-to-video generated videos on the two criteria. Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions
Authors:
David Junhao Zhang,
Dongxu Li,
Hung Le,
Mike Zheng Shou,
Caiming Xiong,
Doyen Sahoo
Abstract:
Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB),…
▽ More
Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
On the Lagrangian multiform structure of the extended lattice Boussinesq system
Authors:
F. W. Nijhoff,
D. J. Zhang
Abstract:
The lattice Boussinesq (lBSQ) equation is a member of the lattice Gel'fand-Dikii (lGD) hierarchy, introduced in \cite{NijPapCapQui1992}, which is an infinite family of integrable systems of partial difference equations labelled by an integer $N$, where $N=2$ represents the lattice Korteweg-de Vries (KdV) system, and $N=3$ the Boussinesq system. In \cite{Hiet2011} it was shown that, written as thre…
▽ More
The lattice Boussinesq (lBSQ) equation is a member of the lattice Gel'fand-Dikii (lGD) hierarchy, introduced in \cite{NijPapCapQui1992}, which is an infinite family of integrable systems of partial difference equations labelled by an integer $N$, where $N=2$ represents the lattice Korteweg-de Vries (KdV) system, and $N=3$ the Boussinesq system. In \cite{Hiet2011} it was shown that, written as three-component system, the lBSQ system allows for extra parameters which essentially amounts to building the lattice KdV inside the lBSQ. In this paper we show that, on the level of the Lagrangian structure, this boils down to a linear combination of Lagrangians from the members of the lGD hierarchy as was established in \cite{LobbNijGD2010}. The corresponding Lagrangian multiform structure is shown to exhibit a `double zero' structure.
△ Less
Submitted 27 January, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
VideoSwap: Customized Video Subject Swap** with Interactive Semantic Point Correspondence
Authors:
Yuchao Gu,
Yipin Zhou,
Bichen Wu,
Licheng Yu,
Jia-Wei Liu,
Rui Zhao,
Jay Zhangjie Wu,
David Junhao Zhang,
Mike Zheng Shou,
Kevin Tang
Abstract:
Current diffusion-based video editing primarily focuses on structure-preserved editing by utilizing various dense correspondences to ensure temporal consistency and motion alignment. However, these approaches are often ineffective when the target edit involves a shape change. To embark on video editing with shape change, we explore customized video subject swap** in this work, where we aim to re…
▽ More
Current diffusion-based video editing primarily focuses on structure-preserved editing by utilizing various dense correspondences to ensure temporal consistency and motion alignment. However, these approaches are often ineffective when the target edit involves a shape change. To embark on video editing with shape change, we explore customized video subject swap** in this work, where we aim to replace the main subject in a source video with a target subject having a distinct identity and potentially different shape. In contrast to previous methods that rely on dense correspondences, we introduce the VideoSwap framework that exploits semantic point correspondences, inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape. We also introduce various user-point interactions (\eg, removing points and dragging points) to address various semantic point correspondence. Extensive experiments demonstrate state-of-the-art video subject swap** results across a variety of real-world videos.
△ Less
Submitted 5 December, 2023; v1 submitted 4 December, 2023;
originally announced December 2023.
-
MotionDirector: Motion Customization of Text-to-Video Diffusion Models
Authors:
Rui Zhao,
Yuchao Gu,
Jay Zhangjie Wu,
David Junhao Zhang,
Jiawei Liu,
Weijia Wu,
Jussi Keppo,
Mike Zheng Shou
Abstract:
Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. For example, generating a video with a car moving in a prescribed manner under specific camera movements to make…
▽ More
Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. For example, generating a video with a car moving in a prescribed manner under specific camera movements to make a movie, or a video illustrating how a bear would lift weights to inspire creators. Adaptation methods have been developed for customizing appearance like subject or style, yet unexplored for motion. It is straightforward to extend mainstream adaption methods for motion customization, including full model tuning, parameter-efficient tuning of additional layers, and Low-Rank Adaptions (LoRAs). However, the motion concept learned by these methods is often coupled with the limited appearances in the training videos, making it difficult to generalize the customized motion to other appearances. To overcome this challenge, we propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Further, we design a novel appearance-debiased temporal loss to mitigate the influence of appearance on the temporal training objective. Experimental results show the proposed method can generate videos of diverse appearances for the customized motions. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions. Our code and model weights will be released.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Authors:
David Junhao Zhang,
Jay Zhangjie Wu,
Jia-Wei Liu,
Rui Zhao,
Lingmin Ran,
Yuchao Gu,
Difei Gao,
Mike Zheng Shou
Abstract:
Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marri…
▽ More
Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.
△ Less
Submitted 17 October, 2023; v1 submitted 27 September, 2023;
originally announced September 2023.
-
Dataset Condensation via Generative Model
Authors:
David Junhao Zhang,
Heng Wang,
Chuhui Xue,
Rui Yan,
Wenqing Zhang,
Song Bai,
Mike Zheng Shou
Abstract:
Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation meth…
▽ More
Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation methods from scaling up to large datasets with diverse classes. Moreover, the relations among condensed samples have been neglected and hence the feature distribution of condensed samples is often not diverse. To solve these problems, we propose to condense the dataset into another format, a generative model. Such a novel format allows for the condensation of large datasets because the size of the generative model remains relatively stable as the number of classes or image resolution increases. Furthermore, an intra-class and an inter-class loss are proposed to model the relation of condensed samples. Intra-class loss aims to create more diverse samples for each class by pushing each sample away from the others of the same class. Meanwhile, inter-class loss increases the discriminability of samples by widening the gap between the centers of different classes. Extensive comparisons with state-of-the-art methods and our ablation studies confirm the effectiveness of our method and its individual component. To our best knowledge, we are the first to successfully conduct condensation on ImageNet-1k.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks
Authors:
David Junhao Zhang,
Mutian Xu,
Chuhui Xue,
Wenqing Zhang,
Xiaoguang Han,
Song Bai,
Mike Zheng Shou
Abstract:
Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been…
▽ More
Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. To address this, we start by uncovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques ( i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios.
△ Less
Submitted 13 August, 2023;
originally announced August 2023.
-
Too Large; Data Reduction for Vision-Language Pre-Training
Authors:
Alex **peng Wang,
Kevin Qinghong Lin,
David Junhao Zhang,
Stan Weixian Lei,
Mike Zheng Shou
Abstract:
This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR, which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major s…
▽ More
This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR, which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major steps. First, a codebook-based encoder-decoder captioner is developed to select representative samples. Second, a new caption is generated to complement the original captions for selected samples, mitigating the text-image misalignment problem while maintaining uniqueness. As the result, TL;DR enables us to reduce the large dataset into a small set of high-quality data, which can serve as an alternative pre-training dataset. This algorithm significantly speeds up the time-consuming pretraining process. Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e.g., reduce well-cleaned CC3M dataset from 2.82M to 0.67M ($\sim$24\%) and noisy YFCC15M from 15M to 2.5M ($\sim$16.7\%). Extensive experiments with three popular VLP models over seven downstream tasks show that VLP model trained on the compressed dataset provided by TL;DR can perform similar or even better results compared with training on the full-scale dataset. The code will be made available at \url{https://github.com/showlab/datacentric.vlp}.
△ Less
Submitted 18 August, 2023; v1 submitted 31 May, 2023;
originally announced May 2023.
-
Making Vision Transformers Efficient from A Token Sparsification View
Authors:
Shuning Chang,
Pichao Wang,
Ming Lin,
Fan Wang,
David Junhao Zhang,
Rong **,
Mike Zheng Shou
Abstract:
The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. I…
▽ More
The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks. The semantic tokens represent cluster centers, and they are initialized by pooling image tokens in space and recovered by attention, which can adaptively represent global or local semantic information. Due to the cluster properties, a few semantic tokens can attain the same effect as vast image tokens, for both global and local vision transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base) can achieve the same accuracy with more than 100% inference speed improvement and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16 semantic tokens in each window to further speed it up by around 20% with slight accuracy increase. Besides great success in image classification, we also extend our method to video recognition. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks, which is powerless for previous token sparsification methods. Experiments demonstrate that our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone. Code is available at http://github.com/changsn/STViT-R
△ Less
Submitted 30 March, 2023; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Label-Efficient Online Continual Object Detection in Streaming Video
Authors:
Jay Zhangjie Wu,
David Junhao Zhang,
Wynne Hsu,
Mengmi Zhang,
Mike Zheng Shou
Abstract:
Humans can watch a continuous video stream and effortlessly perform continual acquisition and transfer of new knowledge with minimal supervision yet retaining previously learnt experiences. In contrast, existing continual learning (CL) methods require fully annotated labels to effectively learn from individual frames in a video stream. Here, we examine a more realistic and challenging problem…
▽ More
Humans can watch a continuous video stream and effortlessly perform continual acquisition and transfer of new knowledge with minimal supervision yet retaining previously learnt experiences. In contrast, existing continual learning (CL) methods require fully annotated labels to effectively learn from individual frames in a video stream. Here, we examine a more realistic and challenging problem$\unicode{x2014}$Label-Efficient Online Continual Object Detection (LEOCOD) in streaming video. We propose a plug-and-play module, Efficient-CLS, that can be easily inserted into and improve existing continual learners for object detection in video streams with reduced data annotation costs and model retraining time. We show that our method has achieved significant improvement with minimal forgetting across all supervision levels on two challenging CL benchmarks for streaming real-world videos. Remarkably, with only 25% annotated video frames, our method still outperforms the base CL learners, which are trained with 100% annotations on all video frames. The data and source code will be publicly available at https://github.com/showlab/Efficient-CLS.
△ Less
Submitted 23 August, 2023; v1 submitted 1 June, 2022;
originally announced June 2022.
-
DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes
Authors:
Jia-Wei Liu,
Yan-Pei Cao,
Weijia Mao,
Wenqiao Zhang,
David Junhao Zhang,
Jussi Keppo,
Ying Shan,
Xiaohu Qie,
Mike Zheng Shou
Abstract:
Modeling dynamic scenes is important for many applications such as virtual reality and telepresence. Despite achieving unprecedented fidelity for novel view synthesis in dynamic scenes, existing methods based on Neural Radiance Fields (NeRF) suffer from slow convergence (i.e., model training time measured in days). In this paper, we present DeVRF, a novel representation to accelerate learning dyna…
▽ More
Modeling dynamic scenes is important for many applications such as virtual reality and telepresence. Despite achieving unprecedented fidelity for novel view synthesis in dynamic scenes, existing methods based on Neural Radiance Fields (NeRF) suffer from slow convergence (i.e., model training time measured in days). In this paper, we present DeVRF, a novel representation to accelerate learning dynamic radiance fields. The core of DeVRF is to model both the 3D canonical space and 4D deformation field of a dynamic, non-rigid scene with explicit and discrete voxel-based representations. However, it is quite challenging to train such a representation which has a large number of model parameters, often resulting in overfitting issues. To overcome this challenge, we devise a novel static-to-dynamic learning paradigm together with a new data capture setup that is convenient to deploy in practice. This paradigm unlocks efficient learning of deformable radiance fields via utilizing the 3D volumetric canonical space learnt from multi-view static images to ease the learning of 4D voxel deformation field with only few-view dynamic sequences. To further improve the efficiency of our DeVRF and its synthesized novel view's quality, we conduct thorough explorations and identify a set of strategies. We evaluate DeVRF on both synthetic and real-world dynamic scenes with different types of deformation. Experiments demonstrate that DeVRF achieves two orders of magnitude speedup (100x faster) with on-par high-fidelity results compared to the previous state-of-the-art approaches. The code and dataset will be released in https://github.com/showlab/DeVRF.
△ Less
Submitted 4 June, 2022; v1 submitted 31 May, 2022;
originally announced May 2022.
-
Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition
Authors:
Mingfei Han,
David Junhao Zhang,
Yali Wang,
Rui Yan,
Lina Yao,
Xiaojun Chang,
Yu Qiao
Abstract:
Learning spatial-temporal relation among multiple actors is crucial for group activity recognition. Different group activities often show the diversified interactions between actors in the video. Hence, it is often difficult to model complex group activities from a single view of spatial-temporal actor evolution. To tackle this problem, we propose a distinct Dual-path Actor Interaction (DualAI) fr…
▽ More
Learning spatial-temporal relation among multiple actors is crucial for group activity recognition. Different group activities often show the diversified interactions between actors in the video. Hence, it is often difficult to model complex group activities from a single view of spatial-temporal actor evolution. To tackle this problem, we propose a distinct Dual-path Actor Interaction (DualAI) framework, which flexibly arranges spatial and temporal transformers in two complementary orders, enhancing actor relations by integrating merits from different spatiotemporal paths. Moreover, we introduce a novel Multi-scale Actor Contrastive Loss (MAC-Loss) between two interactive paths of Dual-AI. Via self-supervised actor consistency in both frame and video levels, MAC-Loss can effectively distinguish individual actor representations to reduce action confusion among different actors. Consequently, our Dual-AI can boost group activity recognition by fusing such discriminative features of different actors. To evaluate the proposed approach, we conduct extensive experiments on the widely used benchmarks, including Volleyball, Collective Activity, and NBA datasets. The proposed Dual-AI achieves state-of-the-art performance on all these datasets. It is worth noting the proposed Dual-AI with 50% training data outperforms a number of recent approaches with 100% training data. This confirms the generalization power of Dual-AI for group activity recognition, even under the challenging scenarios of limited supervision.
△ Less
Submitted 6 April, 2022; v1 submitted 5 April, 2022;
originally announced April 2022.
-
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning
Authors:
David Junhao Zhang,
Kunchang Li,
Yali Wang,
Yunpeng Chen,
Shashwat Chandra,
Yu Qiao,
Luoqi Liu,
Mike Zheng Shou
Abstract:
Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC)…
▽ More
Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC) layer for video representation learning. Specifically, a MorphMLP block consists of two key layers in sequence, i.e., MorphFC_s and MorphFC_t, for spatial and temporal modeling respectively. MorphFC_s can effectively capture core semantics in each frame, by progressive token interaction along both height and width dimensions. Alternatively, MorphFC_t can adaptively learn long-term dependency over frames, by temporal token aggregation on each spatial location. With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance. Finally, we evaluate our MorphMLP on a number of popular video benchmarks. Compared with the recent state-of-the-art models, MorphMLP significantly reduces computation but with better accuracy, e.g., MorphMLP-S only uses 50% GFLOPs of VideoSwin-T but achieves 0.9% top-1 improvement on Kinetics400, under ImageNet1K pretraining. MorphMLP-B only uses 43% GFLOPs of MViT-B but achieves 2.4% top-1 improvement on SSV2, even though MorphMLP-B is pretrained on ImageNet1K while MViT-B is pretrained on Kinetics400. Moreover, our method adapted to the image domain outperforms previous SOTA MLP-Like architectures. Code is available at https://github.com/MTLab/MorphMLP.
△ Less
Submitted 23 August, 2022; v1 submitted 24 November, 2021;
originally announced November 2021.
-
Two-fluid discrete Boltzmann model for compressible flows: based on Ellipsoidal Statistical Bhatnagar-Gross-Krook
Authors:
D. J. Zhang,
A. G. Xu,
Y. D. Zhang,
Y. J. Li
Abstract:
A two-fluid Discrete Boltzmann Model(DBM) for compressible flows based on Ellipsoidal Statistical Bhatnagar-Gross-Krook(ES-BGK) is presented. The model has flexible Prandtl number or specific heat ratio. Mathematically, the model is composed of two coupled Discrete Boltzmann Equations(DBE). Each DBE describes one component of the fluid. Physically, the model is equivalent to a macroscopic fluid mo…
▽ More
A two-fluid Discrete Boltzmann Model(DBM) for compressible flows based on Ellipsoidal Statistical Bhatnagar-Gross-Krook(ES-BGK) is presented. The model has flexible Prandtl number or specific heat ratio. Mathematically, the model is composed of two coupled Discrete Boltzmann Equations(DBE). Each DBE describes one component of the fluid. Physically, the model is equivalent to a macroscopic fluid model based on Navier-Stokes(NS) equations, and supplemented by a coarse-grained model for thermodynamic non-equilibrium behaviors. To obtain a flexible Prandtl number, a coefficient is introduced in the ellipsoidal statistical distribution function to control the viscosity. To obtain a flexible specific heat ratio, a parameter is introduced in the energy kinetic moments to control the extra degree of freedom. For binary mixture, the correspondence between the macroscopic fluid model and the DBM may be several-to-one. Five typical benchmark tests are used to verify and validate the model. Some interesting non-equilibrium results, which are not available in the NS model or the single-fluid DBM, are presented.
△ Less
Submitted 14 December, 2020; v1 submitted 20 June, 2020;
originally announced June 2020.