-
OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution
Authors:
Zidong Cao,
Hao Ai,
Yan-Pei Cao,
Ying Shan,
Xiaohu Qie,
Lin Wang
Abstract:
Omnidirectional images (ODIs) have become increasingly popular, as their large field-of-view (FoV) can offer viewers the chance to freely choose the view directions in immersive environments such as virtual reality. The Möbius transformation is typically employed to further provide the opportunity for movement and zoom on ODIs, but applying it to the image level often results in blurry effect and…
▽ More
Omnidirectional images (ODIs) have become increasingly popular, as their large field-of-view (FoV) can offer viewers the chance to freely choose the view directions in immersive environments such as virtual reality. The Möbius transformation is typically employed to further provide the opportunity for movement and zoom on ODIs, but applying it to the image level often results in blurry effect and aliasing problem. In this paper, we propose a novel deep learning-based approach, called \textbf{OmniZoomer}, to incorporate the Möbius transformation into the network for movement and zoom on ODIs. By learning various transformed feature maps under different conditions, the network is enhanced to handle the increasing edge curvatures, which alleviates the blurry effect. Moreover, to address the aliasing problem, we propose two key components. Firstly, to compensate for the lack of pixels for describing curves, we enhance the feature maps in the high-resolution (HR) space and calculate the transformed index map with a spatial index generation module. Secondly, considering that ODIs are inherently represented in the spherical space, we propose a spherical resampling module that combines the index map and HR feature maps to transform the feature maps for better spherical correlation. The transformed feature maps are decoded to output a zoomed ODI. Experiments show that our method can produce HR and high-quality ODIs with the flexibility to move and zoom in to the object of interest. Project page is available at http://vlislab22.github.io/OmniZoomer/.
△ Less
Submitted 18 August, 2023; v1 submitted 15 August, 2023;
originally announced August 2023.
-
HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video
Authors:
Jia-Wei Liu,
Yan-Pei Cao,
Tianyuan Yang,
Eric Zhongcong Xu,
Jussi Keppo,
Ying Shan,
Xiaohu Qie,
Mike Zheng Shou
Abstract:
We introduce HOSNeRF, a novel 360° free-viewpoint rendering method that reconstructs neural radiance fields for dynamic human-object-scene from a single monocular in-the-wild video. Our method enables pausing the video at any frame and rendering all scene details (dynamic humans, objects, and backgrounds) from arbitrary viewpoints. The first challenge in this task is the complex object motions in…
▽ More
We introduce HOSNeRF, a novel 360° free-viewpoint rendering method that reconstructs neural radiance fields for dynamic human-object-scene from a single monocular in-the-wild video. Our method enables pausing the video at any frame and rendering all scene details (dynamic humans, objects, and backgrounds) from arbitrary viewpoints. The first challenge in this task is the complex object motions in human-object interactions, which we tackle by introducing the new object bones into the conventional human skeleton hierarchy to effectively estimate large object deformations in our dynamic human-object model. The second challenge is that humans interact with different objects at different times, for which we introduce two new learnable object state embeddings that can be used as conditions for learning our human-object representation and scene representation, respectively. Extensive experiments show that HOSNeRF significantly outperforms SOTA approaches on two challenging datasets by a large margin of 40% ~ 50% in terms of LPIPS. The code, data, and compelling examples of 360° free-viewpoint renderings from single videos will be released in https://showlab.github.io/HOSNeRF.
△ Less
Submitted 24 April, 2023;
originally announced April 2023.
-
MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
Authors:
Mingdeng Cao,
Xintao Wang,
Zhongang Qi,
Ying Shan,
Xiaohu Qie,
Yinqiang Zheng
Abstract:
Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results. For example, generation approaches usually fail to synthesize multiple images of the same objects/characters but with different views or poses. Meanwhile, existing editing methods either fail to achieve effective compl…
▽ More
Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results. For example, generation approaches usually fail to synthesize multiple images of the same objects/characters but with different views or poses. Meanwhile, existing editing methods either fail to achieve effective complex non-rigid editing while maintaining the overall textures and identity, or require time-consuming fine-tuning to capture the image-specific appearance. In this paper, we develop MasaCtrl, a tuning-free method to achieve consistent image generation and complex non-rigid image editing simultaneously. Specifically, MasaCtrl converts existing self-attention in diffusion models into mutual self-attention, so that it can query correlated local contents and textures from source images for consistency. To further alleviate the query confusion between foreground and background, we propose a mask-guided mutual self-attention strategy, where the mask can be easily extracted from the cross-attention maps. Extensive experiments show that the proposed MasaCtrl can produce impressive results in both consistent image generation and complex non-rigid real image editing.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
VMesh: Hybrid Volume-Mesh Representation for Efficient View Synthesis
Authors:
Yuan-Chen Guo,
Yan-Pei Cao,
Chen Wang,
Yu He,
Ying Shan,
Xiaohu Qie,
Song-Hai Zhang
Abstract:
With the emergence of neural radiance fields (NeRFs), view synthesis quality has reached an unprecedented level. Compared to traditional mesh-based assets, this volumetric representation is more powerful in expressing scene geometry but inevitably suffers from high rendering costs and can hardly be involved in further processes like editing, posing significant difficulties in combination with the…
▽ More
With the emergence of neural radiance fields (NeRFs), view synthesis quality has reached an unprecedented level. Compared to traditional mesh-based assets, this volumetric representation is more powerful in expressing scene geometry but inevitably suffers from high rendering costs and can hardly be involved in further processes like editing, posing significant difficulties in combination with the existing graphics pipeline. In this paper, we present a hybrid volume-mesh representation, VMesh, which depicts an object with a textured mesh along with an auxiliary sparse volume. VMesh retains the advantages of mesh-based assets, such as efficient rendering, compact storage, and easy editing, while also incorporating the ability to represent subtle geometric structures provided by the volumetric counterpart. VMesh can be obtained from multi-view images of an object and renders at 2K 60FPS on common consumer devices with high fidelity, unleashing new opportunities for real-time immersive applications.
△ Less
Submitted 28 March, 2023;
originally announced March 2023.
-
Accelerating Vision-Language Pretraining with Free Language Modeling
Authors:
Teng Wang,
Yixiao Ge,
Feng Zheng,
Ran Cheng,
Ying Shan,
Xiaohu Qie,
** Luo
Abstract:
The state of the arts in vision-language pretraining (VLP) achieves exemplary performance but suffers from high training costs resulting from slow convergence and long training time, especially on large-scale web datasets. An essential obstacle to training efficiency lies in the entangled prediction rate (percentage of tokens for reconstruction) and corruption rate (percentage of corrupted tokens)…
▽ More
The state of the arts in vision-language pretraining (VLP) achieves exemplary performance but suffers from high training costs resulting from slow convergence and long training time, especially on large-scale web datasets. An essential obstacle to training efficiency lies in the entangled prediction rate (percentage of tokens for reconstruction) and corruption rate (percentage of corrupted tokens) in masked language modeling (MLM), that is, a proper corruption rate is achieved at the cost of a large portion of output tokens being excluded from prediction loss. To accelerate the convergence of VLP, we propose a new pretraining task, namely, free language modeling (FLM), that enables a 100% prediction rate with arbitrary corruption rates. FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted. FLM-trained models are encouraged to learn better and faster given the same GPU time by exploiting bidirectional contexts more flexibly. Extensive experiments show FLM could achieve an impressive 2.5x pretraining time reduction in comparison to the MLM-based methods, while kee** competitive performance on both vision-language understanding and generation tasks. Code will be public at https://github.com/TencentARC/FLM.
△ Less
Submitted 24 March, 2023;
originally announced March 2023.
-
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
Authors:
Chong Mou,
Xintao Wang,
Liangbin Xie,
Yanze Wu,
Jian Zhang,
Zhongang Qi,
Ying Shan,
Xiaohu Qie
Abstract:
The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the c…
▽ More
The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.
△ Less
Submitted 20 March, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
RILS: Masked Visual Reconstruction in Language Semantic Space
Authors:
Shusheng Yang,
Yixiao Ge,
Kun Yi,
Dian Li,
Ying Shan,
Xiaohu Qie,
Xinggang Wang
Abstract:
Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training. In this work, we seek the synergy between two paradigms and study the emerging properties when MIM meets natural language supervision. To this end, we present a novel masked visual Reconstruction In Language semantic Space (RILS) pre-training framework, in which sent…
▽ More
Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training. In this work, we seek the synergy between two paradigms and study the emerging properties when MIM meets natural language supervision. To this end, we present a novel masked visual Reconstruction In Language semantic Space (RILS) pre-training framework, in which sentence representations, encoded by the text encoder, serve as prototypes to transform the vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets. The vision models can therefore capture useful components with structured information by predicting proper semantic of masked tokens. Better visual representations could, in turn, improve the text encoder via the image-text alignment objective, which is essential for the effective MIM target transformation. Extensive experimental results demonstrate that our method not only enjoys the best of previous MIM and CLIP but also achieves further improvements on various tasks due to their mutual benefits. RILS exhibits advanced transferability on downstream classification, detection, and segmentation, especially for low-shot regimes. Code will be made available at https://github.com/hustvl/RILS.
△ Less
Submitted 28 February, 2023; v1 submitted 17 January, 2023;
originally announced January 2023.
-
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
Authors:
Jiale Xu,
Xintao Wang,
Weihao Cheng,
Yan-Pei Cao,
Ying Shan,
Xiaohu Qie,
Shenghua Gao
Abstract:
Recent CLIP-guided 3D optimization methods, such as DreamFields and PureCLIPNeRF, have achieved impressive results in zero-shot text-to-3D synthesis. However, due to scratch training and random initialization without prior knowledge, these methods often fail to generate accurate and faithful 3D structures that conform to the input text. In this paper, we make the first attempt to introduce explici…
▽ More
Recent CLIP-guided 3D optimization methods, such as DreamFields and PureCLIPNeRF, have achieved impressive results in zero-shot text-to-3D synthesis. However, due to scratch training and random initialization without prior knowledge, these methods often fail to generate accurate and faithful 3D structures that conform to the input text. In this paper, we make the first attempt to introduce explicit 3D shape priors into the CLIP-guided 3D optimization process. Specifically, we first generate a high-quality 3D shape from the input text in the text-to-shape stage as a 3D shape prior. We then use it as the initialization of a neural radiance field and optimize it with the full prompt. To address the challenging text-to-shape generation task, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between the images synthesized by the text-to-image diffusion model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, Dream3D, is capable of generating imaginative 3D content with superior visual quality and shape accuracy compared to state-of-the-art methods.
△ Less
Submitted 3 April, 2023; v1 submitted 28 December, 2022;
originally announced December 2022.
-
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Authors:
Jay Zhangjie Wu,
Yixiao Ge,
Xintao Wang,
Weixian Lei,
Yuchao Gu,
Yufei Shi,
Wynne Hsu,
Ying Shan,
Xiaohu Qie,
Mike Zheng Shou
Abstract:
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-a…
▽ More
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.
△ Less
Submitted 17 March, 2023; v1 submitted 22 December, 2022;
originally announced December 2022.
-
Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis
Authors:
Yuchao Gu,
Xintao Wang,
Yixiao Ge,
Ying Shan,
Xiaohu Qie,
Mike Zheng Shou
Abstract:
Vector-Quantized (VQ-based) generative models usually consist of two basic components, i.e., VQ tokenizers and generative transformers. Prior research focuses on improving the reconstruction fidelity of VQ tokenizers but rarely examines how the improvement in reconstruction affects the generation ability of generative transformers. In this paper, we surprisingly find that improving the reconstruct…
▽ More
Vector-Quantized (VQ-based) generative models usually consist of two basic components, i.e., VQ tokenizers and generative transformers. Prior research focuses on improving the reconstruction fidelity of VQ tokenizers but rarely examines how the improvement in reconstruction affects the generation ability of generative transformers. In this paper, we surprisingly find that improving the reconstruction fidelity of VQ tokenizers does not necessarily improve the generation. Instead, learning to compress semantic features within VQ tokenizers significantly improves generative transformers' ability to capture textures and structures. We thus highlight two competing objectives of VQ tokenizers for image synthesis: semantic compression and details preservation. Different from previous work that only pursues better details preservation, we propose Semantic-Quantized GAN (SeQ-GAN) with two learning phases to balance the two objectives. In the first phase, we propose a semantic-enhanced perceptual loss for better semantic compression. In the second phase, we fix the encoder and codebook, but enhance and finetune the decoder to achieve better details preservation. The proposed SeQ-GAN greatly improves VQ-based generative models and surpasses the GAN and Diffusion Models on both unconditional and conditional image generation. Our SeQ-GAN (364M) achieves Frechet Inception Distance (FID) of 6.25 and Inception Score (IS) of 140.9 on 256x256 ImageNet generation, a remarkable improvement over VIT-VQGAN (714M), which obtains 11.2 FID and 97.2 IS.
△ Less
Submitted 9 March, 2023; v1 submitted 6 December, 2022;
originally announced December 2022.
-
One for All, All for One: Learning and Transferring User Embeddings for Cross-Domain Recommendation
Authors:
Chenglin Li,
Yuanzhen Xie,
Chenyun Yu,
Bo Hu,
Zang li,
Guoqiang Shu,
Xiaohu Qie,
Di Niu
Abstract:
Cross-domain recommendation is an important method to improve recommender system performance, especially when observations in target domains are sparse. However, most existing techniques focus on single-target or dual-target cross-domain recommendation (CDR) and are hard to be generalized to CDR with multiple target domains. In addition, the negative transfer problem is prevalent in CDR, where the…
▽ More
Cross-domain recommendation is an important method to improve recommender system performance, especially when observations in target domains are sparse. However, most existing techniques focus on single-target or dual-target cross-domain recommendation (CDR) and are hard to be generalized to CDR with multiple target domains. In addition, the negative transfer problem is prevalent in CDR, where the recommendation performance in a target domain may not always be enhanced by knowledge learned from a source domain, especially when the source domain has sparse data. In this study, we propose CAT-ART, a multi-target CDR method that learns to improve recommendations in all participating domains through representation learning and embedding transfer. Our method consists of two parts: a self-supervised Contrastive AuToencoder (CAT) framework to generate global user embeddings based on information from all participating domains, and an Attention-based Representation Transfer (ART) framework which transfers domain-specific user embeddings from other domains to assist with target domain recommendation. CAT-ART boosts the recommendation performance in any target domain through the combined use of the learned global user representation and knowledge transferred from other domains, in addition to the original user embedding in the target domain. We conducted extensive experiments on a collected real-world CDR dataset spanning 5 domains and involving a million users. Experimental results demonstrate the superiority of the proposed method over a range of prior arts. We further conducted ablation studies to verify the effectiveness of the proposed components. Our collected dataset will be open-sourced to facilitate future research in the field of multi-domain recommender systems and user modeling.
△ Less
Submitted 21 November, 2022;
originally announced November 2022.
-
Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems
Authors:
Guanghu Yuan,
Fajie Yuan,
Yudong Li,
Beibei Kong,
Shujie Li,
Lei Chen,
Min Yang,
Chenyun Yu,
Bo Hu,
Zang Li,
Yu Xu,
Xiaohu Qie
Abstract:
Existing benchmark datasets for recommender systems (RS) either are created at a small scale or involve very limited forms of user feedback. RS models evaluated on such datasets often lack practical values for large-scale real-world applications. In this paper, we describe Tenrec, a novel and publicly available data collection for RS that records various user feedback from four different recommend…
▽ More
Existing benchmark datasets for recommender systems (RS) either are created at a small scale or involve very limited forms of user feedback. RS models evaluated on such datasets often lack practical values for large-scale real-world applications. In this paper, we describe Tenrec, a novel and publicly available data collection for RS that records various user feedback from four different recommendation scenarios. To be specific, Tenrec has the following five characteristics: (1) it is large-scale, containing around 5 million users and 140 million interactions; (2) it has not only positive user feedback, but also true negative feedback (vs. one-class recommendation); (3) it contains overlapped users and items across four different scenarios; (4) it contains various types of user positive feedback, in forms of clicks, likes, shares, and follows, etc; (5) it contains additional features beyond the user IDs and item IDs. We verify Tenrec on ten diverse recommendation tasks by running several classical baseline models per task. Tenrec has the potential to become a useful benchmark dataset for a majority of popular recommendation tasks.
△ Less
Submitted 4 June, 2023; v1 submitted 13 October, 2022;
originally announced October 2022.
-
DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes
Authors:
Jia-Wei Liu,
Yan-Pei Cao,
Weijia Mao,
Wenqiao Zhang,
David Junhao Zhang,
Jussi Keppo,
Ying Shan,
Xiaohu Qie,
Mike Zheng Shou
Abstract:
Modeling dynamic scenes is important for many applications such as virtual reality and telepresence. Despite achieving unprecedented fidelity for novel view synthesis in dynamic scenes, existing methods based on Neural Radiance Fields (NeRF) suffer from slow convergence (i.e., model training time measured in days). In this paper, we present DeVRF, a novel representation to accelerate learning dyna…
▽ More
Modeling dynamic scenes is important for many applications such as virtual reality and telepresence. Despite achieving unprecedented fidelity for novel view synthesis in dynamic scenes, existing methods based on Neural Radiance Fields (NeRF) suffer from slow convergence (i.e., model training time measured in days). In this paper, we present DeVRF, a novel representation to accelerate learning dynamic radiance fields. The core of DeVRF is to model both the 3D canonical space and 4D deformation field of a dynamic, non-rigid scene with explicit and discrete voxel-based representations. However, it is quite challenging to train such a representation which has a large number of model parameters, often resulting in overfitting issues. To overcome this challenge, we devise a novel static-to-dynamic learning paradigm together with a new data capture setup that is convenient to deploy in practice. This paradigm unlocks efficient learning of deformable radiance fields via utilizing the 3D volumetric canonical space learnt from multi-view static images to ease the learning of 4D voxel deformation field with only few-view dynamic sequences. To further improve the efficiency of our DeVRF and its synthesized novel view's quality, we conduct thorough explorations and identify a set of strategies. We evaluate DeVRF on both synthetic and real-world dynamic scenes with different types of deformation. Experiments demonstrate that DeVRF achieves two orders of magnitude speedup (100x faster) with on-par high-fidelity results compared to the previous state-of-the-art approaches. The code and dataset will be released in https://github.com/showlab/DeVRF.
△ Less
Submitted 4 June, 2022; v1 submitted 31 May, 2022;
originally announced May 2022.
-
Masked Image Modeling with Denoising Contrast
Authors:
Kun Yi,
Yixiao Ge,
Xiaotong Li,
Shusheng Yang,
Dian Li,
Jian** Wu,
Ying Shan,
Xiaohu Qie
Abstract:
Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up. MIM recently dominates this line of research with state-of-the-art performance on vision Transformers (ViTs), where the core is to enhance the patch…
▽ More
Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up. MIM recently dominates this line of research with state-of-the-art performance on vision Transformers (ViTs), where the core is to enhance the patch-level visual context capturing of the network via denoising auto-encoding mechanism. Rather than tailoring image tokenizers with extra training stages as in previous works, we unleash the great potential of contrastive learning on denoising auto-encoding and introduce a pure MIM method, ConMIM, to produce simple intra-image inter-patch contrastive constraints as the sole learning objectives for masked patch prediction. We further strengthen the denoising mechanism with asymmetric designs, including image perturbations and model progress rates, to improve the network pre-training. ConMIM-pretrained models with various scales achieve competitive results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks, e.g., on ImageNet-1K classification, we achieve 83.9% top-1 accuracy with ViT-Small and 85.3% with ViT-Base without extra data for pre-training.
△ Less
Submitted 29 January, 2023; v1 submitted 19 May, 2022;
originally announced May 2022.
-
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Authors:
Yuying Ge,
Yixiao Ge,
Xihui Liu,
Alex **peng Wang,
Jian** Wu,
Ying Shan,
Xiaohu Qie,
** Luo
Abstract:
Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible…
▽ More
Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible solution to address the above limitation. In this work, we for the first time investigate masked visual modeling in video-text pre-training with the "dual-encoder" architecture. We perform Masked visual modeling with Injected LanguagE Semantics (MILES) by employing an extra snapshot video encoder as an evolving "tokenizer" to produce reconstruction targets for masked video patch prediction. Given the corrupted video, the video encoder is trained to recover text-aligned features of the masked patches via reasoning with the visible regions along the spatial and temporal dimensions, which enhances the discriminativeness of local visual features and the fine-grained cross-modality alignment. Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols. Our approach also surpasses the baseline models significantly on zero-shot action recognition, which can be cast as video-to-text retrieval.
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection
Authors:
Ye Liu,
Siyuan Li,
Yang Wu,
Chang Wen Chen,
Ying Shan,
Xiaohu Qie
Abstract:
Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era. Nevertheless, jointly conducting moment retrieval and highlight detection is an emerging research topic, even though its component problems and some related tasks have already been studied for a while. In this paper, we pre…
▽ More
Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era. Nevertheless, jointly conducting moment retrieval and highlight detection is an emerging research topic, even though its component problems and some related tasks have already been studied for a while. In this paper, we present the first unified framework, named Unified Multi-modal Transformers (UMT), capable of realizing such joint optimization while can also be easily degenerated for solving individual problems. As far as we are aware, this is the first scheme to integrate multi-modal (visual-audio) learning for either joint optimization or the individual moment retrieval task, and tackles moment retrieval as a keypoint detection problem using a novel query generator and query decoder. Extensive comparisons with existing methods and ablation studies on QVHighlights, Charades-STA, YouTube Highlights, and TVSum datasets demonstrate the effectiveness, superiority, and flexibility of the proposed method under various settings. Source code and pre-trained models are available at https://github.com/TencentARC/UMT.
△ Less
Submitted 27 March, 2022; v1 submitted 23 March, 2022;
originally announced March 2022.
-
Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval
Authors:
Guanyu Cai,
Yixiao Ge,
Binjie Zhang,
Alex **peng Wang,
Rui Yan,
Xudong Lin,
Ying Shan,
Lianghua He,
Xiaohu Qie,
Jian** Wu,
Mike Zheng Shou
Abstract:
Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval. Despite the impressive results, VLP research becomes extremely expensive with the need for massive data and a long training time, preventing further explorations. In this work, we revital…
▽ More
Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval. Despite the impressive results, VLP research becomes extremely expensive with the need for massive data and a long training time, preventing further explorations. In this work, we revitalize region features of sparsely sampled video clips to significantly reduce both spatial and temporal visual redundancy towards democratizing VLP research at the same time achieving state-of-the-art results. Specifically, to fully explore the potential of region features, we introduce a novel bidirectional region-word alignment regularization that properly optimizes the fine-grained relations between regions and certain words in sentences, eliminating the domain/modality disconnections between pre-extracted region features and text. Extensive results of downstream video-language retrieval tasks on four datasets demonstrate the superiority of our method on both effectiveness and efficiency, \textit{e.g.}, our method achieves competing results with 80\% fewer data and 85\% less pre-training time compared to the most efficient VLP method so far \cite{lei2021less}. The code will be available at \url{https://github.com/showlab/DemoVLP}.
△ Less
Submitted 7 February, 2023; v1 submitted 15 March, 2022;
originally announced March 2022.
-
All in One: Exploring Unified Video-Language Pre-training
Authors:
Alex **peng Wang,
Yixiao Ge,
Rui Yan,
Yuying Ge,
Xudong Lin,
Guanyu Cai,
Jian** Wu,
Ying Shan,
Xiaohu Qie,
Mike Zheng Shou
Abstract:
Mainstream Video-Language Pre-training models \cite{actbert,clipbert,violet} consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks. In this work, we for the first time introduce…
▽ More
Mainstream Video-Language Pre-training models \cite{actbert,clipbert,violet} consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks. In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture. We argue that the unique temporal information of video data turns out to be a key barrier hindering the design of a modality-agnostic Transformer. To overcome the challenge, we introduce a novel and effective token rolling operation to encode temporal representations from video clips in a non-parametric manner. The careful design enables the representation learning of both video-text multimodal inputs and unimodal inputs using a unified backbone model. Our pre-trained all-in-one Transformer is transferred to various downstream video-text tasks after fine-tuning, including text-video retrieval, video-question answering, multiple choice and visual commonsense reasoning. State-of-the-art performances with the minimal model FLOPs on nine datasets demonstrate the superiority of our method compared to the competitive counterparts. The code and pretrained model have been released in https://github.com/showlab/all-in-one.
△ Less
Submitted 14 March, 2022;
originally announced March 2022.
-
Bridging Video-text Retrieval with Multiple Choice Questions
Authors:
Yuying Ge,
Yixiao Ge,
Xihui Liu,
Dian Li,
Ying Shan,
Xiaohu Qie,
** Luo
Abstract:
Pre-training a model to learn transferable video-text representation for retrieval has attracted a lot of attention in recent years. Previous dominant works mainly adopt two separate encoders for efficient retrieval, but ignore local associations between videos and texts. Another line of research uses a joint encoder to interact video with texts, but results in low efficiency since each text-video…
▽ More
Pre-training a model to learn transferable video-text representation for retrieval has attracted a lot of attention in recent years. Previous dominant works mainly adopt two separate encoders for efficient retrieval, but ignore local associations between videos and texts. Another line of research uses a joint encoder to interact video with texts, but results in low efficiency since each text-video pair needs to be fed into the model. In this work, we enable fine-grained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features. Specifically, we exploit the rich semantics of text (i.e., nouns and verbs) to build questions, with which the video encoder can be trained to capture more regional content and temporal dynamics. In the form of questions and answers, the semantic associations between local video-text features can be properly established. BridgeFormer is able to be removed for downstream retrieval, rendering an efficient and flexible model with only two encoders. Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets with different experimental setups (i.e., zero-shot and fine-tune), including HowTo100M (one million videos). We further conduct zero-shot action recognition, which can be cast as video-to-text retrieval, and our approach also significantly surpasses its counterparts. As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e.g., action recognition with linear evaluation.
△ Less
Submitted 17 March, 2022; v1 submitted 13 January, 2022;
originally announced January 2022.
-
Object-aware Video-language Pre-training for Retrieval
Authors:
Alex **peng Wang,
Yixiao Ge,
Guanyu Cai,
Rui Yan,
Xudong Lin,
Ying Shan,
Xiaohu Qie,
Mike Zheng Shou
Abstract:
Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval. Yet, existing video-language transformer models do not explicitly fine-grained semantic align. In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object represent…
▽ More
Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval. Yet, existing video-language transformer models do not explicitly fine-grained semantic align. In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations. The key idea is to leverage the bounding boxes and object tags to guide the training process. We evaluate our model on three standard sub-tasks of video-text matching on four widely used benchmarks. We also provide deep analysis and detailed ablation about the proposed method. We show clear improvement in performance across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a video-language architecture. The code will be released at \url{https://github.com/FingerRec/OA-Transformer}.
△ Less
Submitted 18 May, 2022; v1 submitted 1 December, 2021;
originally announced December 2021.
-
Graph-Based Equilibrium Metrics for Dynamic Supply-Demand Systems with Applications to Ride-sourcing Platforms
Authors:
Fan Zhou,
Shikai Luo,
Xiaohu Qie,
Jie** Ye,
Hongtu Zhu
Abstract:
How to dynamically measure the local-to-global spatio-temporal coherence between demand and supply networks is a fundamental task for ride-sourcing platforms, such as DiDi. Such coherence measurement is critically important for the quantification of the market efficiency and the comparison of different platform policies, such as dispatching. The aim of this paper is to introduce a graph-based equi…
▽ More
How to dynamically measure the local-to-global spatio-temporal coherence between demand and supply networks is a fundamental task for ride-sourcing platforms, such as DiDi. Such coherence measurement is critically important for the quantification of the market efficiency and the comparison of different platform policies, such as dispatching. The aim of this paper is to introduce a graph-based equilibrium metric (GEM) to quantify the distance between demand and supply networks based on a weighted graph structure. We formulate GEM as the optimal objective value of an unbalanced transport problem, which can be efficiently solved by optimizing an equivalent linear programming. We examine how the GEM can help solve three operational tasks of ride-sourcing platforms. The first one is that GEM achieves up to 70.6% reduction in root-mean-square error over the second-best distance measurement for the prediction accuracy. The second one is that the use of GEM for designing order dispatching policy increases answer rate and drivers' revenue for more than 1%, representing a huge improvement in number. The third one is that GEM is to serve as an endpoint for comparing different platform policies in AB test.
△ Less
Submitted 23 March, 2021; v1 submitted 10 February, 2021;
originally announced February 2021.
-
Spatio-Temporal Hierarchical Adaptive Dispatching for Ridesharing Systems
Authors:
Chang Liu,
Jiahui Sun,
Haiming **,
Meng Ai,
Qun Li,
Cheng Zhang,
Kehua Sheng,
Guobin Wu,
Xiaohu Qie,
Xinbing Wang
Abstract:
Nowadays, ridesharing has become one of the most popular services offered by online ride-hailing platforms (e.g., Uber and Didi Chuxing). Existing ridesharing platforms adopt the strategy that dispatches orders over the entire city at a uniform time interval. However, the uneven spatio-temporal order distributions in real-world ridesharing systems indicate that such an approach is suboptimal in pr…
▽ More
Nowadays, ridesharing has become one of the most popular services offered by online ride-hailing platforms (e.g., Uber and Didi Chuxing). Existing ridesharing platforms adopt the strategy that dispatches orders over the entire city at a uniform time interval. However, the uneven spatio-temporal order distributions in real-world ridesharing systems indicate that such an approach is suboptimal in practice. Thus, in this paper, we exploit adaptive dispatching intervals to boost the platform's profit under a guarantee of the maximum passenger waiting time. Specifically, we propose a hierarchical approach, which generates clusters of geographical areas suitable to share the same dispatching intervals, and then makes online decisions of selecting the appropriate time instances for order dispatch within each spatial cluster. Technically, we prove the impossibility of designing constant-competitive-ratio algorithms for the online adaptive interval problem, and propose online algorithms under partial or even zero future order knowledge that significantly improve the platform's profit over existing approaches. We conduct extensive experiments with a large-scale ridesharing order dataset, which contains all of the over 3.5 million ridesharing orders in Bei**g, China, received by Didi Chuxing from October 1st to October 31st, 2018. The experimental results demonstrate that our proposed algorithms outperform existing approaches.
△ Less
Submitted 4 September, 2020;
originally announced September 2020.
-
Weakly Supervised Learning Meets Ride-Sharing User Experience Enhancement
Authors:
Lan-Zhe Guo,
Feng Kuang,
Zhang-Xun Liu,
Yu-Feng Li,
Nan Ma,
Xiao-Hu Qie
Abstract:
Weakly supervised learning aims at co** with scarce labeled data. Previous weakly supervised studies typically assume that there is only one kind of weak supervision in data. In many applications, however, raw data usually contains more than one kind of weak supervision at the same time. For example, in user experience enhancement from Didi, one of the largest online ride-sharing platforms, the…
▽ More
Weakly supervised learning aims at co** with scarce labeled data. Previous weakly supervised studies typically assume that there is only one kind of weak supervision in data. In many applications, however, raw data usually contains more than one kind of weak supervision at the same time. For example, in user experience enhancement from Didi, one of the largest online ride-sharing platforms, the ride comment data contains severe label noise (due to the subjective factors of passengers) and severe label distribution bias (due to the sampling bias). We call such a problem as "compound weakly supervised learning". In this paper, we propose the CWSL method to address this problem based on Didi ride-sharing comment data. Specifically, an instance reweighting strategy is employed to cope with severe label noise in comment data, where the weights for harmful noisy instances are small. Robust criteria like AUC rather than accuracy and the validation performance are optimized for the correction of biased data label. Alternating optimization and stochastic gradient methods accelerate the optimization on large-scale data. Experiments on Didi ride-sharing comment data clearly validate the effectiveness. We hope this work may shed some light on applying weakly supervised learning to complex real situations.
△ Less
Submitted 19 January, 2020;
originally announced January 2020.
-
The Large High Altitude Air Shower Observatory (LHAASO) Science Book (2021 Edition)
Authors:
Zhen Cao,
D. della Volpe,
Siming Liu,
Editors,
:,
Xiaojun Bi,
Yang Chen,
B. D'Ettorre Piazzoli,
Li Feng,
Huanyu Jia,
Zhuo Li,
Xinhua Ma,
Xiangyu Wang,
Xiao Zhang,
External Referees,
:,
Xiushu Qie,
Hongbo Hu,
Internal Referees,
:,
Alejandro Sáiz,
Ruizhi Yang,
Contributors,
:,
Andrea Addazi
, et al. (69 additional authors not shown)
Abstract:
Since the science white paper of the Large High Altitude Air Shower Observatory (LHAASO) published on arXiv in 2019 [e-Print: 1905.02773 (astro-ph.HE)], LHAASO has completed the transition from a project to an operational gamma-ray astronomical observatory LHAASO is a new generation multi-component facility located in Daocheng, Sichuan province of China, at an altitude of 4410 meters. It aims at m…
▽ More
Since the science white paper of the Large High Altitude Air Shower Observatory (LHAASO) published on arXiv in 2019 [e-Print: 1905.02773 (astro-ph.HE)], LHAASO has completed the transition from a project to an operational gamma-ray astronomical observatory LHAASO is a new generation multi-component facility located in Daocheng, Sichuan province of China, at an altitude of 4410 meters. It aims at measuring with unprecedented sensitivity the spectrum, composition, and anisotropy of cosmic rays in the energy range between 10$^{12}$ and 10$^{18}$~eV, and acting simultaneously as a wide aperture (one stereoradiant) continuously operating gamma-ray telescope in the energy range between 10$^{11}$ and $10^{15}$~eV with the designed sensitivity of 1.3\% of the Crab Unit (CU) above 100 TeV. LHAASO's capability of measuring simultaneously different shower components (electrons, muons, and Cherenkov/fluorescence light), will allow it to investigate the origin, acceleration, and propagation of CR through measurement of the energy spectrum, elemental composition, and anisotropy with unprecedented resolution. The remarkable sensitivity of LHAASO will play a key role in CR physics and gamma-ray astronomy for a general and comprehensive exploration of the high energy universe and will allow important studies of fundamental physics (such as indirect dark matter search, Lorentz invariance violation, quantum gravity) and solar and heliospheric physics. The LHAASO Collaboration organized an editorial working group and finished all editorial work of this science book, to summarize the instrumental features and outline the prospects of scientific researches with the LHAASO experiment.
△ Less
Submitted 18 February, 2022; v1 submitted 7 May, 2019;
originally announced May 2019.