Skip to main content

Showing 1–24 of 24 results for author: Qie, X

.
  1. arXiv:2308.08114  [pdf, other

    cs.CV cs.AI

    OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution

    Authors: Zidong Cao, Hao Ai, Yan-Pei Cao, Ying Shan, Xiaohu Qie, Lin Wang

    Abstract: Omnidirectional images (ODIs) have become increasingly popular, as their large field-of-view (FoV) can offer viewers the chance to freely choose the view directions in immersive environments such as virtual reality. The Möbius transformation is typically employed to further provide the opportunity for movement and zoom on ODIs, but applying it to the image level often results in blurry effect and… ▽ More

    Submitted 18 August, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

    Comments: Accepted by ICCV 2023

  2. arXiv:2304.12281  [pdf, other

    cs.CV

    HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video

    Authors: Jia-Wei Liu, Yan-Pei Cao, Tianyuan Yang, Eric Zhongcong Xu, Jussi Keppo, Ying Shan, Xiaohu Qie, Mike Zheng Shou

    Abstract: We introduce HOSNeRF, a novel 360° free-viewpoint rendering method that reconstructs neural radiance fields for dynamic human-object-scene from a single monocular in-the-wild video. Our method enables pausing the video at any frame and rendering all scene details (dynamic humans, objects, and backgrounds) from arbitrary viewpoints. The first challenge in this task is the complex object motions in… ▽ More

    Submitted 24 April, 2023; originally announced April 2023.

    Comments: Project page: https://showlab.github.io/HOSNeRF

  3. arXiv:2304.08465  [pdf, other

    cs.CV

    MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

    Authors: Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, Yinqiang Zheng

    Abstract: Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results. For example, generation approaches usually fail to synthesize multiple images of the same objects/characters but with different views or poses. Meanwhile, existing editing methods either fail to achieve effective compl… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: Project available at https://ljzycmd.github.io/projects/MasaCtrl

  4. arXiv:2303.16184  [pdf, other

    cs.CV cs.GR

    VMesh: Hybrid Volume-Mesh Representation for Efficient View Synthesis

    Authors: Yuan-Chen Guo, Yan-Pei Cao, Chen Wang, Yu He, Ying Shan, Xiaohu Qie, Song-Hai Zhang

    Abstract: With the emergence of neural radiance fields (NeRFs), view synthesis quality has reached an unprecedented level. Compared to traditional mesh-based assets, this volumetric representation is more powerful in expressing scene geometry but inevitably suffers from high rendering costs and can hardly be involved in further processes like editing, posing significant difficulties in combination with the… ▽ More

    Submitted 28 March, 2023; originally announced March 2023.

    Comments: Project page: https://bennyguo.github.io/vmesh/

  5. arXiv:2303.14038  [pdf, other

    cs.CV

    Accelerating Vision-Language Pretraining with Free Language Modeling

    Authors: Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, Xiaohu Qie, ** Luo

    Abstract: The state of the arts in vision-language pretraining (VLP) achieves exemplary performance but suffers from high training costs resulting from slow convergence and long training time, especially on large-scale web datasets. An essential obstacle to training efficiency lies in the entangled prediction rate (percentage of tokens for reconstruction) and corruption rate (percentage of corrupted tokens)… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: To appear in CVPR 2023

  6. arXiv:2302.08453  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

    Authors: Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie

    Abstract: The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the c… ▽ More

    Submitted 20 March, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

    Comments: Tech Report. GitHub: https://github.com/TencentARC/T2I-Adapter

  7. arXiv:2301.06958  [pdf, other

    cs.CV

    RILS: Masked Visual Reconstruction in Language Semantic Space

    Authors: Shusheng Yang, Yixiao Ge, Kun Yi, Dian Li, Ying Shan, Xiaohu Qie, Xinggang Wang

    Abstract: Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training. In this work, we seek the synergy between two paradigms and study the emerging properties when MIM meets natural language supervision. To this end, we present a novel masked visual Reconstruction In Language semantic Space (RILS) pre-training framework, in which sent… ▽ More

    Submitted 28 February, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

  8. arXiv:2212.14704  [pdf, other

    cs.CV

    Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models

    Authors: Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, Shenghua Gao

    Abstract: Recent CLIP-guided 3D optimization methods, such as DreamFields and PureCLIPNeRF, have achieved impressive results in zero-shot text-to-3D synthesis. However, due to scratch training and random initialization without prior knowledge, these methods often fail to generate accurate and faithful 3D structures that conform to the input text. In this paper, we make the first attempt to introduce explici… ▽ More

    Submitted 3 April, 2023; v1 submitted 28 December, 2022; originally announced December 2022.

    Comments: Accepted by CVPR 2023. Project page: https://bluestyle97.github.io/dream3d/

  9. arXiv:2212.11565  [pdf, other

    cs.CV

    Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

    Authors: Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

    Abstract: To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-a… ▽ More

    Submitted 17 March, 2023; v1 submitted 22 December, 2022; originally announced December 2022.

    Comments: Preprint

  10. arXiv:2212.03185  [pdf, other

    cs.CV

    Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

    Authors: Yuchao Gu, Xintao Wang, Yixiao Ge, Ying Shan, Xiaohu Qie, Mike Zheng Shou

    Abstract: Vector-Quantized (VQ-based) generative models usually consist of two basic components, i.e., VQ tokenizers and generative transformers. Prior research focuses on improving the reconstruction fidelity of VQ tokenizers but rarely examines how the improvement in reconstruction affects the generation ability of generative transformers. In this paper, we surprisingly find that improving the reconstruct… ▽ More

    Submitted 9 March, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

  11. One for All, All for One: Learning and Transferring User Embeddings for Cross-Domain Recommendation

    Authors: Chenglin Li, Yuanzhen Xie, Chenyun Yu, Bo Hu, Zang li, Guoqiang Shu, Xiaohu Qie, Di Niu

    Abstract: Cross-domain recommendation is an important method to improve recommender system performance, especially when observations in target domains are sparse. However, most existing techniques focus on single-target or dual-target cross-domain recommendation (CDR) and are hard to be generalized to CDR with multiple target domains. In addition, the negative transfer problem is prevalent in CDR, where the… ▽ More

    Submitted 21 November, 2022; originally announced November 2022.

    Comments: 9 pages, accepted by WSDM 2023

  12. arXiv:2210.10629  [pdf, other

    cs.IR

    Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems

    Authors: Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun Yu, Bo Hu, Zang Li, Yu Xu, Xiaohu Qie

    Abstract: Existing benchmark datasets for recommender systems (RS) either are created at a small scale or involve very limited forms of user feedback. RS models evaluated on such datasets often lack practical values for large-scale real-world applications. In this paper, we describe Tenrec, a novel and publicly available data collection for RS that records various user feedback from four different recommend… ▽ More

    Submitted 4 June, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

  13. arXiv:2205.15723  [pdf, other

    cs.CV

    DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes

    Authors: Jia-Wei Liu, Yan-Pei Cao, Weijia Mao, Wenqiao Zhang, David Junhao Zhang, Jussi Keppo, Ying Shan, Xiaohu Qie, Mike Zheng Shou

    Abstract: Modeling dynamic scenes is important for many applications such as virtual reality and telepresence. Despite achieving unprecedented fidelity for novel view synthesis in dynamic scenes, existing methods based on Neural Radiance Fields (NeRF) suffer from slow convergence (i.e., model training time measured in days). In this paper, we present DeVRF, a novel representation to accelerate learning dyna… ▽ More

    Submitted 4 June, 2022; v1 submitted 31 May, 2022; originally announced May 2022.

    Comments: Project page: https://jia-wei-liu.github.io/DeVRF/

  14. arXiv:2205.09616  [pdf, other

    cs.CV

    Masked Image Modeling with Denoising Contrast

    Authors: Kun Yi, Yixiao Ge, Xiaotong Li, Shusheng Yang, Dian Li, Jian** Wu, Ying Shan, Xiaohu Qie

    Abstract: Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up. MIM recently dominates this line of research with state-of-the-art performance on vision Transformers (ViTs), where the core is to enhance the patch… ▽ More

    Submitted 29 January, 2023; v1 submitted 19 May, 2022; originally announced May 2022.

    Comments: Accepted by ICLR 2023. The code will be available at https://github.com/TencentARC/ConMIM

  15. arXiv:2204.12408  [pdf, other

    cs.CV

    MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

    Authors: Yuying Ge, Yixiao Ge, Xihui Liu, Alex **peng Wang, Jian** Wu, Ying Shan, Xiaohu Qie, ** Luo

    Abstract: Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics. The recent success of image BERT pre-training with masked visual modeling that promotes the learning of local visual context, motivates a possible… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

  16. arXiv:2203.12745  [pdf, other

    cs.CV

    UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

    Authors: Ye Liu, Siyuan Li, Yang Wu, Chang Wen Chen, Ying Shan, Xiaohu Qie

    Abstract: Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era. Nevertheless, jointly conducting moment retrieval and highlight detection is an emerging research topic, even though its component problems and some related tasks have already been studied for a while. In this paper, we pre… ▽ More

    Submitted 27 March, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

    Comments: Accepted to Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)

  17. arXiv:2203.07720  [pdf, other

    cs.CV

    Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

    Authors: Guanyu Cai, Yixiao Ge, Binjie Zhang, Alex **peng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, Xiaohu Qie, Jian** Wu, Mike Zheng Shou

    Abstract: Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval. Despite the impressive results, VLP research becomes extremely expensive with the need for massive data and a long training time, preventing further explorations. In this work, we revital… ▽ More

    Submitted 7 February, 2023; v1 submitted 15 March, 2022; originally announced March 2022.

  18. arXiv:2203.07303  [pdf, other

    cs.CV

    All in One: Exploring Unified Video-Language Pre-training

    Authors: Alex **peng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jian** Wu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

    Abstract: Mainstream Video-Language Pre-training models \cite{actbert,clipbert,violet} consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks. In this work, we for the first time introduce… ▽ More

    Submitted 14 March, 2022; originally announced March 2022.

    Comments: 18 pages. 11 figures. Code: https://github.com/showlab/all-in-one

  19. arXiv:2201.04850  [pdf, other

    cs.CV

    Bridging Video-text Retrieval with Multiple Choice Questions

    Authors: Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xiaohu Qie, ** Luo

    Abstract: Pre-training a model to learn transferable video-text representation for retrieval has attracted a lot of attention in recent years. Previous dominant works mainly adopt two separate encoders for efficient retrieval, but ignore local associations between videos and texts. Another line of research uses a joint encoder to interact video with texts, but results in low efficiency since each text-video… ▽ More

    Submitted 17 March, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

    Comments: Accepted by CVPR 2022

  20. arXiv:2112.00656  [pdf, other

    cs.CV cs.CL

    Object-aware Video-language Pre-training for Retrieval

    Authors: Alex **peng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, Xiaohu Qie, Mike Zheng Shou

    Abstract: Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval. Yet, existing video-language transformer models do not explicitly fine-grained semantic align. In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object represent… ▽ More

    Submitted 18 May, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

    Comments: CVPR2022; Code: https://github.com/FingerRec/OA-Transformer

  21. arXiv:2102.05805  [pdf, other

    math.OC stat.AP

    Graph-Based Equilibrium Metrics for Dynamic Supply-Demand Systems with Applications to Ride-sourcing Platforms

    Authors: Fan Zhou, Shikai Luo, Xiaohu Qie, Jie** Ye, Hongtu Zhu

    Abstract: How to dynamically measure the local-to-global spatio-temporal coherence between demand and supply networks is a fundamental task for ride-sourcing platforms, such as DiDi. Such coherence measurement is critically important for the quantification of the market efficiency and the comparison of different platform policies, such as dispatching. The aim of this paper is to introduce a graph-based equi… ▽ More

    Submitted 23 March, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

    Comments: Accepted by Journal of the American Statistical Association

  22. arXiv:2009.02080  [pdf, other

    eess.SP

    Spatio-Temporal Hierarchical Adaptive Dispatching for Ridesharing Systems

    Authors: Chang Liu, Jiahui Sun, Haiming **, Meng Ai, Qun Li, Cheng Zhang, Kehua Sheng, Guobin Wu, Xiaohu Qie, Xinbing Wang

    Abstract: Nowadays, ridesharing has become one of the most popular services offered by online ride-hailing platforms (e.g., Uber and Didi Chuxing). Existing ridesharing platforms adopt the strategy that dispatches orders over the entire city at a uniform time interval. However, the uneven spatio-temporal order distributions in real-world ridesharing systems indicate that such an approach is suboptimal in pr… ▽ More

    Submitted 4 September, 2020; originally announced September 2020.

  23. arXiv:2001.09027  [pdf, other

    cs.LG stat.ML

    Weakly Supervised Learning Meets Ride-Sharing User Experience Enhancement

    Authors: Lan-Zhe Guo, Feng Kuang, Zhang-Xun Liu, Yu-Feng Li, Nan Ma, Xiao-Hu Qie

    Abstract: Weakly supervised learning aims at co** with scarce labeled data. Previous weakly supervised studies typically assume that there is only one kind of weak supervision in data. In many applications, however, raw data usually contains more than one kind of weak supervision at the same time. For example, in user experience enhancement from Didi, one of the largest online ride-sharing platforms, the… ▽ More

    Submitted 19 January, 2020; originally announced January 2020.

    Comments: AAAI 2020

  24. arXiv:1905.02773  [pdf

    astro-ph.HE astro-ph.GA astro-ph.IM astro-ph.SR

    The Large High Altitude Air Shower Observatory (LHAASO) Science Book (2021 Edition)

    Authors: Zhen Cao, D. della Volpe, Siming Liu, Editors, :, Xiaojun Bi, Yang Chen, B. D'Ettorre Piazzoli, Li Feng, Huanyu Jia, Zhuo Li, Xinhua Ma, Xiangyu Wang, Xiao Zhang, External Referees, :, Xiushu Qie, Hongbo Hu, Internal Referees, :, Alejandro Sáiz, Ruizhi Yang, Contributors, :, Andrea Addazi , et al. (69 additional authors not shown)

    Abstract: Since the science white paper of the Large High Altitude Air Shower Observatory (LHAASO) published on arXiv in 2019 [e-Print: 1905.02773 (astro-ph.HE)], LHAASO has completed the transition from a project to an operational gamma-ray astronomical observatory LHAASO is a new generation multi-component facility located in Daocheng, Sichuan province of China, at an altitude of 4410 meters. It aims at m… ▽ More

    Submitted 18 February, 2022; v1 submitted 7 May, 2019; originally announced May 2019.

    Comments: This document is a collaborative effort, 185 pages, 110 figures

    Journal ref: Chinese Physics C , Vol. 46, No. 3 (2022) 035001-035007