Skip to main content

Showing 1–50 of 120 results for author: Gong, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01606  [pdf, other

    cs.LG cs.AI cs.CL cs.CV stat.ML

    On Discrete Prompt Optimization for Diffusion Models

    Authors: Ruochen Wang, Ting Liu, Cho-Jui Hsieh, Boqing Gong

    Abstract: This paper introduces the first gradient-based framework for prompt optimization in text-to-image diffusion models. We formulate prompt engineering as a discrete optimization problem over the language space. Two major challenges arise in efficiently finding a solution to this problem: (1) Enormous Domain Space: Setting the domain to the entire language space poses significant difficulty to the opt… ▽ More

    Submitted 26 June, 2024; originally announced July 2024.

    Comments: ICML 2024. Code available at https://github.com/ruocwang/dpo-diffusion

    MSC Class: 68T01

    Journal ref: Proceedings of the 41st International Conference on Machine Learning (ICML 2024)

  2. arXiv:2406.16476  [pdf, other

    cs.CV

    ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance

    Authors: Shuwei Shi, Wenbo Li, Yuechen Zhang, **gwen He, Biao Gong, Yinqiang Zheng

    Abstract: Diffusion models excel at producing high-quality images; however, scaling to higher resolutions, such as 4K, often results in over-smoothed content, structural distortions, and repetitive patterns. To this end, we introduce ResMaster, a novel, training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions. Specifically, ResMast… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  3. arXiv:2406.02965  [pdf, other

    cs.CV

    Understanding the Impact of Negative Prompts: When and How Do They Take Effect?

    Authors: Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Boqing Gong, Cho-Jui Hsieh

    Abstract: The concept of negative prompts, emerging from conditional generation models like Stable Diffusion, allows users to specify what to exclude from the generated images.%, demonstrating significant practical efficacy. Despite the widespread use of negative prompts, their intrinsic mechanisms remain largely unexplored. This paper presents the first comprehensive study to uncover how and when negative… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  4. arXiv:2406.01970  [pdf, other

    cs.CV cs.AI

    The Crystal Ball Hypothesis in diffusion models: Anticipating object positions from initial noise

    Authors: Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Boqing Gong, Cho-Jui Hsieh, Minhao Cheng

    Abstract: Diffusion models have achieved remarkable success in text-to-image generation tasks; however, the role of initial noise has been rarely explored. In this study, we identify specific regions within the initial noise image, termed trigger patches, that play a key role for object generation in the resulting images. Notably, these patches are ``universal'' and can be generalized across various positio… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  5. arXiv:2406.00448  [pdf, other

    cs.CV cs.GR

    Bilateral Guided Radiance Field Processing

    Authors: Yuehao Wang, Chaoyi Wang, Bingchen Gong, Tianfan Xue

    Abstract: Neural Radiance Fields (NeRF) achieves unprecedented performance in synthesizing novel view synthesis, utilizing multi-view consistency. When capturing multiple inputs, image signal processing (ISP) in modern cameras will independently enhance them, including exposure adjustment, color correction, local tone map**, etc. While these processings greatly improve image quality, they often break the… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: SIGGRAPH (ACM TOG), 2024. Project page: https://bilarfpro.github.io

  6. arXiv:2405.17835  [pdf, other

    cs.CV

    Deform3DGS: Flexible Deformation for Fast Surgical Scene Reconstruction with Gaussian Splatting

    Authors: Shuojue Yang, Qian Li, Daiyun Shen, Bingchen Gong, Qi Dou, Yueming **

    Abstract: Tissue deformation poses a key challenge for accurate surgical scene reconstruction. Despite yielding high reconstruction quality, existing methods suffer from slow rendering speeds and long training times, limiting their intraoperative applicability. Motivated by recent progress in 3D Gaussian Splatting, an emerging technology in real-time 3D rendering, this work presents a novel fast reconstruct… ▽ More

    Submitted 30 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Early accepted at MICCAI 2024, 10 pages, 2 figures

  7. arXiv:2405.16567  [pdf, other

    cs.AI cs.CR

    Automatic Jailbreaking of the Text-to-Image Generative AI Systems

    Authors: Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang

    Abstract: Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jai… ▽ More

    Submitted 28 May, 2024; v1 submitted 26 May, 2024; originally announced May 2024.

    Comments: Under review

  8. arXiv:2405.12367  [pdf, other

    eess.IV cs.CV

    Large-Scale Multi-Center CT and MRI Segmentation of Pancreas with Deep Learning

    Authors: Zheyuan Zhang, Elif Keles, Gorkem Durak, Yavuz Taktak, Onkar Susladkar, Vandan Gorade, Debesh Jha, Asli C. Ormeci, Alpay Medetalibeyoglu, Lanhong Yao, Bin Wang, Ilkin Sevgi Isler, Linkai Peng, Hongyi Pan, Camila Lopes Vendrami, Amir Bourhani, Yury Velichko, Boqing Gong, Concetto Spampinato, Ayis Pyrros, Pallavi Tiwari, Derk C. F. Klatte, Megan Engels, Sanne Hoogenboom, Candice W. Bolan , et al. (13 additional authors not shown)

    Abstract: Automated volumetric segmentation of the pancreas on cross-sectional imaging is needed for diagnosis and follow-up of pancreatic diseases. While CT-based pancreatic segmentation is more established, MRI-based segmentation methods are understudied, largely due to a lack of publicly available datasets, benchmarking research efforts, and domain-specific deep learning methods. In this retrospective st… ▽ More

    Submitted 25 May, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

    Comments: under review version

  9. arXiv:2402.13217  [pdf, other

    cs.CV cs.AI

    VideoPrism: A Foundational Visual Encoder for Video Understanding

    Authors: Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong

    Abstract: We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic… ▽ More

    Submitted 15 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: Accepted to ICML 2024. v2: added retrieval results on MSRVTT (1K-A), more data analyses, and ablation studies

  10. arXiv:2401.06129  [pdf, other

    cs.CV

    Distilling Vision-Language Models on Millions of Videos

    Authors: Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan

    Abstract: The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-i… ▽ More

    Submitted 15 April, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

    Comments: CVPR 2024. Project page: https://zhaoyue-zephyrus.github.io/video-instruction-tuning

  11. arXiv:2401.01952  [pdf, other

    cs.CV cs.AI cs.CL

    Instruct-Imagen: Image Generation with Multi-modal Instruction

    Authors: Hexiang Hu, Kelvin C. K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, Ming-Wei Chang, Xuhui Jia

    Abstract: This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant gener… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

    Comments: 20 pages, 18 figures

  12. arXiv:2312.15770  [pdf, other

    cs.CV cs.AI

    A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

    Authors: Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang

    Abstract: Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g., 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost of video captioning. Instead, it could be far easier to collect unlabeled clips from v… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

    Comments: Project page: https://tf-t2v.github.io/

  13. arXiv:2311.17002  [pdf, other

    cs.CV

    Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

    Authors: Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, **gren Zhou

    Abstract: Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed… ▽ More

    Submitted 9 April, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

  14. SeamlessNeRF: Stitching Part NeRFs with Gradient Propagation

    Authors: Bingchen Gong, Yuehao Wang, Xiaoguang Han, Qi Dou

    Abstract: Neural Radiance Fields (NeRFs) have emerged as promising digital mediums of 3D objects and scenes, sparking a surge in research to extend the editing capabilities in this domain. The task of seamless editing and merging of multiple NeRFs, resembling the ``Poisson blending'' in 2D image editing, remains a critical operation that is under-explored by existing work. To fill this gap, we propose Seaml… ▽ More

    Submitted 30 October, 2023; originally announced November 2023.

    Comments: To appear in SIGGRAPH Asia 2023. Project website is accessible at https://sites.google.com/view/seamlessnerf

  15. arXiv:2311.15841  [pdf, other

    cs.CV

    Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

    Authors: Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang

    Abstract: This study focuses on a novel task in text-to-image (T2I) generation, namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actio… ▽ More

    Submitted 10 May, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: CVPR 2024

  16. arXiv:2311.15773  [pdf, other

    cs.CV

    Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

    Authors: Biao Gong, Siteng Huang, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu

    Abstract: Diffusion models have recently achieved remarkable progress in generating realistic images. However, challenges remain in accurately understanding and synthesizing the layout requirements in the textual prompts. To align the generated image with layout instructions, we present a training-free layout calibration system SimM that intervenes in the generative process on the fly during inference time.… ▽ More

    Submitted 25 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

  17. arXiv:2311.06386  [pdf, other

    cs.CV cs.LG

    Towards A Unified Neural Architecture for Visual Recognition and Reasoning

    Authors: Calvin Luo, Boqing Gong, Ting Chen, Chen Sun

    Abstract: Recognition and reasoning are two pillars of visual understanding. However, these tasks have an imbalance in focus; whereas recent advances in neural networks have shown strong empirical performance in visual recognition, there has been comparably much less success in solving visual reasoning. Intuitively, unifying these two tasks under a singular framework is desirable, as they are mutually depen… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

  18. arXiv:2310.05737  [pdf, other

    cs.CV cs.AI cs.MM

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Authors: Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang

    Abstract: While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer… ▽ More

    Submitted 29 March, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  19. arXiv:2310.04550  [pdf, other

    cs.CV cs.CL cs.LG

    Module-wise Adaptive Distillation for Multimodality Foundation Models

    Authors: Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo Zhao, Boqing Gong, Tianyi Zhou

    Abstract: Pre-trained multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes. One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer. Motivated by our observation that certain architecture compone… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

  20. arXiv:2309.13446  [pdf, other

    cs.CV

    Video Timeline Modeling For News Story Understanding

    Authors: Meng Liu, Mingda Zhang, Jialu Liu, Hanjun Dai, Ming-Hsuan Yang, Shuiwang Ji, Zheyun Feng, Boqing Gong

    Abstract: In this paper, we present a novel problem, namely video timeline modeling. Our objective is to create a video-associated timeline from a set of videos related to a specific topic, thereby facilitating the content and structure understanding of the story being told. This problem has significant potential in various real-world applications, for instance, news story summarization. To bootstrap resear… ▽ More

    Submitted 27 October, 2023; v1 submitted 23 September, 2023; originally announced September 2023.

    Comments: Accepted as a spotlight by NeurIPS 2023, Track on Datasets and Benchmarks

  21. arXiv:2309.13247  [pdf, other

    cs.CV

    Multi-modal Domain Adaptation for REG via Relation Transfer

    Authors: Yifan Ding, Liqiang Wang, Boqing Gong

    Abstract: Domain adaptation, which aims to transfer knowledge between domains, has been well studied in many areas such as image classification and object detection. However, for multi-modal tasks, conventional approaches rely on large-scale pre-training. But due to the difficulty of acquiring multi-modal data, large-scale pre-training is often impractical. Therefore, domain adaptation, which can efficientl… ▽ More

    Submitted 23 September, 2023; originally announced September 2023.

  22. arXiv:2308.13280  [pdf, other

    physics.ao-ph cs.AI cs.LG physics.comp-ph

    AtmoRep: A stochastic model of atmosphere dynamics using large scale representation learning

    Authors: Christian Lessig, Ilaria Luise, Bing Gong, Michael Langguth, Scarlet Stadtler, Martin Schultz

    Abstract: The atmosphere affects humans in a multitude of ways, from loss of life due to adverse weather effects to long-term social and economic impacts on societies. Computer simulations of atmospheric dynamics are, therefore, of great importance for the well-being of our and future generations. Here, we propose AtmoRep, a novel, task-independent stochastic computer model of atmospheric dynamics that can… ▽ More

    Submitted 7 September, 2023; v1 submitted 25 August, 2023; originally announced August 2023.

  23. arXiv:2307.03166  [pdf, other

    cs.CV

    VideoGLUE: Video General Understanding Evaluation of Foundation Models

    Authors: Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong

    Abstract: We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoG… ▽ More

    Submitted 1 December, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

    Comments: Fixes some typos and include project open-source page: https://github.com/tensorflow/models/tree/master/official/projects/videoglue

  24. arXiv:2306.03515  [pdf, other

    cs.LG cs.AI cs.LO

    Logic Diffusion for Knowledge Graph Reasoning

    Authors: Xiaoying Xie, Biao Gong, Yiliang Lv, Zhen Han, Guoshuai Zhao, Xueming Qian

    Abstract: Most recent works focus on answering first order logical queries to explore the knowledge graph reasoning via multi-hop logic predictions. However, existing reasoning models are limited by the circumscribed logical paradigms of training samples, which leads to a weak generalization of unseen logic. To address these issues, we propose a plug-in module called Logic Diffusion (LoD) to discover unseen… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: 10 pages, 6 figures

  25. arXiv:2304.10199  [pdf, other

    cs.IR

    Selective and Collaborative Influence Function for Efficient Recommendation Unlearning

    Authors: Yuyuan Li, Chaochao Chen, Xiaolin Zheng, Yizhao Zhang, Biao Gong, Jun Wang

    Abstract: Recent regulations on the Right to be Forgotten have greatly influenced the way of running a recommender system, because users now have the right to withdraw their private data. Besides simply deleting the target data in the database, unlearning the associated data lineage e.g., the learned personal features and preferences in the model, is also necessary for data withdrawal. Existing unlearning m… ▽ More

    Submitted 20 April, 2023; originally announced April 2023.

  26. arXiv:2304.07882  [pdf, other

    cs.CV

    Federated Learning of Shareable Bases for Personalization-Friendly Image Classification

    Authors: Hong-You Chen, Jike Zhong, Mingda Zhang, Xuhui Jia, Hang Qi, Boqing Gong, Wei-Lun Chao, Li Zhang

    Abstract: Personalized federated learning (PFL) aims to harness the collective wisdom of clients' data while building personalized models tailored to individual clients' data distributions. Existing works offer personalization primarily to clients who participate in the FL process, making it hard to encompass new clients who were absent or newly show up. In this paper, we propose FedBasis, a novel PFL frame… ▽ More

    Submitted 31 October, 2023; v1 submitted 16 April, 2023; originally announced April 2023.

    Comments: Preprint

  27. arXiv:2304.07429  [pdf, other

    cs.CV

    Identity Encoder for Personalized Diffusion

    Authors: Yu-Chuan Su, Kelvin C. K. Chan, Yandong Li, Yang Zhao, Han Zhang, Boqing Gong, Huisheng Wang, Xuhui Jia

    Abstract: Many applications can benefit from personalized image generation models, including image enhancement, video conferences, just to name a few. Existing works achieved personalization by fine-tuning one model for each person. While being successful, this approach incurs additional computation and storage overhead for each new identity. Furthermore, it usually expects tens or hundreds of examples per… ▽ More

    Submitted 14 April, 2023; originally announced April 2023.

  28. arXiv:2304.02720  [pdf, other

    eess.IV cs.CR cs.CV

    Domain Generalization with Adversarial Intensity Attack for Medical Image Segmentation

    Authors: Zheyuan Zhang, Bin Wang, Lanhong Yao, Ugur Demir, Debesh Jha, Ismail Baris Turkbey, Boqing Gong, Ulas Bagci

    Abstract: Most statistical learning algorithms rely on an over-simplified assumption, that is, the train and test data are independent and identically distributed. In real-world scenarios, however, it is common for models to encounter data from new and different domains to which they were not exposed to during training. This is often the case in medical imaging applications due to differences in acquisition… ▽ More

    Submitted 5 April, 2023; originally announced April 2023.

    Comments: Code is available upon publication

  29. arXiv:2304.02642  [pdf, other

    cs.CV

    Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

    Authors: Xuhui Jia, Yang Zhao, Kelvin C. K. Chan, Yandong Li, Han Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, Yu-Chuan Su

    Abstract: This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only… ▽ More

    Submitted 5 April, 2023; originally announced April 2023.

  30. arXiv:2303.16341  [pdf, other

    cs.CV

    Structured Video-Language Modeling with Temporal Grou** and Spatial Grounding

    Authors: Yuanhao Xiong, Long Zhao, Boqing Gong, Ming-Hsuan Yang, Florian Schroff, Ting Liu, Cho-Jui Hsieh, Liangzhe Yuan

    Abstract: Existing video-language pre-training methods primarily focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information in both videos and text, which is of importance to downstream tasks requiring temporal localization and semantic reasoning. A powerful model is expected to be capable of capturing region-object corr… ▽ More

    Submitted 8 March, 2024; v1 submitted 28 March, 2023; originally announced March 2023.

  31. arXiv:2303.15230  [pdf, other

    cs.CV cs.CL cs.LG

    Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning

    Authors: Siteng Huang, Biao Gong, Yutong Feng, Min Zhang, Yiliang Lv, Donglin Wang

    Abstract: Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs. Relying on learning the joint representation of seen compositions, these methods ignore the explicit modeling of the state and object, thus limiting the exploitation of pre-trained knowledge and generalization to unseen compo… ▽ More

    Submitted 25 March, 2024; v1 submitted 27 March, 2023; originally announced March 2023.

    Comments: CVPR 2024

  32. arXiv:2303.08998  [pdf, other

    cs.CV

    Unified Visual Relationship Detection with Vision and Language Models

    Authors: Long Zhao, Liangzhe Yuan, Boqing Gong, Yin Cui, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu

    Abstract: This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propos… ▽ More

    Submitted 20 August, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted to ICCV 2023. Code is available at https://github.com/google-research/scenic/tree/main/scenic/projects/univrd

  33. arXiv:2303.08561  [pdf, other

    cs.SD eess.AS

    Enhancing Unsupervised Audio Representation Learning via Adversarial Sample Generation

    Authors: Yulin Pan, Xiangteng He, Biao Gong, Yuxin Peng, Yiliang Lv

    Abstract: Existing audio analysis methods generally first transform the audio stream to spectrogram, and then feed it into CNN for further analysis. A standard CNN recognizes specific visual patterns over feature map, then pools for high-level representation, which overlooks the positional information of recognized patterns. However, unlike natural image, the semantic of an audio spectrogram is sensitive to… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

    Comments: 8 pages, 4 figures

  34. Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

    Authors: Yulin Pan, Xiangteng He, Biao Gong, Yiliang Lv, Yujun Shen, Yuxin Peng, Deli Zhao

    Abstract: Video temporal grounding aims to pinpoint a video segment that matches the query description. Despite the recent advance in short-form videos (\textit{e.g.}, in minutes), temporal grounding in long videos (\textit{e.g.}, in hours) is still at its early stage. To address this challenge, a common practice is to employ a sliding window, yet can be inefficient and inflexible due to the limited number… ▽ More

    Submitted 22 March, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

    Comments: 11 pages, 8 figures

    Journal ref: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

  35. arXiv:2303.06911  [pdf, other

    cs.CV

    ViM: Vision Middleware for Unified Downstream Transferring

    Authors: Yutong Feng, Biao Gong, Jianwen Jiang, Yiliang Lv, Yujun Shen, Deli Zhao, **gren Zhou

    Abstract: Foundation models are pre-trained on massive data and transferred to downstream tasks via fine-tuning. This work presents Vision Middleware (ViM), a new learning paradigm that targets unified transferring from a single foundation model to a variety of downstream tasks. ViM consists of a zoo of lightweight plug-in modules, each of which is independently learned on a midstream dataset with a shared… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

  36. arXiv:2302.06891  [pdf, other

    cs.CV

    UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training

    Authors: Biao Gong, Xiaoying Xie, Yutong Feng, Yiliang Lv, Yujun Shen, Deli Zhao

    Abstract: This work presents a unified knowledge protocol, called UKnow, which facilitates knowledge-based studies from the perspective of data. Particularly focusing on visual and linguistic modalities, we categorize data knowledge into five unit types, namely, in-image, in-text, cross-image, cross-text, and image-text, and set up an efficient pipeline to help construct the multimodal knowledge graph from… ▽ More

    Submitted 21 March, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

  37. RecolorNeRF: Layer Decomposed Radiance Fields for Efficient Color Editing of 3D Scenes

    Authors: Bingchen Gong, Yuehao Wang, Xiaoguang Han, Qi Dou

    Abstract: Radiance fields have gradually become a main representation of media. Although its appearance editing has been studied, how to achieve view-consistent recoloring in an efficient manner is still under explored. We present RecolorNeRF, a novel user-friendly color editing approach for the neural radiance fields. Our key idea is to decompose the scene into a set of pure-colored layers, forming a palet… ▽ More

    Submitted 18 September, 2023; v1 submitted 19 January, 2023; originally announced January 2023.

    Comments: To appear in ACM Multimedia 2023. Project website is accessible at https://sites.google.com/view/recolornerf

  38. arXiv:2212.12053  [pdf, other

    cs.CV cs.AI cs.LG

    On Calibrating Semantic Segmentation Models: Analyses and An Algorithm

    Authors: Dongdong Wang, Boqing Gong, Liqiang Wang

    Abstract: We study the problem of semantic segmentation calibration. Lots of solutions have been proposed to approach model miscalibration of confidence in image classification. However, to date, confidence calibration research on semantic segmentation is still limited. We provide a systematic study on the calibration of semantic segmentation models and propose a simple yet effective approach. First, we fin… ▽ More

    Submitted 25 March, 2023; v1 submitted 22 December, 2022; originally announced December 2022.

    Comments: Accepted to CVPR-2023 (8 pages, 4 figures)

  39. arXiv:2211.12764  [pdf, other

    cs.CV cs.AI cs.CL

    VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

    Authors: Siteng Huang, Biao Gong, Yulin Pan, Jianwen Jiang, Yiliang Lv, Yuyuan Li, Donglin Wang

    Abstract: Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models. In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retriev… ▽ More

    Submitted 21 March, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: Accepted by CVPR 2023

  40. arXiv:2210.08064  [pdf, other

    cs.CV cs.RO

    LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds

    Authors: Minghua Liu, Yin Zhou, Charles R. Qi, Boqing Gong, Hao Su, Dragomir Anguelov

    Abstract: Semantic segmentation of LiDAR point clouds is an important task in autonomous driving. However, training deep models via conventional supervised methods requires large datasets which are costly to label. It is critical to have label-efficient segmentation approaches to scale up the model to new operational domains or to improve performance on rare cases. While most prior works focus on indoor sce… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

  41. arXiv:2208.08349  [pdf, other

    cs.CV cs.LG

    Open Long-Tailed Recognition in a Dynamic World

    Authors: Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, Stella X. Yu

    Abstract: Real world data often exhibits a long-tailed and open-ended (with unseen classes) distribution. A practical recognition system must balance between majority (head) and minority (tail) classes, generalize across the distribution, and acknowledge novelty upon the instances of unseen classes (open classes). We define Open Long-Tailed Recognition++ (OLTR++) as learning from such naturally distributed… ▽ More

    Submitted 17 August, 2022; originally announced August 2022.

    Comments: To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022. Extended version of our previous CVPR oral paper (arXiv:1904.05160)

  42. arXiv:2204.05376  [pdf, other

    cs.CV

    medXGAN: Visual Explanations for Medical Classifiers through a Generative Latent Space

    Authors: Amil Dravid, Florian Schiffers, Boqing Gong, Aggelos K. Katsaggelos

    Abstract: Despite the surge of deep learning in the past decade, some users are skeptical to deploy these models in practice due to their black-box nature. Specifically, in the medical space where there are severe potential repercussions, we need to develop methods to gain confidence in the models' decisions. To this end, we propose a novel medical imaging generative adversarial framework, medXGAN (medical… ▽ More

    Submitted 17 April, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: 10 pages, 11 figures, accepted to CVPR TCV workshop

    ACM Class: I.5.4; I.5.1; I.4.9; I.4.5; I.2.10

  43. arXiv:2203.08065  [pdf, other

    cs.LG cs.AI

    Surrogate Gap Minimization Improves Sharpness-Aware Training

    Authors: Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, Ting Liu

    Abstract: The recently proposed Sharpness-Aware Minimization (SAM) improves generalization by minimizing a \textit{perturbed loss} defined as the maximum loss within a neighborhood in the parameter space. However, we show that both sharp and flat minima can have a low perturbed loss, implying that SAM does not always prefer flat minima. Instead, we define a \textit{surrogate gap}, a measure equivalent to th… ▽ More

    Submitted 19 March, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

    Comments: Paper accepted by ICLR22, https://openreview.net/forum?id=edONMAnhLu-

  44. arXiv:2112.07074  [pdf, other

    cs.CV cs.LG

    Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

    Authors: Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown

    Abstract: In this paper, we explore the possibility of building a unified foundation model that can be adapted to both vision-only and text-only tasks. Starting from BERT and ViT, we design a unified transformer consisting of modality-specific tokenizers, a shared transformer encoder, and task-specific output heads. To efficiently pre-train the proposed model jointly on unpaired images and text, we propose… ▽ More

    Submitted 13 December, 2021; originally announced December 2021.

    Comments: preliminary work

  45. arXiv:2112.05181  [pdf, other

    cs.CV

    Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision

    Authors: Liangzhe Yuan, Rui Qian, Yin Cui, Boqing Gong, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu

    Abstract: Modern self-supervised learning algorithms typically enforce persistency of instance representations across views. While being very effective on learning holistic image and video representations, such an objective becomes sub-optimal for learning spatio-temporally fine-grained features in videos, where scenes and instances evolve through space and time. In this paper, we present Contextualized Spa… ▽ More

    Submitted 1 April, 2022; v1 submitted 9 December, 2021; originally announced December 2021.

    Comments: CVPR 2022

  46. arXiv:2112.04480  [pdf, other

    cs.CV cs.LG

    Exploring Temporal Granularity in Self-Supervised Video Representation Learning

    Authors: Rui Qian, Yeqing Li, Liangzhe Yuan, Boqing Gong, Ting Liu, Matthew Brown, Serge Belongie, Ming-Hsuan Yang, Hartwig Adam, Yin Cui

    Abstract: This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations. In TeG, we sample a long clip from a video and a short clip that lies inside the long clip. We then extract their dense temporal embeddings. The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between co… ▽ More

    Submitted 8 December, 2021; originally announced December 2021.

  47. arXiv:2109.09023  [pdf, other

    cs.CR cs.LG cs.MM

    Anti-Neuron Watermarking: Protecting Personal Data Against Unauthorized Neural Networks

    Authors: Zihang Zou, Boqing Gong, Liqiang Wang

    Abstract: We study protecting a user's data (images in this work) against a learner's unauthorized use in training neural networks. It is especially challenging when the user's data is only a tiny percentage of the learner's complete training set. We revisit the traditional watermarking under modern deep learning settings to tackle the challenge. We show that when a user watermarks images using a specialize… ▽ More

    Submitted 1 August, 2022; v1 submitted 18 September, 2021; originally announced September 2021.

    Comments: Accepted to ECCV 2022

  48. arXiv:2108.11976  [pdf, other

    cs.DC cs.LG

    JUWELS Booster -- A Supercomputer for Large-Scale AI Research

    Authors: Stefan Kesselheim, Andreas Herten, Kai Krajsek, Jan Ebert, Jenia Jitsev, Mehdi Cherti, Michael Langguth, Bing Gong, Scarlet Stadtler, Amirpasha Mozaffari, Gabriele Cavallaro, Rocco Sedona, Alexander Schug, Alexandre Strube, Roshni Kamath, Martin G. Schultz, Morris Riedel, Thomas Lippert

    Abstract: In this article, we present JUWELS Booster, a recently commissioned high-performance computing system at the Jülich Supercomputing Center. With its system architecture, most importantly its large number of powerful Graphics Processing Units (GPUs) and its fast interconnect via InfiniBand, it is an ideal machine for large-scale Artificial Intelligence (AI) research and applications. We detail its s… ▽ More

    Submitted 30 June, 2021; originally announced August 2021.

    Comments: 12 pages, 5 figures. Accepted at ISC 2021, Workshop Deep Learning on Supercomputers. This is a duplicate submission as my previous submission is on hold for several weeks now and my attempts to contact the moderators failed

    Report number: 1234567Dummy

  49. ME-PCN: Point Completion Conditioned on Mask Emptiness

    Authors: Bingchen Gong, Yinyu Nie, Yiqun Lin, Xiaoguang Han, Yizhou Yu

    Abstract: Point completion refers to completing the missing geometries of an object from incomplete observations. Main-stream methods predict the missing shapes by decoding a global feature learned from the input point cloud, which often leads to deficient results in preserving topology consistency and surface details. In this work, we present ME-PCN, a point completion network that leverages `emptiness' in… ▽ More

    Submitted 14 October, 2021; v1 submitted 18 August, 2021; originally announced August 2021.

    Comments: Accepted to ICCV 2021; typos corrected

  50. arXiv:2108.07792  [pdf, other

    cs.CV

    Federated Multi-Target Domain Adaptation

    Authors: Chun-Han Yao, Boqing Gong, Yin Cui, Hang Qi, Yukun Zhu, Ming-Hsuan Yang

    Abstract: Federated learning methods enable us to train machine learning models on distributed user data while preserving its privacy. However, it is not always feasible to obtain high-quality supervisory signals from users, especially for vision tasks. Unlike typical federated settings with labeled client data, we consider a more practical scenario where the distributed client data is unlabeled, and a cent… ▽ More

    Submitted 17 August, 2021; originally announced August 2021.